Real Pole-Balancing

Fri Jan 15 17:20:30 EST 1993

Jervis and Fallside recently posted an abstract on real pole-balancing that
prompted me to write this. They have implemented the learning algorithm that we
wrote about in 1983 (Barto, Sutton, and Anderson, IEEE Trans. Systems Man and
Cybern. 13, pp. 834-846) on a real pole balancing system. I was lucky enough to
see their system work and was quite impressed. They indicate that they had some
trouble getting it to work and conclude the abstract with the following
statement, which I would like to discuss:
"This limits the usefulness of this kind of learning controller to small
problems which are likely to be better controlled by other means. Before a
learning controller can tackle more difficult problems, a more powerful learning
scheme has to be found."

Much progress has been made on this class of learning algorithms since 1983, and
we now have a much better understanding of them and their potential. I certainly
agree that the pole-balancing problem is not a very good candidate for these
algorithms. We artificially limited the information available to the controller,
in effect, turning it into a problem that is harder than pole-balancing really
is (as we indicated in our 1983 paper). We now understand that learning
algorithm, and related ones developed by us and many others (e.g., Watkins,
Werbos), as methods for approximating solutions to optimal control problems by
means of approximating dynamic programming (DP). A short paper by Rich Sutton,
Ron Williams, and me appeared in Control Systems Magazine (vol. 12, April 1992)
that describes this perspective. Although much work has been done to address the
problems that Jervis and Fallside encountered in specifying a suitable state
representation, my real point is that these algorithms seem very well suited for
some classes of problems. There are more scales by which to measure problems
than "small" and "large".

Specifically, we think that these methods (not necessarily the old
pole-balancing system, but more recent versions of the same approach) make good
computational sense for stochastic optimal control problems with large state
sets. You are probably familiar with Tesauro's TD-gammon system, which uses a
system similar to the pole-balancer to learn how to play remarkably good
backgammon. This is a kind of stochastic optimal control problem (admittedly not
one of great engineering utility), and the conventional DP solution method is
infeasible due to the very large state set. By many games of self-play,
TD-gammon system was able to focus computation onto relevant regions of the
state set, and the multi-layer network stored the information gained in a
compact form. We think this can be a big advantage in certain classes of
stochastic optimal control problems, and we are currently working to provide
more evidence for this, as well to develop more theory. Pole-balancing, in the
form in which it is relatively easy to solve, is a regulation problem, not a
stochastic optimal control problem of the kind that approximate DP methods might
be good at. 

In summary, I agree wholeheartedly with Jervis and Fallside that pole-balancing
can be better achieved by other means. But I think the conclusion that the kind
of approximate DP methods (of which our 1983 system was a primitive example) are
only suited to small problems is not warranted. In fact, we think they are well
suited to some optimal control problems that are so big and nonlinear that
conventional control design techniques are not computationally feasible.

A. Barto