postscript preprints (Reinforcement Learning)

Mon May 8 13:59:18 EDT 1995

This is to announce the availability of two new postscript preprints:

       REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES
                          Satinder P. Singh
                          Richard S. Sutton
                    to appear in Machine Learning
  ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/singh-sutton-96.ps.gz

ABSTRACT: The eligibility trace is one of the basic mechanisms used in
reinforcement learning to handle delayed reward. In this paper we
introduce a new kind of eligibility trace, the {\it replacing} trace,
analyze it theoretically, and show that it results in faster, more
reliable learning than the conventional trace. Both kinds of trace
assign credit to prior events according to how recently they occurred,
but only the conventional trace gives greater credit to repeated events.
Our analysis is for conventional and replace-trace versions of the
offline TD(1) algorithm applied to undiscounted absorbing Markov chains.
First, we show that these methods converge under repeated presentations
of the training set to the same predictions as two well known Monte
Carlo methods. We then analyze the relative efficiency of the two Monte
Carlo methods. We show that the method corresponding to conventional TD
is biased, whereas the method corresponding to replace-trace TD is
unbiased. In addition, we show that the method corresponding to
replacing traces is closely related to the maximum likelihood solution
for these tasks, and that its mean squared error is always lower in the
long run. Computational results confirm these analyses and show that
they are applicable more generally. In particular, we show that
replacing traces significantly improve performance and reduce parameter
sensitivity on the ``Mountain-Car" task, a full reinforcement-learning
problem with a continuous state space, when using a feature-based
function approximator.

    TD MODELS: MODELING THE WORLD AT A MIXTURE OF TIME SCALES
                        Richard S. Sutton
                     to appear in Proc. ML95
    ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/sutton-95.ps.Z

ABSTRACT: Temporal-difference (TD) learning can be used not just to
predict {\it rewards}, as is commonly done in reinforcement learning,
but also to predict {\it states}, i.e., to learn a model of the world's
dynamics. We present theory and algorithms for intermixing TD models of
the world at different levels of temporal abstraction within a single
structure. Such multi-scale TD models can be used in model-based
reinforcement-learning architectures and dynamic programming methods in
place of conventional Markov models. This enables planning at higher and
varied levels of abstraction, and, as such, may prove useful in
formulating methods for hierarchical or multi-level planning and
reinforcement learning. In this paper we treat only the {\it
prediction} problem---that of learning a model and value function for
the case of fixed agent behavior. Within this context, we establish the
theoretical foundations of multi-scale models and derive TD algorithms
for learning them. Two small computational experiments are presented to
test and illustrate the theory. This work is an extension and
generalization of the work of Singh (1992), Dayan (1993), and Sutton \&
Pinette (1985).

The following previously published papers related to reinforcement learning
are also available online for the first time:

   Sutton, R.S., Barto, A.G. (1990) "Time-Derivative Models of Pavlovian
   Reinforcement," in Learning and Computational Neuroscience:
   Foundations of Adaptive Networks, M. Gabriel and J. Moore, Eds.,
   pp. 497--537.  MIT Press.
   (Main paper for the TD model of classical conditioning)
   ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/sutton-barto-90.ps.Z

   Barto, A.G., Sutton, R.S., Watkins, C.J.C.H. (1990) "Learning and Sequential
   Decision Making". In Learning and Computational Neuroscience, M. Gabriel
   and J.W. Moore, Eds., pp. 539-602, MIT Press.
   (a good intro to RL)
   ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/barto-sutton-watkins-90.ps.Z

   Sutton, R.S. (1992b) "Gain Adaptation Beats Least Squares?", Proceedings
   of the Seventh Yale Workshop on Adaptive and Learning Systems,
   pp. 161-166, Yale University, New Haven, CT.
   (Step-size adaptation from an engineering perspective, 2 new algorithms)
   ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/sutton-92b.ps.Z

For abstracts, see the file ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/CATALOG.
If you have trouble obtaining these files, an alternate route is via the
mirror at ftp://ftp.gte.com/pub/reinforcement-learning/.