Tech report available
Tommi Jaakkola
tommi at psyche.mit.edu
Wed Jul 21 18:26:38 EDT 1993
THe following paper is now available on the neuroprose
archive as "jaakkola.convergence.ps.Z".
On the Convergence of Stochastic Iterative Dynamic
Programming Algorithms
Tommi Jaakkola
Michael I. Jordan
Department of Brain and Cognitive Sciences
Massachusetts Institute of Technology
Satinder P. Singh
Department of Computer Science
University of Massachusetts at Amherst
Recent developments in the area of reinforcement learning have
yielded a number of new algorithms for the prediction and control
of Markovian environments. These algorithms, including the
TD($\lambda$) algorithm of Sutton (1988) and the Q-learning
algorithm of Watkins (1989), can be motivated heuristically as
approximations to dynamic programming (DP). In this paper we
provide a rigorous proof of convergence of these DP-based learning
algorithms by relating them to the powerful techniques of stochastic
approximation theory via a new convergence theorem. The theorem
establishes a general class of convergent algorithms to which
both TD($\lambda$) and Q-learning belong.
More information about the Connectionists
mailing list