New paper on Direct Reinforcement Learning

Mon Sep 20 04:12:32 EDT 1999

The following paper is available from
http://wwwsyseng.anu.edu.au/~jon/papers/drlexp.ps.gz. It is a sequel to
http://wwwsyseng.anu.edu.au/~jon/papers/drlalg.ps.gz
All comments welcome.

Title: Direct Gradient-Based Reinforcement Learning: II.
Gradient Ascent Algorithms and Experiments

Authors:
Jonathan Baxter, Lex Weaver and Peter Bartlett
Australian National University

Abstract:
In \cite{drl1} we introduced \pomdpg, an algorithm for computing
arbitrarily accurate approximations to the performance gradient of
parameterized partially observable Markov decision processes
(\pomdps).

The algorithm's chief advantages are that it requires only a single
sample path of the underlying Markov chain, it uses only one
free parameter $\beta\in [0,1)$ which has a natural interpretation
in terms of bias-variance trade-off, and it requires no knowledge of
the underlying state. In addition, the algorithm can be applied to
infinite state, control and observation spaces.

In this paper we present \conjgrad, a conjugate-gradient ascent
algorithm that uses \pomdpg\ as a subroutine to estimate the gradient
direction. \conjgrad\ uses a novel line-search routine that relies
solely on gradient estimates and hence is robust to noise in the
performance estimates. \olpomdp, an on-line gradient ascent algorithm
based on
\pomdpg\ is also presented.

The chief theoretical advantage of this gradient based approach over
value-function-based approaches to reinforcement learning is that it
guarantees improvement in the performance of the policy at {\em every}
step. To show that this advantage is real, we give experimental
results in which \conjgrad\ was used to optimize a simple three-state
Markov chain controlled by a linear function, a two-dimensional
``puck'' controlled by a neural network, a call admission queueing
problem, and a variation of the classical ``mountain-car'' task.  In
all cases the algorithm rapidly found optimal or near-optimal
solutions.