Paper Available On Reinforcement Learning

Tue Jul 27 22:03:58 EDT 1999

The following paper can be obtained from :

http://wwwsyseng.anu.edu.au/~jon/papers/drlalg.ps.gz      

Title: Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation
Algorithms

Authors: Jonathan Baxter and Peter Bartlett 

Research School of Information Sciences and Engineering 
Australian National University
Jonathan.Baxter at anu.edu.au, Peter.Bartlett at anu.edu.au

Abstract: 
Despite their many empirical successes, approximate value-function
based approaches to reinforcement learning suffer from a paucity of
theoretical guarantees on the performance of the policy generated by
the value-function. In this paper we pursue an alternative approach:
first compute the gradient of the {\em average reward} with respect to
the parameters controlling the state transitions in a Markov chain (be
they parameters of a class of approximate value functions generating a
policy by some form of look-ahead, or parameters directly parameterizing
a set of policies), and then use gradient ascent to generate a new set of
parameters with increased average reward. We call this method ``direct''
reinforcement learning because we are not attempting to first find an
accurate value-function from which to generate a policy, we are instead
adjusting the parameters to directly improve the average reward.

We present an algorithm for computing approximations to the gradient of
the average reward from a single sample path of the underlying Markov
chain.  We show that the accuracy of these approximations depends on
the relationship between the discount factor used by the algorithm
and the mixing time of the Markov chain, and that the error can be made
arbitrarily small by setting the discount factor suitably close to $1$.
We extend this algorithm to the case of partially observable Markov
decision processes controlled by stochastic policies.  We prove that
both algorithms converge with probability 1.