two papers on reinforcement learning

Tue Sep 28 14:18:49 EDT 1999

This is to announce the availability of two papers on reinforcement learning.

--------------------------------------------------------------------------------
Policy Gradient Methods for Reinforcement Learning with Function Approximation

  Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour

                  Accepted for presentation at NIPS'99

Function approximation is essential to reinforcement learning,
but the standard approach of approximating a value function and determining
a policy from it has so far proven theoretically intractable.  In this
paper we explore an alternative approach in which the policy is explicitly
represented by its own function approximator, independent of the value
function, and is updated according to the gradient of expected reward with
respect to the policy parameters.  Williams's REINFORCE method and
actor--critic methods are examples of this approach.  Our main new result
is to show that the gradient can be written in a form suitable for
estimation from experience aided by an approximate action-value or
advantage function.   Using this result, we prove for the first time that
a version of policy iteration with arbitrary differentiable function
approximation is convergent to a locally optimal policy.

ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/SMSM-NIPS99-submitted.ps.gz or
ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/SMSM-NIPS99-submitted.pdf

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

                       Between MDPs and Semi-MDPs:
       A Framework for Temporal Abstraction in Reinforcement Learning

            Richard S. Sutton, Doina Precup, and Satinder Singh

            Accepted for publication in Artificial Intelligence

     (a revised version of our earlier technical report on this topic)

Learning, planning, and representing knowledge at multiple levels of
temporal abstraction are key, longstanding challenges for AI.  In this
paper we
consider how these challenges can be addressed within the mathematical
framework of reinforcement learning and Markov decision processes (MDPs).  We
extend the usual notion of action in this framework to include {\it
options\/}---closed-loop policies for taking action over a period of
time.  Examples of options include picking up an object, going to lunch, and
traveling to a distant city, as well as primitive actions such as muscle
twitches and joint torques.  Overall, we show that options enable temporally
abstract knowledge and action to be included in the reinforcement
learning framework in a natural and general way. In particular, we show that
options may be used interchangeably with primitive actions in planning
methods such as dynamic programming and in learning methods such as
Q-learning.  Formally, a set of options defined over an MDP constitutes a
semi-Markov decision process (SMDP), and the theory of SMDPs provides the
foundation for the theory of options.  However, the most interesting issues
concern the interplay between the underlying MDP and the SMDP and are thus
beyond SMDP theory.  We present results for three such cases: 1) we show
that the results of planning with options can be used during execution to
interrupt options and thereby perform even better than planned, 2) we
introduce new  {\it intra-option\/} methods that are able to learn about an
option from fragments of its execution, and 3) we propose a notion of
subgoal that can be used to improve the options themselves.  All of these
results have precursors in the existing literature; the contribution of
this paper is to establish them in a simpler and more general setting with
fewer changes to the existing reinforcement learning framework.  In
particular, we show that these results can be obtained without committing
to (or ruling out) any particular approach to state abstraction, hierarchy,
function approximation, or the macro-utility problem.

ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/SPS-aij.ps.gz
--------------------------------------------------------------------------------