PhD thesis available: Reinforcement Learning with Exploration

Sat May 3 11:27:08 EDT 2003

Dear Connectionists,

My PhD thesis,

  Reinforcement Learning with Exploration.
  School of Computer Science, The University of Birmingham.

is available for download at:

  http://www.cs.bham.ac.uk/~sir/pub/thesis.abstract.html

Regards
Stuart Reynolds

Abstract:

Reinforcement Learning (RL) techniques may be used to find optimal
controllers for multistep decision problems where the task is to maximise
some reward signal. Successful applications include backgammon, network
routing and scheduling problems. In many situations it is useful or
necessary to have methods that learn about one behaviour while actually
following another (i.e. `off-policy' methods). Most commonly, the learner
may be required to follow an exploring behaviour, while its goal is to
learn about the optimal behaviour. Existing methods for learning in this
way (namely, Q-learning and Watkins' Q(lambda)) are notoriously
inefficient with their use of  real experience. More efficient methods
exist but are either unsound (in that they are provably non-convergent to
optimal solutions in standard formalisms), or are not easy to apply
online. Online learning is an important factor in effective exploration.
Being able to quickly assign credit to the actions that lead to rewards
means that more informed choices between actions can be made sooner. A new
algorithm is introduced to overcome these problems. It works online,
without `eligibility traces', and has a naturally efficient
implementation. Experiments and analysis characterise when it is likely to
outperform existing related methods. New insights into the use of optimism
for encouraging exploration are also discovered. It is found that standard
practices can have strongly negative effect on the performance of a large
class of RL methods for control optimisation.

Also examined are large and non-discrete state-space problems where
`function approximation' is needed, but where many RL methods are known to
be unstable. Particularly, these are control optimisation methods and when
experience is gathered in `off-policy' distributions (e.g. while
exploring). By a new choice of error measure to minimise, the well studied
linear gradient descent methods are shown to be `stable' when used with
any `discounted return' estimating RL method. The notion of stability is
weak (very large, but finite error bounds are shown), but the result is
significant insofar as it covers new cases such as off-policy and
multi-step methods for control optimisation.

New ways of viewing the goal of function approximation in RL are also
examined. Rather than a process of error minimisation between the learned
and observed reward signal, the objective is viewed to be that of finding
representations that make it possible to identify the best action for
given states. A new `decision boundary partitioning' algorithm is
presented with this goal in mind. The method recursively refines the
value-function representation, increasing it in areas where it is expected
that this will result in better decision policies.