TR: On-line Q-learning using Connectionist Systems
gar@eng.cam.ac.uk
gar at eng.cam.ac.uk
Tue Oct 4 11:20:12 EDT 1994
The following technical report is available by anonymous ftp from the
archive of the Speech, Vision and Robotics Group at the Cambridge
University Engineering Department.
ON-LINE Q-LEARNING
USING
CONNECTIONIST SYSTEMS
G. A. Rummery and M. Niranjan
Technical Report CUED/F-INFENG/TR 166
Cambridge University Engineering Department
Trumpington Street
Cambridge CB2 1PZ
England
Abstract
Reinforcement learning algorithms are a powerful machine learning
technique. However, much of the work on these algorithms has been
developed with regard to discrete finite-state Markovian problems,
which is too restrictive for many real-world environments.
Therefore, it is desirable to extend these methods to high
dimensional continuous state-spaces, which requires the use of
function approximation to generalise the information learnt by the
system. In this report, the use of back-propagation neural networks
(Rumelhart et al, 1986) is considered in this context.
We consider a number of different algorithms based around Q-Learning
(Watkins, 1989) combined with the Temporal Difference algorithm
(Sutton, 1988), including a new algorithm (Modified Connectionist
Q-Learning), and Q(lambda) (Peng and Williams, 1994). In addition,
we present algorithms for applying these updates on-line during trials,
unlike backward replay used by Lin (1993) that requires
waiting until the end of each trial before updating can occur.
On-line updating is found to be more robust to the choice of
training parameters than backward replay, and also enables the
algorithms to be used in continuously operating systems where no end
of trial conditions occur.
We compare the performance of these algorithms on a realistic robot
navigation problem, where a simulated mobile robot is trained to
guide itself to a goal position in the presence of obstacles. The
robot must rely on limited sensory feedback from its surroundings,
and make decisions that can be generalised to arbitrary layouts of
obstacles.
These simulations show that on-line learning algorithms are less
sensitive to the choice of training parameters than backward replay,
and that the alternative update rules of MCQ-L and Q(lambda) are
more robust than standard Q-learning updates.
************************ How to obtain a copy ************************
Via anonymous ftp:
unix> ftp svr-ftp.eng.cam.ac.uk
Name: anonymous
Password: (type your email address)
ftp> cd reports
ftp> binary
ftp> get rummery_tr166.ps.Z
ftp> quit
unix> uncompress rummery_tr166.ps.Z
unix> lpr rummery_tr166.ps (or however you print PostScript)
More information about the Connectionists
mailing list