TR: On-line Q-learning using Connectionist Systems

Tue Oct 4 11:20:12 EDT 1994

The following technical report is available by anonymous ftp from the
archive of the Speech, Vision and Robotics Group at the Cambridge
University Engineering Department.

                        ON-LINE Q-LEARNING
                              USING
                      CONNECTIONIST SYSTEMS

                  G. A. Rummery and M. Niranjan

              Technical Report CUED/F-INFENG/TR 166

	    Cambridge University Engineering Department 
		        Trumpington Street 
		        Cambridge CB2 1PZ 
			     England 

                             Abstract

  Reinforcement learning algorithms are a powerful machine learning
  technique. However, much of the work on these algorithms has been
  developed with regard to discrete finite-state Markovian problems,
  which is too restrictive for many real-world environments.
  Therefore, it is desirable to extend these methods to high
  dimensional continuous state-spaces, which requires the use of
  function approximation to generalise the information learnt by the
  system. In this report, the use of back-propagation neural networks
  (Rumelhart et al, 1986) is considered in this context.

  We consider a number of different algorithms based around Q-Learning
  (Watkins, 1989) combined with the Temporal Difference algorithm
  (Sutton, 1988), including a new algorithm (Modified Connectionist
  Q-Learning), and Q(lambda) (Peng and Williams, 1994). In addition, 
  we present algorithms for applying these updates on-line during trials, 
  unlike backward replay used by Lin (1993) that requires
  waiting until the end of each trial before updating can occur.
  On-line updating is found to be more robust to the choice of
  training parameters than backward replay, and also enables the
  algorithms to be used in continuously operating systems where no end
  of trial conditions occur.

  We compare the performance of these algorithms on a realistic robot
  navigation problem, where a simulated mobile robot is trained to
  guide itself to a goal position in the presence of obstacles. The
  robot must rely on limited sensory feedback from its surroundings,
  and make decisions that can be generalised to arbitrary layouts of
  obstacles.

  These simulations show that on-line learning algorithms are less
  sensitive to the choice of training parameters than backward replay,
  and that the alternative update rules of MCQ-L and Q(lambda) are
  more robust than standard Q-learning updates.

************************ How to obtain a copy ************************

Via anonymous ftp:

unix> ftp svr-ftp.eng.cam.ac.uk
Name: anonymous
Password: (type your email address)
ftp> cd reports
ftp> binary
ftp> get rummery_tr166.ps.Z
ftp> quit
unix> uncompress rummery_tr166.ps.Z
unix> lpr rummery_tr166.ps (or however you print PostScript)