Tech Report Available

Tue Jun 25 10:47:58 EDT 1996

Multi-Agent Residual Advantage Learning With General Function Approximation

Mance E. Harmon
Wright Laboratory
WL/AACF 
2241 Avionics Circle
Wright-Patterson AFB,Ohio  45433-7318
harmonme at aa.wpafb.af.mil        

Leemon C. Baird III
U.S.A.F. Academy
2354 Fairchild Dr.
Suite 6K41
USAFA, Colorado 80840-6234
baird at cs.usafa.af.mil

ABSTRACT

A new algorithm, advantage learning, is presented that improves on advantage 
updating by requiring that a single function be learned rather than two.  
Furthermore, advantage learning requires only a single type of update, the 
learning update, while advantage updating requires two different types of 
updates, a learning update and a normilization update.  The reinforcement 
learning system uses the residual form of advantage learning.  An application 
of reinforcement learning to a Markov game is presented.  The test-bed has 
continuous states and nonlinear dynamics.  The game consists of two players, a 
missile and a plane; the missile pursues the plane and the plane evades the 
missile.  On each time step, each player chooses one of two possible actions; 
turn left or turn right, resulting in a 90 degree instantaneous change in the 
aircraft's heading.  Reinforcement is given only when the missile hits the 
plane or the plane reaches an escape distance from the missile.  The advantage 
function is stored in a single-hidden-layer sigmoidal network.  Speed of 
learning is increased by a new algorithm, Incremental Delta-Delta (IDD), which 
extends Jacobs' (1988) Delta-Delta for use in incremental training, and 
differs from Sutton's Incremental Delta-Bar-Delta (1992) in that it does not 
require the use of a trace and is amenable for use with general function 
approximation systems. The advantage learning algorithm for optimal control is 
modified for Markov games in order to find the minimax point, rather than the 
maximum.  Empirical results gathered using the missile/aircraft test-bed 
validate theory that suggests residual forms of reinforcement learning 
algorithms converge to a local minimum of the mean squared Bellman residual 
when using general function approximation systems.  Also, to our knowledge, 
this is the first time an approximate second order method has been used with 
residual algorithms.  Empirical results are presented comparing convergence 
rates with and without the use of IDD for the reinforcement learning test-bed 
described above and for a supervised learning test-bed.  The results of these 
experiments demonstrate IDD increased the rate of convergence and resulted in 
an order of magnitude lower total asymptotic error than when using 
backpropagation alone.

Available at http://www.aa.wpafb.af.mil/~harmonme