Tech Report Available
Mance E. Harmon
harmonme at aa.wpafb.af.mil
Tue Jun 25 10:47:58 EDT 1996
Multi-Agent Residual Advantage Learning With General Function Approximation
Mance E. Harmon
Wright Laboratory
WL/AACF
2241 Avionics Circle
Wright-Patterson AFB,Ohio 45433-7318
harmonme at aa.wpafb.af.mil
Leemon C. Baird III
U.S.A.F. Academy
2354 Fairchild Dr.
Suite 6K41
USAFA, Colorado 80840-6234
baird at cs.usafa.af.mil
ABSTRACT
A new algorithm, advantage learning, is presented that improves on advantage
updating by requiring that a single function be learned rather than two.
Furthermore, advantage learning requires only a single type of update, the
learning update, while advantage updating requires two different types of
updates, a learning update and a normilization update. The reinforcement
learning system uses the residual form of advantage learning. An application
of reinforcement learning to a Markov game is presented. The test-bed has
continuous states and nonlinear dynamics. The game consists of two players, a
missile and a plane; the missile pursues the plane and the plane evades the
missile. On each time step, each player chooses one of two possible actions;
turn left or turn right, resulting in a 90 degree instantaneous change in the
aircraft's heading. Reinforcement is given only when the missile hits the
plane or the plane reaches an escape distance from the missile. The advantage
function is stored in a single-hidden-layer sigmoidal network. Speed of
learning is increased by a new algorithm, Incremental Delta-Delta (IDD), which
extends Jacobs' (1988) Delta-Delta for use in incremental training, and
differs from Sutton's Incremental Delta-Bar-Delta (1992) in that it does not
require the use of a trace and is amenable for use with general function
approximation systems. The advantage learning algorithm for optimal control is
modified for Markov games in order to find the minimax point, rather than the
maximum. Empirical results gathered using the missile/aircraft test-bed
validate theory that suggests residual forms of reinforcement learning
algorithms converge to a local minimum of the mean squared Bellman residual
when using general function approximation systems. Also, to our knowledge,
this is the first time an approximate second order method has been used with
residual algorithms. Empirical results are presented comparing convergence
rates with and without the use of IDD for the reinforcement learning test-bed
described above and for a supervised learning test-bed. The results of these
experiments demonstrate IDD increased the rate of convergence and resulted in
an order of magnitude lower total asymptotic error than when using
backpropagation alone.
Available at http://www.aa.wpafb.af.mil/~harmonme
More information about the Connectionists
mailing list