papers available on reinforcement learning

Sat Jul 1 17:05:05 EDT 1995

The following papers are now available online:

---------------------------------------------------------------------------
Improving Elevator Performance Using Reinforcement Learning

Robert H. Crites and Andrew G. Barto
Computer Science Department
University of Massachusetts
Amherst, MA 01003-4610
crites at cs.umass.edu, barto at cs.umass.edu

Submitted to NIPS 8

8 pages

ftp://ftp.cs.umass.edu/pub/anw/pub/crites/nips8.ps.Z

ABSTRACT
This paper describes the application of reinforcement learning (RL) to the
difficult real world problem of elevator dispatching.  The elevator domain
poses a combination of challenges not seen in most RL research to date.
Elevator systems operate in continuous state spaces and in continuous time
as discrete event dynamic systems.  Their state is not fully observable and
they are non-stationary due to changing passenger arrival rates.  In
addition, we use a team of RL agents, each of which is responsible for
controlling one elevator car.  The team receives a global reinforcement
signal which appears noisy to each agent due to the effects of the actions
of the other agents, the random nature of the arrivals and the incomplete
observation of the state.  In spite of all of these complications, we show
results that surpass the best of the heuristic elevator control algorithms
of which we are aware.  These results demonstrate the power of RL on a very
large scale dynamic optimization problem of practical utility.

---------------------------------------------------------------------------
An Actor/Critic Algorithm that is Equivalent to Q-Learning

Robert H. Crites and Andrew G. Barto
Computer Science Department
University of Massachusetts
Amherst, MA 01003-4610
crites at cs.umass.edu, barto at cs.umass.edu

To appear in: G. Tesauro, D. S. Touretzky and T. K. Leen, eds., Advances in
Neural Information Processing Systems 7, MIT Press, Cambridge MA, 1995.

8 pages

ftp://ftp.cs.umass.edu/pub/anw/pub/crites/nips7.ps.Z

ABSTRACT
We prove the convergence of an actor/critic algorithm that is equivalent to
Q-learning by construction.  Its equivalence is achieved by encoding
Q-values within the policy and value function of the actor and critic.  The
resultant actor/critic algorithm is novel in two ways: it updates the
critic only when the most probable action is executed from any given state,
and it rewards the actor using criteria that depend on the relative
probability of the action that was executed.

---------------------------------------------------------------------------