TR available in neuroprose

Fri Sep 17 09:51:08 EDT 1993

FTP-host: archive.cis.ohio-state.edu
FTP-filename: /pub/neuroprose/williams.policy-iter.ps.Z

		**PLEASE DO NOT FORWARD TO OTHER GROUPS**

The following paper is now available in the neuroprose directory.  It is
49 pages long.  For those unable to obtain the file by ftp, hardcopies can
be obtained by contacting: Diane Burke, College of Computer Science, 161 CN,
Northeastern University, Boston, MA 02115, USA.

	        Analysis of Some Incremental Variants of
	   Policy Iteration: First Steps Toward Understanding
	            Actor-Critic Learning Systems

	   Northeastern University College of Computer Science
	            Technical Report NU-CCS-93-11

        Ronald J. Williams               Leemon C. Baird, III
        College of Computer Science	 Wright Laboratory
        Northeastern University		 Wright-Patterson Air Force Base
        rjw at ccs.neu.edu			 bairdlc at wL.wpafb.af.mil

Abstract:
This paper studies algorithms based on an incremental dynamic programming
abstraction of one of the key issues in understanding the behavior of
actor-critic learning systems.  The prime example of such a learning system
is the ASE/ACE architecture introduced by Barto, Sutton, and Anderson (1983).
Also related are Witten's adaptive controller (1977) and Holland's bucket
brigade algorithm (1986).  The key feature of such a system is the presence
of separate adaptive components for action selection and state evaluation,
and the key issue focused on here is the extent to which their joint
adaptation is guaranteed to lead to optimal behavior in the limit.
In the incremental dynamic programming point of view taken here, these
questions are formulated in terms of the use of separate data structures
for the current best choice of policy and current best estimate of state
values, with separate operations used to update each at individual states.
Particular emphasis here is on the effect of complete asynchrony in the
updating of these data structures across states.  The main results are that,
while convergence to optimal performance is not guaranteed in general,
there are a number of situations in which such convergence is assured.
Since the algorithms investigated represent a certain idealized abstraction
of actor-critic learning systems, these results are not directly applicable
to current versions of such learning systems but may be viewed instead as
providing a useful first step toward more complete understanding of such
systems.  Another useful perspective on the algorithms analyzed here is that
they represent a broad class of asynchronous dynamic programming procedures
based on policy iteration.

To obtain a copy:

  ftp cheops.cis.ohio-state.edu
  login: anonymous
  password: <your email address>
  cd pub/neuroprose
  binary
  get williams.policy-iter.ps.Z
  quit

Then at your system:

  uncompress williams.policy-iter.ps.Z
  lpr -P<printer-name> williams.policy-iter.ps