revised paper available

Tue Nov 30 09:04:42 EST 1993

FTP-host: archive.cis.ohio-state.edu
FTP-filename: /pub/neuroprose/williams.perf-bound.ps.Z

		**PLEASE DO NOT FORWARD TO OTHER GROUPS**

The following improved version of the recently announced paper with the
same title is now available in the neuroprose directory.  The compressed
file has the same name and supersedes the earlier version.  In addition to
containing some new material, this new version has a corrected TR number and
an added co-author, and it is 20 pages long.  For those unable to obtain
the file by ftp, hardcopies can be obtained by contacting: Diane Burke,
College of Computer Science, 161 CN, Northeastern University,
Boston, MA 02115, USA.

		Tight Performance Bounds on Greedy Policies
		     Based on Imperfect Value Functions

	    Northeastern University College of Computer Science
	             Technical Report NU-CCS-93-14

        Ronald J. Williams               Leemon C. Baird, III
        College of Computer Science	 Wright Laboratory
        Northeastern University		 Wright-Patterson Air Force Base
        rjw at ccs.neu.edu			 bairdlc at wL.wpafb.af.mil

Abstract:
Consider a given value function on states of a Markov decision problem,
as might result from applying a reinforcement learning algorithm.
Unless this value function equals the corresponding optimal value function,
at some states there will be a discrepancy, which is natural to call
the Bellman residual, between what the value function specifies at that
state and what is obtained by a one-step lookahead along the seemingly best
action at that state using the given value function to evaluate all
succeeding states.  This paper derives a tight bound on how far
from optimal the discounted return for a greedy policy based on the given
value function will be as a function of the maximum norm magnitude
of this Bellman residual.  A corresponding result is also obtained for
value functions defined on state-action pairs, as are used in Q-learning.
One significant application of these results is to problems where a function
approximator is used to learn a value function, with training of the
approximator based on trying to minimize the Bellman residual across
states or state-action pairs.  When control is based on the use of the
resulting value function, this result provides a link between how well
the objectives of function approximator training are met and the quality
of the resulting control.

To obtain a copy:

  ftp cheops.cis.ohio-state.edu
  login: anonymous
  password: <your email address>
  cd pub/neuroprose
  binary
  get williams.perf-bound.ps.Z
  quit

Then at your system:

  uncompress williams.perf-bound.ps.Z
  lpr -P<printer-name> williams.perf-bound.ps