paper available

Tue Nov 23 13:48:58 EST 1993

FTP-host: archive.cis.ohio-state.edu
FTP-filename: /pub/neuroprose/williams.perf-bound.ps.Z

		**PLEASE DO NOT FORWARD TO OTHER GROUPS**

The following paper is now available in the neuroprose directory.  It is
17 pages long.  For those unable to obtain the file by ftp, hardcopies can
be obtained by contacting: Diane Burke, College of Computer Science, 161 CN,
Northeastern University, Boston, MA 02115, USA.

		Tight Performance Bounds on Greedy Policies
		     Based on Imperfect Value Functions

	    Northeastern University College of Computer Science
	             Technical Report NU-CCS-93-13

		 	  Ronald J. Williams
		      College of Computer Science
		        Northeastern University
			   rjw at ccs.neu.edu

Abstract:
Consider a given value function on states of a Markov decision problem, as
might result from applying a reinforcement learning algorithm.  Unless this
value function equals the corresponding optimal value function, at some
states there will be a discrepancy, which is natural to call the Bellman
residual, between what the value function specifies at that state and what
is obtained by a one-step lookahead along the seemingly best action at that
state using the given value function to evaluate all succeeding states.
This paper derives a bound on how far from optimal the discounted return
for a greedy policy based on the given value function will be as a function
of the maximum norm magnitude of this Bellman residual.  A corresponding
result is also obtained for value functions defined on state-action pairs,
as are used in Q-learning, and in this case it is also shown that this bound
is tight in general.  One significant application of this result is to
problems where a function approximator is used to learn a value function,
with training of the approximator based on trying to minimize the Bellman
residual across states or state-action pairs.  When control is based on the
use of the resulting value function, this result provides a link between how
well the objectives of function approximator training are met and the quality
of the resulting control.

To obtain a copy:

  ftp cheops.cis.ohio-state.edu
  login: anonymous
  password: <your email address>
  cd pub/neuroprose
  binary
  get williams.perf-bound.ps.Z
  quit

Then at your system:

  uncompress williams.perf-bound.ps.Z
  lpr -P<printer-name> williams.perf-bound.ps

---------------------------------------------------------------------------
Ronald J. Williams                   | email: rjw at ccs.neu.edu
College of Computer Science, 161 CN  | Phone: (617) 373-8683
Northeastern University              | Fax: (617) 373-5121
Boston, MA 02115, USA                |
---------------------------------------------------------------------------