Tech Rept: Q-learning for Bandit Problems

Fri Mar 24 17:06:59 EST 1995

The following technical report is available via anonymous ftp:

		      Q-LEARNING FOR BANDIT PROBLEMS
		     (COMPSCI Technical Report 95-26)

			       Michael Duff
		      Department of Computer Science
			University of Massachusetts
			     Amherst, MA 01003
			     duff at cs.umass.edu

	  Multi-armed      bandits      may    be      viewed  as
	  decompositionally-structured Markov decision  processes
	  (MDP's) with potentially   very large  state  sets.   A
	  particularly elegant methodology  for computing optimal
	  policies  was  developed  over  twenty  ago  by Gittins
	  [Gittins \& Jones, 1974].   Gittins'  approach  reduces
	  the problem  of    finding optimal   policies for   the
	  original MDP to a  sequence of low-dimensional stopping
	  problems  whose solutions determine  the optimal policy
	  through the  so-called ``Gittins  indices.''  Katehakis
	  and Veinott [Katehakis   \& Veinott,  1987]  have shown
	  that the Gittins index for a  task  in state $i$ may be
	  interpreted   as     a  particular  component   of  the
	  maximum-value      function     associated   with   the
	  ``restart-in-$i$''  process,  a   simple MDP  to  which
	  standard solution   methods   for   computing   optimal
	  policies,  such  as  successive  approximation,  apply.
	  This paper explores the problem of learning the Gittins
	  indices on-line without the  aid of a process model; it
	  suggests      utilizing task-state-specific  Q-learning
	  agents  to solve  their respective restart-in-state-$i$
	  subproblems,  and includes  an   example in   which the
	  online reinforcement  learning approach is applied to a
	  simple problem of  stochastic scheduling---one instance
	  drawn  from  a  wide  class of  problems  that  may  be
          formulated as bandit problems.

FTP-host: envy.cs.umass.edu 
FTP-file: pub/duff/bandit.ps.Z

18 MBytes compressed / .46 MBytes uncompressed / 32 pages (8 figures)

FTP Instructions:

unix> ftp envy.cs.umass.edu 
  login: anonymous
  password: (your email address)
ftp> cd pub/duff
ftp> binary
ftp> get bandit.ps.Z
ftp> quit
unix> uncompress bandit.ps.Z
unix> lpr bandit.ps