JAIR: MAXQ Hierarchical Reinforcement Learning
Thomas G. Dietterich
tgd at cs.orst.edu
Sun Dec 3 23:01:06 EST 2000
JAIR is pleased to announce the publication of the following article:
Dietterich, T.G. (2000)
"Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition",
Volume 13, pages 227-303.
Available in PDF, PostScript and compressed PostScript.
For quick access via your WWW browser, use this URL:
http://www.jair.org/abstracts/dietterich00a.html
More detailed instructions are below.
Abstract: This paper presents a new approach to hierarchical
reinforcement learning based on decomposing the target Markov decision
process (MDP) into a hierarchy of smaller MDPs and decomposing the
value function of the target MDP into an additive combination of the
value functions of the smaller MDPs. The decomposition, known as the
MAXQ decomposition, has both a procedural semantics---as a subroutine
hierarchy---and a declarative semantics---as a representation of the
value function of a hierarchical policy. MAXQ unifies and extends
previous work on hierarchical reinforcement learning by Singh,
Kaelbling, and Dayan and Hinton. It is based on the assumption that
the programmer can identify useful subgoals and define subtasks that
achieve these subgoals. By defining such subgoals, the programmer
constrains the set of policies that need to be considered during
reinforcement learning. The MAXQ value function decomposition can
represent the value function of any policy that is consistent with the
given hierarchy. The decomposition also creates opportunities to
exploit state abstractions, so that individual MDPs within the
hierarchy can ignore large parts of the state space. This is
important for the practical application of the method. This paper
defines the MAXQ hierarchy, proves formal results on its
representational power, and establishes five conditions for the safe
use of state abstractions. The paper presents an online model-free
learning algorithm, MAXQ-Q, and proves that it converges with
probability 1 to a kind of locally-optimal policy known as a
recursively optimal policy, even in the presence of the five kinds of
state abstraction. The paper evaluates the MAXQ representation and
MAXQ-Q through a series of experiments in three domains and shows
experimentally that MAXQ-Q (with state abstractions) converges to a
recursively optimal policy much faster than flat Q learning. The fact
that MAXQ learns a representation of the value function has an
important benefit: it makes it possible to compute and execute an
improved, non-hierarchical policy via a procedure similar to the
policy improvement step of policy iteration. The paper demonstrates
the effectiveness of this non-hierarchical execution experimentally.
Finally, the paper concludes with a comparison to related work and a
discussion of the design tradeoffs in hierarchical reinforcement learning.
The article is available via:
-- comp.ai.jair.papers (also see comp.ai.jair.announce)
-- World Wide Web: The URL for our World Wide Web server is
http://www.jair.org/
For direct access to this article and related files try:
http://www.jair.org/abstracts/dietterich00a.html
-- Anonymous FTP from either of the two sites below.
Carnegie-Mellon University (USA):
ftp://ftp.cs.cmu.edu/project/jair/volume13/dietterich00a.ps
The University of Genoa (Italy):
ftp://ftp.mrg.dist.unige.it/pub/jair/pub/volume13/dietterich00a.ps
The compressed PostScript file is named dietterich00a.ps.Z (464K)
For more information about JAIR, visit our WWW or FTP sites, or
contact jair-ed at isi.edu
More information about the Connectionists
mailing list