new papers about exploration in active learning

Tue Jun 6 06:52:25 EDT 2006

This is an announcement of three papers about exploration in
neurocontrol and reinforcement learning. I copied postscript versions
to our neuroprose archive.  Thanks to Jordan Pollack - what would
connectionism be without him??

Instructions for retrieval can be found at the end of this message.
Comments are welcome.

                                                  --- Sebastian Thrun

===========================================================================

ACTIVE EXPLORATION IN DYNAMIC ENVIRONMENTS 
by S.Thrun and K.Moeller 
To appear in: Advances in Neural Information Processing Systems 4,
J.E.  Moody, S.J. Hanson, and R.P. Lippmann (eds.)  Morgan Kaufmann,
San Mateo, CA, 1992

Whenever an agent learns to control an unknown environment, two
opposing principles have to be combined, namely: exploration
(long-term optimization) and exploitation (short-term optimization).
Many real-valued connectionist approaches to learning control realize
exploration by randomness in action selection. This might be
disadvantageous when costs are assigned to ``negative experiences.''
The basic idea presented in this paper is to make an agent explore
unknown regions in a more directed manner. This is achieved by a
so-called competence map, which is trained to predict the controller's
accuracy, and is used for guiding exploration. Based on this, a
bistable system enables smoothly switching attention between two
behaviors -- exploration and exploitation -- depending on expected
costs and knowledge gain.  The appropriateness of this method is
demonstrated by a simple robot navigation task.

                                           archive name: thrun.nips91.ps.Z  
===========================================================================

EFFICIENT EXPLORATION IN REINFORCEMENT LEARNING 
by S. Thrun
Technical Report CMU-CS-92-102, Jan. 1992, Carnegie-Mellon University

Exploration plays a fundamental role in any active learning system.
This study evaluates the role of exploration in active learning and
describes several local techniques for exploration in finite, discrete
domains, embedded in a reinforcement learning framework (delayed
reinforcement). 
This paper distinguishes between two families of exploration schemes:
undirected and directed exploration. While the former family is
closely related to random walk exploration, directed exploration
techniques memorize exploration-specific knowledge which is used for
guiding the exploration search. In many finite deterministic domains,
any learning technique based on undirected exploration is inefficient
in terms of learning time, i.e.  learning time is expected to scale
exponentially with the size of the state space [Whitehead 91]. We
prove that for all these domains, reinforcement learning using a
directed technique can always be performed in polynomial time,
demonstrating the important role of exploration in reinforcement
learning.
Subsequently, several exploration techniques found in recent
reinforcement learning and connectionist adaptive control literature
are described. In order to trade off efficiently between exploration
and exploitation -- a trade-off which characterizes many real-world
active learning tasks -- combination methods are described which
explore and avoid costs simultaneously. This includes a selective
attention mechanism, which allows smooth switching between exploration
and exploitation.
All techniques are evaluated and compared on a discrete reinforcement
learning task (robot navigation).  The empirical evaluation is
followed by an extensive discussion of benefits and limitations of
this work.

                            archive name: thrun.explor-reinforcement.ps.Z 
===========================================================================

THE ROLE OF EXPLORATION IN LEARNING CONTROL 
by S. Thrun 
To appear in: Handbook of Intelligent Control: Neural, Fuzzy and
Adaptive Approaches, D.A. White and D.A. Sofge, Van Nostrand Reinhold,
Florence, Kentucky 41022

This chapter basically summarizes the results described in the papers
above, and surveys recent work on exploration in neurocontrol and
reinforcement learning. Here are the issues addressed in this paper:

`[...] Let us begin with the questions characterizing exploration and
exploitation.  Exploration seeks to minimize learning time. Thus, the
central question of efficient exploration reads ``How can learning
time be minimized?''.  Accordingly, the question of exploitation is
``How can costs be minimized?''.  These questions are usually
opposing, i.e. the smaller the learning time, the larger the costs,
and vice versa. But as we will see, pure exploration does not
necessarily minimize learning time. This is because pure exploration,
as presented in this chapter, maximizes knowledge gain, and thus may
waste much time in exploring task-irrelevant parts of the environment.
If one is interested in restricting exploration to relevant parts of
the environment, it often makes sense to exploit simultaneously.
Therefore exploitation is part of efficient exploration.  On the other
hand, exploration is also part of efficient exploitation, because
costs clearly cannot be minimized over time without exploring the
environment.
The second important question to ask is ``What impact has the
exploration rule on the speed and the costs of learning?'', or in
other words ``How much time should a designer, who designs an active
learning system, spend for designing an appropriate exploration
rule?''. This question will be extensively discussed, since the impact
of the exploration technique on both learning time and learning costs
can be enormous.  Depending on the structure of the environment,
``wrong'' exploration rules may result in inefficient learning time,
even if very efficient learning techniques are employed.
The third central question relevant for any implementation of learning
control is ``How does one trade-off exploration and exploitation?''.
Since exploration and exploitation establish a trade-off, this
question needs further specification. For example, one might ask ``How
can I find the best controller in a given time?'', or ``How can I find
the best controller while not exceeding a certain amount of costs?''.
Both questions constrain the trade-off dilemma in such a way that an
optimal combination between exploration and exploitation may be found,
given that the problem can be solved with these constraints at all.
Now assume one has already an efficient exploration and an efficient
exploitation technique. This raises the question ``How shall
exploration and exploitation be combined?''. Shall each action explore
and exploit the environment simultaneously, or shall an agent
sometimes focus more on exploration, and sometimes focus more on
exploitation?'

                               archive name: thrun.exploration-overview.ps.Z 
===========================================================================

		INSTRUCTIONS FOR RETRIEVAL

unix>           ftp archive.cis.ohio-state.edu                (or 128.146.8.52)
Name:           anonymous
Password:       <your user id>
ftp>            cd pub/neuroprose
ftp>            binary
ftp>            get thrun.nips91.ps.Z                         (First paper)
ftp>            get thrun.explor-reinforcement.ps.Z           (Second paper)
ftp>            get thrun.exploration-overview.ps.Z           (Third paper)
ftp>            quit
unix>           zcat thrun.nips91.ps.Z | lpr
unix>           zcat thrun.explor-reinforcement.ps.Z | lpr
unix>           zcat thrun.exploration-overview.ps.Z | lpr

If you are unable to ftp and/or print the papers, send mail to
thrun at cs.cmu.edu or write to Sebastian Thrun, School of Computer
Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA