exploration
Juergen Schmidhuber
juergen at idsia.ch
Mon Aug 9 04:56:52 EDT 1999
Two papers on exploration are now available in digital form:
-----------------------------------------------------------------------
Efficient Model-Based Exploration
Marco Wiering & Juergen Schmidhuber, IDSIA, Lugano, Switzerland
In R. Pfeiffer, B. Blumberg, J. Meyer, S. W. Wilson, eds., From Animals
to Animats 5: Proceedings of the Fifth International Conference on
Simulation of Adaptive Behavior, p. 223-228, MIT Press, 1998.
ftp://ftp.idsia.ch/pub/juergen/sab98explore.ps.gz
Model-Based Reinforcement Learning (MBRL) can greatly profit from using
world models for estimating the consequences of selecting particular
actions: an animat can construct such a model from its experiences
and use it for computing rewarding behavior. We study the problem
of collecting useful experiences through exploration in stochastic
environments. Towards this end we use MBRL to maximize exploration rewards
(in addition to environmental rewards) for visits of states that promise
information gain. We also combine MBRL and the Interval Estimation
algorithm (Kaelbling, 1993). Experimental results demonstrate the
advantages of our approaches.
-----------------------------------------------------------------------
Artificial Curiosity Based on Discovering Novel Algorithmic
Predictability Through Coevolution
Juergen Schmidhuber, IDSIA, Lugano, Switzerland
In P. Angeline, Z. Michalewicz, M. Schoenauer, X. Yao, Z. Zalzala,
eds., Congress on Evolutionary Computation, p. 1612-1618, IEEE Press,
Piscataway, NJ, 1999. (based on TR IDSIA-35-97, 1997)
ftp://ftp.idsia.ch/pub/juergen/cec99.ps.gz
How to explore a spatio-temporal domain? By predicting and learning from
success/failure what's predictable and what's not. I study a "curious"
embedded agent that differs from previous explorers in the sense that
it can limit its predictions to fairly arbitrary, computable aspects
of event sequences and thus can explicitly ignore almost arbitrary
unpredictable, random aspects. It constructs initially random algorithms
mapping event sequences to abstract internal representations (IRs). It
also constructs algorithms predicting IRs from IRs computed earlier.
It wants to learn novel algorithms creating IRs useful for correct
IR predictions, without wasting time on those learned before. This is
achieved by a co-evolutionary scheme involving two competing modules
COLLECTIVELY designing SINGLE algorithms to be executed. The modules
can bet on the outcome of IR predictions computed by the algorithms they
have agreed upon. If their opinions differ then the system checks who's
right, punishes the loser (the surprised one), and rewards the winner.
A reinforcement learning algorithm forces each module to maximixe
reward. This motivates both modules to lure the other into agreeing upon
algorithms involving predictions that surprise it. Since each module
essentially can put in its veto against algorithms it does not consider
profitable, the system is motivated to focus on those computable aspects
of the environment where both modules still have confident but different
opinions. Once both share the same opinion on a particular issue (via
the loser's learning process, e.g., the winner is simply copied onto
the loser), the winner loses a source of reward - an incentive to shift
the focus of interest onto novel, yet unknown algorithms. Simulations
include an example where surprise-generation of this kind helps to speed
up external reward.
-----------------------------------------------------------------------
Several additional postscripts now available in
http://www.idsia.ch/~juergen/onlinepub.html
Juergen Schmidhuber www.idsia.ch
More information about the Connectionists
mailing list