optimal universal reinforcement learner

Mon Aug 12 10:15:10 EDT 2002

I'd like to draw your attention to the first optimal, universal
reinforcement learner.

While traditional RL requires unrealistic Markovian assumptions,
the recent AIXI model of Marcus Hutter just needs an environment
whose reactions to control actions are sampled from an unknown
but computable distribution mu.  This includes basically
every environment we can reasonably talk about:

M. Hutter. Towards a Universal Theory of Artificial Intelligence
based on Algorithmic Probability and Sequential Decisions.
Proc. ECML-2001, p. 226-238.
http://www.idsia.ch/~marcus/ai/paixi.pdf
ftp://ftp.idsia.ch/pub/techrep/IDSIA-14-00.ps.gz

Self-Optimizing and Pareto-Optimal Policies in General
Environments based on Bayes-Mixtures. Proc. COLT-2002, 364-379.
http://www.idsia.ch/~marcus/ai/selfopt.pdf
ftp://ftp.idsia.ch/pub/techrep/IDSIA-04-02.ps.gz

How does AIXI work?  An optimal predictor using a universal
Bayesmix XI predicts future events including reward.  Here XI
is just a weighted sum of all distributions nu in a set M.
AIXI now simply selects those action sequences that maximize
predicted reward.

It turns out that this method really is self-optimizing in the
following sense:  for all nu in the mix, the average value of
actions, given the history, asymptotically converges to the
optimal value achieved by the unknown policy which knows the
true mu in advance!

The necessary condition is that M does admit self-optimizing
policies. This is also sufficient! And there is no other policy
yielding higher or equal value in all environments nu and a
strictly higher value in at least one.

Interestingly, the right way of treating the temporal horizon
is not to discount it exponentially, as done in most traditional
RL work, but to let the future horizon grow in proportion to the
learner's lifetime so far.

To quote some AIXI referees: "[...] Clearly fundamental and
potentially interesting research direction with practical
applications. [...] Great theory. Extends a major theoretical
direction that led to practical MDL and MML. This approach may
do the same thing (similar thing) wrt to decision theory and
reinforcement learning, to name a few." "[...] this could be
the foundation of a theory which might inspire AI and MC for
years (decades?)."

Juergen Schmidhuber, IDSIA
http://www.idsia.ch/~juergen/unilearn.html