metalearner

Thu Aug 30 10:57:29 EDT 2001

I would like to draw your attention to Sepp Hochreiter's astonishing
recent result on "learning to learn."

He trains gradient-based "Long Short-Term Memory" (LSTM) recurrent
networks with roughly 5000 weights to _metalearn_ fast online learning
algorithms for nontrivial classes of functions, such as all quadratic
functions of two variables. LSTM is necessary because metalearning
typically involves huge time lags between important events, and standard
gradient-based recurrent nets cannot deal with these.  After a month
of metalearning on a PC he freezes all weights, then uses the frozen
net as follows: He selects some new function f, and feeds a sequence of
random training exemplars of the form ...data/target/data/target/data...
into the input units, one sequence element at a time. After about 30
exemplars the frozen recurrent net correctly predicts target inputs before
it sees them. No weight changes! How is this possible? After metalearning
the frozen net implements a sequential learning algorithm which apparently
computes something like error signals from data inputs and target inputs
and translates them into changes of internal estimates of f.  Parameters
of f, errors, temporary variables, counters, computations of f and of
parameter updates are all somehow represented in form of circulating
activations. Remarkably, the new - and quite opaque - online learning
algorithm running on the frozen network is much faster than standard
backprop with optimal learning rate. This indicates that one can use
gradient descent to metalearn learning algorithms that outperform gradient
descent.  Furthermore, the metalearning procedure automatically avoids
overfitting in a principled way, since it punishes overfitting online
learners just like it punishes slow ones, simply because overfitters
and slow learners cause more cumulative errors during metalearning.

Hochreiter himself admits the paper is not well-written. But the results
are quite amazing: http://www.cs.colorado.edu/~hochreit

@inproceedings{Hochreiter:01meta,
author =      "S. Hochreiter and A. S. Younger and P. R. Conwell",
title =       "Learning to learn using gradient descent",
booktitle=    "Lecture Notes on Comp. Sci. 2130, 
Proc. Intl. Conf. on Artificial Neural Networks (ICANN-2001)",
editors =     "G. Dorffner and H. Bischof and K. Hornik",
publisher=    "Springer: Berlin, Heidelberg",
pages    =    "87-94",
year =        "2001"}

-------------------------------------------------
Juergen Schmidhuber                      director
IDSIA, Galleria 2, 6928 Manno-Lugano, Switzerland
juergen at idsia.ch            www.idsia.ch/~juergen