Bengio's paper on 'learning learning rules'

Henrik Klagges uh311ae at sunmanager.lrz-muenchen.de
Tue Jan 21 16:00:28 EST 1992


I just ftp'ed and read the above paper, which is available on the neuroprose
server. Essentially, it proposes to use an optimizer (like a gradient descent
method or a genetic algorithm) to optimize very global network parameters,
last not least the learning rule itself. The latter might be accomplished by
e.g. having a ga switching individual modules that get combined into a hete-
rogenous rule on and off. Unfortunately, the paper does not include any si-
mulation data to support this idea.
I did some experiments last year which might be of interest, because they
do support Bengio's predictions. Due to the hunger for resources exhibited
by a GA that has expensive function evaluations (network training tests),
the results are based on the usual set of toy problems like xor and a 16-2-16
encoder.

The problem was wether and if, with what mixing factor, to couple two dif-
fering learning rules into a hybrid. This is not as straightforward as to
simply evaluate the mixing factor, because typically differing algorithms
like different 'environments' to work in. More specifically, the two algo-
rithms I considered are very sensitive to initialization range of the net-
work weights and prefer rather nonoverlapping values. This complicated d the
search for a good mixing factor into a multi-parameter nobody-knows problem,
because I couldn't a priori rule out that a good hybrid would exist with
unknown initialization parameters.
One night of 36MHz R3000 sweat produced a nice hybrid with improved conver-
gence for the tested problems, thus Bengio's claims get some support from me.
I'd like to add, though, that more advanced searches are very likely requi-
ring very long and careful optimization runs, if the GA is to sample a
sufficiently large part of the search space.

A hint to the practitioneer: It helps to introduce (either by hand, or dyna-
mically) precision and range 'knobs' into the simulation, which makes it pos-
sible to start with low precision, large range. It is also helpful to average
at least 10, better 20+ individual network runs into a single function eva-
luation. The GA could in principle deal with this noise, but is actually hard
pressed when confronted with networks which sometimes do & sometimes don't
converge.

Cheers, Henrik Klagges

IBM Research
rick at vee.lrz-muenchen.de & henrik at mpci.llnl.gov



More information about the Connectionists mailing list