No subject

Tue Jun 6 06:52:25 EDT 2006

Ray White writes:

>
> Larry Fast writes:
>
> > I'm expanding the PDP Backprop program (McClelland&Rumlhart version 1.1) to
> > compensate for the following problem:
>
> > As Backprop passes the error back thru multiple layers, the gradient has
> > a built in tendency to decay.  At the output the maximum slope of
> > the 1/( 1 + e(-sum)) activation function is 0.5.
> > Each successive layer multiplies this slope by a maximum of 0.5.
> .....
>
> > It has been suggested (by a couple of sources) that an attempt should be
> > made to have each layer learn at the same rate. ...
>
> > The new error function is:  errorPropGain * act * (1 - act)
>
> This suggests to me that we are too strongly wedded to precisely
> f(sum) = 1/( 1 + e(-sum)) as the squashing function. That function
> certainly does have a maximum slope of 0.25.
>
> A nice way to increase that maximum slope is to choose a slightly different
> squashing function.  For example f(sum) = 1/( 1 + e(-4*sum)) would fill
> the bill, or if you'd rather have your output run from -1 to +1, then
> tanh(sum) would work.  I think that such changes in the squashing function
> should automatically improve the maximum-slope situation, essentially by
> doing the "errorPropGain" bookkeeping for you.
>
> Such solutions are static fixes. I suggested a dynamic adjustment of the
> learning parameter for recurrent backprop at IJCNN - 90 in San Diego
> (The Learning Rate in Back-Propagation Systems: an Application of Newton's
> Method, IJCNN 90, vol I, p 679). The method amounts to dividing the
> learning rate parameter by the square of the gradient of the output
> function (subject to an empirical minimum divisor). One should be able
> to do something similar with feedforward systems, perhaps on a layer by
> layer basis.
>
> - Ray White (white at teetot.acusd.edu)

The fact that the error "decays" when backpropagated through several
layers is not a "problem" with the BP algorithm, its merely a reflection
of the fact that earlier weights contribute less to the error than later
weights. If you go around changing the formula for the error at each weight
then the resulting learning algorithm will no longer be gradient descent,
and hence there is no guarantee that your algorithm will reduce the network's
error.

Ray White's solution is preferable as it will still use gradient descent
to improve the network's performance, although doing things on a layer by
layer basis would be wrong.

I have experimented a little with keeping the magnitude of the error vector
constant in feedforward, backprop nets (by dividing the error vector by
its magnitude) and have found a significant (*10) speedup in small problems
(xor, encoder--decoders, etc). This increase in speed is most noticable
in problems where the "solution" is a set of infinite weights, so that an
approximate solution is reached by traversing vast, flat regions of weight
space. Presumably there is a lot of literature out there on this kind of
thing.
Another idea is to calculate the matrix of second derivatives (grad(grad E)) as
well as the first derivatives (grad E) and from this information calculate the
(unique) parabolic surface in weight space that has the same derivatives. Then
the weights should be updated so as to jump to the center (minimum) of the
parabola. I haven't coded this idea yet, has anyone else looked at this kind
of thing, and if so what are the results?

Jon Baxter - jon at degas.cs.flinders.oz.au