No subject
Ray White
white at teetot.acusd.edu
Fri Oct 25 19:49:14 EDT 1991
Larry Fast writes:
> I'm expanding the PDP Backprop program (McClelland&Rumlhart version 1.1) to
> compensate for the following problem:
> As Backprop passes the error back thru multiple layers, the gradient has
> a built in tendency to decay. At the output the maximum slope of
> the 1/( 1 + e(-sum)) activation function is 0.5.
> Each successive layer multiplies this slope by a maximum of 0.5.
.....
> It has been suggested (by a couple of sources) that an attempt should be
> made to have each layer learn at the same rate. ...
> The new error function is: errorPropGain * act * (1 - act)
This suggests to me that we are too strongly wedded to precisely
f(sum) = 1/( 1 + e(-sum)) as the squashing function. That function
certainly does have a maximum slope of 0.25.
A nice way to increase that maximum slope is to choose a slightly different
squashing function. For example f(sum) = 1/( 1 + e(-4*sum)) would fill
the bill, or if you'd rather have your output run from -1 to +1, then
tanh(sum) would work. I think that such changes in the squashing function
should automatically improve the maximum-slope situation, essentially by
doing the "errorPropGain" bookkeeping for you.
Such solutions are static fixes. I suggested a dynamic adjustment of the
learning parameter for recurrent backprop at IJCNN - 90 in San Diego
(The Learning Rate in Back-Propagation Systems: an Application of Newton's
Method, IJCNN 90, vol I, p 679). The method amounts to dividing the
learning rate parameter by the square of the gradient of the output
function (subject to an empirical minimum divisor). One should be able
to do something similar with feedforward systems, perhaps on a layer by
layer basis.
- Ray White (white at teetot.acusd.edu)
Please respond directly to 72247.2225 at compuserve.com
Thanks, Larry Fast
More information about the Connectionists
mailing list