Summary of "Does backprop need the derivative ??"
fac paul munro
munro at lis.pitt.edu
Thu Feb 11 11:14:44 EST 1993
Forgive the review of college math, but there are a few issues, while
obvious to many, might be worth reviewing here...
[1] The gradient of a well-behaved single-valued function
of N variables (here the error as a function of the
weights) is generally orthogonal to an N-1 dimensional
manifold on which the function is constant (an iso-error
surface)
[2] The effect of infinitesimal motion in the space on the
function can be computed as the inner (dot) product of
the gradient vector with the movement vector; thus,
as long as the dot product between the gradient and the
delta-w vector is negative, the error will decrease.
That is, the new iso-error surface will correspond to a lower
error value.
[3] This implies that the signs of the errors is adequate to reduce
the error, assuming the learning rate is sufficiently small,
since any two vectors with all components the same sign
must have a positive inner product! [They lie in the same
orthant of the space]
Having said all this, I must point out that the argument pertains
only to single patterns. That is, eliminating the derivative term,
is guaranteed to reduce the error for the pattern that is presented.
Its effect on the error summed over the training set is not
guaranteed, even for batch learning...
One more caveat: Of course, if the nonlinear part of the units'
transfer function is non-monotonic (i.e., the sign of the
derivative varies), be sure to throw the derivative back in!
- Paul Munro
More information about the Connectionists
mailing list