Summary of "Does backprop need the derivative ??"

Thu Feb 11 11:14:44 EST 1993

Forgive the review of college math, but there are a few issues, while
obvious to many, might be worth reviewing here...

[1] The gradient of a well-behaved single-valued function 
    of N variables (here the error as a function of the
    weights) is generally orthogonal to an N-1 dimensional
    manifold on which the function is constant (an iso-error 
    surface)

[2] The effect of infinitesimal motion in the space on the
    function can be computed as the inner (dot) product of
    the gradient vector with the movement vector; thus,
    as long as the dot product between the gradient and the
    delta-w vector is negative, the error will decrease.
    That is, the new iso-error surface will correspond to a lower
    error value.

[3] This implies that the signs of the errors is adequate to reduce
    the error, assuming the learning rate is sufficiently small,
    since any two vectors with all components the same sign
    must have a positive inner product! [They lie in the same
    orthant of the space]

Having said all this, I must point out that the argument pertains
only to single patterns.  That is, eliminating the derivative term,
is guaranteed to reduce the error for the pattern that is presented.

Its effect on the error summed over the training set is not 
guaranteed, even for batch learning...  

One more caveat: Of course, if the nonlinear part of the units'
transfer function is non-monotonic (i.e., the sign of the
derivative varies), be sure to throw the derivative back in!

- Paul Munro