Summary of "Does backprop need the derivative ??"

Mon Feb 15 11:32:44 EST 1993

Paul Munro writes:

> [3] This implies that the signs of the errors is adequate to reduce
>     the error, assuming the learning rate is sufficiently small,
>     since any two vectors with all components the same sign
>     must have a positive inner product! [They lie in the same
>     orthant of the space]

I beleive a critical point is being missed, that is, the derivative
is being replaced by sign at every stage in applying the chain rule,
not just to the initial backpropagation of the error.  Consider the
following example:

      ----n2-----
     /           \
w--n1             n4
     \           /
      ----n3-----

In other words, there is an output neuron n4 which is connected to two
neurons n2 and n3, each of which is connected to neuron n1, which has
a weight w.  Suppose the weight connecting n2 to n4 is negative and all
other connections in the diagram are positive.  Suppose further that
n2 is saturated and none of the other neurons are saturated.  Now,
suppose that n4 must be decreased in order to reduce the error.
Backpropagating along the n4-n2-n1 path, w receives an error term
which would tend to increase n1, while backpropagating along the
n4-n3-n1 path would result in a term which would tend to decrease n1.
If the true sigmoid derivative were used, the force to increase n1
would be dampened because n2 is saturated, and the net result would
be to increase w and therefore increase n1 and n3 and decrease n4.
However, replacing the sigmoid derivative with a constant could easily
allow the n4-n2-n1 path to dominate, and the error at the output would
increase.  Thus, it is not a sound thing to do regardless of how
many patterns are used for training.

  -Ken Miller.