correction to backprop example

Thu Feb 18 11:51:18 EST 1993

For those of you who have lost interest in the backprop debate about
replacing the sigmoid derivative with a constant, please disregard
this message.

It was recently pointed out to me that my backprop example was incomplete
(I don't know the name of the sender):

> The error need not be increased although w increased because W1-3 decreased
> and W3-4 decreased. With 2 decreases and 1 increase, one could still expect
> the N4 to decrease and also the error.
> Rgds,
> TH

My original example (with typographical corrections) was:

Consider the following example:

      ----n2-----
     /           \
w--n1             n4
     \           /
      ----n3-----

In other words, there is an output neuron n4 which is connected to two
neurons n2 and n3, each of which is connected to neuron n1, which has
a weight w.  Suppose the weight connecting n2 to n4 is negative and all
other connections in the diagram are positive.  Suppose further that
n2 is saturated and none of the other neurons are saturated.  Now,
suppose that n4 must be decreased in order to reduce the error.
Backpropagating along the n4-n2-n1 path, w receives an error term
which would tend to increase n1, while backpropagating along the
n4-n3-n1 path would result in a term which would tend to decrease n1.
If the true sigmoid derivative were used, the force to increase n1
would be dampened because n2 is saturated, and the net result would
be to decrease w and therefore decrease n1, n3, n4, and the error.
However, replacing the sigmoid derivative with a constant could easily
allow the n4-n2-n1 path to dominate, and the error at the output would
increase. 

The conclusion was that replacing the sigmoid derivative with a constant
can result in increasing the error, and is therefore undesireable.

CORRECTION TO THE EXAMPLE:

The original example did not take into account the perturbation on
W1-3 and W3-4, but the argument still holds with the following modification.
Whatever the perturbation on W1-3 and W3-4, there exists (or at least
a situation can be constructed such that there exists) some positive 
perturbation on w which will counteract those perturbations and result in an 
increase in the output error.  Now replicate the n1-n2-n4 path as necessary
by adding an n1-n5-n4 path, an n1-n6-n4 path etc.  Each new path 
results in incrementing w by some constant delta, so there must exist
some number of paths which results in a sufficient increase in w
to cause an increase in the output error of the network.  Thus, an
example can be constructed in which the error increases, so the method
cannot be considered theoretically sound.

However, you can get virtually all of the benefit without any of the 
theoretical problems by using the derivative of the piecewise-linear function

               -------------------
              /
            /
          /
---------

which involves using a constant or zero for the derivative, depending 
on a simple range test.

  -Ken Miller.