Does backprop need the derivative ??

Mon Feb 8 20:33:19 EST 1993

My experience with Boltzmann machines and GRAIN/diffusion networks
(the continuous stochastic version of the Boltzmann machine) has been
that replacing the real gradient by its sign times a constant
accelerates learning DRAMATICALLY. I first saw this technique in one
of the original CMU tech reports on the Boltzmann machine. I believe
Peterson and Hartman and Peterson and Anderson also used this
technique, which they called "Manhattan updating", with the
deterministic Mean Field learning algorithm. I believe they had an
article in "Complex Systems" comparing Backprop and Mean-Field with
both with standard gradient descent and with Manhattan updating. 

It is my understanding that the Mean-Field/Boltzmann chip developed at
Bellcore uses "Manhattan Updating" as its default training method.
Josh Allspector is the person to contact about this.

At this point I've tried 4 different learning algorithms with
continuous and discrete stochastic networks and in all cases Manhattan
Updating worked better than straight gradient descent.The question is
why Manhattan updating works so well (at least in stochastic and
Mean-Field networks) ?

 One possible interpreation is that Manhattan updating limits the
influence of outliers and thus it performs something similar to robust
regression. Another interpretation is that Manhattan updating avoids
the saturation regions, where the error space becomes almost
flat in some dimensions, slowing down learning. 

One of the disadvantages of Manhattan updating is that sometimes one
needs to reduce the weight change constant at the end of learning. But
sometimes we also do this in standard gradient descent anyway.

           -Javier