No subject


Mon Jun 5 16:42:55 EDT 2006


derivative CANNOT be replaced with a constant. Below several people 
indicate that they had trouble when the derivative was replaced. However
some people says that it can be replaced. I believe that it is problem 
dependent. I know of two small problems in which you cannot change 
the derivative. They are:

 - the XOR-problem (I suppose everyone is familiar with that one) and
 - the so-called sine-problem:

        Try to learn a 1-3-1 network the sine in the range from -pi to
        pi. The output neuron has a linear transfer function, the 
        hidden neurons have a tanh transfer function. Backpropagation 
        with individual update is used to train the network (batch update
        has problems). I used 37 training samples in the given range and
        stopped training when the total error (sum over all patterns of 
        .5 times square target output minus actual output) was smaller 
        than 0.001

Furthermore, I would like to say that communicating in such a way is
very efficient and I would like to thank everyone for their responses
(and upcoming responses).

Heini Withagen
Department of Electrical Engineering EH 9.29
Eindhoven Technical University
P.O. Box 513
5600 MB Eindhoven
The Netherlands
Phone: 31-40472366
Fax:   31-40455674
E-mail: heiniw at eeb.ele.tue.nl

------------------------------------------------------------------------

David Bisant from Stanford Univ. wrote:

Those interested in this problem might want to take a look at an
obscure reference by Chen & Mars (Wash., DC IJCNN, Vol 1 pg 601,
1990).  They essentially drop the derivative term altogether from
the weight update equation for the output layer.  They claim that
it helps to avoid saturated units.   A magazine article (AI Expert
July, 1991) has empirically compared this method with Fahlman's and
a few others on some toy problems (not a rigorous comparison, but
still informative).

Here are some other references where an attempt has been made to
simplify the activation and/or differential function:

Samad             IJCNN 90 (Wash DC)
                   & Honeywell Tech Report SSDC-89-14902-3
  
Rezgui            IJCNN 90 (Wash DC)

Tepedelenlioglu   IEEE ICSE 89


------------------------------------------------------------------------

Guido Bugmann from King's College London wrote:

I have developped a model of formal neuron by using
micro-circuits of pRAM neurons. In order to train the
parameters of the pRAM's composing the formal neuron,
I had to rewrite backpropagation for this case. 
At some stage, I have found that propagating back only
the sign (+1 or -1) of the error was enough. But it
turned out that this technique was restricted to cases
where the weights had to converge toward their maximum
or minimum value. For problems where intermediate weights
were optimum, the more refined information of the size
of the error for each example was required. (By "error" I
mean the whole expression which is backpropagated).

------------------------------------------------------------------------

Scott E. Fahlman from Carnegie Mellon University wrote:

Interesting.  I just tried this on encoder problems and a couple of other
simple things, and leapt to the conclusion that it was a general
phenomenon.  It seems plausible to me that any "derivative" function that
preserves the sign of the error and doesn't have a "flat spot" (stable
point of 0 derivative) would work OK, but I don't know of anyone who has
made an extensive study of this.

------------------------------------------------------------------------

George Bolt from University of York, U.K. wrote:

I've looked at BP learning in MLP's w.r.t. fault tolerance and found 
that the derivative of the transfer function is used to *stop* learning.
Once a unit's weights for some particular input (to that unit rather than
the network) are sufficiently developed for it to decide whether to output
0 or 1, then weight changes are approximately zero due to this derivative.
I would imagine that by setting it to a constant, then a MLP will over-
learn certain patterns and be unable to converge to a state of equilibrium,
i.e. all patterns are matched to some degree.

A better route would be to set the derivative function to a constant
over a range [-r,+r], where f[r] -
(sorry) f( |r| ) -> 1.0. To make individual units robust with respect
to weights, make r=c.a where f( |a| ) -> 1.0 and c is a small constant
multiplicative value.

------------------------------------------------------------------------

Joris van Dam from University of Amsterdam wrote:

At the University of Amsterdam, we have a single layer feed forward network
that computes the probabilities in one occupancy grid given the occupancy
probabilities in another grid that is rotated and translated with respect to 
the former. It turns out that a rather complex activation function needs to be
used, which also involves the computation of a complex derivative. (Note: it
can be easily computed from the activation). It is clear that in this case
the derivative cannot be omitted: LEARNING WOULD BE INCORRECT. The derivative
has a clear interpretation in the context of occupancy grids and the learning
procedure (with derivative !!!!!) can be related to Monte Carlo estimation
procedures. Omission of the derivative can thus be proven to be incorrect
and experiments have underlined this theory.
In my opinion the omission of the derivative is mathematically incorrect,
but can be useful in some applications and may even speed up learning (some
derivatives have, like Scott Fahlmann said, zero spots). However, it seems
that esp. with complex networks and activation functions, the derivative
needs to be used indeed.

------------------------------------------------------------------------

Janvier Movellan wrote:

My experience with Boltzmann machines and GRAIN/diffusion networks
(the continuous stochastic version of the Boltzmann machine) has been
that replacing the real gradient by its sign times a constant
accelerates learning DRAMATICALLY. I first saw this technique in one
of the original CMU tech reports on the Boltzmann machine. I believe
Peterson and Hartman and Peterson and Anderson also used this
technique, which they called "Manhattan updating", with the
deterministic Mean Field learning algorithm. I believe they had an
article in "Complex Systems" comparing Backprop and Mean-Field with
both with standard gradient descent and with Manhattan updating. 

It is my understanding that the Mean-Field/Boltzmann chip developed at
Bellcore uses "Manhattan Updating" as its default training method.
Josh Allspector is the person to contact about this.

At this point I've tried 4 different learning algorithms with
continuous and discrete stochastic networks and in all cases Manhattan
Updating worked better than straight gradient descent.The question is
why Manhattan updating works so well (at least in stochastic and
Mean-Field networks) ?

 One possible interpreation is that Manhattan updating limits the
influence of outliers and thus it performs something similar to robust
regression. Another interpretation is that Manhattan updating avoids
the saturation regions, where the error space becomes almost
flat in some dimensions, slowing down learning. 

One of the disadvantages of Manhattan updating is that sometimes one
needs to reduce the weight change constant at the end of learning. But
sometimes we also do this in standard gradient descent anyway.

------------------------------------------------------------------------

David G. Stork from Ricoh California Research Center wrote:

In an in-depth study of a particular hardware implementation of backprop,
we investigated the need for the derivative in the learning rule.  We found thatit was often essential to have such a derivative.  For instance, the XOR problemcould not be so solved.  (Incidentally, this analysis led to a patent:  "A
method employing logical gates for calculating activation function derivatives
on stochastically-encoded signals" granted to myself and Ron Keesing, US
Patent # 5,157,275.)
     Without the derivative, one is not guaranteed that you're doing gradient
descent in error space.

------------------------------------------------------------------------

Randy Shimabukuro wrote:

I am not familiar with Fahlman's paper, but I have looked at
approximating the derivative of the transfer function with a step
function approximation. I also looked at other approximations which we
made to simplify the implementation of back propagation in an integrated
circuit. The results were writen up in the following reference.

Shimabukuro, Randy L., Shoemaker, Patrick A., Guest, Clark C., & Carlin,
Michael J.(1991) Effect of Circuit Parameters on Convergence of Trinary
Update Back-Propagation. Proceedings of the 1990 Connectionist
Models Summer School, Touretzky, D.S., Elman, J.L., Sejnowski, T.J., and
Hinton, G.E., Eds., pp. 152-158. Morgan Kaufmann, San Mateo, CA.

------------------------------------------------------------------------

Marwan Jabri from Sydney University wrote:

It is likely as Scott Fahlman suggested any derivative that
"preserves" the error sign may do the job. The question however is the
implication in terms of convergence speed, and the comparison thereof with
perturbation type training methods.

------------------------------------------------------------------------

Radford Neal responded to Marwan Jabri's writing with:

One would expect this to work only for BATCH training. On-line training
approximates the batch result only if the net result of updating the 
weights on many training cases mimics the summing of derivatives in
the batch scheme. This will not be the case if a training case where the
derivative is +0.00001 counts as much as one where it is +10000.

This is not to say it might not work in some cases. There's just no reason
to think that it will work generally.

------------------------------------------------------------------------

Jonathan Cohen wrote:

You might take a look at a paper by Nestor Schmayuk in Psychological
Review 1992. The paper is about the role of the hippocampus which, in a
word, he argues implements biologically plausible backprop. The
algorithm uses a hidden unit's activation rather than its derivative for
computing the error.  He doesn't give too broad a range of training
examples, but you might contact him to find out what else he has tried. 
Hope this information is helpful.

------------------------------------------------------------------------

Jay McClelland wrote:

Some work has been done using the activation rather than the
derivative of the activation by Nestor Schmajuk.  He is interested
in biologically plausible models and tends to keep hidden units in the
bottom half of the sigmoid.  In that case they can be approximated
by exponentials and so the derivative can be approximated by the
activation.

------------------------------------------------------------------------

John Kolen wrote:

The quick answer to your question is no, you don't need "the derivative"
you can use anything with the general qualitative shape of the derivate.
I have some empirical results of training feedforward networks with
different learning "functions", i.e different squashing derivatives,
combination operators, etc.

------------------------------------------------------------------------

Gary Cottrell wrote:

I happen to know it doesn't work for a more complicated encoder 
problem: Image compression. When Paul Munro & I were first doing
image compression back in 86, the error would go down and then
back up! Rumelhart said: "there's a bug in your code" and indeed
there was: we left out the derivative on the hidden units. 

------------------------------------------------------------------------



More information about the Connectionists mailing list