Does backprop need the derivative ??

Marwan Jabri marwan at sedal.su.oz.au
Sun Feb 7 18:13:36 EST 1993


> It is true that several studies show a sudden failure of backprop learning
> when you use fixnum arithmetic and reduce the number of bits per word.  The
> point of failure seems to be problem-specific, but is often around 10-14
> bits (incuding sign).
> 
> Marcus Hoehfeld and I studied this issue and found that the source of the
> failure was a quantization effect: the learning algorithm needs to
> accumulate lots of small steps, for weight-update or whatever, and since
> these are smaller than half the low-order bit, it ends up accumulating a
> lot of zeros instead.  We showed that if a form of probabilisitic rounding
> (dithering) is used to smooth over these quantization steps, learning
> continues on down to 4 bits or fewer, with only a gradual degradation in
> learning time, number of units/weights required, and quality of the result.
> This study used Cascor, but we believe that the results hold for backprop
> as well.
> 
>     Marcus Hoehfeld and Scott E. Fahlman (1992) "Learning with Limited
>     Numerical Precision Using the Cascade-Correlation Learning Algorithm"
>     in IEEE Transactions on Neural Networks, Vol. 3, no. 4, July 1992, pp.
>     602-611.
> 

Yun Xie and I have tried simular experiments on the Sonar and ECG data,
and it is fair to say that standard backprop gives up about 10 bits [2].
In a closer look at the quantisation effects you would find that the
signal/noise ratio depends on the number of layers[1]. As you go deeper you
require less precision. This would be a source of variation between
backprop and cascor.

> Of course, a learning system implemented in analog hardware might have only
> a few bits of accuracy due to noise and nonlinearity in the circuits, but
> it wouldn't suffer from this quantization effect, since you get a sort of
> probabilistic dithering for free.
> 

Hmmm... precision also suffers from number of operations in analog
implementations. The free dithering you get is every where including in
your errors! The gradient descent turns into a yoyo. This is well 
explained in [2, 3].

The best way of using backprop or more efficiently, conjuguate gradient is
to do the training off-chip and then to download the (truncated) weights.
Our experience in the training of real analog chips shows that some
further in-loop training is required. Note our chips were ultra low power
and you may have less problems with strong inversion implementations.

Regarding the idea of Simplex that has been suggested. The inquirer was
talking about on-chip learning. Have you in your experiments done a
limited precision Simplex? Have you tried it on a chip in in-loop mode?
Philip Leong here has tried a similar idea (I think) a while back.  The
problem with this approach is that you need to a have a very good guess at
your starting point as the Simplex will move you from one vertex (feasible
solution) to another while expanding the weight solution space. 
Philip's experience is that it does work for small problems when you have
a good guess!

At the last NIPS, there were 4 posters about learning in or for analog
chips. The inquirer may wish to consult these papers (two at least were 
advertised deposited in the neuroprose archive, one by Gert Cauwengergh 
and one by Barry Flower and I).

So far, for us, the most reliable analog chips training algorithm has been the
combined search algorithm (modified weight perturbation and partial random
search) [3]. I will be very interested in hearing more about experiments
where analog chips are trained.

Marwan

[1] Yun Xie and M. Jabri, Analysis of the Effects of Quantization in
Multi-layer Neural Networks Using A Statistical Model, IEEE
Transactions on Neural Networks, Vol. 3, No. 2, pp. 334-338, March, 1992.

[2] M. Jabri, S. Pickard, P. Leong and Y. Xie, Algorithms and
Implementation Issues in Analog Low Power Learning Neural Nertwork
Chips,  To appear in the Intenational Journal on VLSI
Signal Processing, early 1993, USA.

[3] Y. Xie and M. Jabri, On the Training of Limited Precision Multi-layer
Perceptrons. Proceedings of the International Joint Conference on
Neural Networks, pp III-942-947, July 1992, Baltimore, USA.

-------------------------------------------------------------------
Marwan Jabri			       Email: marwan at sedal.su.oz.au
Senior Lecturer				      Tel: (+61-2) 692-2240
SEDAL, Electrical Engineering,		      Fax:         660-1228
Sydney University, NSW 2006, Australia     Mobile: (+61-18) 259-086



More information about the Connectionists mailing list