forward propagation

Sun Oct 27 14:17:00 EST 1991

I am interested in training multi-layer perceptron without using
back-propagation (BP) of the error.

MLP training by means of the back-propagation (BP) algorithm is in fact
minimization of the criterion function using the ordinary gradient-descent
minimization algorithm. For this, the computation of derivatives is
necessary. Now, it is off course possible to optimize multi-variable function
without computation of derivatives. One of effective algorithms
of this type is simplex algorithm [1], so it seems logical to utilize it for
MLP training. There are two advantages in avoiding derivatives: first,
transfer functions of the individual neurons may be non-differentiable.
Second, BP utilizes a criterion function that must be
written in the form of the average squared difference between target
and actual outputs (there are variants to this, but, for the
purpose of this discussion, they vary insignificantly), and the
derivative of this function with respect to the weights must be
computable. Using simplex, i. e. not using derivatives,
this limitation can be avoided, as long as the function to be
minimized can be measured. This can be important for applications in
control where we are sometimes not able to express criterion function
as a function of network parameters.

There is one serious limitation regarding this algorithm, and it is
spatial complexity. It requires roughly N*N memory locations, where
N is the number of variables (network weights). In practice, this
limits the size of the network to a couple of thousands of weights.

In order to verify the behavior of the algorithm, I performed
extensive experiments with Ljubomir Citkusev of the Boston University.
We trained MLP to perform classification tasks on three data sets. In short,
the results obtained indicate that training of the network using simplex
can be done successfully. However, BP is more effective, regarding both
classification accuracy (i. e., function approximation accuracy), and
computational complexity (number of iterations). We didn't yet verify
the ability of the algorithm to train networks with non-differentiable
transfer functions or criterion functions that can not be computed
analitically.

It is puzzling that in [2] Minsky and Papert claimed the training
of the perceptrons with hidden layers to be impossible, while at that
time (1969.) there was available effective algorithm for precisely
that task. While BP was shown to be superior in our experiments, they
could have done some quite satisfactory training of the multi-layer
networks when they wrote the book. I tried to talk to Minsky about
this, but I couldn't do it.

I would like to hear people's opinion on this idea. Also, it would be
beneficial to know if anyone is aware of similar work.

        Thanks, Ljubomir Buturovic, University of Belgrade

References
[1] Nelder, J. A., and Mead, R. 1965, Computer Journal, vol. 7,
p. 308.
[2] M. Minsky, and S. Papert, Perceptrons: An Introduction to
Computational Geometry, MIT Press, 1969.