second derivatives and the back propagation network

Tue Nov 12 12:03:25 EST 1991

RE:  Second Derivatives and Stiff ODEs for Back Prop Training

Several threads in this newsgroup recently have mentioned the use
of second derivative information (i.e., the Hessian or Jacobian
matrix) and/or stiff ordinary differential equations [ODEs] in 
the training of the back propagation network [BPN].

    [-- Aside: Stiff differential equation solvers derive 
	their speed and accuracy by specifically utilizing 
	the information contained in the second-derivative 
	Jacobian matrix. -- ]

This is to confirm our experience that training the BPN using 
second-derivative methods in general, and stiff ODE solvers in
particular, is extremely fast and efficient for problems which
are small enough (i.e., up to about 1000 connection weights) to
allow the Jacobian matrix [size = (number of weights)**2] to be
stored in the computer's real memory. "Stiff" backprop is 
particularly well-suited to real-valued function mappings in
which a high degree of accuracy is required.

We have been using this method successfully in most of our production 
applications for several years. See the abtracts below of a paper
presented at the 1989 IJCNN in Washington and of a recently-issued
U. S. patent.

It is possible -- and desirable -- to use the back error propagation
methodology (i.e., the chain rule of calculus) to explicitly
compute the second derivative of the sum_of_squared_prediction_error 
with respect to the weights (i.e., the Jacobian matrix) analytically. 
Using an analytic Jacobian, rather than computing the second 
derivatives numerically [or -- an UNVERIFIED personal hypothesis -- 
stochastically], increases the algorithm's speed and accuracy 
significantly.

-- Aaron --

Aaron J. Owens
Du Pont Neural Network Technology Center
P. O. B. 80357
Wilmington, DE 19880-0357

Telephone Numbers:
	Office	  (302) 695-7341 (Phone & FAX)
	Home 	    "	738-5413 
Internet:	  owens at esvax.dnet.dupont.com

---------- IJCNN '89 paper abstract ------------

EFFICIENT TRAINING OF THE BACK PROPAGATION NETWORK BY SOLVING A
SYSTEM OF STIFF ORDINARY DIFFERENTIAL EQUATIONS

A. J. Owens and D. L. Filkin  
Central Research and Development
Department P. O. Box 80320 
E. I. du Pont de Nemours and Company (Inc.) 
Wilmington, DE  19880-0320

International Joint Conference on Neural Networks 
June 19-22, 1989, Washington, DC 
Volume II, pp. 381-386

Abstract.   The  training  of  back  propagation  networks  involves
adjusting  the weights between the computing nodes in the artificial
neural  network  to  minimize  the  errors  between  the   network's
predictions  and  the  known  outputs  in  the  training  set.  This
least-squares minimization  problem is conventionally solved  by  an
iterative  fixed-step technique, using gradient descent, which occa-
sionally exhibits instabilities and converges slowly.  We show  that
the  training  of the back propagation network can be expressed as a
problem of solving coupled ordinary differential equations  for  the
weights as  a  (continuous)  function  of  time.  These differential
equations  are usually mathematically stiff.  The  use  of  a  stiff
differential  equation  solver  ensures  quick  convergence  to  the
nearest least-squares minimum.    Training  proceeds  at  a  rapidly
accelerating  rate  as the accuracy of the predictions increases, in
contrast with gradient descent and conjugate gradient  methods.  The
number of  presentations  required for accurate training is  reduced
by up to several orders of magnitude over the conventional method.

---------- U. S. Patent No. 5,046,020 abstract ----------

DISTRIBUTED PARALLEL PROCESSING NETWORK WHEREIN THE CONNECTION
WEIGHTS ARE GENERATED USING STIFF DIFFERENTIAL EQUATIONS

Inventor:  David L. Filkin
Assignee:  E. I. du Pont de Nemours and Company

U. S. Patent Number 5,046,020 
Sep. 3, 1991

Abstract.  A parallel distributed processing network of the back
propagation type is disclosed in which the weights of connection 
between processing elements in the various layers of the network
are determined in accordance with the set of steady solutions of
the stiff differential equations governing the relationship 
between the layers of the network.