Dissertation available: Scaled Stochastic Methods for Training Neural Networks

Mon Feb 5 06:48:18 EST 1996

My PhD dissertation is available for electronic retrieval.  Retrieval
information appears at the bottom of this message.

This thesis deals with the problem of training large nonlinear feedforward
neural networks using practical stochastic descent methods. Four major
topics are explored:

(1) An O(N) statistical method that determines a reasonable estimate
    of the optimal time-varying learning parameter for the stochastic
    backpropagation procedure (last half of Chapter 4 and parts of
    Chapter 7).
(2) An accelerated O(N) learning procedure that performs an optimal
    stochastic update along an arbitrary search direction (Chapters 5
    and 7 and Appendix I). 
(3) An O(N) stochastic method which generates an estimate of the
    Newton direction (Chapters 6 and 7).
(4) Various O(N) methods that generate second-order information 
    and other essential information about a neural network's
    Sum Square Error and Mean Square Error surfaces (Appendices J and
    K and parts of Chapter 7).

An abstract follows.

---------------------------------------------------
Scaled Stochastic Methods for Training Neural Networks

		Michael A. Lehr
	Department of Electrical Engineering
	      Stanford University

	   Supervisor: Bernard Widrow

The performance surfaces of large neural networks contain ravines,
``flat spots,'' nonconvex regions, and other features that make weight
optimization difficult.  Although a variety of sophisticated
alternatives are available, the simple on-line backpropagation
procedure remains the most popular method for adapting the weights of
these systems.  This approach, which performs stochastic (or
incremental) steepest descent, is significantly hampered by the
character of the performance surface.  Backpropagation's principal
advantage over alternate methods rests in its ability to perform an
update following each pattern presentation, while maintaining time and
space demands that grow only linearly with the number of adaptive
weights.

In this dissertation, we explore new stochastic methods that improve
on the learning speed of the backpropagation algorithm, while
retaining its linear complexity.  We begin by examining the
convergence properties of two deterministic steepest descent methods.
Corresponding scaled stochastic algorithms are then developed from an
analysis of the neural network's Expected Mean Square Error (EMSE)
sequence in the neighborhood of a local minimum of the performance
surface.  To maintain stable behavior over broad conditions, this
development uses a general statistical model for the neural network's
instantaneous Hessian matrix.  For theoretical performance
comparisons, however, we require a more specialized statistical
framework.  The corresponding analysis helps reveal the complementary
convergence properties of the two updates---a relationship we exploit
by combining the updates to form a family of dual-update procedures.

Effective methods are established for generating a slowly varying
sequence of search direction vectors and all required scaling
information.  The result is a practical algorithm which performs
robustly when the weight vector of a large neural network is placed at
arbitrary initial positions.  The two weight updates are scaled by
parameters computed from recursive estimates of five scalar sequences:
the first and second moments of the trace of the instantaneous Hessian
matrix, the first and second moments of the instantaneous gradient
vector's projection along the search direction, and the first moment
of the instantaneous Hessian's ``projection'' along the same
direction.

-----------------------------------------------------------
RETRIEVAL INFORMATION: 

The thesis available at the URL:

http://www-isl.stanford.edu/people/lehr

in four gzipped postscript files: thesis1.ps.gz, thesis2.ps.gz,
thesis3.ps.gz, and thesis4.ps.gz.  The corresponding uncompressed
postscript files will be available at the same location for the time
being. Including front matter, the thesis contains 405 pages formatted
for two-sided printing (size: 2.2M compressed, 10.3M uncompressed).

The files can also be obtained by anonymous ftp from the directory
/pub/lehr/thesis on either simoon.stanford.edu (36.10.0.209),
boreas.stanford.edu (36.60.0.210), or zephyrus.stanford.edu
(36.60.0.211).

Sorry, hardcopies are not available.