NIPS workshop summary

Wed Dec 16 15:57:30 EST 1992

                            NIPS 92 Workshop Summary
                            ========================

                Computational Issues in Neural Network Training
                ===============================================

Main focus: Optimization algorithms used in training neural networks
----------

Organizers: Scott Markel and Roger Crane
----------

This was a one day workshop exploring the use of optimization algorithms, such
as back-propagation, conjugate gradient, and sequential quadratic programming,
in neural network training.  Approximately 20-25 people participated in the
workshop.  About two thirds of the participants used some flavor of back
propagation as their algorithm of choice, with the other third using conjugate
gradient, sequential quadratic programming, or something else.  I would guess
that participants were split about 60-40 between industry and the academic
community.

The workshop consisted of lots of discussion and the following presentations:

Introduction
------------
Scott Markel (David Sarnoff Research Center - smarkel at sarnoff.com)

I opened by saying that Roger and I are mathematicians and started looking at
neural network training problems when neural net researchers were experiencing
difficulties with back-propagation.  We think there are some wonderfully
advanced and robust implementations of classical algorithms developed by the
mathematical optimization community that are not being exploited by the neural
network community.  This is due largely to a lack of interaction between the two
communities.  This workshop was set up to address that issue.  In July we
organized a similar workshop for applied mathematicians at SIAM '92 in Los
Angeles.

Optimization Overview
---------------------
Roger Crane (David Sarnoff Research Center - rcrane at sarnoff.com)

Roger gave a very brief, but broad, historical overview of optimization
algorithm research and development in the mathematical community.  He showed a
time line starting with gradient descent in the 1950's and progressing to
sequential quadratic programming (SQP) in the 1970's and 1980's.  SQP is the
current state of the art optimization algorithm for constrained optimization.
It's a second order method that solves a sequence of quadratic approximation
problems.  SQP is quite frugal with function evaluations and handles both linear
and nonlinear constraints.  Roger stressed the robustness of algorithms found in
commercial packages (e.g. NAG library) and that reinventing the wheel was
usually not a good thing to do since many subtleties will be missed.  A good
reference for this material is

Practical Optimization
Gill, P. E., Murray, W., and Wright, M. H.
Academic Press: London and New York
1981

Roger's overview generated a lot of discussion.  Most of it centered around the
fact that second order methods involve using the Hessian, or an approximation to
it, and that this is impractical for large problems (> 500-1000 parameters).
Participants also commented that the mathematical optimization community has not
yet fully realized this and that stochastic optimization techniques are needed
for these large problems.  All classical methods are inherently deterministic
and work only for "batch" training.

SQP on a Test Problem
---------------------
Scott Markel (David Sarnoff Research Center - smarkel at sarnoff.com)

I followed Roger's presentation with a short set of slides showing actual
convergence of a neural network training problem where SQP was the training
algorithm.  Most of the workshop participants had not seen this kind of
convergence before.  Yann Le Cun noted that with such sharp convergence
generalization would probably be pretty bad.  I noted that sharp convergence was
necessary if one was trying to do something like count local minima, where
generaization is not an issue.

In Defense of Gradient Descent
------------------------------
Barak Pearlmutter (Oregon Graduate Institute - bap at merlot.cse.ogi.edu)

By this point back propagation and its many flavors had been well defended from
the audience.  Barak's presentation captured the main points in a clarifying
manner.  He gave examples of real application neural networks with thousands,
millions, and billions of connections.  This underscored the need for stochastic
optimization techniques.  Barak also made some general remarks about the
characteristics of error surfaces.  Some earlier work by Barak on gradient
descent and second order momentum can be found in the NIPS-4 proceedings
(p. 887).  A strong plea was made by Barak, and echoed by the other
participants, for fair comparisons between training methods.  Fair comparisons
are rare, but much needed.

Very Fast Simulated Reannealing
-------------------------------
Bruce Rosen (University of Texas at San Antonio - rosen at ringer.cs.utsa.edu)

This presentation focused on a new optimization technique called Very Fast
Simulated Reannealing (VFSR), which is faster than Boltzmann Annealing (BA) and
Fast (Cauchy) Annealing (FA).  Unlike back propagation, which Bruce considers
mostly a method for pattern association/classification/generalization, simulated
annealing methods are perhaps best used for functional optimization. He
presented some results on this work, showing a comparison of Very Fast Simulated
Reannealing to GA for function optimization and some recent work on function
optimization with BA, FA, and VFSR.

Bruce's (and Lester Ingber's) code is available from netlib -

Interactive:
        ftp research.att.com
        [login as netlib, your_login_name as password]
        cd opt
        binary
        get vfsr.Z
Email:
        mail netlib at research.att.com
        send vfsr from opt

Contact Bruce (rosen at ringer.cs.utsa.edu) or Lester
(ingber at alumni.cco.caltech.edu) for further information.

General Comments
----------------
Yann Le Cun (AT&T Bell Labs - yann at neural.att.com)

I asked Yann to summarize some of the comments he and others had been making
during the morning session.  Even though we didn't give him much time to
prepare, he nicely outlined the main points.  These included

- large problems require stochastic methods
- the mathematical community hasn't yet addressed the needs of the neural
  network community
- neural network researchers are using second order information in a variety of
  ways, but are definitely exploring uncharted territory
- symmetric sigmoids are necessary; [0,1] sigmoids cause scaling problems
  (Roger commented that classical methods would accommodate this)

Cascade Correlation and Greedy Learning
---------------------------------------
Scott Fahlman (Carnegie Mellon University - scott.fahlman at cs.cmu.edu)

Scott's presentation started with a description of QuickProp.  This algorithm
was developed in an attempt to address the slowness of back propagation.
QuickProp uses second order information ala modified Newton method.  This was
yet another example of neural network researchers seeing no other alternative
but to do their own algorithm development.  Scott then described Cascade
Correlation.  CasCor and CasCor2 are greedy learning algorithms.  They build the
network, putting each new node in its own layer, in response to the remaining
error.  The newest node is trained to deal with the largest remaining error
component.  Papers on QuickProp, CasCor, and Recurrent CasCor can be found in
the neuroprose archive (see fahlman.quickprop-tr.ps.Z, fahlman.cascor-tr.ps.Z,
and fahlman.rcc.ps.Z).

Comments on Training Issues
---------------------------
Gary Kuhn (Siemens Corporate Research - gmk at learning.siemens.com)

Gary presented

1. a procedure for training with stochastic conjugate gradient.
   (G. Kuhn and N. Herzberg, Some Variations on Training of Recurrent Networks,
    in R. Mammone & Y. Zeevi, eds, Neural Networks: Theory and Applications,
    New York, Academic Press, 1991, p 233-244.)

2. a sensitivity analysis that led to a change in the architecture of a speech
   recognizer and to further, joint optimization of the classifier and its 
   input features.  (G. Kuhn, Joint Optimization of Classifier and Feature
   Space in Speech Recognition, IJCNN '92, IV:709-714.)

He related Scott Fahlmans' interest in sensitivity to Yann Le Cun's emphasis on
trainability, by showing how a sensitivity analysis led to improved
trainability.

Active Exemplar Selection
-------------------------
Mark Plutowski (University of California - San Diego - pluto at cs.ucsd.edu)

Mark gave a quick recap of his NIPS poster on choosing a concise subset for
training.  Fitting these exemplars results in the entire set being fit as well
as desired.  This method has only been used on noise free problems, but looks
promising.  Scott Fahlman expressed the opinion that exploiting the training
data was the remaining frontier in neural network research.

Final Summary
-------------
Incremental, stochastic methods are required for training large networks.
Robust, readily available implementations of classical algorithms can be used
for training modest sized networks and are especially effective research tools
for investigating mathematical issues, e.g. estimating the number of local
minima.