Batch Backprop versus Incremental

Fri Jan 28 14:21:01 EST 1994

H Cribbs writes:

>Dear Connectionists,
>	I attended the 5th International Conference on Genetic Algorithms
>this summer, and in one of the sessions on combinations of genetic 
>algorithms (GAs) and Neural Nets(ANNs) a gentleman from the U.K.
>suggested that Batch mode learning could possibly be unstable in
>the long term for backpropagation.  I did not know the gentleman
>and when I asked for a reference he could not provide one.

>Does anyone have any kind of proof stating that one method is better 
>than another?  Or that possibly batch backprop is unstable in <<Some>>
>sense?

As has been known for quite some time,  incremental backprop can be 
viewed as a type of stochastic algorithm, which in the small step
size limit will be essentially equivalent to the batch algorithm.
Since the batch version of backprop is a type of gradient descent,
there are only two stability issues involved.  The first is the 
possibility that the weights will become infinitely large, which 
is possible if there is no genuine minimum, (local or global)
which can be reached by the training process from a given starting
point.   An example where this occurs is when the function which one
is trying to approximate with the network contains a discontinuity
(a step function, say).  This sort of behavior is sometimes called 
"converging to infinity".  One should note here,  that  the 
error will still always  decrease during training (though it won't 
necessarily converge), unless one encounters problems of "numerical 
instability".  
   Problems of numerical instability are due to the fact that one 
is using only a discrete version of a "true" gradient descent.  That
is, the training algorithm with a constant step size can be viewed as 
the Euler approximation of the solution to a differential equation.     
For the solution to this differential equation, the error will always
decrease and will either converge to a (perhaps local) minimum of 
the error function, or to infinity as described above.  The error for
the training may not always decrease and the training process can 
"explode" if the step size is choosen too "large".   The question of 
what is "large" depends essentially on the size of the eigenvalues of 
the matrix of second derivatives of the error function, in particular,
the smallest one.  If the ratio between the largest and 
smallest eigenvalue is "large" the differential equation is referred
to as being "stiff" or poorly conditioned.  The trade off that has 
to be achieved is between stability (achieved by having a small step
size) and speed of convergence (achieved by having a larger step size).
It should be noted that the conditioning of the problem will also 
effect the stability of the incremental version of backprop, since
it is also only an approximation of the solution to the same differential
equation.   
   The problems of numerical stability can be reduced by using Newton
or Quasi-Newton methods (sometimes  a problem for neural networks, due 
to the dimension of typical problems, where hundreds or thousands of 
parameters may be involved) or by regularization, which modifies the 
error function to improve the conditioning of the Hessian.  The simplest
type of conditioning is to simply add a quadratic term in the weights
to the error function, i.e., if E(W) is the original error funtion
(viewed as a function of the vector of network weights W = (w_i)_{i=1,,,.n})
then add a (penalty) term  P_{\lambda}(W) = \lambda \sum_{i=1,...,n} w_i^2,
which leads to a "weight decay" term in the training algorithm.  
     This modification of the error function also has the 
effect of preventing the traing process from converging to infinity
and will often improve the generalization ability of the network trained
using the modified error function.   The disadvantage with this is that 
one then has another parameter to choose (the weighting factor \lambda) and 
that the penalty term tends to create additional local minima (particularly
around zero) in which one will frequently get stuck while searching for 
a "good" solution to the minimization problem,  which brings us back to 
the incremental version of backprop.
   The advantages with using the incremental versions of backprop
(in my opinion) have nothing to do with stability issues.  First, and 
most obvious, is that it can be implemented online.   Second is the 
question of efficiency: Since weight updates are made more frequently,
the reduction in error can be much faster than with the batch algorithm, 
although this will depend on the specific nature of the data being used.
Finally, due to its stochastic nature, the incremental version of 
backprop has a "quasi-annealling" character which makes it less 
likely to get stuck in local minima than the batch training process;
(this statement can be made fairly rigorous, consult the references
given below).   

References:

1)  Battiti, R. (1992). First- and second order methods for learning: Between steepest descent and Newton's method, {\it Neural Computation} 4, 141-166. 

2)  Benveniste, A., M\'etivier, M. and Priouret, P.,  { \it Adaptive Algorithms and Stochastic Approximations}, Springer Verlag (1987).      

3)  Bouton C., Approximation Gaussienne d'algorithmes stochastiques a dynamique  Markovienne.  Thesis, Paris VI, (in French), (1985).             

4)  Darken C. and Moody J., Note on learning rate schedules for stochastic optimization, in{\it Advances in Neural Information Processing Systems 3}, Lippmann, R. Moody, J., and Touretzky, D., ed., Morgan Kaufmann, San Mateo, (1991).

5)  Finnoff, W., Asymptotics for the constant step size backpropagation algorithm and resistance to local minima, {\it Neural Computation}, 6, pp. 285-293, 1994.  

6) Finnoff, W.,  The dynamics of generalization for first and second order parameter adaptation methods, submitted to {\it IEEE Trans.~on Information Theory}.

7) Hornik, K. and Kuan, C.M., Convergence of learning algorithms with constant learning rates, {\it IEEE Trans. on Neural Networks} 2, pp. 484-489, (1991).

8) Krasovskii, N., {\it The Stability of Motion}, Stanford University Press, Stanford, (1963).

9) Le Cun, Y., Kanter I. and Solla, S.  (1990).  Second order properties of error surfaces: Learning time and generalization,  In  R. Lippman, J. Moody and D. Touretzy (Eds.), {\it Advances in Neural Information Processing III} (pp.918-924).  San Mateo:  Morgan Kaufman.

10)  White, H. 1989. Learning in artificial neural networks: A statistical perspective, {\it Neural Computation} 1: 425-464.

11)  White, H., Some asymptotic results for learning in single hidden-layer feedforward network models,  Jour. Amer. Stat. Ass. 84, no. 408, pp. 1003-1013, (1989)I.

Cheers

William

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
William Finnoff
Prediction Co.
320 Aztec St., Suite B
Santa Fe, NM, 87501, USA

Tel.: (505)-984-3123
Fax:  (505)-983-0571

e-mail: finnoff at predict.com