Why batch learning is slower

Sun Mar 29 10:07:02 EST 1992

   Date: Wed, 25 Mar 92 12:09:56 -0500
   From: Barak Pearlmutter <bap at james.psych.yale.edu>
   Reply-To: Pearlmutter-Barak at CS.YALE.EDU

Begin Pearlmutter quote:
   Of course, you are studing the "cyclic online presentation" case,
   which, although it fulfills the conditions of stochastic gradient,
   also has some additional structure.

   However, it is easy to convince yourself that this case does not
   permit a fixed step size.  Consider a 1-d system, with two patterns,
   one of which gives $E_1 = (w-1)^2$ and the other $E_2 = (w+1)^2.$
   Notice that $E = E_1 + E_2$ has a minimum at $w=0$.  But with a step
   size that does not go to zero, $w$ will flip back and forth forever.

				   --Barak Pearlmutter.
End quote

	One of the interesting results of the paper is that the
rapidity of convergence can be judged in terms of the redundancy of
the training set, where (as mentioned in the announcement) redundancy
is measured in terms of the correlation (inner product) between pairs
of samples.  

	A more thorough analysis is needed, but at first glance, it
appears that if the redundancy (collinearity, correlation) measure is $R
\in [-1, 1]$, then the convergence rate $C$ (how fast the algorithm
approaches the minimum) is given by

C = 1 / (1 - abs(R)).

One may consider the most redundant pair of samples to dominate the
convergence rate (one place where more analysis is needed).  If a pair
of samples is collinear, then as you have pointed out, the convergence
rate goes to zero.  The above equation gives this directly, since the
redundancy R is 1 in that case.  If all of the samples are orthogonal,
on the other hand, the per-sample algorithm will find the absolute
minimum in one epoch.  For intermediate degrees of collinearity, $C$
gives a measure of the severity of the "cross-stitching" which the MGD
algorithm will suffe, numbers closer to zero indicating a slower
convergence.

All of the above discussion is in terms of the linear network
described in my paper, but it is hard to see how adding nonlinnearity
to the problem can improve the situation, except by chance.  

				Thomas H. Hildebrandt
				CSEE Department
				Lehigh University