Why batch learning is slower

Wed Mar 25 12:09:56 EST 1992

For a quadratic form minimum, in the batch case, without momentum, it
is well known that a stepsize eta<2/lmax (where lmax = max eigenvalue
of Hessian) gives convergence to the minimum.

However, this is not true in the online or "stochastic gradient
descent" case.

In that case, a fixed stepsize leads to convergence to a neighborhood
of the minimum, where the size of the neighborhood is determined by
the stepsize, since it amplifies the noise in the noisy samples of the
gradient.  For this reason, it is necessary for the stepsize to go to
zero in the limit in the online case.

In fact, it can be shown that $\sum_t \eta(t)$ must go to infinity to
prevent convergence to a non-minimum, while $\sum_t \eta(t)^2$ must
not go to infinity, to guarantee convergence.  If these two conditions
are satisfied, convergence to a minimum is achieved with probability 1.

Of course, you are studing the "cyclic online presentation" case,
which, although it fulfills the conditions of stochastic gradient,
also has some additional structure.

However, it is easy to convince yourself that this case does not
permit a fixed step size.  Consider a 1-d system, with two patterns,
one of which gives $E_1 = (w-1)^2$ and the other $E_2 = (w+1)^2.$
Notice that $E = E_1 + E_2$ has a minimum at $w=0$.  But with a step
size that does not go to zero, $w$ will flip back and forth forever.

				--Barak Pearlmutter.