Paper in neuroprose

Wed Mar 18 14:37:17 EST 1992

   From: xiru Zhang <xiru at Think.COM>
   Date: Wed, 18 Mar 92 10:33:51 EST

Excerpt from my abstract:

	 However, the batch algorithm takes
	 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	 only one step per epoch while the per-sample algorithm takes $P$.
	 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	 Thus, the conclusion is that the batch algorithm is $P$ times slower
	 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	 in a serial implementation.
	 ^^^^^^^^^^^^^^^^^^^^^^^^^^^

Comment by Zhang:
   The last argument is not sound: the directions computed based on one
   training example and that based on all training examples can be very
   different, thus even if the step size is the same, the convergence rate can
   be different. This may not be a serious problem for the example in your
   paper, where the network has linear (identity) activation function and no
   hidden units, but in a multiple-layer network with non-linear units, not
   only the step size is important, the direction of the step is at least equally
   important.

   - Xiru Zhang

My rebuttal:
    The direction of the step taken by batch BP and the direction
taken by per-sample BP are BOTH BAD, in the sense that they do not
point directly at the minimum, but toward the bottom of the valley in
the direction of steepest descent -- a different thing entirely.  In
the simplified case I examine, the batch algorithm steps daintily to
the bottom of the valley after one epoch.  In the mean time, the P-S
algorithm is responding in turn to the individual components of the
error function.  In doing so, it often crosses the center of the
valley in the total (batch) error surface.  As a result, P-S takes
larger steps on the average, and ends up closer to the minimum than
does batch.  In addition, this overstepping of the minimum makes the
P-S version of the algorithm more robust toward local minima.  This
subject is dealt with more fully in the text of the paper.
    One thing I did not note in my paper is that, in the linear case,
a batch step will take you to the bottom of the valley by the shortest
route.  This reduces the dimensionality of the search by 1, so after a
number of epochs equal to the number of dimensions in your search
space, you know that you are at the minimum.
    However, pleasant this mathematical fiction is, when you add
nonlinearity to the picture, all bets are off.  It MAY BE that the
next step taken by the batch algorithm will place it squarely at the
bottom of the valley, where it will remain.  The same can be true for
the P-S algorithm.  However possible this may be, to me it does not
appear probable.  The paper I have posted begs to be followed by one
which analyzes the likelihood of such a lucky step occurring, or at
least gives a stochastic view of the paths which the two algorithms
follow in approaching the minimum.  I will look into it, time
permitting.

    Viewed intuitively:  In any guessing game, you are likely to do
better if the number of queries you are allowed to make increases.
The batch algorithm allows itself one query (as to the local slope of
the error surface) per epoch, while the per-sample algorithm gets $P$
times as many.  

				Thomas H. Hildebrandt
				CSEE Department
				Lehigh University