Paper in neuroprose

Mon Mar 16 14:24:09 EST 1992

The following paper, which has been submitted to IEEE Transactions on
Neural Networks, is now available in PostScript format through the
neuroprose archive:

	"Why Batch Learning is Slower Than Per-Sample Learning"
			Thomas H. Hildebrandt

			      ABSTRACT
   We compare the convergence properties of the batch and per-sample
   versions of the standard backpropagation algorithm.  The comparison is
   made on the basis of ideal step sizes computed for the two algorithms
   with respect to a simplified, linear problem.  For either algorithm,
   convergence is guaranteed as long as no step exceeds the minimum ideal
   step size by more than a factor of 2.  By limiting the discussion to a
   fixed, safe step size, we can compare the maximum step that can be
   taken by each algorithm in the worst case.
   It is found that the maximum fixed safe step size is $P$ times smaller
   for the batch version than for the per-sample version, where $P$ is
   the number of training examples.  This fact is
   balanced somewhat by the fact that batch algorithm sums $P$ substeps
   in order to compute its step, meaning that the steps taken by the two
   algorithms are comparable in size.  However, the batch algorithm takes
   only one step per epoch while the per-sample algorithm takes $P$.
   Thus, the conclusion is that the batch algorithm is $P$ times slower
   in a serial implementation.

In response to last Fall's discussion involving Yann LeCun, Kamil
Grajski and others regarding the unexpectedly poor performance of
parallel implementations of the batch backpropagation algorithm, I
performed an analysis of the convergence speed of batch and per-sample
versions of the backpropagation algorithm based on calculation of the
ideal step size.  The conclusion is that, even if there are as many
processors as training samples, the parallel implementation of a batch
algorithm which does not alter its step size adaptively during an
epoch can never be faster than the serial implementation of the
per-sample algorithm.

Due to the manner in which the problem is approached, it does not
exactly go beyond the "multiple-copies" argument, as desired by LeCun.
However, it does succeed in formalizing that argument.  In the
process, it also defines a relative measure of "redundancy" in the
training set as correlation (the degree of collinearity) between the
training vectors in the input space.  Such a measure can be computed
directly before training is begun.

To obtain copies of this article:

unix> ftp archive.cis.ohio-state.edu (or 128.146.8.52)
Name : anonymous
Password: <your id>
ftp> cd pub/neuroprose
ftp> binary
ftp> get hildebrandt.batch.ps.Z
ftp> quit
unix> uncompress hildebrandt.batch.ps.Z
unix> lpr -Pps hildebrandt.batch.ps (or however you print PostScript)

(Thanks to Jordan Pollack for providing this valuable service to the
NN Research community.)
				Thomas H. Hildebrandt
				Visiting Research Scientist
				CSEE Department
				Lehigh University