Paper in neuroprose
Thomas H. Hildebrandt
thildebr at athos.csee.lehigh.edu
Mon Mar 16 14:24:09 EST 1992
The following paper, which has been submitted to IEEE Transactions on
Neural Networks, is now available in PostScript format through the
neuroprose archive:
"Why Batch Learning is Slower Than Per-Sample Learning"
Thomas H. Hildebrandt
ABSTRACT
We compare the convergence properties of the batch and per-sample
versions of the standard backpropagation algorithm. The comparison is
made on the basis of ideal step sizes computed for the two algorithms
with respect to a simplified, linear problem. For either algorithm,
convergence is guaranteed as long as no step exceeds the minimum ideal
step size by more than a factor of 2. By limiting the discussion to a
fixed, safe step size, we can compare the maximum step that can be
taken by each algorithm in the worst case.
It is found that the maximum fixed safe step size is $P$ times smaller
for the batch version than for the per-sample version, where $P$ is
the number of training examples. This fact is
balanced somewhat by the fact that batch algorithm sums $P$ substeps
in order to compute its step, meaning that the steps taken by the two
algorithms are comparable in size. However, the batch algorithm takes
only one step per epoch while the per-sample algorithm takes $P$.
Thus, the conclusion is that the batch algorithm is $P$ times slower
in a serial implementation.
In response to last Fall's discussion involving Yann LeCun, Kamil
Grajski and others regarding the unexpectedly poor performance of
parallel implementations of the batch backpropagation algorithm, I
performed an analysis of the convergence speed of batch and per-sample
versions of the backpropagation algorithm based on calculation of the
ideal step size. The conclusion is that, even if there are as many
processors as training samples, the parallel implementation of a batch
algorithm which does not alter its step size adaptively during an
epoch can never be faster than the serial implementation of the
per-sample algorithm.
Due to the manner in which the problem is approached, it does not
exactly go beyond the "multiple-copies" argument, as desired by LeCun.
However, it does succeed in formalizing that argument. In the
process, it also defines a relative measure of "redundancy" in the
training set as correlation (the degree of collinearity) between the
training vectors in the input space. Such a measure can be computed
directly before training is begun.
To obtain copies of this article:
unix> ftp archive.cis.ohio-state.edu (or 128.146.8.52)
Name : anonymous
Password: <your id>
ftp> cd pub/neuroprose
ftp> binary
ftp> get hildebrandt.batch.ps.Z
ftp> quit
unix> uncompress hildebrandt.batch.ps.Z
unix> lpr -Pps hildebrandt.batch.ps (or however you print PostScript)
(Thanks to Jordan Pollack for providing this valuable service to the
NN Research community.)
Thomas H. Hildebrandt
Visiting Research Scientist
CSEE Department
Lehigh University
More information about the Connectionists
mailing list