Paper in neuroprose
xiru Zhang
xiru at Think.COM
Wed Mar 18 10:33:51 EST 1992
Date: Mon, 16 Mar 92 14:24:09 -0500
From: "Thomas H. Hildebrandt " <thildebr at athos.csee.lehigh.edu>
The following paper, which has been submitted to IEEE Transactions on
Neural Networks, is now available in PostScript format through the
neuroprose archive:
"Why Batch Learning is Slower Than Per-Sample Learning"
Thomas H. Hildebrandt
ABSTRACT
We compare the convergence properties of the batch and per-sample
versions of the standard backpropagation algorithm. The comparison is
made on the basis of ideal step sizes computed for the two algorithms
with respect to a simplified, linear problem. For either algorithm,
convergence is guaranteed as long as no step exceeds the minimum ideal
step size by more than a factor of 2. By limiting the discussion to a
fixed, safe step size, we can compare the maximum step that can be
taken by each algorithm in the worst case.
It is found that the maximum fixed safe step size is $P$ times smaller
for the batch version than for the per-sample version, where $P$ is
the number of training examples. This fact is
balanced somewhat by the fact that batch algorithm sums $P$ substeps
in order to compute its step, meaning that the steps taken by the two
algorithms are comparable in size. However, the batch algorithm takes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
only one step per epoch while the per-sample algorithm takes $P$.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Thus, the conclusion is that the batch algorithm is $P$ times slower
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
in a serial implementation.
^^^^^^^^^^^^^^^^^^^^^^^^^^^
The last argument is not sound: the directions computed based on one
training example and that based on all training examples can be very
different, thus even if the step size is the same, the convergence rate can
be different. This may not be a serious problem for the example in your
paper, where the network has linear (identity) activation function and no
hidden units, but in a multiple-layer network with non-linear units, not
only the step size is important, the direction of the step is at least equally
important.
- Xiru Zhang
More information about the Connectionists
mailing list