Paper in neuroprose

Wed Mar 18 10:33:51 EST 1992

   Date: Mon, 16 Mar 92 14:24:09 -0500
   From: "Thomas H. Hildebrandt  " <thildebr at athos.csee.lehigh.edu>

   The following paper, which has been submitted to IEEE Transactions on
   Neural Networks, is now available in PostScript format through the
   neuroprose archive:

	   "Why Batch Learning is Slower Than Per-Sample Learning"
			   Thomas H. Hildebrandt

				 ABSTRACT
      We compare the convergence properties of the batch and per-sample
      versions of the standard backpropagation algorithm.  The comparison is
      made on the basis of ideal step sizes computed for the two algorithms
      with respect to a simplified, linear problem.  For either algorithm,
      convergence is guaranteed as long as no step exceeds the minimum ideal
      step size by more than a factor of 2.  By limiting the discussion to a
      fixed, safe step size, we can compare the maximum step that can be
      taken by each algorithm in the worst case.
      It is found that the maximum fixed safe step size is $P$ times smaller
      for the batch version than for the per-sample version, where $P$ is
      the number of training examples.  This fact is
      balanced somewhat by the fact that batch algorithm sums $P$ substeps
      in order to compute its step, meaning that the steps taken by the two
      algorithms are comparable in size.  However, the batch algorithm takes
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      only one step per epoch while the per-sample algorithm takes $P$.
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      Thus, the conclusion is that the batch algorithm is $P$ times slower
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      in a serial implementation.
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^

The last argument is not sound: the directions computed based on one
training example and that based on all training examples can be very
different, thus even if the step size is the same, the convergence rate can
be different. This may not be a serious problem for the example in your
paper, where the network has linear (identity) activation function and no
hidden units, but in a multiple-layer network with non-linear units, not
only the step size is important, the direction of the step is at least equally
important.

- Xiru Zhang