batch-mode parallel implementations

Sun Oct 20 11:08:11 EDT 1991

    I would like to mention a case where, surprisingly, even large batches
    gave instability.  The application was recognition of handwritten
    lower-case letters, and the network was of the LeCun variety.  The
    training set comprised three batches of 1950 letter images (a total of
    5850 images).  This partition was chosen randomly.  Fahlman's quickprop
    behaved poorly, and with some close inspection I found a number of
    weights for which the partial derivative was changing sign from one
    batch to the next.  Further, the magnitudes of those partials were not
    always small.  In short, the performance surfaces for the three batches
    differed considerably.  The moral:  You may have to make a single batch
    of the entire training set, even when working with fairly large training
    sets.

    -- Tom English

Note that it is OK to switch from one training set to another when using
Quickprop, but that every time you change the training set you *must* zero
out the prev-slopes and delta vectors.  This prevents to quadratic part of
the algorithm from trying to draw a parabola between two slopes that are
not closely related.  If you don't do this, that one step can badly mess up
the weights you've laboriously accumulated so far.  Of course, if you do
this after every sample, the quadratic acceleration never kicks in and you
end up with nothing more than plain old backprop without momentum.  If you
want to get any benefit from quickprop, you have to run each distinct
training set for at least a few cycles.

If you were aware of all that (it's unclear from your message) and still
experienced instability, then I would say that the batches, even though
they are fairly large, are not large enough to provide a fair
representation of the underlying distribution.

-- Scott