batch-mode parallel implementations

Tom English english at sun1.cs.ttu.edu
Fri Oct 18 14:20:19 EDT 1991


Scott Fahlman remarked,

> As for speed of convergence, continuous updating clearly beats per-epoch
> updating if the training set is highly redundant.

Another important factor is the autocorrelation of the training sequence.
Consider a (highly redundant) training sequence that starts with 1000
examples of A and ends with 1000 examples of B.  With continuous updating,
there is a good chance that learning the B examples will cause the learned
response to A examples to be lost.  The obvious answer, in this contrived
case, is to alternate presentations of A and B examples.

Now for an uncontrived case:  Suppose we are training a recurrent net for
speaker-independent speech recognition, and that inputs to the net are
power spectra extracted from the speech signal at fixed intervals.  There
are relatively long intervals in which the speech sound (spectrum) does
not change much.  There are even longer intervals in which the speaker
does not change.  Reordering the spectra for an utterance is clearly
not an option, and continuous updating seems imprudent even though the
redundancy of the training set is high.  I'm sure there are plenty of
nonstationary time series, other than speech, which present the same
problems.

In response to Scott's remark on the batch size used with an accelerated
convergence procedure,

> It must be sufficiently large to give a reasonably stable picture of
> the overall gradient, but not so large that the gradient is computed
> many times over before a weight-update cycle occurs.

I would like to mention a case where, surprisingly, even large batches
gave instability.  The application was recognition of handwritten
lower-case letters, and the network was of the LeCun variety.  The
training set comprised three batches of 1950 letter images (a total of
5850 images).  This partition was chosen randomly.  Fahlman's quickprop
behaved poorly, and with some close inspection I found a number of
weights for which the partial derivative was changing sign from one
batch to the next.  Further, the magnitudes of those partials were not
always small.  In short, the performance surfaces for the three batches
differed considerably.  The moral:  You may have to make a single batch
of the entire training set, even when working with fairly large training
sets.

-- Tom English
   english at sun1.cs.ttu.edu


More information about the Connectionists mailing list