batch-mode parallel implementations

Fri Oct 18 12:38:38 EDT 1991

    Original-From: Yann le Cun <yann at lamoon.neural>
    I personally prefer the phrase "stochastic gradient" to all of these.

That's a fine term, but it seems to me that it refers to one of the effects
of per-sample updating, and not to the mechanism itself.  You might get a
"stochastic gradient" because you are updating after every randomly chosen
sample, but you might also get it from noise in the samples themselves.  So
if you want to refer to the choice of updating mechanism, and not to the
quality of the gradient, I think it's better to use a term like "per-sample
updating" that is nearly impossible for the reader to misunderstand.

       >I guess you could measure redundancy by seeing if some subset of the
       >training data set produces essentially the same gradient vector as the full
       >set.

    Hmmm, I think any dataset for which you expect good generalization is redundant.
    Train your net on 30% of the dataset, and measure how many of the remaining
    70% you get right. If you get a significant portion of them right, then
    accumulating gradients on these examples (without updating the weights) would
    be little more than a waste of time.

    This suggests the following (unverified) postulate:
     The better the generalization, the bigger the speed difference between
     on-line (per-sample, stochastic....) and batch.

    In other words, any dataset interesting enough to be learned (as opposed to
    stored) has to be redundant.
    There might be no such thing as a large non-redundant dataset that is worth 
    learning.

I think we may be talking about two different things here.  Let's assume
that there is some underlying distribution that we are trying to model, and
that we take some number of samples from this distribution to use as a
training set.  It is clearly true that there must be some "redundancy" in
the underlying distribution if it is to be worth modelling.  In this case,
I'm using the term "redundancy" to mean that there's some sort of regular
statistical structure that is stable enough to be of predictive value.  Put
another way, the distribution must not be totally random-looking; it has
less than the maximum possible information per sample.

However, given one of these redundant underlying distributions, we want to
choose a training set that is large enough to be representative of the
distribution (and to separate signal from noise), but not so large as to be
redundant itself.  This training set is what I was referring to in my
earlier message.  I think it is quite possible for the training set to be
large, not internally redundant, and interesting in the sense that it
models an predictable (redundant) underlying distribution.  And this is the
kind of case where I think that batch-updating has an advantage.

-- Scott Fahlman