batch-mode parallel implementations
Scott_Fahlman@SEF-PMAX.SLISP.CS.CMU.EDU
Scott_Fahlman at SEF-PMAX.SLISP.CS.CMU.EDU
Fri Oct 18 12:38:38 EDT 1991
Original-From: Yann le Cun <yann at lamoon.neural>
I personally prefer the phrase "stochastic gradient" to all of these.
That's a fine term, but it seems to me that it refers to one of the effects
of per-sample updating, and not to the mechanism itself. You might get a
"stochastic gradient" because you are updating after every randomly chosen
sample, but you might also get it from noise in the samples themselves. So
if you want to refer to the choice of updating mechanism, and not to the
quality of the gradient, I think it's better to use a term like "per-sample
updating" that is nearly impossible for the reader to misunderstand.
>I guess you could measure redundancy by seeing if some subset of the
>training data set produces essentially the same gradient vector as the full
>set.
Hmmm, I think any dataset for which you expect good generalization is redundant.
Train your net on 30% of the dataset, and measure how many of the remaining
70% you get right. If you get a significant portion of them right, then
accumulating gradients on these examples (without updating the weights) would
be little more than a waste of time.
This suggests the following (unverified) postulate:
The better the generalization, the bigger the speed difference between
on-line (per-sample, stochastic....) and batch.
In other words, any dataset interesting enough to be learned (as opposed to
stored) has to be redundant.
There might be no such thing as a large non-redundant dataset that is worth
learning.
I think we may be talking about two different things here. Let's assume
that there is some underlying distribution that we are trying to model, and
that we take some number of samples from this distribution to use as a
training set. It is clearly true that there must be some "redundancy" in
the underlying distribution if it is to be worth modelling. In this case,
I'm using the term "redundancy" to mean that there's some sort of regular
statistical structure that is stable enough to be of predictive value. Put
another way, the distribution must not be totally random-looking; it has
less than the maximum possible information per sample.
However, given one of these redundant underlying distributions, we want to
choose a training set that is large enough to be representative of the
distribution (and to separate signal from noise), but not so large as to be
redundant itself. This training set is what I was referring to in my
earlier message. I think it is quite possible for the training set to be
large, not internally redundant, and interesting in the sense that it
models an predictable (redundant) underlying distribution. And this is the
kind of case where I think that batch-updating has an advantage.
-- Scott Fahlman
More information about the Connectionists
mailing list