batch-mode parallel implementations

Fri Oct 18 11:08:03 EDT 1991

   Scott Fahlman writes:

    >I avoid using the term "on-line" for what I call "per-sample" or
    >"continuous" updating of weights.

I personally prefer the phrase "stochastic gradient" to all of these.

   >I guess you could measure redundancy by seeing if some subset of the
   >training data set produces essentially the same gradient vector as the full
   >set.

Hmmm, I think any dataset for which you expect good generalization is redundant.
Train your net on 30% of the dataset, and measure how many of the remaining
70% you get right. If you get a significant portion of them right, then
accumulating gradients on these examples (without updating the weights) would
be little more than a waste of time.

This suggests the following (unverified) postulate:
 The better the generalization, the bigger the speed difference between
 on-line (per-sample, stochastic....) and batch.

In other words, any dataset interesting enough to be learned (as opposed to
stored) has to be redundant.
There might be no such thing as a large non-redundant dataset that is worth 
learning.

  -- Yann