batch-mode parallel implementations

Fri Oct 18 02:10:29 EDT 1991

    Yann LeCun writes:

    Now, the larger, and more redundant the dataset is, the larger the difference
    in convergence speed between on-line and batch.  
    For small (and/or random) datasets, batch might be OK, but who cares.  

I think that it may be misleading to lump together "large" and "redundant"
as if they were the same thing, or as if they were inseparable.  I agree
that for highly redundant datasets, continuous updating has an advantage.
I also agree that for small datasets, we don't care much about speed.  But
it seems to me that it is possible to have a large, not-very-redundant data
set, and that accelerated batch methods should have an advantage for these.

I guess you could measure redundancy by seeing if some subset of the
training data set produces essentially the same gradient vector as the full
set.  Probably statisticians have good ways of talking about this
redundancy business -- unfortunately, I don't know the right vocabulary.
In a data set with noise, you need a big enough training set to raise
relatively rare but real features above the level of the random background
noise.  If you have roughly that much data, I bet fast batch techniques
would win; if you have a training set that is several times this minimal
size, then continuous updating would win.  That's my suspicion, anyway.

    I would love to see a clear demonstration that any of these methods beats a
    carefully tuned on-line gradient on a large pattern classification problem.
    I tried many of these methods several years ago, and failed.

Well, if my hypothesis above is right, we could demonstrate this by finding
a dataset that is large enough to make you happy, but not highly redundant.
I guess that we could create this by taking any large dataset, measuring
its redundancy, and trimming it down to minimal size (assuming that the
result still can be classified as large).  Do you know of any big sets that
would qualify?  It should preferably a relatively "pure" N-input
data-classification problem, without all the additional issues (e.g.
translation invariance) that are present in image-processing and
speech-processing tasks.

    I think there are two interesting challenges here:
    1 - Explain theoretically why on-line is so much faster than batch
        (something that goes beyond the "multiple copies" argument).
    2 - Find more speedup methods that work with on-line training.

I have a hunch that if we work hard enough on speeding up online training,
we'll end up with something whose NET EFFECT is equivalent to the following:

1. Accumulate gradient data for a length of time that is adaptively chosen:
   Large enough for the gradients to be stable and accurate, but not large
   enough to be redundant.

2. Use something equivalent to one of the batch-processing acceleration
   techniques on this smoothed gradient.

That's not to say that the technique will necessary do this in an obvious
way -- it may be twiddling the weights each time a sample goes by -- but I
suspect this kind of accumulation, smoothing, and acceleration will be
present at some level.  As I said, for now this is just a hunch.

-- Scott Fahlman

P.S. I avoid using the term "on-line" for what I call "per-sample" or
"continuous" updating of weights.  For me, "online" means something else.
At this moment, I am sitting at my workstation watching one of my
batch-updating algorithms running "on-line" in front of me.