batch-mode parallel implementations
neural!lamoon.neural!yann@att.att.com
neural!lamoon.neural!yann at att.att.com
Thu Oct 17 10:46:39 EDT 1991
Several years ago, Steve Nowlan and I implemented a "batch-mode"
vectorized backprop on a Cray. Just as in Gary Cottrell's story, the
raw CUPS rate was high, but because batch mode converges so much slower
than on-line, the net gain was 0.
I think Patrick Haffner and Alex Waibel had a similar experience
with their implementations of TDNNs on the Alliant.
Now, the larger, and more redundant the dataset is, the larger the difference
in convergence speed between on-line and batch.
For small (and/or random) datasets, batch might be OK, but who cares.
Also, if you need a very high accuracy solution (for function approximation
for example), a second-order batch technique will probably be better than
on-line.
Sadly, almost all speedup techniques for backprop only apply to batch (or
semi-batch) mode. That includes conjugate gradient, delta-bar-delta, most
Newton or Quasi-Newton methods (BFGS...), etc...
I would love to see a clear demonstration that any of these methods beats a
carefully tuned on-line gradient on a large pattern classification problem.
I tried many of these methods several years ago, and failed.
I think there are two interesting challenges here:
1 - Explain theoretically why on-line is so much faster than batch
(something that goes beyond the "multiple copies" argument).
2 - Find more speedup methods that work with on-line training.
-- Yann Le Cun
More information about the Connectionists
mailing list