batch-mode parallel implementations

Thu Oct 17 10:46:39 EDT 1991

Several years ago, Steve Nowlan and I implemented a "batch-mode"
vectorized backprop on a Cray. Just as in Gary Cottrell's story, the
raw CUPS rate was high, but because batch mode converges so much slower
than on-line, the net gain was 0.

I think Patrick Haffner and Alex Waibel had a similar experience
with their implementations of TDNNs on the Alliant. 

Now, the larger, and more redundant the dataset is, the larger the difference
in convergence speed between on-line and batch.  
For small (and/or random) datasets, batch might be OK, but who cares.  
Also, if you need a very high accuracy solution (for function approximation
for example), a second-order batch technique will probably be better than
on-line.

Sadly, almost all speedup techniques for backprop only apply to batch (or
semi-batch) mode. That includes conjugate gradient, delta-bar-delta, most
Newton or Quasi-Newton methods (BFGS...), etc... 

I would love to see a clear demonstration that any of these methods beats a
carefully tuned on-line gradient on a large pattern classification problem.
I tried many of these methods several years ago, and failed.

I think there are two interesting challenges here:
1 - Explain theoretically why on-line is so much faster than batch
    (something that goes beyond the "multiple copies" argument).
2 - Find more speedup methods that work with on-line training.

  -- Yann  Le Cun