batch-mode parallel implementations

Mon Oct 14 19:41:34 EDT 1991

Hi folks,

In reviewing some implementations of back-prop type algorithms on
parallel machines, it is apparent that several such implementations
obtain their high performance because of batch-mode training.
What this means is that one operates on N independent training
patterns simultaneously and then collects all the weight update
information and reestimates once per N samples.  Example where
this has been used (among others) are the GF-111, MasPar, CM-2,
Warp (I think, at least for a self-org feature map implementation),
etc.  In many papers, I have read passing references to the fact that
real-time learning is preferred (in practice) over the theoretically
indicated batch-mode (so-called "true gradient") learning.  Some of
the arguments given include "faster" convergence and "better"
generalization.  Are the convergence and generalization arguments
linked at some deeper level of analysis?  (You could have fast
convergence which generalizes poorly, etc.) I have played with this
just a little bit on small speech and other datasets without reaching
any conclusive results.

I am wondering whether there have been some definitive studies,
theoretical and/or practical which really confront this issue?
How big an issue is this for people?  For example, would you NOT
look at a parallel design which assumes batch-mode training?

Kamil
P.S.  If this is a dead issue and I missed the funeral, I apologize.

================
Kamil A. Grajski
Apple Computer
(408) 974-1313
kamil at apple.com
================