batch-mode parallel implementations
Scott_Fahlman@SEF-PMAX.SLISP.CS.CMU.EDU
Scott_Fahlman at SEF-PMAX.SLISP.CS.CMU.EDU
Tue Oct 15 01:23:47 EDT 1991
I don't recall seeing any studies that claim better generalization for
per-sample or continuous updating than for per-epoch or batch updating.
Can you supply some citations? The only reason I can think of for better
generalization in the per-sample case would be a weak sort of
simulated-annealing effect, with the random variation among individual
training samples helping to jiggle the system out of small local minima in
the vicinity of the best answer.
As for speed of convergence, continuous updating clearly beats per-epoch
updating if the training set is highly redundant. To see this, imagine
taking a small set or training cases, duplicating that set 1000 times, and
presenting the resulting huge set as the training set. Per-sample updating
would probably have converged on a good set of weights before the first
per-epoch weight adjustment is ever made. Also, in some cases it just is
not practical to use per-epoch updating. There may be a stream of
ever-changing data going by, and it may be impractical to store a large set
of samples from this data stream for repeated use.
On the other hand, it is rather dangerous to use continuous updating with
high learning rates or with techniques that adjust the learning rate based
on some sort of second-derivative estimate. If you are not very careful, a
few atypical cases in a row can accelerate you right out of the solar
system. Some techniques, such as quickprop and most of the conjugate
gradient methods, depend on the ability to look at the same set of training
examples more than once, so they inherently are per-epoch models.
In my opinion, the best solution in most situations is probably to use one
of the accelerated convergence methods and to update the weights after an
"epoch" whose size is chosen by the experimenter. It must be sufficiently
large to give a reasonably stable picture of the overall gradient, but not
so large that the gradient is computed many times over before a
weight-update cycle occurs. However, I am sure that this view is not
universally accepted: some people seem to believe that per-sample updating
is superior in all cases.
-- Scott Fahlman
More information about the Connectionists
mailing list