Batch methods versus stochastic methods...

Mon Oct 21 08:13:06 EDT 1991

--- Concerning the discussion about batch-update versus stochastic
update.

The last about 6 month we have been working with online versus batch problems.
A preprint of a paper, which tries to describe why the stochastic methods in 
some instances are better than the deterministic batch mehods will soon be 
available via the neuroprose archive.  
The paper also introduces a new algorithm which combines the good properties
of the stochastic methods as well the batch methods.
Our results so far can be summarized as follows:

The redundancy of the trainingset plays as has been mentioned before
a very important role. It is not clear, however, how to define this
redundancy in a proper way. The usually definition of redundancy taken
from the information theory can give a hint about he redundancy but can 
not in any obvious way provide a precise defintion, because this would 
involve the information content of the trainingset as well as the 
internal dynamics (the structur) of the network. So when we discuss
the concept of redundancy we should be aware of that redundancy in the 
context of learning in feedforward networks is not very well defined.

Another very important issue which I think is even more important 
than the concept of redundancy is the structure of the error surface.
The "true" error surface which are given by the whole trainingset
is as we know often characterized by a large number of flat regions and 
very steep, narrow ravines. 
Batch methods operates in the true but very complex error
surfaces while stochastic methods operate in partial error surfaces which
are only approximations to the true error surface. So stochastic methods
makes a noisy, stochastic search in the true error surface which can
help them through the flat regions. One can think of the stochastic search
as a kind of "simulated annealing" approach where increase of error is also
allowed.

The algorithm we propose is based on a combination of the good
properties of stochastic and batch algorithms. The main idea is to
use a conjugate gradient algorithm on blocks of data (block-update or
semi-batch update). Because the conjugate gradient algorithms updates weights
with variable (and sometime large) stepsizes a validation scheme is used to
control the updates. Through a simple sample technique we estimate the 
probabillity that an update will decrease the total error. This probabillity
is then used to decide whether to update or not. 
The number of patterns needed in each block-update is a variable and 
controlled by an adaptive optimization scheme during training.

We have done some experiments with this approach on the nettalk problem. 
Our results so far shows that the approach decreases the error faster per
epoch than the stochastic backpropagation. More computation is however needed
per epoch. 
An interesting observation is that the number of blocks needed
to make an update is growing during learning so that after a certain
number of epochs the blocksize is equal to the number of patterns.
When this happens the algorithm is equal to a traditional batch-mode
algorithm and no validation is needed anymore.
In order to be able to draw some definite conclusions we need a few more 
experiments on different trainingsets.
Unfortunately, we do not have any datasets of the proper size.
So I would appreciate if anyone could inform me about where to find big 
datasets that are public available.

-- Martin M

-----------------------------------------------------------------------
Martin F. Moller	       	email: mmoller at daimi.aau.dk
Computer Science Department	phone: +45 86202711 5223
Aarhus University		fax:    +45 86135725
Ny Munkegade, Building 540
8000 Aarhus C
Denmark
----------------------------------------------------------------------