Increase in Error in SRN

Mon Apr 6 15:42:45 EDT 1992

The following summary and clarification of the further comments on
increased error for recurrent networks might be helpful:

There are three possible sources of an increase in error:

1. The approximation introduced in online learning by considering only
one example at a time.

This method of gradient approximation clearly should be paired with an
infinitesimal step size algorithm, since long range extrapolations (as
in variable step size algorithms) would lead to too large a change in
the model based on insufficient data; this would destabilize learning.

2. The approximation in the gradient computation for a recurrent
network by truncating the backward recursion.

Here, the computation of the full gradient by backpropagation-in-time
is no more expensive that the truncated version; it requires only that
the activation history for a token be recorded, and the gradient
information accumulated in a right-to-left pass in the analogous way
to the top-bottom pass in normal backprop.  Thus, the approximation is
unnecessary. A forward form of the complete gradient is more complex
computationally; (see Barak Pearlmutter, Two New Learning Procedures
for Recurrent Networks, Neural Network Review, v3, pp 99-101, 1990).

3. The use of a fixed step-size algorithm, which is known to be
unstable.

This is where a line-search, or golden section search, or other method
can be used to control the descent of the error.

So, roughly the situation is that fixed step-sized methods can be used
with gradient approximation methods with the possibility that the
error can increase. Variable step size methods can be used with
complete gradient methods, in which case the error is guaranteed to be
monotonically decreasing.

Gary Kuhn has reported good results using a forward-in-time complete
gradient algorithm based on estimates of the gradient over a balanced
subset of the training data that increases during training (Some
Variations on the Training of Recurrent Networks, with Norman
Herzberg, in Neural Networks: Theory and Applications, Mammone and
Zeeri, eds. Academic Press, 1991).

Whether the network instantiates time-delay links is relevant only if
it is restricted to a feedforward architecture; in that case, only
considerations 1 and 3 apply. Recurrent time-delay models have
been successfully trained using the complete gradient and a
line-search (embedded in a quasi-Newton optimization method), with the
result that there has been no increase in the objective function. (R.
Watrous, Phoneme Recognition Using Connectionist Networks, J. Acoust.
Soc. Am. 87(4) pp 1753-1772, 1990).

Raymond Watrous
Siemens Corporate Research
755 College Road East
Princeton, NJ 08540

(609) 734-6596