Why does the error rise in a SRN?

Sun Apr 5 13:15:24 EDT 1992

In response to the lack of a true gradient signal in "simple recurrent"
(Elman-style) back-propagation networks Scott Fahlman writes:

>It would be possible, but very expensive, to add the missing terms into the
>Elman net.  You end up with something that looks much like the
>Williams-Zipser RTRL model...

It does not have to be significantly harder to compute a good approximation
to the error signal than Elman's approximation to it.  The method of
expanding the network in time achieves this by changing the per-pattern
update to a per-buffer update.  The buffer length should be longer than the
expected context effects, and shorter than the training set size if the
advantages of frequent updating are to be maintained [in practice this is not
a difficult constraint].  The method is:

Replicate the network N times where N is the buffer length, and stitch it
together where the activations are to be passed forward to make one large
network

  Place N patterns at the N input positions, and do a forward pass.

  Place N targets at the N output positions and (using your favourite error
  measure) perform a standard backward pass through the large network.

  Add up all the partial gradients for every shared weight and use the result
  in your favourite hack of gradient descent.

Of course there are some end effects with a finite length buffer, but these
can be made small by making the buffer large enough, and placing the buffer
boundaries at different positions in the training data on subsequent passes.

However, adding in all those extra nasty non-linearities into the gradient
signal gives a much harder training problem.  I think that it is worth it in
terms of the increase in computational power of the network.

Tony [Robinson]