Why does the error rise in a SRN?

Fri Apr 3 01:43:33 EST 1992

    I have been working with the Simple Recurrent Network (Elman style)
    and variants there of for some time. Something which seems to happen
    with surprising frequency is that the error will decrease for a period
    and then will start to increase again.

As Ray Watrous suggests, your problem might be due to online updating with
a fixed step size, but I have seen the same kind of problem with batch
updating, in which none of the weights are updated until the error gradient
dE/dw has been computed over the whole set of training sequences.  And this
was with Quickprop, which adjusts the step-size dynamically.  In fact,
I know of several independent attempts to apply Quickprop learning to Elman
nets, most with very disappointing results.

I think that you're running into an approximation that often causes trouble
in Elman-style recurrent nets.  In these nets, in addition to the usual
inputs units, we have a set of "state" variables that hold the values of
the hidden units from the previous time-step.  These state variables are
treated just like the inputs during network training.  That is, we pretend
that the state variables are independent of the weights being trained,
and we compute dE(t)/dw for all the network's weights based on that
assumption.

However, the state variables are not really independent of the network's
weights, since they are just the hidden-unit values from time t-1.  The
true value of dE(t)/dw will include terms involving dS(t)/dw for the
various weights w that affect the state variables S.  Or, if you prefer,
they will include dH(t-1)/dw, for the hidden units H.  These terms are
dropped in the usual Elman or SRN formulation, but that can be dangerous,
since they are not negligible in general.  In fact, it is these terms that
implement the "back propagation in time", which can alter the network's
weights so that a state bit is set at one point and used many cycles later.

So in an Elman net, even if you are using batch updating, you are not
following the true error gradient dE/dw, but only a rough approximation to
it.  Often this will get you to the right place, or at least to a very
interesting place, but it causes a lot of trouble for algorithms like
Quickprop that try to follow the (alleged) gradient more aggressively.
Even if you descend the pseudo-gradient slowly and carefully, you will
often see that the true error begins to increase after a while.

It would be possible, but very expensive, to add the missing terms into the
Elman net.  You end up with something that looks much like the
Williams-Zipser RTRL model, which basically requires you to keep a matrix
showing the derivative of every state value with respect to every weight.
In a net that allows only self-recurrent connections, you only need to save
one extra value for each input-side weight, so in these networks it is
practical to keep the extra terms.  Such models, including Mike Mozer's
"Focused" recurrent nets and my own Recurrent Cascade-Correlation model,
don't suffer from the approximation described above.

-- Scott
===========================================================================
Scott E. Fahlman
School of Computer Science
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh, PA 15213

Internet: sef+ at cs.cmu.edu