Transfer in Recurrent Networks: A preliminary report and request for advice

Mon Jun 29 04:18:20 EDT 1992

Following is an abbreviated version of a preliminary report on our attempts
to produce instructed training and transfer in recurrent networks.  I am
posting it in the hopes of soliciting advice as to how to proceed (or not)
and pointers to related work.  The longer version of the report is actually
not yet available as results are being computed even as I write. (This
version was produced by basically just removing everything that said: "<put
the results of experiment XXX here>.")

We would appreciate any thoughts that you have on how to proceed or where
to look for related work.

Jeff Shrager
David Blumenthal

(Incidentally, I'd be happy to give the commonlisp (Sun Franz 4.1) bp driver
code (mentioned below) to anyone who has the need to run similar supervised
experiments with the McClelland and Rumelhart programs.  It's slightly
special purpose, but very easy to modify.)

           --- Please Do Not Quote or Redistribute ---

We have been exploring of training regimes, labeling, and transfer in
recurrent backpropogation networks.  Our goal in this research is to model
three aspects of human development: First, people learn to associate words
with actions.  Second, given such associations, people can, on command, do
one or a sequence of actions.  Third, by practicing sequential actions put
together as a result of verbal direction (or self-direction), they can
learn new skills, and give new labels to these skills.  Finally, each of
these processes may require (or make use of) physical guidance or tutorial
remediation by an "expert".  For people, this is especially the case for
the first of these phenomena.

The metaphorical model that we use in considering these phenomena is that
of interaction between a parent and child during joint activity, such as
baking muffins.  Shrager & Callanan (1991: Proceedings of the Cognitive
Science Conference) studied the various means by which parents and their
children of about 3-, 4-, and 5-years-old scoop baking soda out of a box
for a muffin recipe.  It was observed, first, that there is a large amount
of non-directive information in the environment, especially in the verbal
context, that a learner such as the child might pick up on in order to
learn this skill.  Furthermore, it was observed that remediation takes
place differently at different ages; through physical guidance in the
earlier years, and through verbal instruction later on.

We set out to model such a collaborative skill acquisition setting using an
algorithmic teacher (`parent') and a Jordan-style recurrent connectionist
sequence learner (`child').  The problem was simplified by reducing it to
that of training a net to produce a sequence of real-value (x,y)
coordinates corresponding to a simple sequence of positions for the spoon.
We chose interior real-value coordinates for our points instead of 0's and
1's to avoid possible edge effects.  Figure 1 exemplifies the learning
task.

                     ___________
     (.2,.4)        |           | 
        *<<<<[out]<<<<<<*(.8,.4)|
        *<=============+^       |  
           [scoop]     $^       |
                       $^ [up]  |
        *>=============+^       |
        *(.2,.2)>[in]>>>*(.8,.2)|
                    |___________|

Figure 1: The outline box is meant to represent a box of baking soda.
Stars represent the starting and ending points that were trained. Arrows
indicate the path heads and tails.  Verbal labels (names) are enclosed in
[brackets].  Each step of the outer path: in+up+out (.2,.2.)->(.8,.2),
(.8,.2)->(.8,.4), (.8,.4)->(.2,.4), is intended to be individually trained
into a recurrent network, using a different label.  Then the unified path,
called "scoop" (>==...>$$>==...>) is to be either verbally composed using
the previously learned labels, or else guided through along with the
labels.  In either case, scoop also has its own label.  We think of a
parent telling a child something like: to scoop the baking soda you put the
spoon in, then bring it up against the box top and pull it out (to level
the amount), while, in the case of a younger child, physically guiding him
or her through these actions.

Goals 

We wished to show that training of [scoop] is facilitated by pretraining
with combinations of [in], [up], and [out], or, conversely, that the
learning of these sequences is facilitated by pretraining with [scoop].
Secondarily, we wished to explore the function of `label' interactions,
where by `label' we mean the presented inputs at the non-recurrent input
units of the network.  

General Method

For most of the experiments reported here we used a Jordan-style recurrent
network with 3 plan units, 2 context units, 3 hidden units, and 2 output
units connected to the output units.  All of our nets will be identified by
the number of units in each part of the net, labeled in the aforementioned
order.  Thus, the just described net will be called: 3232 (Figure 2).

                -----------------------
                 8 9            output

                6 7 8           hidden

                (0 1 2)  (3 4)  (plan)   (context)

Figure 2: The recurrent network, represented in accord with the numbering
scheme used by the bp program.

The "bp" program (McClelland & Rumelhart) was used to handle the network
learning through backpropogation.  A lisp front-end was written for bp
within which simple algorithmic experiments could be run to train the
network on a sequence of different inputs and to various criteria.

Data: 

Figure 3 shows each training set.  The individual subsequences ([in]/[out])
were given different input codings (010/100), and the entire sequence
([scoop]) was given still another code (generally the combined code: 110).
We considered the input codes as labels.  This each action had a different
label.

Figure 3: The training patterns we used in this experiment: (Negative
context numbers refer to the number of the unit the context unit is linked
to.  See M&R, pgs 157-158 for details of this evil hack.)

in3232.pat

in0     0 1 0   0  0    .2 .2
in1     0 1 0  -8 -9    .8 .2
        \   /  \   /    \    /
        label context   target

out3232.pat

scr0    1 0 0  0  0    .8 .4
scr1    1 0 0 -8 -9     .2 .4

InOut3232.pat

InOut0  0 1 0   0  0    .2 .2
InOut2  0 1 0  -8 -9    .8 .2
InOut3  1 0 0   0  0    .8 .4
InOut4  1 0 0  -8 -9    .2 .4

scoop3232.pat

scoop0    1 1 0  0   0           .2 .2
scoop1    1 1 0 -8  -9            .8 .2 
scoop2    1 1 0 -8  -9            .8 .4 
scoop3    1 1 0 -8  -9            .2 .4

Parameters: We shall use the phrases "fully trained" and "to criterion" to
mean that the total sum of squares ("ecrit", in the language of bp) was
less than or equal to 0.01.  Unless otherwise specified, weights were
updated after each epoch of training (epoch mode).  The training driver
proceeded in steps of no smaller than 10 epochs, therefore all results are
recorded at some increment of 10 epochs, even if bp had reached criterion
before that point.

Experiments 

The value of interest to us in our initial experiments is the number of
training epochs required to learn [scoop] to criterion, given various prior
experience.  That is, we tried to train different parts of the sequence
individually before training the whole.  There ought to be some savings from
training in a simpler task (in, out, or combinations), that can transfer to
and improve (speed up) the training of the whole.  

Two general groups of studies were carried out: Group 1 studies pretrained
the network with various combinations of [in], [out], [up], to varying
degree, and then recorded the time to train [scoop] to criterion.  Group 2
studies did the opposite, pretraining with [scoop] and recording the
training time for the subsequences.  In most cases, different labels,
composed from simple binary values (e.g., 010, 101) were assigned to each
subsequence, and then [scoop] was given the unified label (111) or average
label (.5 .5 .5).

Each reported mean and deviation results from 50 repetitions of the
experiment, carried out on a newly started copy of bp (thus guaranteeing
random initial weights).  Deviations will be reported in parentheses
following means.  If no deviation is reported, the value is not a mean.

When a scoop training value is reported, it is a mean (sd) difference
between the end of pretraining (to whatever criterion is indicated, or
0.01), and the point at which the [scoop] pattern reached criterion, on
a per-trail basis.  Thus, unless otherwise specified, the phrase
"pretrained" means "pretrained to criterion (of ecrit 0.01, to the next
increment of 10 epochs)".

Group 1 Studies (on the 3232 network)

The training of [scoop] alone required 316 (87) epochs.  Pretraining with
[in] resulted in [scoop] training time of 326 (236).  This difference was
(pretty obviously!) not significant by a t-test.  However, pretraining with
"in" followed by "out" and then "scoop", resulted in much longer training
time 600 (351), which differs from scoop alone (t(8)=2.56, p<.025), and
from in+scoop (t(8)=2.05, p<.05).  Similar results were obtained by
pretraining with InOut.

We next attempted to parameterize the amount of ill effect that pretraining
with InOut was having, by "nudging" the network by changing the pretraining
total sum of squares (ecrit) values.  Figure 4, the graph of exp9, plots
the amount of time that [scoop] takes to train, given pretraining to
different critical tss values, ranging from 0.25 (very little pretraining)
through 0.02 (greatest amount of pretraining). One can see that although
the data is very noisy (r2=.16) there is a trend towards [scoop] requiring
more training as the network "overlearns" InOut.

[InOut]
nudge
ecrit  [scoop] training rate to 0.01 ecrit
-----  -----------------------------------
0.20   (MEAN 554.0 ERR 48.286182 DEV 152.69432)
0.15   (MEAN 415.0 ERR 57.334305 DEV 181.30699)
0.10   (MEAN 574.44446 ERR 86.63639 DEV 259.90918)
0.08   (MEAN 366.0 ERR 62.203964 DEV 196.7062)
0.06   (MEAN 534.0 ERR 84.33003 DEV 266.675)
0.04   (MEAN 673.0 ERR 68.02042 DEV 215.09946)
0.02   (MEAN 644.0 ERR 88.206825 DEV 278.93448)

Figure 4. `Nudge' criterion and [scoop] training rates for various levels
of nudging.  [This is supposed to be a plot, but it appears in this textual
version as a table.]

[Report of a number of unsuccessful attempts with different network
architectures deleted.]

Group 2 Studies

Since we failed to find consistent pretraining effects from subsequences to
the whole sequence, we investigated transfer in the other direction:
pretraining [scoop] and looking for effects on the training times of
different subsequence.  This is non-trivial because the parts of the
sequence were again given different labels and did always start at the
first point in the [scoop] sequence.  The Jordan-style recurrent network
tries to replicate a particular sequence in the order in which it was
learned. This is the effect that we are both depending upon and fighting
against.

We found considerable pretraining effect in most cases (Table 1). 

                        with [scoop]    remarks
                alone   pretraining     (label)
                -----   ------------    -------

in              130,19  12,4      *     (0 1 0)
out                     31,7      ?     (1 0 0)
up              83,23   965,99    *     (0 0 0) zero effect?
123point        458,132 250,154   *     (0 1 0)
1234point       288,28  375,158         (0 1 0)
23point         87,25   263,133   *     (0 1 0)
234point        350,95  146,110   *     (0 1 0)
1point          8,1     1,1       *     (0 1 0)
2point          8,1     107,226   *     (0 1 0)
4point                  7,2       ?     (0 1 0)
scpout          77,12             ?     (1 1 0)
scp234point     301,85  294,161         (1 1 0)

Table 1: Transfer from [scoop] to its subsequences.  Numbers refer to
subsequence points from the [scoop] sequence (the first point is 1, the
last is 4):  4 <<<<< 3
                     ^
             1 >>>>> 2 

Patterns that begin with "scp" have the same labels (1 1 0) as [scoop].
Results are mean,deviation from 50 trials.  * indicates a significant
difference.

These results suggest that the network learns the sequence, and its
knowledge of the sequence is not completely linked to the inputs. Thus,
after pretraining with [scoop], the network can learn a similar sequence
with a different label and/or a different starting point relatively easily.
However, the large deviations in this case suggest that the network may be
learning several different versions of [scoop], and that some of these lend
themselves to transfer while others do not.  (In a very few cases the
system appeared to go into an infinite loop, apparently never reaching
criterion, using precisely the same inputs as resulted in small reasonable
training rate in most cases.  These were stopped by force at times on the
order of 1E5 to 1E6 epochs, depending upon when we noticed the problem, and
were excluded from the results.  This may have been an infinite, or at
least very deep, hole in the space.  Changes in learning parameters may
have fixed these problems.)