What have neural networks achieved?

Rafael Brander brander at csee.uq.edu.au
Wed Sep 2 14:28:01 EDT 1998


These look like achievements of neural networks (see detail below):

(1) Suggested by Randall O'Reilly,  "...the neural network
approach provides a principled basis for understanding why we have a
hippocampus, and what its functional characteristics should be."

The catastrophic interference literature has also given a plausible
explanation -- sparseness -- for why, given that our brains look very
much like neural networks, we have any memory at all.

(2) I think that the catastrophic interference and generalisation
literature suggests the possibility that in order to have generalisation
capability, the human weakness with discrete symbolic memory may be an
inevitability -- in artificial as well as biological computers. This
bolsters the common view that AI will only be achieved with
human-comparable computer hardware. It also explains the typical human
complaint "why do I have such a terrible memory..." (relative to
silicon chip computers).

Apologies for length of this email, see my Masters abstract at bottom.
Rafael Brander.

Randall O'Reilly wrote:

"Another angle on the hippocampal story has to do with the phenomenon
of catestrophic interference (McCloskey & Cohen, 1989), and the notion
that the hippocampus and the cortex are complementary learning systems
that each optimize different functional objectives (McClelland,
McNaughton, & O'Reilly, 1995).  In this case, the neural network
approach provides a principled basis for understanding why we have a
hippocampus, and what its functional characteristics should be.
Interestingly, one of the "sucesses" of neural networks in this case
was their dramatic failure in the form of the catestrophic
interference phenomenon.  This failure tells us something about the
limitations of the cortical memory system, and thus, why we might need
a hippocampus."


I agree. And further to this, research results involving sparse vectors
show that hippocampally realistic settings of certain of the parameters
of an otherwise standard MLP are sufficient to equal the
intermediary-term memory performance of human subjects. In other words,
current knowledge of sparse neural networks is actually consistent with
human intermediary-term memory performance, which is associated with the
hippocampus.
See bottom of email for my Masters abstract.

Jay McClelland wrote:

[text deleted]
"To allow rapid learning of the contents of a particular experience,
the arguement goes, a second learning system, complementary to the
first [neocortex..],
 is needed; such a system has a higher learning rate and recodes
inputs using what we call 'sparse, random conjunctive coding' to
minimize interference (while simultaneously reducing the adequacy of
generalization).  These characteristics are just the ones that appear
to characterize the hippocampal system: it is the part of the brain
known to be crucial for the rapid learning of the contents of specific
experiences; it is massively plastic; and neuronal recording studies
indicate that it does indeed use sparse, random conjunctive coding."


My research mentioned above, which found simple parameters allowing a
sparse network to equal human intermediary-term memory performance on
the tasks studied confirm Jay's comments. He also refers to sparseness
"simultaneously reducing the adequacy of generalization". I also studied
the issue of this generalisation/memory trade-off in my thesis. I found
that indeed generalisation can be wiped out by sparseness if the domain
is combinatorial. By *combinatorial*, I mean where any input features can
be combined freely in any combination to form an input vector. I also
found, affirming results of French and Lewandowsky, that generalisation
was not affected in simpler, noncombinatorial (actually classification)
domains.

This problem in combinatorial domains of sparseness damaging
generalisation, a classic utility of neural networks, is a bit
discouraging for applications. However, any working algorithm must find
a way to both sufficiently separate memories to avoid interference, and 
to combine them sensibly to perform tasks requiring recognition of shared
features. In this vein, current knowledge actually suggests that the
human brain has traded off detailed discrete memory against
generalisation capability. Relative to conventional silicon chip Von
Neumann computers, everyone is aware of human frustration with detailed
discrete symbolic memory. This, despite the colossal hardware available
in the human brain. Note that the "catastrophic interference" of naive
networks referred to in the literature is catastrophic relative to
humans; but humans are also catastrophic relative to computers. This
frustration relative to silicon chip computers may stem from
*partially* overlapping semantic feature based representations, which may
be necessary for generalisation and content addressability. According to
the literature on catastrophic interference in artificial networks,
overlap the representations too much and memory disappears; too little,
and generalisation vanishes (at least in a combinatorial domain).
Humans can perform generalisation on tasks spanning periods well under
the few months or years that memories take to become established
in the neocortex, hence the hippocampus may be involved. This line of
thinking, based on the catastrophic interference literature, suggests
the possibility that in order to have generalisation capability, the
human weakness with discrete symbolic memory may be an inevitability --
in artificial as well as biological computers. Most AI researchers
believe that they will need human-comparable computer hardware to achieve
human level performance; I think this is further evidence for that view.


Bryan Thompson wrote:

    Max writes,

    Think about the structure of this argument for a moment. It runs
    thus:

    1. Neural networks suffer from catastrophic interference.
    2. Therefore the cortical memory system suffers from catastrophic
    interference.  3. That's why we might need a hippocampus.

    Is everyone happy with the idea that (1) implies (2)?

    Max
                  max at currawong.bhs.mq.edu.au

  "I am not happy with the conclusion (1), above.  Catastrophic
  interference is a function of the global quality of the weights
  involved in the network.  More local networks are, of necessity, less
  prone to such interference as less overlapping subsets of the weights
  are used to maps the transformation from input to output space.  Modifying
  some weights may have *no* effect on some other predictions. In
  the extreme case of table lookups, it is clear that catastropic
  interference completely disappears (along with most options for
  generalization, etc.:) In many ways, it seems that this statement is
  true for supervised learning networks in which weights are more global
  than not.  Other, more practical counter examples would include
  (differentiable) CMACs and radial basis function networks.
  (text deleted..)"


This mostly ties in. But a couple of elaborations. Although sparser
networks in earlier research always reduced interference, the memory
performance of multilayered sparse networks was generally much below
what one might intuitively expect (see Masters abstract below). The
reasons for this I explain below, but if you set up the network
correctly then your comments about progressively more severe sparseness
reducing interference are correct. Regarding radial basis function
networks, I might expect an RBF net with narrow activation bases to be
analogous to sparse MLPs, although Robins' work in progress (just below)
seems pessimistic about its memory capabilities. Some of the unexpected
but correctable problems which I found with sparse networks might have
their analogues in narrowed RBFs.


Anthony Robins wrote:

[stuff deleted]...
In any case, retaining old items for rehearsal in a
network seems somewhat artificial, as it requires that they be
available on demand from some other source, which would seem to make
the network itself redundant.

[I agree. Retaining items for rehearsal requires memory overhead in a
system which is supposed to be trying to optimise memory...]

It is possible to achieve the benefits of rehearsal, however, even
when there is no access to old items.  This "pseudorehearsal"
mechanism, introduced in Robins (1995), is based on the relearning of
artificially constructed populations of "pseudoitems" instead of the
actual old items.

In MLP / backprop type networks a pseudoitem is constructed by
generating a new input vector at random, and passing it forward
through a network in the standard way.  Whatever output vector this
input generates becomes the associated target output.
[stuff deleted] ....
(Work in progress suggests that simply using a "local"
learning algorithm such as an RBF is not enough).

We have already linked pseudorehearsal in MLP networks to the
consolidation of information during sleep (Robins, 1996).
[stuff deleted]...


Considerations of some kind of rehearsal for consolidation of information
during sleep, and over the long-term in general sound very interesting.
Regarding shorter memory tasks, say over a number of minutes such
as the ones I studied in my Masters, it seems less likely to me that
the brain would have the time and memory capacity to implement
pseudorehearsal. From some of your data, rather large pseudopopulation
overhead seems to be needed to slow the swinging off-target (as shown by
dot products with target) of the original learned output vectors.


The abstract of my Masters thesis is appended below, but I mention a
few bits of extra relevant detail here. As mentioned above, it was
successfully demonstrated that hidden-layer sparseness, in combination
with three other hippocampally realistic network conditions, can
eliminate catastrophic interference. Some of the simulations were a
resolution of McCloskey and Cohen's [1989] expose of catastrophic
interference. The other three necessary anti-interference factors were
context dominance (for context-dependent tasks of course), initial
weightsize and the bias influence.
Context dominance refers to large (i.e. larger than 1) context unit
activation values.  It was set to 4 for the human-equal
context-dependent simulation, which actually matched the whole human
forgetting curve well. Comparing with the hippocampus, there is no easy
way to determine just how much "attention" it pays to list context. Much
larger than normal initial weightsizes -- at values typical of
in-training sizes -- were also found to be necessary for both tasks in
avoiding catastrophic interference. One would expect weight sizes in the
brain to always be at "in-training" sizes. Finally, a small bias
influence relative to the input layer, such as is typically found in
large networks (the hippocampus is a huge network), was required.

Here is a summarised explanation for how the four factors
influence interference. For sparseness, it's the obvious one given many
times by other researchers; sparseness reduces overlap, thus reduces
weight-learning interference. However, in my simulations sparseness by
itself only ever got first list retention just off the floor;
interference from the second list was still catastrophic. For the
context-dependent task, the most important other factor as context
dominance, which works as follows. If context activation values are too
small, switching list context is not going to change the total summed
inputs to the hidden layer by much. Consequently, the network will
train mostly the same weights for second list items as it previously
did for the first list, wiping out memory of list 1. Turning to initial
weightsizes. Firstly, training on list 1 naturally pushed the weights
most involved to the "intraining" sizes -- much greater than traditional
initialisations. During list 2 training, backpropagation of error
through these enlarged weights was far greater than through small
weights, which directly encouraged the new associations to be learned
using those same weights over again. Initialising weights around the
in-training range removed this quite substantial interference effect.
Regarding bias influence. In a small model network, the bias for each
node, traditionally set to 1, is significant in comparison to the
previous layer's fan-in to the node. When a hidden node is "switched
on" during early training (list 1), its bias is naturally increased to
speed up the training. The result is that the hidden node's activation
threshold is now low, and it is therefore much more likely to be
turned on during the learning of list 2. Thus early and late training
tends strongly to use the same hidden nodes, increasing interference.
This problem vanished for smaller bias settings.

It was not obvious that low initial weight-sizes and high bias influence
could cause substantial interference, and I point out that these factors
should obtain across a wide variety of commonly used networks.

I'll probably submit material from the Masters catastrophic interference
sections for publication in the near future (and probably some SOM
stuff too). If someone asks me about it, I can put the latest version of
the thesis on the web in a few weeks, I'm finishing some
examiner-initiated modifications. The thesis has a great deal more
references on catastrophic interference than I've appended below.

\title{On Sparse Neural Network Representations: Catastrophic Interference,
 Generalisation, and Combinatoriality}
\author{Rafael Antony Brander\\
 B.Sc.(Hons. Pure Math; Hons. Applied Math), G.Dip.(Comp.Sci.)

{\it A thesis submitted for the degree of Master of Science}

School of Information Technology\\
The University of Queensland\\
Australia}
\date{September 30, 1997}

Abstract:

  Memory is fundamental to human psychology, being necessary for any
thought processing. Hence the importance of research involving human
memory modelling, which aims to give us a better understanding of how the
human memory system works. Learning and forgetting are of course the most
important aspects of memory, and any worthwhile model of human memory
must be expected to exhibit these phenomena in a manner qualitatively
similar to that of a human. It has been claimed (see below) that standard
artificial neural networks cannot fulfil this elementary expectation,
suffering from {\it catastrophic interference}, and sparseness of neural
network representations has been employed, directly and indirectly, in
many of the attempts to answer this claim. Part of the motivation for the
employment of sparseness has been the fact that sparse vectors have been
observed in human neurological memory systems [Barnes, McNaughton,
Mizumori, Leonard and Lin, 1990]. In the broader field of neural
networks, sparse representations have recently become a popular tool of
research. This thesis aimed to discover what fundamental properties of
sparseness might justify the counter-claims alluded to above, and in so
doing uncovered a more general characterisation of the effects of sparse
vector representations in neural networks.

As yet, little formal knowledge of the concept of sparseness itself has
been reported in the literature; we developed some formal definitions and
measures, which served as foundational background theory for the thesis.
We also discussed several representative methods for implementing
sparsification.

We initially conjectured that the main problem of sparsification in the
case of a boolean space might be that of finding ``base'' clusters
without being concerned with orientation with respect to an origin. This
pointed us towards a clustering algorithm. We employed simulations and
theory to show that a particular sparse representation, which we derived
from a neural network cluster-based learning algorithm, the Self
Organising Map ({\it SOM}) [Kohonen 1982], is an ineffective basis for
even simple learning and generalisation tasks, {\it in combinatorial
domains}. The SOM is generally regarded as a good performer in
classification tasks, which is the noncombinatorial domain.

We then turned to the well known problem referred to earlier, where
neural networks are observed to fail to model a basic fundamental
property of human memory. Since McCloskey and Cohen [1989] and Ratcliff
[1990] first brought it to the attention of neural network researchers,
the problem of {\it catastrophic interference} in standard feedforward,
multilayer, backpropagation neural network models of human memory has
continued to generate research aimed at its resolution. Interference is
termed ``catastrophic'' when the learning of a second list almost
completely removes memory of a list learned earlier, and when forgetting
of this extreme nature does not occur in humans.

In previous research [French, 1991; McRae \& Hetherington, 1993],
sparseness at the hidden layer, either directly or indirectly induced,
has been shown to substantially reduce catastrophic interference in
memory tasks where no context is required to distinguish between the two
lists.

Our interference studies investigated the degree to which sparsification
algorithms can eliminate the serious problem of catastrophic
interference, by virtue of a comparison to the human performance data
[Barnes \& Underwood, 1959; Barnes-McGovern, 1964; Garskof, 1968], a
match to  which data had not yet been achieved with standard MLPs in the
literature. These studies investigated both the AB AC and AB CD list
learning paradigms, which represent instances of context dependent and
context independent tasks respectively.

It was successfully demonstrated that sparseness, in combination with
three other realistic network conditions, can eliminate catastrophic
interference. The other three necessary anti-interference factors were
context dominance, initial weightsize and the bias influence. Context
dominance was here definitionally implemented as setting the context
units to have large (i.e. larger than 1) activation values. Much larger
than normal initial weightsizes -- at values typical of in-training sizes
 -- were also found to be necessary in avoiding catastrophic
interference. Finally, a small bias influence relative to the input
layer, such as is typically found in large networks, was required. The
explanation for sparseness' removal of catastrophic interference was
argued to be that it reduces relative overlap between unrelated vectors.

However, it is believed [French, 1991; McRae \& Hetherington, 1993;
Lewandowsky, 1994; Sharkey \& Sharkey, 1995] that there is a trade-off
between sparseness and generalisation. We also addressed this trade-off
issue, and showed that the sparsification algorithms used above to
eliminate catastrophic interference concomitantly incur a great loss in
a neural network's generalisation capability {\it in combinatorial
domains}; although there was no such loss (as agreed by French [1992] and
Lewandowsky [1994]) if the domain was noncombinatorial.

Combining the results of the studies of the SOM and effects of sparseness
on generalisation, we suggested that sparseness has no effect on learning
or generalisation in noncombinatorial domains, and that in combinatorial
domains generalisation is removed while learning can only occur by
exhaustively training in a supervised scheme on all exemplars. Further,
it was shown that a more abstract result predicts -- and gives an
intuitive explanation for -- the specific results of all our experiments
discussed above. This more abstract, and intuitively expected result is
the following: {\it sparseness at the hidden layer of a standard network
has the effect, in combinatorial domains in particular, of reducing the
network's general operational similarity dependence on the similarity of
input vectors}.

The results of this thesis clarify understanding of the way the human
memory system works, which is of interest in psychology and neuroscience.
They also provide important insights into the functionality of the SOM
and the MLP.

[Barnes \& Underwood, 1959]{BU} Barnes, J. M. and Underwood, B.
 J. (1959).
``Fate'' of first-list associations in transfer theory.
 {\it Journal of Experimental Psychology}, {\bf 58}(2), 97-105.

[Barnes-McGovern, 1964] Barnes-McGovern, J. M. (1964).
Extinction of associations in four transfer paradigms.
 {\it Psychological Monographs: General and Applied}, Whole No. 593,
 {\bf 78}(16), 1-21.

[Barnes et al., 1990]{Ba} Barnes, C. A., McNaughton, B. L.,
 Mizumori, S. J.
 and Lim, L. H. (1990).
 Comparison of spatial and temporal characteristics of neuronal 
activity in sequential stages of hippocampal processing.
  {\it Progress in Brain Research}, {\bf 83}, 287-300.

[French, 1991]{F91} French, R. (1991).
 Using semi-distributed representations to overcome 
	catastrophic forgetting in connectionist networks.
 {\it Proceedings 
	of the 13th Annual Conference of the Cognitive Science Society},
 173-178. Hillsdale, NJ: Erlbaum.

[French, 1992]{F} French, R. M. (1992).
Semi-distributed representations and catastrophic
 forgetting in connectionist networks.                      
{\it Connection Science}, {\bf 4}, (3/4), 365-378.

[Garskof, 1968] Garskof, B. E. (1968).
Unlearning as a function of degree of interpolated learning and method of
testing in the A-B, A-C and A-B, C-D paradigms.
 {\it Journal of Experimental Psychology}, {\bf 76}(4), 579-583.

[Kohonen, 1982]{K} Kohonen, T. (1982).
 Self-organised formation of topologically correct feature maps.
{\it Biological Cybernetics}, {\bf 43}, 59-69.

[Lewandowsky, 1994]{L94} Lewandowsky, S. (1994).
 On the relation between catastrophic
	interference and generalization in connectionist networks.
Journal of Biological Systems, Vol. 2(3), 307-333.

[McCloskey \& Cohen, 1989]{MC} McCloskey, M. and Cohen, N. J.
 (1989).
  Catastrophic interference in connectionist networks: the
sequential learning problem.
In (Ed.), G. H. Bower,
{\it The Psychology of Learning and Motivation}, {\bf 24}, 109-165.

[McRae \& Hetherington, 1993]{MH}  McRae, K. and 
Hetherington, P. A. (1993).
 Catastrophic interference is eliminated in pretrained networks.
 {\it Proceedings of the
Fifteenth Annual Conference of the Cognitive Science Society}, 723-728.
Hillsdale, NJ: Erlbaum.

[Ratcliff, 1990]{R} Ratcliff, R. (1990).
  Connectionist models of recognition memory: constraints imposed
 by learning and forgetting functions.
{\it Psychological Review}, {\bf 97}(2), 285-308.

Sharkey, NE and Sharkey, AJC (1995).
 An analysis of Catastrophic Interference.
Connection Science, 7, 301-329 



More information about the Connectionists mailing list