A summary on Catastrophic Interference
DeLiang Wang
dwang at cis.ohio-state.edu
Thu Feb 2 15:24:30 EST 1995
Catastrophic Interference and Incremental Training
A brief summary
by DeLiang Wang
Several weeks ago I posted a message on this network asking about the
research status of catastrophic interference. I have since received more
than 50 replies either publicly or privately. I have benefited
much from the discussions, and I hope others have too. I have read/scanned
through most of the papers (those easily accessible) prompted from the
replies. To thank all of you who have replied, I have compiled a brief
summary of my readings. Since I do not want to post a long review paper
on the net, my following summary will be very brief, which means an
inevitable oversimplification of those models mentioned here and neglect
of some other work.
(1) Catastrophic interference (CI) refers to the phenomenon
that later training disrupts results of previous training.
It was pointed out as a criticism to multi-layer perceptrons with
backpropagation training (MLP) by Grossberg (1987), and systematically
revealed by McCloskey & Cohen (1989) and Ratcliff (1990). Why is CI bad?
It prevents a single network from incrementally accumulating knowledge
(the alternative would be to have each network learn just one set of
transform), and it poses severe problems for MLP to model human/animal
memory.
(2) Catastrophic interference is model-specific. So far, the problem is
revealed only in MLP or its variants. We know that models where different
items are represented without overlaps (e.g. ART) do not have this problem.
Even some models with certain overlaps do not have this problem (see,
for example, Willshaw, 1981; Wang and Yuwono, 1995).
Unfortunately, many studies on CI carry general titles, such as
"connectionist models" and "neural network models", and leave people the
impression that CI is a general problem with all neural network (NN) models.
These general titles are used both by critics and proponents of neural
networks. This type of titles may be justified in early times when the
awareness of NN needed to be raised. In this development stage of the
field, results about NN should be specified and the title should properly
reflect the scope of the paper.
(3) Tradeoff between distributedness and interference. The major cause of
CI is the "distributedness" of representations: Learning of new patterns needs
to use those weights that participate in representing previously learned
patterns. Much investigation to overcome CI is directed towards reducing
the extent of distributedness. We can say that there is a tradeoff
between distributedness and interference (as said earlier, no overlaps no CI).
(4) Ways of overcoming CI. The problem has been studied by a number of
authors, all of whom work on MLP or its variants. Here is a list of
proposals to alleviate CI:
* Reduce overlapping in hidden unit representations. French (1991,
1994), Kruschke (1992), Murre (1992), Fahlman (1991).
* Orthogonolization. This idea was proposed long ago for reducing
cross-talks in associative memories. The same idea works here to reduce CI.
See Kortge (1990), Lewandowsky (1991), Sloman and Rumelhart (1992),
Sharkey and Sharkey (1994), McClelland et al. (1994).
* Prior training. Assuming later patterns are drawn from the same
underlying function as earlier patterns, prior training of the general task
can reduce RI (McRae and Hetherington, 1993). This proposal will not
work if later patterns have little to do with previously trained patterns.
* Modularization. The idea is similar to earlier ones. We
have a hierarchy of different networks, and each network is selected to
learn a different category of tasks. See Waibel (1989), Brashers-Krug et al.
(1995).
* Retain only a recent history. The idea here is that we let past
patterns forget and only retain a limited number of patterns including the
new one (reminiscent of STM). See Ruiz de Angulo and Torras (1995).
(5) Transfer studies. Another body of related, but different, work studies
how previous training can facilitate acquiring a new pattern. The effect
of acquiring new pattern on previous memory, however, is not explored.
See Pratt (1993), Thrun and Mitchell (1994), Murre (1995).
(6) What about associative memory? In my original message, I suspected that
associative memory models that can handle correlated patterns (see Kanter
and Sompolinsky, 1987; Diederich and Opper, 1987) should suffer
the same problem of catastrophic interference. Unfortunately, no response
has touched on this issue. Are people taking associative memory models
seriously nowadays?
(7) To summarize, the following two ideas, in my opinion, seem to hold
greatest promise for solving RI. The first idea is to reduce the receptive
field of each unit, thus reducing the overlaps among different feature
detectors. RBF (radial basis function) networks fall into this type. After
all, limited receptive fields are characteristic of brain cells, and
all-to-all connections are scarce if existing at all. The second idea is
to introduce some form of modularization so that different underlying
functions are handled in different modules (reducing overlaps among
differing tasks). This may not only solve the problem of CI, but also
facilitate acquiring new knowledge (positive transfer). Furthermore, this
idea is consistent with the general principle of functional localization
in the brain.
References (More detailed references for tech. reports were posted before):
Brashers-Krug T., R. Shadmehr, and E. Todorov (1995): In: NIPS-94 Proceedings,
to appear.
Diederich S. and M. Opper (1987). Phys. Rev. Lett. 58, 949-952.
Fahlman S. (1991): In NIPS-91 Proceedings.
French, R.M. (1991): In: Proc. the 13th Annual Conf. of the Cog Sci Society.
French, R.M. (1994): In: Proc. the 16th Annual Conf. of the Cog Sci Society.
Grossberg S. (1987). Cognit. Sci. 11, 23-64.
Kantor I. & Sompolinsky H.(1987). Phys. Rev. A 35, 380-392.
Kortge, C.A. (1990): In: Proc. the 12th Annual Conf. of the Cog Sci Society.
Krushke, J.K. (1992). Psychological Review, 99, 22-44.
Lewandowsky, S. (1991): In: Relating theory and data: Essays on human
memory in honor of Bennet B. Murdock (W.Hockley & S.Lewandowsky, Eds.).
McClelland, J., McNaughton, B., & O'Reilly, R. (1994): CMU Tech report:
PNP.CNS.94.1.
McCloskey M., and Cohen N. (1989): In: The Psychology of Learning and
Motivation, 24, 109-165.
McRae, K., & Hetherington, P.A. (1993): In: Proc. the 15th Annual Conf. of the
Cog Sci Society.
Murre, J.M.J. (1992): In: Proc. the 14th Annual Conf. of the Cog Sci Society.
Murre, J.M.J. (in press): In: J. Levy et al. (Eds): Connectionist Models of
Memory and Language. London: UCL Press.
Pratt, L. (1993): In: NIPS-92 Proceedings.
Ratcliff, R (1990): Psychological Review 97, 285-308
Ruiz de Angulo V. and C. Torras (1995): IEEE Trans. on Neural Net., in press.
Sharkey, N.E. & Sharkey, A.J.C. (1994): Technical Report, Department of
Computer Science, University of Sheffield, U.K.
Sloman, S.A., & Rumelhart, D.E. (1992): In A. Healy et al.(Eds): From learning
theory to cognitive processes: Essays in honor of William K. Estes.
Thrun S. & Mitchell T. (1994): CMU Tech Rep.
Waibel A. (1989). Neural Computation 1, 39-46.
Wang D.L. and B. Yuwono (1995): IEEE Trans.on Syst. Man Cyber. 25(4), in press.
Willshaw D. (1981): In: Parallel Models of Associative Memory (Eds. G. Hinton
and J. Anderson, Erlbaum).
More information about the Connectionists
mailing list