A summary on Catastrophic Interference

Thu Feb 2 15:24:30 EST 1995

Catastrophic Interference and Incremental Training

A brief summary

by DeLiang Wang

Several weeks ago I posted a message on this network asking about the 
research status of catastrophic interference.  I have since received more 
than 50 replies either publicly or privately.  I have benefited 
much from the discussions, and I hope others have too.  I have read/scanned 
through most of the papers (those easily accessible) prompted from the 
replies.   To thank all of you who have replied, I have compiled a brief 
summary of my readings.  Since I do not want to post a long review paper 
on the net, my following summary will be very brief, which means an 
inevitable oversimplification of those models mentioned here and neglect 
of some other work.

(1) Catastrophic interference (CI) refers to the phenomenon 
that later training disrupts results of previous training.  
It was pointed out as a criticism to multi-layer perceptrons with 
backpropagation training (MLP) by Grossberg (1987), and systematically 
revealed by McCloskey & Cohen (1989) and Ratcliff (1990).  Why is CI bad?  
It prevents a single network from incrementally accumulating knowledge 
(the alternative would be to have each network learn just one set of 
transform), and it poses severe problems for MLP to model human/animal 
memory.

(2) Catastrophic interference is model-specific.  So far, the problem is 
revealed only in MLP or its variants.  We know that models where different 
items are represented without overlaps (e.g. ART) do not have this problem. 
Even some models with certain overlaps do not have this problem (see, 
for example, Willshaw, 1981; Wang and Yuwono, 1995).  
	Unfortunately, many studies on CI carry general titles, such as 
"connectionist models" and "neural network models", and leave people the 
impression that CI is a general problem with all neural network (NN) models.  
These general titles are used both by critics and proponents of neural 
networks.  This type of titles may be justified in early times when the 
awareness of NN needed to be raised.  In this development stage of the 
field, results about NN should be specified and the title should properly 
reflect the scope of the paper.

(3) Tradeoff between distributedness and interference.  The major cause of 
CI is the "distributedness" of representations: Learning of new patterns needs 
to use those weights that participate in representing previously learned 
patterns.  Much investigation to overcome CI is directed towards reducing 
the extent of distributedness.  We can say that there is a tradeoff 
between distributedness and interference (as said earlier, no overlaps no CI).

(4) Ways of overcoming CI.  The problem has been studied by a number of 
authors, all of whom work on MLP or its variants.  Here is a list of 
proposals to alleviate CI:

	* Reduce overlapping in hidden unit representations.  French (1991, 
1994), Kruschke (1992), Murre (1992), Fahlman (1991).

	* Orthogonolization.  This idea was proposed long ago for reducing 
cross-talks in associative memories.  The same idea works here to reduce CI.  
See Kortge (1990), Lewandowsky (1991), Sloman and Rumelhart (1992), 
Sharkey and Sharkey (1994), McClelland et al. (1994).

	* Prior training. Assuming later patterns are drawn from the same 
underlying function as earlier patterns, prior training of the general task 
can reduce RI (McRae and Hetherington, 1993).  This proposal will not 
work if later patterns have little to do with previously trained patterns.

	* Modularization. The idea is similar to earlier ones.  We 
have a hierarchy of different networks, and each network is selected to 
learn a different category of tasks. See Waibel (1989), Brashers-Krug et al. 
(1995).  

	* Retain only a recent history. The idea here is that we let past 
patterns forget and only retain a limited number of patterns including the 
new one (reminiscent of STM). See Ruiz de Angulo and Torras (1995).

(5) Transfer studies.  Another body of related, but different, work studies 
how previous training can facilitate acquiring a new pattern.  The effect 
of acquiring new pattern on previous memory, however, is not explored.  
See Pratt (1993), Thrun and Mitchell (1994), Murre (1995).

(6) What about associative memory? In my original message, I suspected that
associative memory models that can handle correlated patterns (see Kanter 
and Sompolinsky, 1987; Diederich and Opper, 1987) should suffer
the same problem of catastrophic interference. Unfortunately, no response 
has touched on this issue. Are people taking associative memory models 
seriously nowadays?

(7) To summarize, the following two ideas, in my opinion, seem to hold 
greatest promise for solving RI.  The first idea is to reduce the receptive 
field of each unit, thus reducing the overlaps among different feature 
detectors.  RBF (radial basis function) networks fall into this type.  After 
all, limited receptive fields are characteristic of brain cells, and 
all-to-all connections are scarce if existing at all.  The second idea is 
to introduce some form of modularization so that different underlying 
functions are handled in different modules (reducing overlaps among 
differing tasks).  This may not only solve the problem of CI, but also 
facilitate acquiring new knowledge (positive transfer).  Furthermore, this 
idea is consistent with the general principle of functional localization 
in the brain. 

References (More detailed references for tech. reports were posted before):

Brashers-Krug T., R. Shadmehr, and E. Todorov (1995): In: NIPS-94 Proceedings,
	to appear.
Diederich S. and M. Opper (1987). Phys. Rev. Lett. 58, 949-952.
Fahlman S. (1991): In NIPS-91 Proceedings.
French, R.M. (1991): In: Proc. the 13th Annual Conf. of the Cog Sci Society.
French, R.M. (1994): In: Proc. the 16th Annual Conf. of the Cog Sci Society.
Grossberg S. (1987). Cognit. Sci. 11, 23-64.
Kantor I. & Sompolinsky H.(1987). Phys. Rev. A 35, 380-392.
Kortge, C.A. (1990): In: Proc. the 12th Annual Conf. of the Cog Sci Society.
Krushke, J.K. (1992).  Psychological Review, 99, 22-44. 
Lewandowsky, S. (1991):  In: Relating theory and data: Essays on human 
	memory in honor of Bennet B. Murdock (W.Hockley & S.Lewandowsky, Eds.).
McClelland, J., McNaughton, B., & O'Reilly, R. (1994): CMU Tech report: 
	PNP.CNS.94.1.
McCloskey M., and Cohen N. (1989): In: The Psychology of Learning and 
	Motivation, 24, 109-165.
McRae, K., & Hetherington, P.A. (1993): In: Proc. the 15th Annual Conf. of the 
	Cog Sci Society.
Murre, J.M.J. (1992): In: Proc. the 14th Annual Conf. of the Cog Sci Society.
Murre, J.M.J. (in press): In: J. Levy et al. (Eds): Connectionist Models of 
	Memory and Language. London: UCL Press.
Pratt, L. (1993): In: NIPS-92 Proceedings.
Ratcliff, R (1990): Psychological Review 97, 285-308
Ruiz de Angulo V. and C. Torras (1995): IEEE Trans. on Neural Net., in press.
Sharkey, N.E. & Sharkey, A.J.C. (1994): Technical Report, Department of 
	Computer Science, University of Sheffield, U.K. 
Sloman, S.A., & Rumelhart, D.E. (1992): In A. Healy et al.(Eds): From learning
	theory to cognitive processes: Essays in honor of William K. Estes. 
Thrun S. & Mitchell T. (1994): CMU Tech Rep. 
Waibel A. (1989). Neural Computation 1, 39-46.
Wang D.L. and B. Yuwono (1995): IEEE Trans.on Syst. Man Cyber. 25(4), in press.
Willshaw D. (1981): In: Parallel Models of Associative Memory (Eds. G. Hinton
	and J. Anderson, Erlbaum).