Connectionists: The Atoms of Neural Computation: A reply to Randy O'Reilly

Fri Nov 7 12:03:59 EST 2014

 Dear Randy,

Thanks for your comments below in response to my remark that I introduced the core equations used in Leabra in the 1960s and early 1970s.

I am personally passionate about trying to provide accurate citations of prior work, and welcome new information about it. This is especially true given that proper citation is not easy in a rapidly developing and highly interdisciplinary field such as ours.

Given your comments and the information at my disposal, however, I stand by my remark, and will say why below. If you have additional relevant information, I will welcome it.

It is particularly difficult to provide proper citation when the same model name is used even after the model equations are changed. Your comment suggests that the name Leabra is used for all such variations. However, a change of a core model equation is, in fact, a change of model.

To deal with the need to develop and refine models, my colleagues and I provide distinct model names for different stages of model development; e.g., ART 1, ART 2, ARTMAP, ARTSCAN, ARTSCENE, etc.

My comment was based on your published claims about earlier versions of Leabra. If Leabra is now so changed that these comments are no longer relevant, then perhaps a new model name would help readers to understand this. Using the same name for many different versions of a model makes it hard to ever disconfirm it. Indeed, some authors just correct old mistakes with new equations under the same model name, and never admit that a mistake was made.

I will refer mostly to two publications about Leabra: The O’Reilly and Munakata (2000) book (abbreviated O&M below) on Computational Explorations in Cognitive Neuroscience, and the O’Reilly and Frank (2006) article (abbreviated O&F) on Making working memory work: A computational model of learning in the prefrontal cortex and basal ganglia ; https://grey.colorado.edu/mediawiki/sites/CompCogNeuro/images/3/30/OReillyFrank06.pdf).

The preface of O&M says that the goal of the book, and of Leabra, is a highly worthy one: “to consolidate and integrate advances…into one coherent package…we have found that the process of putting all of these ideas together…led to an emergent phenomenon in which the whole is greater than the sum of its parts…” I was therefore dismayed to see that the core equations of this presumably new synthesis were already pioneered and developed by my colleagues and me long before 2000, and used in a coherent way in many of our previous articles.

O&M leaves readers thinking that their process of “putting all of these ideas together” represented a novel synthesis for a unified cognitive architecture. For example, the O&M book review http://srsc.ulb.ac.be/axcWWW/papers/pdf/03-EJCP.pdf writes that the book’s first five chapters are “dedicated to developing a novel, biologically motivated learning algorithm called Leabra”. The review then lists standard hypotheses in the neural modeling literature as the basic properties of Leabra. The purported advance of Leabra “in contrast to the now almost passé back propagation algorithm, takes as a starting point that networks of real neurons exhibit several properties that are incompatible with the assumptions of vanilla back propagation”; notably that cells can send signals reciprocally to each other; they experience competition; their adaptive weights never change sign during learning; and their connections are never used to back-propagate error information during learning. These claims were not new in 2000.

My more specific examples will be drawn from O&F, for definiteness. I will compare the claims of this article with previously published results from our own work, although similar concerns could be expressed using examples from other authors.

The devil is in the details. Let me now get specific about the core equations of Leabra. I will break my comments into six parts, one part for each core model equation:

1. STM: NETWORK SHUNTING DYNAMICS
Randy, you wrote in your email below that Leabra is “based on the standard equivalent circuit equations for the neuron” and mention Hodgkin and Huxley in this regard.

It is not clear how “based on” translates into a mathematical model equation. In particular, the Hodgkin-Huxley equations are empirical fits to the dynamics of a squid giant axon. They were not equations for neural networks.

It was a big step conceptually to go from individual neurons to neural networks. When I started publishing the Additive and Shunting models for neural networks in 1967-68, they were not considered “standard”, as illustrated by the fact that several of these articles were published in the Proceedings of the National Academy of Sciences.

As to the idea that moving away from back propagation was novel in 2000, consider the extensive critique of back propagation in the oft-cited 1988 Grossberg article entitled Nonlinear neural networks: Principles, mechanisms, and architectures (Neural Networks, 1, 17-61). http://www.cns.bu.edu/Profiles/Grossberg/Gro1988NN.pdf. See the comparison of back propagation and adaptive resonance theory in Section 17. The main point of this article was not to criticize back propagation, however. It was to review efforts to develop neurally-based cognitive models that had been ongoing already in 1988 for at least 20 years.

As to the shunting dynamics used in O&F, on pp. 316 of their article, consider equations (A.1) and (A.2), which define shunting cooperative and competitive dynamics. Compare equation (9) on p. 23 and equations (100)-(101) on p. 35 in Grossberg (1988), or equations (A16) and (A18) on pp. 47-48 in the oft-cited 1980 Grossberg article on How does a brain build a cognitive code? (Psychological Review, 87, 1-51 http://cns.bu.edu/Profiles/Grossberg/Gro1980PsychRev.pdf), This article reviewed aspects of the paradigm that I introduced in the 1960s to unify aspects of brain and cognition. Or see equations (1) – (7) in the even earlier, also oft-cited, 1973 Grossberg article on Contour enhancement, short-term memory, and constancies in reverberating neural networks (Studies in Applied Mathematics, 52, 213-257 http://cns.bu.edu/Profiles/Grossberg/Gro1973StudiesAppliedMath.pdf). This breakthrough article showed how to design recurrent shunting cooperative and competitive networks and their signal functions to exhibit key properties of contrast enhancement, noise suppression, activity normalization, and short-term memory storage. These three articles illustrate scores of our articles that have developed such concepts before you began to write on this subject.

2. SIGMOID SIGNALS
O&F introduce a sigmoidal signal function in their equation (A.3). Grossberg (1973) was the first article to mathematically characterize how sigmoidal signal functions transform inputs before storing them in a short-term memory that is defined by a recurrent shunting on-center off-surround network. These results have been reviewed in many places; e.g., Grossberg (1980, pp. 46-49, Appendices C and D) and Grossberg (1988, p. 37).

3. COMPETITION, PARTIAL CONTRAST, AND k-WINNERS-TAKE-ALL
O&F introduce k-Winners-Take-All Inhibition in their equations (A.5)-(A.6). Grossberg (1973) mathematically proved how to realize partial contrast enhancement (i.e., k-Winners-Take-All Inhibition) in a shunting recurrent on-center off-surround network. This result is also reviewed in Grossberg (1980) and Grossberg (1988), and happens automatically when a sigmoid signal function is used in a recurrent shunting on-center off-surround network. It does not require a separate hypothesis.

4. MTM: HABITUATIVE TRANSMITTER GATING AND SYNAPTIC DEPRESSION
O&F describe synaptic depression in their equation (A.18). The term synaptic depression was introduced by Abbott et al. (1997), who derived an equation for it from their visual cortical data. Tsodyks and Markram (1997) derived a similar equation with somatosensory cortical data in mind. I introduced equations for synaptic depression in PNAS in 1968 (e.g., equations (18)-(24) in http://cns.bu.edu/~steve/Gro1968PNAS60.pdf). I called it medium-term memory (MTM), or activity-dependent habituation, or habituative transmitter gates; e.g., see the review in http://www.scholarpedia.org/article/Recurrent_neural_networks.

MTM has multiple functional roles that were all used in our models in the 1960s-1980s, and thereafter to the present. One role is to carry out intracellular adaptation that divides the response to a current input with a time-average of recent input intensity. A related role is to prevent recurrent activation from persistently choosing the same neuron, by reducing the net input to this neuron. MTM traces also enable reset events to occur. For example, in a gated dipole opponent processing network, used a great deal in our modeling of reinforcement learning, they enable an antagonistic rebound in activation to occur in the network's OFF channel in response to either a rapidly decreasing input to the ON channel, or to an arousal burst to both channels that is triggered by an unexpected event (e.g., Grossberg, 1972, http://cns.bu.edu/~steve/Gro1972MathBioSci_II.pdf. Grossberg, 1980, Appendix E, http://cns.bu.edu/~steve/Gro1972MathBioSci_II.pdf). This property enables a resonance that reads out a predictive error to be quickly reset, thereby triggering a memory search, or hypothesis testing, to discover a recognition category capable of better representing an attended object or event, as in adaptive resonance theory, or ART. MTM reset dynamics also help to explain data about the dynamics of visual perception, cognitive-emotional interactions, decision-making under risk, and sensory-motor control.

5. LTM: HEBBIAN VS. GATED STEEPEST DESCENT LEARNING
O&F then introduce what they call a Hebbian learning equation (A.7). This equation was introduced in several of my 1969 articles and has been used in many articles since then. It is reviewed in Grossberg (1980, p. 43, equation (A2)) and Grossberg (1988, p. 23, equation (11). This equation describes gated steepest descent learning, with variants called outstar learning for the learning of spatial patterns (that was described in the Journal of Statistical Physics in 1969; http://cns.bu.edu/~steve/Gro1969JourStatPhy.pdf) and instar learning for the tuning of adaptive filters (that was described in 1976 in Biological Cybernetics; http://cns.bu.edu/~steve/Gro1976BiolCyb_I.pdf), where I first used it to develop competitive learning and self-organizing map models. It is sometimes called Kohonen learning after Kohonen’s first use of it in self-organizing maps after 1984. Significantly, this learning law seems to be the first example of a process that gates learning, a concept that O&R emphasize, and which I first discovered in 1958 and published in 1969 and thereafter as parts of a mathematical analysis of associative learning in recurrent neural networks.

I introduced ART in the second part of the 1976 Biological Cybernetics article (http://cns.bu.edu/~steve/Gro1976BiolCyb_II.pdf) in order to show how this kind of learning could dynamically self-stabilize in response to large non-stationary data bases using attentional matching and memory search, or hypothesis testing.  MTM plays an important role in this self-regulating search process.

It should also be noted that equation (A.7) is not Hebbian. It mixes Hebbian and anti-Hebbian properties. Such an adaptive weight, or long-term memory (LTM) trace, can either increase or decrease to track the signals in its pathway.   When an LTM trace increases, it can properly be said to undergo Hebbian learning, after the famous law of Hebb (1949) which said that associative traces always increase during learning. When such an LTM trace decreases, it is said to undergo anti-Hebbian learning. Gated steepest descent was the first learning law to incorporate both Hebbian and anti-Hebbian properties in a single synapse. Since that time, such a law has been used to model neurophysiological data about learning in the hippocampus and cerebellum (also called Long Term Potentiation and Long Term Depression) and about adaptive tuning of cortical  feature detectors during early visual development, among many other topics.

The Hebb (1949) learning postulate says that: "When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some grown process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased". This postulate only allows LTM traces to increase. Thus, after sufficient learning took place, Hebbian traces would saturate at their maximum values, and could not subsequently respond adaptively to changing environmental demands. The Hebb postulate assumed the wrong processing unit: It assumed that the strength of an individual connection is the unit of learning. My mathematical work in the 1960s showed, instead, that the unit of LTM is a pattern of LTM traces that is distributed across a network. When one needs to match an LTM pattern to an STM pattern, as occurs during category learning, then both increases and decreases of LTM strength are needed.

6. ERROR-DRIVEN LEARNING
Finally, O&F introduce a form of error-driven learning in equation (A.10). Several biological variants of error-driven learning have been well-known for many years. Indeed, I have proposed a fundamental reason why the brain needs both gated steepest descent and error-driven learning, which I will only mention briefly here: The brain’s global organization seems to embody Complementary Computing. For further discussion of this theme, see Figure 1 and related text in the Grossberg (2012 Neural Networks, 37, 1-47) review article (http://cns.bu.edu/~steve/ART.pdf) or Grossberg (2000, Trends in Cognitive Sciences, 4, 233-246; http://www.cns.bu.edu/Profiles/Grossberg/Gro2000TICS.pdf).

One error-driven learning equation, for opponent learning of adaptive movement gains, was already reviewed in Grossberg (1988, p. 51, equations (118)-(121)). However, the main use of the O&F error-driven learning is in reinforcement learning. Even here, there are error-based reinforcement learning laws that explain more neural data than the O&F equation can, and did it before they did. I will come back to error-driven reinforcement learning in a moment.

First, let me summarize: The evidence supplied above shows that there is precious little that was new in the original Leabra formalism. It is disappointing, given the fact that my work is well-known by many neural modelers to have pioneered these concepts and mechanisms, that none of these articles was cited as a source for Leabra in O&F. This is all the more regrettable since I exchanged detailed collegial emails with an O’Reilly collaborator more than 10 years ago to try to make this historical background known to him.

Every model has its weaknesses, which provide opportunities for further development. Such weaknesses are hard to accept, however, when they have already been overcome in prior published work, and are not noted in articles that espouse a later, weaker, model.

In order to prevent this comment from becoming unduly long, I discuss only one aspect of the O&F work on error-driven reinforcement learning, to complete my comments about Leabra.

In O&F (2006, p. 284), it was written that “to date, no model has attempted to address the more difficult question of how the BG [basal ganglia] ‘knows’ what information is task relevant (which was hard-wired in prior models). The present model learns this dynamic gating functionality in an adaptive manner via reinforcement learning mechanisms thought to depend on the dopaminergic system and associated areas”. This claim is not correct. For example, two earlier articles by Brown, Bullock, and Grossberg use dopaminergic gating, among other mechanisms, to show how the brain can learn what is task relevant (1999, Journal of Neuroscience, 19, 10502-10511 http://www.cns.bu.edu/Profiles/Grossberg/BroBulGro99.pdf; 2004, Neural Networks, 271-510 http://www.cns.bu.edu/Profiles/Grossberg/BroBulGro2003NN.pdf). The former article simulates neurophysiological data about basal ganglia and related brain regions that temporal difference models and that O&F (2006) could not explain. The latter article shows how the TELOS model can incrementally learn five different tasks that monkeys have been trained to learn. After learning, the model quantitatively simulates the recorded neurophysiological dynamics of 17 established cell types in frontal cortex, basal ganglia, and related brain regions, and predicts explicit functional roles for all of these cells in the learning and performance of these tasks.

The O&F (2006) model did not achieve this level of understanding. Instead, the model simulated some relatively simple cognitive tasks and seems to show no quantitative fits to any data. The model also seemed to make what I consider unnecessary errors. For example, O&F (2006) wrote on p. 294 that “…When a conditioned stimulus is activated in advance of a primary reward, the PV system is actually trained to not expect reward at this time, because it is always trained by the current primary reward value. Therefore, we need an additional mechanism to account for the anticipatory DA bursting at CS onset, which in turn is critical for training up the BG gating system…This is the learned value (LV) system, which is trained only when primary rewards are either present or expected by the PV and is free to fire at other times without adapting its weights. Therefore, the LV is protected from having to learn that no primary reward is actually present at CS onset, because it is not trained at that time”. No such convoluted assumptions were needed in Brown et al (1999) to explain and quantitatively simulate a broad range of anatomical and neurophysiological conditioning data from monkeys that were recorded in ventral striatum, striosomes, pedunculo-pntine tegmental nucleus, and the lateral hypothalamus.

I believe that a core problem in their model is a lack of understanding of an issue that is basic in these data; namely, how adaptively timed conditioning occurs. Their error-driven conditioning laws, based on delta-rule learning, are, to my mind, simply inadequate; see their equations (3.1)-(3.3) on p. 294 and their equations in Section A.5, pp. 318+. In contrast, the Bullock and Grossberg family of articles have traced adaptively timed error-driven learning to detailed dynamics of the metabotropic glutamate receptor system, as simulated in Brown et al (1999), and also used to quantitatively simulate adaptively timed cerebellar data. In this regard, O’Reilly and Frank (2006) mention in passing the cerebellum as a source of “timing signals” on p. 294, line 7. However, the timing that goes on in the cerebellum and the timing that goes on in the basal ganglia have different functional roles. A detailed modeling synthesis and simulations of biochemical, biophysical, neurophysiological, anatomical, and behavioral data about adaptively timed conditioning in the cerebellum is provided in the article by Fiala, Grossberg, and Bullock (Journal of Neuroscience, 1996, 16, 3760-3774, http://cns.bu.edu/~steve/FiaGroBul1996JouNeuroscience.pdf).

There are many related problems. Not the least of them is the assumption “that time is discretized into steps that correspond to environmental events (e.g., the presentation of a CS or US)” (O&F, 2006, p. 318). One cannot understand adaptively timed learning, or working memory for that matter, if such an assumption is made. Such unrealistic technical assumptions often lead one to unrealistic conceptual assumptions. A framework that uses real-time dynamics is needed to deeply understand how these brain processes work.

Best,

Steve

On Nov 5, 2014, at 3:14 AM, Randall O'Reilly wrote:

> 
>> Given Randy O‘Reilly’s comments about Leabra, it is also of historical interest that I introduced the core equations used in Leabra in the 1960s and early 1970s, and they have proved to be of critical importance in all the developments of ART.
> 
> For future reference, Leabra is based on the standard equivalent circuit equations for the neuron which I believe date at least to the time of Hodgkin and Huxley.  Specifically, we use the “AdEx” model of Gerstner and colleagues, and a rate-code equivalent thereof that we derived.  For learning we use a biologically-plausible version of backpropagation that I analyzed in 1996 and has no provenance in any of your prior work.  Our more recent version of this learning model shares a number of features in common with the BCM algorithm from 1982, while retaining the core error-driven learning component.  I just put all the equations in one place here in case you’re interested: https://grey.colorado.edu/emergent/index.php/Leabra
> 
> None of this is to say that your pioneering work was not important in shaping the field — of course it was, but I hope you agree that it is also important to get one’s facts straight on these things.
> 
> Best,
> - Randy
> 

Stephen Grossberg
Wang Professor of Cognitive and Neural Systems
Professor of Mathematics, Psychology, and Biomedical Engineering
Director, Center for Adaptive Systems http://www.cns.bu.edu/about/cas.html
http://cns.bu.edu/~steve
steve at bu.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/connectionists/attachments/20141107/d2e8d7c3/attachment.html>