Connectionists: The Atoms of Neural Computation: A reply to Randy O'Reilly

Fri Nov 7 13:05:23 EST 2014

There is a rather fundamental levels of analysis difference here. As
Sejnowski mentioned, there are 100 to 1000 types of pyramidal neurons, but
Dror Cohen appropriately pointed out Edelman's notion of "degeneracy." This
is where neurons have become differentiated and yet perform the same
function. Ockham's razor suggests that we assume a strong form of the
degeneracy hypothesis, using simple, abstract models, and then add
increased biological fidelity as these models prove insufficient.

Sufficiency is, of course, dependent on your goal. If your goal is to
create a model that solves the problems that the brain solves doing it in
the way that the brain does it, while retaining some but not too much
fidelity, then you want biologically plausible backpropagation (for now, at
least, as it's far from proven that the brain does this). If you just want
to do what the brain does in some way, you abstract away and use Bayes. If
you want to do what the brain does in all the excruciating detail that the
brain does it in, because that's just what you are personally interested
in, or for whatever reason, you concern yourself with the almost literally
infinitely detailed differences in neurons. And there is plenty of room
in-between these three basic levels. Even studying Purkinje neurons in
detail for decades outside the context of the rest of the brain helps
inform everything else in the brain.

There are more general issues with provenance conversations like this one.
They ignore the principle of "he who says it best says it last" - J.
Schmidhuber (2008). *The last inventor of the telephone*. Science. This
point stresses that communication is essential, and pushes against
mountains of papers that are hard to comprehend and would require a legion
of programmers to independently replicate, and thus really understand,
assuming that such replication is possible. But perhaps the most
fundamental problem with such provenance conversations is that it ignores
that we are all working together here, despite that we may *think* we are
working at cross-purposes, or with different goals in mind (including just
trying to be more awesome than everyone else).  Everything everyone does
informs everything that everyone else does. This reminds me of
multi-objective optimization, which also seems to summarize how we -
embodied brains - came to be in the first place. This same kind of process,
at the memetic instead of genetic level, will lead to a theory of the
brain. Interacting levels of analysis. Researchers working apart working
together.

That said, I think all would agree that Schmidhuber's recent review is a
very useful endeavor, and provenance is ultimately of historical
importance, and credit assignment is motivationally important. Regarding
Leabra, Grossberg's claim that he invented the core Leabra equations, which
embody the intuition that backpropagation can be implemented in a
biologically plausible way if neurons keep track of their activation in two
different phases and then compute their weight deltas purely locally based
on this difference, seems wrong. Where is this prior actual implementation
and actual description that could be implemented and would work as well as
backpropagation as Leabra does? Regarding the Leabra cognitive
architecture, it borrows heavily from pretty much every single other
researcher, including Grossberg (as his postdoc(s) have worked in
O'Reilly's lab, and we have tried to read some of Grossberg's papers), but
also even ACT-R, which has an independent implementation in *emergent, *and
has been partially integrated with Leabra in a model called SAL (Synthesis
of ACT-R and Leabra). And this architecture is inspired by *so many*
others. As many as possible.

Giving credit assignment here properly is *very* hard, and there is no
point fighting about it. The O'Reilly lab bibliography contains tens of
thousands of papers aggregated by scores of researchers over decades.
Ultimately, these papers are distilled into concepts and memes that are
incorporated into models. These new models stand on the shoulders of
giants. For a giant to come back and say that the new work is in fact his,
is counterproductive. Instead, we should look for something like SALART, or
just let the process unfold. Provenance is to be left to historians, and it
will ultimately be flawed, as the information for accurate record-keeping
is just not around, nor easily expressible in such a complex domain. And
arguably, almost all of the *real* communication in science is communicated
not via papers, but via researchers migrating from lab to lab and sharing
concepts whose origin they may well be unaware of, which pushes all of our
knowledge forward. Researchers working in isolation and aggressively
assigning credit to themselves is hypercompetitive and borderline
narcissism. It's also rather ironic considering that we study systems that
use not just competition but also cooperation. This is all explained by
game theory at some level, after all. It's a mix!

Speaking of how mixed up all of our contributions are, please take the time
to help keep my list of neural simulators up-to-date, which I intended to
be a community resource and project.
https://grey.colorado.edu/emergent/index.php/Comparison_of_Neural_Network_Simulators

Sincerely,

Brian

http://linkedin.com/in/brianmingus

On Fri, Nov 7, 2014 at 10:03 AM, Stephen Grossberg <steve at cns.bu.edu> wrote:

>
>  Dear Randy,
>
> Thanks for your comments below in response to my remark that I introduced
> the core equations used in Leabra in the 1960s and early 1970s.
>
> I am personally passionate about trying to provide accurate citations of
> prior work, and welcome new information about it. This is especially true
> given that proper citation is not easy in a rapidly developing and highly
> interdisciplinary field such as ours.
>
> Given your comments and the information at my disposal, however, I stand
> by my remark, and will say why below. If you have additional relevant
> information, I will welcome it.
>
> It is particularly difficult to provide proper citation when the same
> model name is used even after the model equations are changed. Your comment
> suggests that the name Leabra is used for all such variations. However, a
> change of a core model equation is, in fact, a change of model.
>
> To deal with the need to develop and refine models, my colleagues and I
> provide distinct model names for different stages of model development;
> e.g., ART 1, ART 2, ARTMAP, ARTSCAN, ARTSCENE, etc.
>
> My comment was based on your published claims about earlier versions of
> Leabra. If Leabra is now so changed that these comments are no longer
> relevant, then perhaps a new model name would help readers to understand
> this. Using the same name for many different versions of a model makes it
> hard to ever disconfirm it. Indeed, some authors just correct old mistakes
> with new equations under the same model name, and never admit that a
> mistake was made.
>
> I will refer mostly to two publications about Leabra: The O’Reilly and
> Munakata (2000) book (abbreviated O&M below) on *Computational
> Explorations in Cognitive Neuroscience*, and the O’Reilly and Frank
> (2006) article (abbreviated O&F) on *Making working memory work: A
> computational model of learning in the prefrontal cortex and basal ganglia*
> ;
> https://grey.colorado.edu/mediawiki/sites/CompCogNeuro/images/3/30/OReillyFrank06.pdf).
>
>
> The preface of O&M says that the goal of the book, and of Leabra, is a
> highly worthy one: “to consolidate and integrate advances…into one coherent
> package…we have found that the process of putting all of these ideas
> together…led to an emergent phenomenon in which the whole is greater than
> the sum of its parts…” I was therefore dismayed to see that the core
> equations of this presumably new synthesis were already pioneered and
> developed by my colleagues and me long before 2000, and used in a coherent
> way in many of our previous articles.
>
> O&M leaves readers thinking that their process of “putting all of these
> ideas together” represented a novel synthesis for a unified cognitive
> architecture. For example, the O&M book review
> http://srsc.ulb.ac.be/axcWWW/papers/pdf/03-EJCP.pdf writes that the
> book’s first five chapters are “dedicated to developing a novel,
> biologically motivated learning algorithm called Leabra”. The review then
> lists standard hypotheses in the neural modeling literature as the basic
> properties of Leabra. The purported advance of Leabra “in contrast to the
> now almost passé back propagation algorithm, takes as a starting point that
> networks of real neurons exhibit several properties that are incompatible
> with the assumptions of vanilla back propagation”; notably that cells can
> send signals reciprocally to each other; they experience competition; their
> adaptive weights never change sign during learning; and their connections
> are never used to back-propagate error information during learning. These
> claims were not new in 2000.
>
> My more specific examples will be drawn from O&F, for definiteness. I will
> compare the claims of this article with previously published results from
> our own work, although similar concerns could be expressed using examples
> from other authors.
>
> The devil is in the details. Let me now get specific about the core
> equations of Leabra. I will break my comments into six parts, one part for
> each core model equation:
>
> *1. STM: NETWORK SHUNTING DYNAMICS*
> Randy, you wrote in your email below that Leabra is “based on the standard
> equivalent circuit equations for the neuron” and mention Hodgkin and Huxley
> in this regard.
>
> It is not clear how “based on” translates into a mathematical model
> equation. In particular, the Hodgkin-Huxley equations are empirical fits to
> the dynamics of a squid giant axon. They were not equations for neural
> networks.
>
> It was a big step conceptually to go from individual neurons to neural
> networks. When I started publishing the Additive and Shunting models for
> neural networks in 1967-68, they were not considered “standard”, as
> illustrated by the fact that several of these articles were published in
> the *Proceedings of the National Academy of Sciences.*
>
> As to the idea that moving away from back propagation was novel in 2000,
> consider the extensive critique of back propagation in the oft-cited 1988
> Grossberg article entitled *Nonlinear neural networks: Principles,
> mechanisms, and architectures* (*Neural Networks*, 1, 17-61).
> http://www.cns.bu.edu/Profiles/Grossberg/Gro1988NN.pdf. See the
> comparison of back propagation and adaptive resonance theory in Section 17.
> The main point of this article was not to criticize back propagation,
> however. It was to review efforts to develop neurally-based cognitive
> models that had been ongoing already in 1988 for at least 20 years.
>
> As to the shunting dynamics used in O&F, on pp. 316 of their article,
> consider equations (A.1) and (A.2), which define shunting cooperative and
> competitive dynamics. Compare equation (9) on p. 23 and equations
> (100)-(101) on p. 35 in Grossberg (1988), or equations (A16) and (A18) on
> pp. 47-48 in the oft-cited 1980 Grossberg article on *How does a brain
> build a cognitive code? *(*Psychological Review*, 87, 1-51
> http://cns.bu.edu/Profiles/Grossberg/Gro1980PsychRev.pdf), This article
> reviewed aspects of the paradigm that I introduced in the 1960s to unify
> aspects of brain and cognition. Or see equations (1) – (7) in the even
> earlier, also oft-cited, 1973 Grossberg article on *Contour enhancement,
> short-term memory, and constancies in reverberating neural networks *(Studies
> in Applied Mathematics, 52, 213-257
> http://cns.bu.edu/Profiles/Grossberg/Gro1973StudiesAppliedMath.pdf). This
> breakthrough article showed how to design recurrent shunting cooperative
> and competitive networks and their signal functions to exhibit key
> properties of contrast enhancement, noise suppression, activity
> normalization, and short-term memory storage. These three articles
> illustrate scores of our articles that have developed such concepts before
> you began to write on this subject.
>
> *2. SIGMOID SIGNALS*
> O&F introduce a sigmoidal signal function in their equation (A.3).
> Grossberg (1973) was the first article to mathematically characterize how
> sigmoidal signal functions transform inputs before storing them in a
> short-term memory that is defined by a recurrent shunting on-center
> off-surround network. These results have been reviewed in many places;
> e.g., Grossberg (1980, pp. 46-49, Appendices C and D) and Grossberg (1988,
> p. 37).
>
> *3. COMPETITION, PARTIAL CONTRAST, AND k-WINNERS-TAKE-ALL*
> O&F introduce k-Winners-Take-All Inhibition in their equations
> (A.5)-(A.6). Grossberg (1973) mathematically proved how to realize partial
> contrast enhancement (i.e., k-Winners-Take-All Inhibition) in a shunting
> recurrent on-center off-surround network. This result is also reviewed in
> Grossberg (1980) and Grossberg (1988), and happens automatically when a
> sigmoid signal function is used in a recurrent shunting on-center
> off-surround network. It does not require a separate hypothesis.
>
> *4. MTM: HABITUATIVE TRANSMITTER GATING AND SYNAPTIC DEPRESSION*
> O&F describe synaptic depression in their equation (A.18). The term
> synaptic depression was introduced by Abbott et al. (1997), who derived an
> equation for it from their visual cortical data. Tsodyks and Markram (1997)
> derived a similar equation with somatosensory cortical data in mind. I
> introduced equations for synaptic depression in *PNAS* in 1968 (e.g.,
> equations (18)-(24) in http://cns.bu.edu/~steve/Gro1968PNAS60.pdf). I
> called it medium-term memory (MTM), or activity-dependent habituation, or
> habituative transmitter gates; e.g., see the review in
> http://www.scholarpedia.org/article/Recurrent_neural_networks.
>
> MTM has multiple functional roles that were all used in our models in the
> 1960s-1980s, and thereafter to the present. One role is to carry out
> intracellular adaptation that divides the response to a current input with
> a time-average of recent input intensity. A related role is to prevent
> recurrent activation from persistently choosing the same neuron, by
> reducing the net input to this neuron. MTM traces also enable reset events
> to occur. For example, in a gated dipole opponent processing network, used
> a great deal in our modeling of reinforcement learning, they enable an
> antagonistic rebound in activation to occur in the network's OFF channel in
> response to either a rapidly decreasing input to the ON channel, or to an
> arousal burst <http://www.scholarpedia.org/article/Bursting> to both
> channels that is triggered by an unexpected event (e.g., Grossberg, 1972,
> http://cns.bu.edu/~steve/Gro1972MathBioSci_II.pdf. Grossberg, 1980,
> Appendix E, http://cns.bu.edu/~steve/Gro1972MathBioSci_II.pdf). This
> property enables a resonance
> <http://www.scholarpedia.org/article/Resonance> that reads out a
> predictive error to be quickly reset, thereby triggering a memory search,
> or hypothesis testing, to discover a recognition category capable of better
> representing an attended object or event, as in adaptive resonance theory,
> or ART. MTM reset dynamics also help to explain data about the dynamics of
> visual perception, cognitive-emotional interactions, decision-making under
> risk, and sensory-motor control.
>
> *5. LTM: HEBBIAN VS. GATED STEEPEST DESCENT LEARNING*
> O&F then introduce what they call a Hebbian learning equation (A.7). This
> equation was introduced in several of my 1969 articles and has been used in
> many articles since then. It is reviewed in Grossberg (1980, p. 43,
> equation (A2)) and Grossberg (1988, p. 23, equation (11). This equation
> describes *gated steepest descent learning*, with variants called *outstar
> learning* for the learning of spatial patterns (that was described in the *Journal
> of Statistical Physics* in 1969;
> http://cns.bu.edu/~steve/Gro1969JourStatPhy.pdf) and *instar learning*
> for the tuning of adaptive filters (that was described in 1976 in *Biological
> Cybernetics; http://cns.bu.edu/~steve/Gro1976BiolCyb_I.pdf
> <http://cns.bu.edu/~steve/Gro1976BiolCyb_I.pdf>*), where I first used it
> to develop competitive learning and self-organizing map models. It is
> sometimes called Kohonen learning after Kohonen’s first use of it in
> self-organizing maps after 1984. Significantly, this learning law seems to
> be the first example of a process that *gates* learning, a concept that
> O&R emphasize, and which I first discovered in 1958 and published in 1969
> and thereafter as parts of a mathematical analysis of associative learning
> in recurrent neural networks.
>
> I introduced ART in the second part of the 1976 *Biological Cybernetics*
> article (http://cns.bu.edu/~steve/Gro1976BiolCyb_II.pdf) in order to show
> how this kind of learning could dynamically self-stabilize in response to
> large non-stationary data bases using attentional matching and memory
> search, or hypothesis testing.  MTM plays an important role in this
> self-regulating search process.
>
> It should also be noted that equation (A.7) is *not* Hebbian. It mixes
> Hebbian and anti-Hebbian properties. Such an adaptive weight, or
> long-term memory (LTM) trace, can either increase or decrease to track the
> signals in its pathway.   When an LTM trace increases, it can properly be
> said to undergo Hebbian learning, after the famous law of Hebb (1949) which
> said that associative traces always increase during learning. When such an
> LTM trace decreases, it is said to undergo anti-Hebbian learning. Gated
> steepest descent was the first learning law to incorporate both Hebbian and
> anti-Hebbian properties in a single synapse. Since that time, such a law
> has been used to model neurophysiological data about learning in the
> hippocampus and cerebellum (also called Long Term Potentiation and Long
> Term Depression) and about adaptive tuning of cortical  feature detectors
> during early visual development, among many other topics.
>
> The Hebb (1949) learning postulate says that: "When an axon of cell A is
> near enough to excite a cell B and repeatedly or persistently takes part in
> firing it, some grown process or metabolic change takes place in one or
> both cells such that A's efficiency, as one of the cells firing B, is
> increased". This postulate only allows LTM traces to increase. Thus, after
> sufficient learning took place, Hebbian traces would saturate at their
> maximum values, and could not subsequently respond adaptively to changing
> environmental demands. The Hebb postulate assumed the wrong processing
> unit: It assumed that the strength of an individual connection is the unit
> of learning. My mathematical work in the 1960s showed, instead, that the
> unit of LTM is a pattern of LTM traces that is distributed across a
> network. When one needs to match an LTM pattern to an STM pattern, as
> occurs during category learning, then both increases and decreases of LTM
> strength are needed.
>
> *6. ERROR-DRIVEN LEARNING*
> Finally, O&F introduce a form of error-driven learning in equation (A.10).
> Several biological variants of error-driven learning have been well-known
> for many years. Indeed, I have proposed a fundamental reason why the brain
> needs both gated steepest descent and error-driven learning, which I will
> only mention briefly here: The brain’s global organization seems to embody
> Complementary Computing. For further discussion of this theme, see Figure 1
> and related text in the Grossberg (2012 *Neural Networks*, 37, 1-47)
> review article (http://cns.bu.edu/~steve/ART.pdf) or Grossberg (2000, *Trends
> in Cognitive Sciences*, 4, 233-246;
> http://www.cns.bu.edu/Profiles/Grossberg/Gro2000TICS.pdf).
>
> One error-driven learning equation, for opponent learning of adaptive
> movement gains, was already reviewed in Grossberg (1988, p. 51, equations
> (118)-(121)). However, the main use of the O&F error-driven learning is in
> reinforcement learning. Even here, there are error-based reinforcement
> learning laws that explain more neural data than the O&F equation can, and
> did it before they did. I will come back to error-driven reinforcement
> learning in a moment.
>
> First, let me summarize: The evidence supplied above shows that there is
> precious little that was new in the original Leabra formalism. It is
> disappointing, given the fact that my work is well-known by many neural
> modelers to have pioneered these concepts and mechanisms, that none of
> these articles was cited as a source for Leabra in O&F. This is all the
> more regrettable since I exchanged detailed collegial emails with an
> O’Reilly collaborator more than 10 years ago to try to make this historical
> background known to him.
>
> Every model has its weaknesses, which provide opportunities for further
> development. Such weaknesses are hard to accept, however, when they have
> already been overcome in prior published work, and are not noted in
> articles that espouse a later, weaker, model.
>
> In order to prevent this comment from becoming unduly long, I discuss only
> one aspect of the O&F work on error-driven reinforcement learning, to
> complete my comments about Leabra.
>
> In O&F (2006, p. 284), it was written that “to date, no model has
> attempted to address the more difficult question of how the BG [basal
> ganglia] ‘knows’ what information is task relevant (which was hard-wired in
> prior models). The present model learns this dynamic gating functionality
> in an adaptive manner via reinforcement learning mechanisms thought to
> depend on the dopaminergic system and associated areas”. This claim is not
> correct. For example, two earlier articles by Brown, Bullock, and Grossberg
> use dopaminergic gating, among other mechanisms, to show how the brain can
> learn what is task relevant (1999, *Journal of Neuroscience*, 19,
> 10502-10511 http://www.cns.bu.edu/Profiles/Grossberg/BroBulGro99.pdf;
> 2004, *Neural Networks*, 271-510
> http://www.cns.bu.edu/Profiles/Grossberg/BroBulGro2003NN.pdf). The former
> article simulates neurophysiological data about basal ganglia and related
> brain regions that temporal difference models and that O&F (2006) could not
> explain. The latter article shows how the TELOS model can incrementally
> learn five different tasks that monkeys have been trained to learn. After
> learning, the model quantitatively simulates the recorded
> neurophysiological dynamics of 17 established cell types in frontal cortex,
> basal ganglia, and related brain regions, and predicts explicit functional
> roles for all of these cells in the learning and performance of these tasks.
>
> The O&F (2006) model did not achieve this level of understanding. Instead,
> the model simulated some relatively simple cognitive tasks and seems to
> show no quantitative fits to any data. The model also seemed to make what I
> consider unnecessary errors. For example, O&F (2006) wrote on p. 294 that
> “…When a conditioned stimulus is activated in advance of a primary reward,
> the PV system is actually trained to not expect reward at this time,
> because it is always trained by the current primary reward value.
> Therefore, we need an additional mechanism to account for the anticipatory
> DA bursting at CS onset, which in turn is critical for training up the BG
> gating system…This is the learned value (LV) system, which is trained only
> when primary rewards are either present or expected by the PV and is free
> to fire at other times without adapting its weights. Therefore, the LV is
> protected from having to learn that no primary reward is actually present
> at CS onset, because it is not trained at that time”. No such convoluted
> assumptions were needed in Brown et al (1999) to explain and quantitatively
> simulate a broad range of anatomical and neurophysiological conditioning
> data from monkeys that were recorded in ventral striatum, striosomes,
> pedunculo-pntine tegmental nucleus, and the lateral hypothalamus.
>
> I believe that a core problem in their model is a lack of understanding of
> an issue that is basic in these data; namely, how *adaptively timed*
> conditioning occurs. Their error-driven conditioning laws, based on
> delta-rule learning, are, to my mind, simply inadequate; see their
> equations (3.1)-(3.3) on p. 294 and their equations in Section A.5, pp.
> 318+. In contrast, the Bullock and Grossberg family of articles have traced
> adaptively timed error-driven learning to detailed dynamics of the
> metabotropic glutamate receptor system, as simulated in Brown et al (1999),
> and also used to quantitatively simulate adaptively timed cerebellar data.
> In this regard, O’Reilly and Frank (2006) mention in passing the cerebellum
> as a source of “timing signals” on p. 294, line 7. However, the timing that
> goes on in the cerebellum and the timing that goes on in the basal ganglia
> have different functional roles. A detailed modeling synthesis and
> simulations of biochemical, biophysical, neurophysiological, anatomical,
> and behavioral data about adaptively timed conditioning in the cerebellum
> is provided in the article by Fiala, Grossberg, and Bullock (*Journal of
> Neuroscience*, 1996, 16, 3760-3774,
> http://cns.bu.edu/~steve/FiaGroBul1996JouNeuroscience.pdf).
>
> There are many related problems. Not the least of them is the assumption
> “that time is discretized into steps that correspond to environmental
> events (e.g., the presentation of a CS or US)” (O&F, 2006, p. 318). One
> cannot understand adaptively timed learning, or working memory for that
> matter, if such an assumption is made. Such unrealistic technical
> assumptions often lead one to unrealistic conceptual assumptions. A
> framework that uses real-time dynamics is needed to deeply understand how
> these brain processes work.
>
> Best,
>
> Steve
>
>
>
>
> On Nov 5, 2014, at 3:14 AM, Randall O'Reilly wrote:
>
>
> Given Randy O‘Reilly’s comments about Leabra, it is also of historical
> interest that I introduced the core equations used in Leabra in the 1960s
> and early 1970s, and they have proved to be of critical importance in all
> the developments of ART.
>
>
> For future reference, Leabra is based on the standard equivalent circuit
> equations for the neuron which I believe date at least to the time of
> Hodgkin and Huxley.  Specifically, we use the “AdEx” model of Gerstner and
> colleagues, and a rate-code equivalent thereof that we derived.  For
> learning we use a biologically-plausible version of backpropagation that I
> analyzed in 1996 and has no provenance in any of your prior work.  Our more
> recent version of this learning model shares a number of features in common
> with the BCM algorithm from 1982, while retaining the core error-driven
> learning component.  I just put all the equations in one place here in case
> you’re interested: https://grey.colorado.edu/emergent/index.php/Leabra
>
> None of this is to say that your pioneering work was not important in
> shaping the field — of course it was, but I hope you agree that it is also
> important to get one’s facts straight on these things.
>
> Best,
> - Randy
>
>
> Stephen Grossberg
> Wang Professor of Cognitive and Neural Systems
> Professor of Mathematics, Psychology, and Biomedical Engineering
> Director, Center for Adaptive Systems http://www.cns.bu.edu/about/cas.html
> http://cns.bu.edu/~steve
> steve at bu.edu
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/connectionists/attachments/20141107/71a099cc/attachment.html>