Connectionists: Galileo and the priest

Fri Mar 17 05:27:31 EDT 2023

Dear Risto and Claudius,

I like your discussion on variable binding / attention / soft links. To an extent, this already worked in the early 1990s, although compute was a million times more expensive than today! 

3 decades ago we published what’s now called a "Transformer with linearized self-attention" (apart from normalization): Learning to control fast-weight memories: an alternative to recurrent networks, Neural Computation, 1992. Based on TR FKI-147-91, TUM, 1991. Here is a well-known tweet on this: https://twitter.com/SchmidhuberAI/status/1576966129993797632?cxt=HHwWgMDSkeKVweIrAAAA

One of the experiments in Sec. 3.2 was really about what you mentioned: learning to bind “fillers" to “slots" or “keys" to "values” through "soft links." I called that “learning internal spotlights of attention” in a follow-up paper at ICANN 1993. 

How does this work? A slow net learns by gradient descent to invent context-dependent useful pairs of “keys" and “values” (called FROM and TO) whose outer products define the "attention mapping" of a fast net with “soft links” or “fast weights” being applied to queries. (The 2017 Transformer combines this with a softmax and a projection operator.) 

The 1991 work separated memory and control like in traditional computers, but in an end-to-end differentiable fashion. I am happy to see that the basic principles have become popular again. Here an overview in Sec. 13 of the Annotated History of Modern AI and Deep Learning (2022): https://people.idsia.ch/~juergen/deep-learning-history.html#transformer . Longer blog post: https://people.idsia.ch/~juergen/fast-weight-programmer-1991-transformer.html 

There is also an ICML 2021 publication on this, with Imanol Schlag and Kazuki Irie: Linear Transformers Are Secretly Fast Weight Programmers. Preprint https://arxiv.org/abs/2102.11174 

Juergen

> On 14 Mar 2023, at 7:49 AM, Risto Miikkulainen <risto at cs.utexas.edu> wrote:
> 
> Back in the 1980s and 1990s we were trying to get neural networks to perform variable binding, and also what Dave Touretzky called “dynamic inferencing”, i.e. bringing together two pieces of information that it knew how to process separately but had never seen together before (like different kinds of grammatical structures). It was very difficult and did not work well. But it seems it now works in GPT: it can, for instance, write a scientific explanation in the style of Shakespeare. The attention mechanism allows it to learn relationships, and the scale-up allows it to form abstractions, and then relationships between abstractions. This effect emerges only at very large scales—scales that are starting to approach that of brain. Perhaps the scale allows it to capture a fundamental processing principle of the brain that we have not been able to identify or model before? It would be interesting to try to characterize it in these terms.
> 
> — Risto
> 
>> On Mar 13, 2023, at 3:38 AM, Claudius Gros <gros at itp.uni-frankfurt.de> wrote:
>> 
>> -- attention as thought processes? --
>> 
>> The discussion here on the list shows, that
>> ChatGPT produces intriguing results. I guess
>> everybody agrees. What it means remains open.
>> 
>> Let me throw in a hypothesis. 
>> 
>> With the introduction of the attention framework, 
>> deep-learning architectures acquired kind of 
>> 'soft links' by computing weighted superpositions 
>> of other states of the network. Possibly, this may
>> be similar to what happens in the brain when we 'think',
>> namely to combine states of distinct brain regions
>> into a single processing stream.
>> 
>> If that would be true (which remains to be seen), it would 
>> imply that the processes performed by transformer 
>> architectures would have a certain resemblance to actual
>> thinking.
>> 
>> Any thoughts (by human brains) on this hypothesis?
>> 
>> Claudius
>> 
>> ==============================================================
>> 
>> 
>> On Friday, March 10, 2023 20:29 CET, Geoffrey Hinton <geoffrey.hinton at gmail.com> wrote: 
>> 
>>> In Berthold Brecht's play about Galileo there is a scene where Galileo asks
>>> a priest to look through a telescope to see the moons of Jupiter. The
>>> priest says there is no point looking because it would be impossible for
>>> things to go round Jupiter (this is from my memory of seeing the play about
>>> 50 years ago).
>>> 
>>> I suspect that Chomsky thinks of himself as more like Galileo than the
>>> priest. But in his recent NYT opinion piece, it appears that the authors
>>> did not actually check what chatGPT would say in answer to their questions
>>> about falling apples or people too stubborn to talk to. Maybe they have
>>> such confidence that chatGPT could not possibly be understanding that there
>>> is no point looking at the data.
>> 
>> 
>> -- 
>> ### 
>> ### Prof. Dr. Claudius Gros
>> ### http://itp.uni-frankfurt.de/~gros
>> ### 
>> ### Complex and Adaptive Dynamical Systems, A Primer   
>> ### A graduate-level textbook, Springer (2008/10/13/15)
>> ### 
>> ### Life for barren exoplanets: The Genesis project
>> ### https://link.springer.com/article/10.1007/s10509-016-2911-0
>> ###
>> 
> 
>