Connectionists: Scientific Integrity, the 2021 Turing Lecture, etc.

Schmidhuber Juergen juergen at idsia.ch
Mon Feb 28 03:03:35 EST 2022


Steve, surely you can agree with me that plagiarism cannot win here? As you said, let's "embrace all the facts” and "the bigger conceptual picture,” but always credit those who did things first. You mention the “common false narrative” (promulgated by certain self-aggrandizing psychologists and neuroscientists since the 1980s). Indeed, this narrative is simply incompatible with the historic facts, ignoring the very origins of deep learning in the mid 1960s (and shallow learning in the 1800s).  We have a duty as academics and scientists to ensure the facts win. Jürgen 

https://people.idsia.ch/~juergen/scientific-integrity-turing-award-deep-learning.html


> On 7 Feb 2022, at 16:26, Stephen José Hanson <jose at rubic.rutgers.edu> wrote:
> 
> Juergen,
> 
> ignoring history by defining it as "fancy talk", is going to make your exegesis of Neural networks always lagging.  You need to embrace all the facts, not just the ones you like or are familiar with.  Your whole endeavor, is an attempt to destroy what you feel is the common false narrative on the origin of neural networks.  I am happy to chat more about this sometime, but I still think your mathematical lens, is preventing you from seeing the bigger conceptual picture.
> 
> Best,
> 
> Steve
> 
> On 2/6/22 3:44 AM, Schmidhuber Juergen wrote:
>> Steve, it’s simple: the original “shallow learning” (~1800) is much older than your relatively recent “shallow learning” references (mostly from the 1900s). No need to mention all of them in this report, which is really about "deep learning” (see title) with adaptive hidden units, which started to work in the 1960s, first through layer by layer training (USSR, 1965), then through stochastic gradient descent (SGD) in relatively deep nets (Japan, 196I 7). The reverse mode of automatic differentiation (now called backpropagation) appeared 3 years later (Finland, 1970). No fancy talk about syntax vs semantics can justify a revisionist history of deep learning that does not mention these achievements. Cheers, Jürgen
>> 
>> https://people.idsia.ch/~juergen/scientific-integrity-turing-award-deep-learning.html
>> 
>> 
>> 
>> 
>>> On 2 Feb 2022, at 00:48, Stephen José Hanson <jose at rubic.rutgers.edu> wrote:
>>> 
>>> Jeurgen:  Even some of us lowly psychologists know some math.   And its not about the math.. its about the context (is this sounding like an echo?)
>>> 
>>> Let me try again.. and I think your good intentions but misguided reconstruction of history is appearing to me, to be  perverse.
>>> 
>>> You tip your hand when you talk about "rebranding".    Also that the PDP books were a "conspiracy".  But lets go point by point.
>>> 
>>> (1) we already agreed that the Perceptron  was not linear regression--lets not go backwards. Closer to logistic regression.   If you are talking about Widrow and Hoff, well it is the Delta Rule-- SSE kind of regression.   But where did the Delta rule come from?  Lets look at Math.  So there is some nice papers by Gluck and Thompson (80s) showing how Pavlovian conditioning is exactly the Delta rule and even more relevant was shown to account for majority of classical (pavlovian) conditioning was the Rescorla-Wagner (1972) model-- \Delta V_A = [\alpha_A\beta_1](\lambda_1 - V_{AX}), which of course was Ivan Petrovich Pavlov discovery of classical conditioning (1880s).   Why aren't you citing him?   What about John Brodeus Watson and Burris Fredrick Skinner?        At least they were focused on learning albeit  *just* in biological systems.  But these were actual  natural world discoveries.
>>> 
>>> (2) Function approximation.   Ok Juergen, claims that everything   is really   just X, reminds me of the man with a Hammer to whom everything looks like a nail!      To the point: its incidental.  Yes, Neural networks are function approximators, but that is incidental to the original more general context (PDP)  as a way to create "internal representations".   The function approximation was a Bonus!
>>> 
>>> (3) Branding.   OMG.  So you seem to believe that everyone is cynical and will put their intellectual finger in the air to find out what to call what they are doing!   Jeez, I hope this isn't true.  But the narrative you eschew is in fact something that Minsky would talk about (I remember this at lunch with him in the 90s at Thinking Machines), and he was quite clear that Perceptron was failing well  before the 1969 book (trying to do speech recognition with a perceptron--yikes), but in a piling on kind of way Perceptrons killed the perceptron, but it was the linearity focus (as BAP points out) and the lack of depth.
>>> 
>>> (4) Group Method of Handling Data.   Frankly, the only one I can find that branded GMHD as a NeuroNet (as they call it) was you. 
>>> There is a 2017 reference, but they reference you again.  
>>> 
>>> (5) Its just names,  fashion and preference..   or no actual concepts matter.  Really?   
>>> 
>>> There was an french mathematician named Fourier in the 19th century who came up with an idea of periodic function decomposition into weighted trigonometric functions.. but he had no math.   And Laplace Legendre and others said he had no math!  So they prevented him from publishing for FIFTEEN YEARS..   150 years later after Tukey invented the FFT, its the most common transform used and misused  in general.
>>> 
>>> Concepts lead to math.. and that may lead to further formalism.. but don't mistake the math for the concept behind it.    The context matters and you are confusing syntax for semantics!    
>>> 
>>> Cheers,
>>> Steve
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 1/31/22 11:38 AM, Schmidhuber Juergen wrote:
>>> 
>>>> Steve, do you really want to erase the very origins of shallow learning (Gauss & Legendre ~1800) and deep learning (DL, Ivakhnenko & Lapa 1965) from the field's history? Why? Because they did not use modern terminology such as "artificial neural nets (NNs)" and "learning internal representations"? Names change all the time like fashions; the only thing that counts is the math. Not only mathematicians but also psychologists like yourself will agree. 
>>>> 
>>>> Again: the linear regressor of Legendre & Gauss is formally identical to what was much later called a linear NN for function approximation (FA), minimizing mean squared error, still widely used today. No history of "shallow learning" (without adaptive hidden layers) is complete without this original shallow learner of 2 centuries ago. Many NN courses actually introduce simple NNs in this mathematically and historically correct way, then proceed to DL NNs with several adaptive hidden layers.
>>>> 
>>>> And of course, no DL history is complete without the origins of functional DL in 1965 [DEEP1-2]. Back then, Ivakhnenko and Lapa published the first general, working DL algorithm for supervised deep feedforward multilayer perceptrons (MLPs) with arbitrarily many layers of neuron-like elements, using nonlinear activation functions (actually Kolmogorov-Gabor polynomials) that combine both additions (like in linear NNs) and multiplications (basically they had deep NNs with gates, including higher order gates). They incrementally trained and pruned their DL networks layer by layer to learn internal representations, using regression and a separate validation set (network depth > 7 by 1971). They had standard justifications of DL such as: "a multilayered structure is a computationally feasible way to implement multinomials of very high degree" [DEEP2] (that cannot be approximated by simple linear NNs). Of course, their DL was automated, and many people have used it up to the 2000s !
>>>>  - just follow the numerous citations.
>>>> 
>>>> I don't get your comments about Ivakhnenko's DL and function approximation (FA). FA is for all kinds of functions, including your "cognitive or perceptual or motor functions." NNs are used as FAs all the time. Like other NNs, Ivakhnenko's nets can be used as FAs for your motor control problems. You boldly claim: "This was not in the intellectual space" of Ivakhnenko's method. But obviously it was. 
>>>> 
>>>> Interestingly, 2 years later, Amari (1967-68) [GD1-2] trained his deep MLPs through a different DL method, namely, stochastic gradient descent (1951-52)[STO51-52]. His paper also did not contain the "modern" expression "learning internal representations in NNs." But that's what it was about. Math and algorithms are immune to rebranding. 
>>>> 
>>>> You may not like the fact that neither the original shallow learning (Gauss & Legendre ~1800) nor the original working DL (Ivakhnenko & Lapa 1965; Amari 1967) were biologically inspired. They were motivated through math and problem solving. The NN rebranding came later. Proper scientific credit assignment does not care for changes in terminology.  
>>>> 
>>>> BTW, unfortunately, Minsky & Papert [M69] made some people think that Rosenblatt [R58-62] had only linear NNs plus threshold functions. But actually he had much more interesting MLPs with a non-learning randomized first layer and an adaptive output layer. So Rosenblatt basically had what much later was rebranded as "Extreme Learning Machines (ELMs)." The revisionist narrative of ELMs (see this web site 
>>>> https://elmorigin.wixsite.com/originofelm
>>>> ) is a bit like the revisionist narrative of DL criticized by my report. Some ELM guys apparently thought they can get away with blatant improper credit assignment. After all, the criticized DL guys seemed to get away with it on an even grander scale. They called themselves the "DL conspiracy" [DLC]; the "ELM conspiracy" is similar. What an embarrassing lack of maturity of our field. 
>>>> 
>>>> Fortunately, more and more ML researchers are helping to set things straight. "In science, by definition, the facts will always win in the end. As long as the facts have not yet won it's not yet the end." [T21v1] 
>>>> 
>>>> References as always under 
>>>> https://people.idsia.ch/~juergen/scientific-integrity-turing-award-deep-learning.html
>>>> 
>>>> 
>>>> Jürgen
>>>> 
>>>> 
>>>> 
>>>>> On 27 Jan 2022, at 17:37, Stephen José Hanson <jose at rubic.rutgers.edu>
>>>>>  wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> Juergen, I have read through GMHD paper and a 1971 Review paper by Ivakhnenko.    These are papers about function approximation.  The method proposes to use series of polynomial functions that are stacked in filtered sets.   The filtered sets are chosen based on best fit, and from what I can tell are manually grown.. so this must of been a tedious and slow process (I assume could be automated).     So are the GMHDs "deep", in that they are stacked 4 deep in figure 1 (8 deep in another).     Interestingly, they are using (with obvious FA justification) polynomials of various degree.   Has this much to do with neural networks?  Yes, there were examples initiated by Rumelhart (and me: 
>>>>> https://www.routledge.com/Backpropagation-Theory-Architectures-and-Applications/Chauvin-Rumelhart/p/book/9780805812596
>>>>> ), based on poly-synaptic dendrite complexity, but not in the GMHD paper.. which was specifically about function approximation.  Ivakhnenko, lists four reasons for the approach t!
>>>>> 
>>>>  hey took: mainly reducing data size and being more efficient with data that one had.   No mention of "internal representations"
>>>> 
>>>>> So when Terry, talks about "internal representations"  --does he mean function approximation?  Not so much.  That of course is part of this, but the actual focus is on cognitive or perceptual or motor functions. Representation in the brain.   Hidden units (which could be polynomials) cluster and project and model the input features wrt to the function constraints conditioned by training data.   This is more similar to model specification through function space search.  And the original Rumelhart meaning of internal representation in PDP vol 1, was in the case of representation certain binary functions (XOR), but more generally about the need for "neurons" (inter-neurons) explicitly between input (sensory) and output (motor).     Consider NETTALK, in which I did the first hierarchical clustering of the hidden units over the input features (letters).  What appeared wasn't probably surprising.. but without model specification, the network (w.hidden units), learned VOWELS and !
>>>>> 
>>>>  CONSONANT distinctions just from training (Hanson & Burr, 1990).   This would be a clear example of "internal representations" in the sense of Rumelhart.     This was not in the intellectual space of Ivakhnenko's Group Method of Handling Data.  (some of this is discussed in more detail in some recent conversations with Terry Sejnowski and another one to appear shortly with Geoff Hinton (AIHUB.org
>>>>   look in Opinions).
>>>> 
>>>>> Now I suppose one could be cynical and opportunistic, and even conclude if you wanted to get more clicks, rather than title your article GROUP METHOD OF HANDLING DATA, you should at least consider:  NEURAL NETWORKS FOR HANDLING DATA, even if you didn't think neural networks had anything to do with your algorithm, after all everyone else is!  Might get it published in this time frame, or even read.     This is not scholarship.  These publications threads are related but not dependent.  And although they diverge  they could be informative if one were to try and develop  polynomial inductive growth networks (see Falhman, 1989; Cascade correlation and Hanson 1990: Meiosis nets)  to motor control in the brain.     But that's not what happened.    I think, like Gauss,  you need to drop this specific claim as well.
>>>>> 
>>>>> With best regards,
>>>>> 
>>>>> Steve
>>>>> 
>>>> On 25 Jan 2022, at 20:03, Schmidhuber Juergen <juergen at idsia.ch>
>>>>  wrote:
>>>> 
>>>> PS: Terry, you also wrote: "Our precious time is better spent moving the field forward.” However, it seems like in recent years much of your own precious time has gone to promulgating a revisionist history of deep learning (and writing the corresponding "amicus curiae" letters to award committees). For a recent example, your 2020 deep learning survey in PNAS [S20] claims that your 1985 Boltzmann machine [BM] was the first NN to learn internal representations. This paper [BM] neither cited the internal representations learnt by Ivakhnenko & Lapa's deep nets in 1965 [DEEP1-2] nor those learnt by Amari’s stochastic gradient descent for MLPs in 1967-1968 [GD1-2]. Nor did your recent survey [S20] attempt to correct this as good science should strive to do. On the other hand, it seems you celebrated your co-author's birthday in a special session while you were head of NeurIPS, instead of correcting these inaccuracies and celebrating the true pioneers of deep learning, such as !
>>>> Ivakhnenko and Amari. Even your recent interview 
>>>> https://blog.paperspace.com/terry-sejnowski-boltzmann-machines/
>>>>  claims: "Our goal was to try to take a network with multiple layers - an input layer, an output layer and layers in between – and make it learn. It was generally thought, because of early work that was done in AI in the 60s, that no one would ever find such a learning algorithm because it was just too mathematically difficult.” You wrote this although you knew exactly that such learning algorithms were first created in the 1960s, and that they worked. You are a well-known scientist, head of NeurIPS, and chief editor of a major journal. You must correct this. We must all be better than this as scientists. We owe it to both the past, present, and future scientists as well as those we ultimately serve.
>>>> 
>>>> The last paragraph of my report 
>>>> https://people.idsia.ch/~juergen/scientific-integrity-turing-award-deep-learning.html
>>>>  quotes Elvis Presley: "Truth is like the sun. You can shut it out for a time, but it ain't goin' away.” I wonder how the future will reflect on the choices we make now.      
>>>> 
>>>> Jürgen
>>>> 
>>>> 
>>>> 
>>>>> On 3 Jan 2022, at 11:38, Schmidhuber Juergen <juergen at idsia.ch>
>>>>>  wrote:
>>>>> 
>>>>> Terry, please don't throw smoke candles like that!  
>>>>> 
>>>>> This is not about basic math such as Calculus (actually first published by Leibniz; later Newton was also credited for his unpublished work; Archimedes already had special cases thereof over 2000 years ago; the Indian Kerala school made essential contributions around 1400). In fact, my report addresses such smoke candles in Sec. XII: "Some claim that 'backpropagation' is just the chain rule of Leibniz (1676) & L'Hopital (1696).' No, it is the efficient way of applying the chain rule to big networks with differentiable nodes (there are also many inefficient ways of doing this). It was not published until 1970 [BP1]."
>>>>> 
>>>>> You write: "All these threads will be sorted out by historians one hundred years from now." To answer that, let me just cut and paste the last sentence of my conclusions: "However, today's scientists won't have to wait for AI historians to establish proper credit assignment. It is easy enough to do the right thing right now."
>>>>> 
>>>>> You write: "let us be good role models and mentors" to the new generation. Then please do what's right! Your recent survey [S20] does not help. It's mentioned in my report as follows: "ACM seems to be influenced by a misleading 'history of deep learning' propagated by LBH & co-authors, e.g., Sejnowski [S20] (see Sec. XIII). It goes more or less like this: 'In 1969, Minsky & Papert [M69] showed that shallow NNs without hidden layers are very limited and the field was abandoned until a new generation of neural network researchers took a fresh look at the problem in the 1980s [S20].' However, as mentioned above, the 1969 book [M69] addressed a 'problem' of Gauss & Legendre's shallow learning (~1800)[DL1-2] that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method [DEEP1-2][DL2] (and then also by Amari's SGD for MLPs [GD1-2]). Minsky was apparently unaware of this and failed to correct it later [HIN](Sec. I).... deep learning research was a!
>>>>> 
>>>>  live and kicking also in the 1970s, especially outside of the Anglosphere."
>>>> 
>>>>> Just follow ACM's Code of Ethics and Professional Conduct [ACM18] which states: "Computing professionals should therefore credit the creators of ideas, inventions, work, and artifacts, and respect copyrights, patents, trade secrets, license agreements, and other methods of protecting authors' works." No need to wait for 100 years. 
>>>>> 
>>>>> Jürgen
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 2 Jan 2022, at 23:29, Terry Sejnowski <terry at snl.salk.edu>
>>>>>>  wrote:
>>>>>> 
>>>>>> We would be remiss not to acknowledge that backprop would not be possible without the calculus,
>>>>>> so Isaac newton should also have been given credit, at least as much credit as Gauss.
>>>>>> 
>>>>>> All these threads will be sorted out by historians one hundred years from now.
>>>>>> Our precious time is better spent moving the field forward.  There is much more to discover.
>>>>>> 
>>>>>> A new generation with better computational and mathematical tools than we had back
>>>>>> in the last century have joined us, so let us be good role models and mentors to them.
>>>>>> 
>>>>>> Terry
>>>>>> 
>>> -- 
>>> <signature.png>
>> 
> -- 
> <signature.png>




More information about the Connectionists mailing list