Connectionists: Scientific Integrity, the 2021 Turing Lecture, etc.
    Schmidhuber Juergen 
    juergen at idsia.ch
       
    Mon Jan 31 11:38:12 EST 2022
    
    
  
Steve, do you really want to erase the very origins of shallow learning (Gauss & Legendre ~1800) and deep learning (DL, Ivakhnenko & Lapa 1965) from the field's history? Why? Because they did not use modern terminology such as "artificial neural nets (NNs)" and "learning internal representations"? Names change all the time like fashions; the only thing that counts is the math. Not only mathematicians but also psychologists like yourself will agree. 
Again: the linear regressor of Legendre & Gauss is formally identical to what was much later called a linear NN for function approximation (FA), minimizing mean squared error, still widely used today. No history of "shallow learning" (without adaptive hidden layers) is complete without this original shallow learner of 2 centuries ago. Many NN courses actually introduce simple NNs in this mathematically and historically correct way, then proceed to DL NNs with several adaptive hidden layers.
And of course, no DL history is complete without the origins of functional DL in 1965 [DEEP1-2]. Back then, Ivakhnenko and Lapa published the first general, working DL algorithm for supervised deep feedforward multilayer perceptrons (MLPs) with arbitrarily many layers of neuron-like elements, using nonlinear activation functions (actually Kolmogorov-Gabor polynomials) that combine both additions (like in linear NNs) and multiplications (basically they had deep NNs with gates, including higher order gates). They incrementally trained and pruned their DL networks layer by layer to learn internal representations, using regression and a separate validation set (network depth > 7 by 1971). They had standard justifications of DL such as: "a multilayered structure is a computationally feasible way to implement multinomials of very high degree" [DEEP2] (that cannot be approximated by simple linear NNs). Of course, their DL was automated, and many people have used it up to the 2000s - just follow the numerous citations.
I don't get your comments about Ivakhnenko's DL and function approximation (FA). FA is for all kinds of functions, including your "cognitive or perceptual or motor functions." NNs are used as FAs all the time. Like other NNs, Ivakhnenko's nets can be used as FAs for your motor control problems. You boldly claim: "This was not in the intellectual space" of Ivakhnenko's method. But obviously it was. 
Interestingly, 2 years later, Amari (1967-68) [GD1-2] trained his deep MLPs through a different DL method, namely, stochastic gradient descent (1951-52)[STO51-52]. His paper also did not contain the "modern" expression "learning internal representations in NNs." But that's what it was about. Math and algorithms are immune to rebranding. 
You may not like the fact that neither the original shallow learning (Gauss & Legendre ~1800) nor the original working DL (Ivakhnenko & Lapa 1965; Amari 1967) were biologically inspired. They were motivated through math and problem solving. The NN rebranding came later. Proper scientific credit assignment does not care for changes in terminology.  
BTW, unfortunately, Minsky & Papert [M69] made some people think that Rosenblatt [R58-62] had only linear NNs plus threshold functions. But actually he had much more interesting MLPs with a non-learning randomized first layer and an adaptive output layer. So Rosenblatt basically had what much later was rebranded as "Extreme Learning Machines (ELMs)." The revisionist narrative of ELMs (see this web site https://elmorigin.wixsite.com/originofelm) is a bit like the revisionist narrative of DL criticized by my report. Some ELM guys apparently thought they can get away with blatant improper credit assignment. After all, the criticized DL guys seemed to get away with it on an even grander scale. They called themselves the "DL conspiracy" [DLC]; the "ELM conspiracy" is similar. What an embarrassing lack of maturity of our field. 
Fortunately, more and more ML researchers are helping to set things straight. "In science, by definition, the facts will always win in the end. As long as the facts have not yet won it's not yet the end." [T21v1] 
References as always under https://people.idsia.ch/~juergen/scientific-integrity-turing-award-deep-learning.html
Jürgen
> On 27 Jan 2022, at 17:37, Stephen José Hanson <jose at rubic.rutgers.edu> wrote:
> 
> 
> 
> Juergen, I have read through GMHD paper and a 1971 Review paper by Ivakhnenko.    These are papers about function approximation.  The method proposes to use series of polynomial functions that are stacked in filtered sets.   The filtered sets are chosen based on best fit, and from what I can tell are manually grown.. so this must of been a tedious and slow process (I assume could be automated).     So are the GMHDs "deep", in that they are stacked 4 deep in figure 1 (8 deep in another).     Interestingly, they are using (with obvious FA justification) polynomials of various degree.   Has this much to do with neural networks?  Yes, there were examples initiated by Rumelhart (and me: https://www.routledge.com/Backpropagation-Theory-Architectures-and-Applications/Chauvin-Rumelhart/p/book/9780805812596), based on poly-synaptic dendrite complexity, but not in the GMHD paper.. which was specifically about function approximation.  Ivakhnenko, lists four reasons for the approach they took: mainly reducing data size and being more efficient with data that one had.   No mention of "internal representations"
> 
> So when Terry, talks about "internal representations"  --does he mean function approximation?  Not so much.  That of course is part of this, but the actual focus is on cognitive or perceptual or motor functions. Representation in the brain.   Hidden units (which could be polynomials) cluster and project and model the input features wrt to the function constraints conditioned by training data.   This is more similar to model specification through function space search.  And the original Rumelhart meaning of internal representation in PDP vol 1, was in the case of representation certain binary functions (XOR), but more generally about the need for "neurons" (inter-neurons) explicitly between input (sensory) and output (motor).     Consider NETTALK, in which I did the first hierarchical clustering of the hidden units over the input features (letters).  What appeared wasn't probably surprising.. but without model specification, the network (w.hidden units), learned VOWELS and CONSONANT distinctions just from training (Hanson & Burr, 1990).   This would be a clear example of "internal representations" in the sense of Rumelhart.     This was not in the intellectual space of Ivakhnenko's Group Method of Handling Data.  (some of this is discussed in more detail in some recent conversations with Terry Sejnowski and another one to appear shortly with Geoff Hinton (AIHUB.org  look in Opinions).
> 
> Now I suppose one could be cynical and opportunistic, and even conclude if you wanted to get more clicks, rather than title your article GROUP METHOD OF HANDLING DATA, you should at least consider:  NEURAL NETWORKS FOR HANDLING DATA, even if you didn't think neural networks had anything to do with your algorithm, after all everyone else is!  Might get it published in this time frame, or even read.     This is not scholarship.  These publications threads are related but not dependent.  And although they diverge  they could be informative if one were to try and develop  polynomial inductive growth networks (see Falhman, 1989; Cascade correlation and Hanson 1990: Meiosis nets)  to motor control in the brain.     But that's not what happened.    I think, like Gauss,  you need to drop this specific claim as well.
> 
> With best regards,
> 
> Steve
On 25 Jan 2022, at 20:03, Schmidhuber Juergen <juergen at idsia.ch> wrote:
PS: Terry, you also wrote: "Our precious time is better spent moving the field forward.” However, it seems like in recent years much of your own precious time has gone to promulgating a revisionist history of deep learning (and writing the corresponding "amicus curiae" letters to award committees). For a recent example, your 2020 deep learning survey in PNAS [S20] claims that your 1985 Boltzmann machine [BM] was the first NN to learn internal representations. This paper [BM] neither cited the internal representations learnt by Ivakhnenko & Lapa's deep nets in 1965 [DEEP1-2] nor those learnt by Amari’s stochastic gradient descent for MLPs in 1967-1968 [GD1-2]. Nor did your recent survey [S20] attempt to correct this as good science should strive to do. On the other hand, it seems you celebrated your co-author's birthday in a special session while you were head of NeurIPS, instead of correcting these inaccuracies and celebrating the true pioneers of deep learning, such as !
Ivakhnenko and Amari. Even your recent interview https://blog.paperspace.com/terry-sejnowski-boltzmann-machines/ claims: "Our goal was to try to take a network with multiple layers - an input layer, an output layer and layers in between – and make it learn. It was generally thought, because of early work that was done in AI in the 60s, that no one would ever find such a learning algorithm because it was just too mathematically difficult.” You wrote this although you knew exactly that such learning algorithms were first created in the 1960s, and that they worked. You are a well-known scientist, head of NeurIPS, and chief editor of a major journal. You must correct this. We must all be better than this as scientists. We owe it to both the past, present, and future scientists as well as those we ultimately serve.
The last paragraph of my report https://people.idsia.ch/~juergen/scientific-integrity-turing-award-deep-learning.html quotes Elvis Presley: "Truth is like the sun. You can shut it out for a time, but it ain't goin' away.” I wonder how the future will reflect on the choices we make now.      
Jürgen
> On 3 Jan 2022, at 11:38, Schmidhuber Juergen <juergen at idsia.ch> wrote:
> 
> Terry, please don't throw smoke candles like that!  
> 
> This is not about basic math such as Calculus (actually first published by Leibniz; later Newton was also credited for his unpublished work; Archimedes already had special cases thereof over 2000 years ago; the Indian Kerala school made essential contributions around 1400). In fact, my report addresses such smoke candles in Sec. XII: "Some claim that 'backpropagation' is just the chain rule of Leibniz (1676) & L'Hopital (1696).' No, it is the efficient way of applying the chain rule to big networks with differentiable nodes (there are also many inefficient ways of doing this). It was not published until 1970 [BP1]."
> 
> You write: "All these threads will be sorted out by historians one hundred years from now." To answer that, let me just cut and paste the last sentence of my conclusions: "However, today's scientists won't have to wait for AI historians to establish proper credit assignment. It is easy enough to do the right thing right now."
> 
> You write: "let us be good role models and mentors" to the new generation. Then please do what's right! Your recent survey [S20] does not help. It's mentioned in my report as follows: "ACM seems to be influenced by a misleading 'history of deep learning' propagated by LBH & co-authors, e.g., Sejnowski [S20] (see Sec. XIII). It goes more or less like this: 'In 1969, Minsky & Papert [M69] showed that shallow NNs without hidden layers are very limited and the field was abandoned until a new generation of neural network researchers took a fresh look at the problem in the 1980s [S20].' However, as mentioned above, the 1969 book [M69] addressed a 'problem' of Gauss & Legendre's shallow learning (~1800)[DL1-2] that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method [DEEP1-2][DL2] (and then also by Amari's SGD for MLPs [GD1-2]). Minsky was apparently unaware of this and failed to correct it later [HIN](Sec. I).... deep learning research was alive and kicking also in the 1970s, especially outside of the Anglosphere."
> 
> Just follow ACM's Code of Ethics and Professional Conduct [ACM18] which states: "Computing professionals should therefore credit the creators of ideas, inventions, work, and artifacts, and respect copyrights, patents, trade secrets, license agreements, and other methods of protecting authors' works." No need to wait for 100 years. 
> 
> Jürgen
> 
> 
> 
> 
> 
>> On 2 Jan 2022, at 23:29, Terry Sejnowski <terry at snl.salk.edu> wrote:
>> 
>> We would be remiss not to acknowledge that backprop would not be possible without the calculus,
>> so Isaac newton should also have been given credit, at least as much credit as Gauss.
>> 
>> All these threads will be sorted out by historians one hundred years from now.
>> Our precious time is better spent moving the field forward.  There is much more to discover.
>> 
>> A new generation with better computational and mathematical tools than we had back
>> in the last century have joined us, so let us be good role models and mentors to them.
>> 
>> Terry
    
    
More information about the Connectionists
mailing list