Connectionists: Scientific Integrity, the 2021 Turing Lecture, etc.

Mon Jan 3 09:55:06 EST 2022

Terry:

We can all agree on the importance of mentoring the next generation. 
However, given that:

1) you have been in full and sole control of the NIPS/NeurIPS foundation 
since the 1980s;

2) you have been in full and sole control of Neural Computation since 
the 1980s;

3) you have extensively published in Neural Computation (and now also PNAS);

4) you have made sure, year after year,  that you and your BHL/CIFAR 
friends were able to control and subtly manipulate NIPS/NeurIPS 
(misleading the field in wrong directions, preventing news ideas and 
outsiders from flourishing, and distorting credit attribution).

Can you please explain to this mailing list how this serves as being "a 
good role model" (to use your own words) for the next generation?

Or did you mean it in a more cynical way--indeed this is one of the 
possible ways for a scientist to be "successful"?

--Pierre

On 1/2/2022 12:29 PM, Terry Sejnowski wrote:
> We would be remiss not to acknowledge that backprop would not be 
> possible without the calculus,
> so Isaac newton should also have been given credit, at least as much 
> credit as Gauss.
>
> All these threads will be sorted out by historians one hundred years 
> from now.
> Our precious time is better spent moving the field forward.  There is 
> much more to discover.
>
> A new generation with better computational and mathematical tools than 
> we had back
> in the last century have joined us, so let us be good role models and 
> mentors to them.
>
> Terry
>
> -----
>
> On 1/2/2022 5:43 AM, Schmidhuber Juergen wrote:
>> Asim wrote: "In fairness to Jeffrey Hinton, he did acknowledge the 
>> work of Amari in a debate about connectionism at the ICNN’97 .... He 
>> literally said 'Amari invented back propagation'..." when he sat next 
>> to Amari and Werbos. Later, however, he failed to cite Amari’s 
>> stochastic gradient descent (SGD) for multilayer NNs (1967-68) 
>> [GD1-2a] in his 2015 survey [DL3], his 2021 ACM lecture [DL3a], and 
>> other surveys.  Furthermore, SGD [STO51-52] (Robbins, Monro, Kiefer, 
>> Wolfowitz, 1951-52) is not even backprop. Backprop is just a 
>> particularly efficient way of computing gradients in differentiable 
>> networks, known as the reverse mode of automatic differentiation, due 
>> to Linnainmaa (1970) [BP1] (see also Kelley's precursor of 1960 
>> [BPa]). Hinton did not cite these papers either, and in 2019 
>> embarrassingly did not hesitate to accept an award for having 
>> "created ... the backpropagation algorithm” [HIN]. All references and 
>> more on this can be found in the report, especially in !
> Se!
>>   c. XII.
>>
>> The deontology of science requires: If one "re-invents" something 
>> that was already known, and only becomes aware of it later, one must 
>> at least clarify it later [DLC], and correctly give credit in all 
>> follow-up papers and presentations. Also, ACM's Code of Ethics and 
>> Professional Conduct [ACM18] states: "Computing professionals should 
>> therefore credit the creators of ideas, inventions, work, and 
>> artifacts, and respect copyrights, patents, trade secrets, license 
>> agreements, and other methods of protecting authors' works." LBH didn't.
>>
>> Steve still doesn't believe that linear regression of 200 years ago 
>> is equivalent to linear NNs. In a mature field such as math we would 
>> not have such a discussion. The math is clear. And even today, many 
>> students are taught NNs like this: let's start with a linear 
>> single-layer NN (activation = sum of weighted inputs). Now minimize 
>> mean squared error on the training set. That's good old linear 
>> regression (method of least squares). Now let's introduce multiple 
>> layers and nonlinear but differentiable activation functions, and 
>> derive backprop for deeper nets in 1960-70 style (still used today, 
>> half a century later).
>>
>> Sure, an important new variation of the 1950s (emphasized by Steve) 
>> was to transform linear NNs into binary classifiers with threshold 
>> functions. Nevertheless, the first adaptive NNs (still widely used 
>> today) are 1.5 centuries older except for the name.
>>
>> Happy New Year!
>>
>> Jürgen
>>
>>
>>> On 2 Jan 2022, at 03:43, Asim Roy <ASIM.ROY at asu.edu> wrote:
>>>
>>> And, by the way, Paul Werbos was also there at the same debate. And 
>>> so was Teuvo Kohonen.
>>>
>>> Asim
>>>
>>> -----Original Message-----
>>> From: Asim Roy
>>> Sent: Saturday, January 1, 2022 3:19 PM
>>> To: Schmidhuber Juergen <juergen at idsia.ch>; connectionists at cs.cmu.edu
>>> Subject: RE: Connectionists: Scientific Integrity, the 2021 Turing 
>>> Lecture, etc.
>>>
>>> In fairness to Jeffrey Hinton, he did acknowledge the work of Amari 
>>> in a debate about connectionism at the ICNN’97 (International 
>>> Conference on Neural Networks) in Houston. He literally said "Amari 
>>> invented back propagation" and Amari was sitting next to him. I 
>>> still have a recording of that debate.
>>>
>>> Asim Roy
>>> Professor, Information Systems
>>> Arizona State University
>>> https://isearch.asu.edu/profile/9973
>>> https://lifeboat.com/ex/bios.asim.roy
>>
>> On 2 Jan 2022, at 02:31, Stephen José Hanson <jose at rubic.rutgers.edu> 
>> wrote:
>>
>> Juergen:  Happy New Year!
>>
>> "are not quite the same"..
>>
>> I understand that its expedient sometimes to use linear regression to 
>> approximate the Perceptron.(i've had other connectionist friends tell 
>> me the same thing) which has its own incremental update rule..that is 
>> doing <0,1> classification.    So I guess if you don't like the 
>> analogy to logistic regression.. maybe Fisher's LDA?  This whole 
>> thing still doesn't scan for me.
>>
>> So, again the point here is context.   Do you really believe that 
>> Frank Rosenblatt didn't reference Gauss/Legendre/Laplace because it 
>> slipped his mind??   He certainly understood modern statistics (of 
>> the 1940s and 1950s)
>>
>> Certainly you'd agree that FR could have referenced linear regression 
>> as a precursor, or "pretty similar" to what he was working on, it 
>> seems disingenuous to imply he was plagiarizing Gauss et al.--right?  
>> Why would he?
>>
>> Finally then, in any historical reconstruction, I can think of, it 
>> just doesn't make sense.    Sorry.
>>
>> Steve
>>
>>
>>> -----Original Message-----
>>> From: Connectionists <connectionists-bounces at mailman.srv.cs.cmu.edu> 
>>> On Behalf Of Schmidhuber Juergen
>>> Sent: Friday, December 31, 2021 11:00 AM
>>> To: connectionists at cs.cmu.edu
>>> Subject: Re: Connectionists: Scientific Integrity, the 2021 Turing 
>>> Lecture, etc.
>>>
>>> Sure, Steve, perceptron/Adaline/other similar methods of the 
>>> 1950s/60s are not quite the same, but the obvious origin and 
>>> ancestor of all those single-layer  “shallow learning” 
>>> architectures/methods is indeed linear regression; today’s simplest 
>>> NNs minimizing mean squared error are exactly what they had 2 
>>> centuries ago. And the first working deep learning methods of the 
>>> 1960s did NOT really require “modern” backprop (published in 1970 by 
>>> Linnainmaa [BP1-5]). For example, Ivakhnenko & Lapa (1965) [DEEP1-2] 
>>> incrementally trained and pruned their deep networks layer by layer 
>>> to learn internal representations, using regression and a separate 
>>> validation set. Amari (1967-68)[GD1] used stochastic gradient 
>>> descent [STO51-52] to learn internal representations WITHOUT 
>>> “modern" backprop in his multilayer perceptrons. Jürgen
>>>
>>>
>>>> On 31 Dec 2021, at 18:24, Stephen José Hanson 
>>>> <jose at rubic.rutgers.edu> wrote:
>>>>
>>>> Well the perceptron is closer to logistic regression... but the 
>>>> heaviside function  of course is <0,1>   so technically not related 
>>>> to linear regression which is using covariance to estimate betas...
>>>>
>>>> does that matter?  Yes, if you want to be hyper correct--as this 
>>>> appears to be-- Berkson (1944) coined the logit.. as log odds.. for 
>>>> probabilistic classification.. this was formally developed by Cox 
>>>> in the early 60s, so unlikely even in this case to be a precursor 
>>>> to perceptron.
>>>>
>>>> My point was that DL requires both Learning algorithm (BP) and an
>>>> architecture.. which seems to me much more responsible for the the 
>>>> success of Dl.
>>>>
>>>> S
>>>>
>>>>
>>>>
>>>> On 12/31/21 4:03 AM, Schmidhuber Juergen wrote:
>>>>> Steve, this is not about machine learning in general, just about deep
>>>>> learning vs shallow learning. However, I added the Pandemonium -
>>>>> thanks for that! You ask: how is a linear regressor of 1800
>>>>> (Gauss/Legendre) related to a linear neural network? It's formally
>>>>> equivalent, of course! (The only difference is that the weights are
>>>>> often called beta_i rather than w_i.) Shallow learning: one adaptive
>>>>> layer. Deep learning: many adaptive layers. Cheers, Jürgen
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On 31 Dec 2021, at 00:28, Stephen José Hanson
>>>>>> <jose at rubic.rutgers.edu>
>>>>>> wrote:
>>>>>>
>>>>>> Despite the comprehensive feel of this it still appears to me to 
>>>>>> be  too focused on Back-propagation per se.. (except for that 
>>>>>> pesky Gauss/Legendre ref--which still baffles me at least how 
>>>>>> this is related to a "neural network"), and at the same time it 
>>>>>> appears to be missing other more general epoch-conceptually 
>>>>>> relevant cases, say:
>>>>>>
>>>>>> Oliver Selfridge  and his Pandemonium model.. which was a 
>>>>>> hierarchical feature analysis system.. which certainly was in the 
>>>>>> air during the Neural network learning heyday...in fact, Minsky 
>>>>>> cites Selfridge as one of his mentors.
>>>>>>
>>>>>> Arthur Samuels:  Checker playing system.. which learned a 
>>>>>> evaluation function from a hierarchical search.
>>>>>>
>>>>>> Rosenblatt's advisor was Egon Brunswick.. who was a gestalt 
>>>>>> perceptual psychologist who introduced the concept that the world 
>>>>>> was stochastic and the the organism had to adapt to this variance 
>>>>>> somehow.. he called it "probabilistic functionalism"  which 
>>>>>> brought attention to learning, perception and decision theory, 
>>>>>> certainly all piece parts of what we call neural networks.
>>>>>>
>>>>>> There are many other such examples that influenced or provided 
>>>>>> context for the yeasty mix that was 1940s and 1950s where Neural 
>>>>>> Networks  first appeared partly due to PItts and McCulloch which 
>>>>>> entangled the human brain with computation and early computers 
>>>>>> themselves.
>>>>>>
>>>>>> I just don't see this as didactic, in the sense of a conceptual 
>>>>>> view of the  multidimensional history of the         field, as 
>>>>>> opposed to  a 1-dimensional exegesis of mathematical threads 
>>>>>> through various statistical algorithms.
>>>>>>
>>>>>> Steve
>>>>>>
>>>>>> On 12/30/21 1:03 PM, Schmidhuber Juergen wrote:
>>>>>>
>>>>>>> Dear connectionists,
>>>>>>>
>>>>>>> in the wake of massive open online peer review, public comments 
>>>>>>> on the connectionists mailing list [CONN21] and many additional 
>>>>>>> private comments (some by well-known deep learning pioneers) 
>>>>>>> helped to update and improve upon version 1 of the report. The 
>>>>>>> essential statements of the text remain unchanged as their 
>>>>>>> accuracy remains unchallenged. I'd like to thank everyone from 
>>>>>>> the bottom of my heart for their feedback up until this point 
>>>>>>> and hope everyone will be satisfied with the changes. Here is 
>>>>>>> the revised version 2 with over 300 references:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> https://urldefense.com/v3/__https://people.idsia.ch/*juergen/scient
>>>>>>> ific-integrity-turing-award-deep-learning.html__;fg!!IKRxdwAv5BmarQ
>>>>>>> !NsJ4lf4yO2BDIBzlUVfGKvTtf_QXY8dpZaHzCSzHCvEhXGJUTyRTzZybDQg-DZY$
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> In particular, Sec. II has become a brief history of deep 
>>>>>>> learning up to the 1970s:
>>>>>>>
>>>>>>> Some of the most powerful NN architectures (i.e., recurrent NNs) 
>>>>>>> were discussed in 1943 by McCulloch and Pitts [MC43] and 
>>>>>>> formally analyzed in 1956 by Kleene [K56] - the closely related 
>>>>>>> prior work in physics by Lenz, Ising, Kramers, and Wannier dates 
>>>>>>> back to the 1920s [L20][I25][K41][W45]. In 1948, Turing wrote up 
>>>>>>> ideas related to artificial evolution [TUR1] and learning NNs. 
>>>>>>> He failed to formally publish his ideas though, which explains 
>>>>>>> the obscurity of his thoughts here. Minsky's simple neural SNARC 
>>>>>>> computer dates back to 1951. Rosenblatt's perceptron with a 
>>>>>>> single adaptive layer learned in 1958 [R58] (Joseph [R61] 
>>>>>>> mentions an earlier perceptron-like device by Farley & Clark); 
>>>>>>> Widrow & Hoff's similar Adaline learned in 1962 [WID62]. Such 
>>>>>>> single-layer "shallow learning" actually started around 1800 
>>>>>>> when Gauss & Legendre introduced linear regression and the 
>>>>>>> method of least squares [DL1-2] - a famous early example of 
>>>>>>> pattern recognition and generalization from training!
>  !
>>   d!
>>> at!
>>>>> a through a parameterized predictor is Gauss' rediscovery of the 
>>>>> asteroid Ceres based on previous astronomical observations. Deeper 
>>>>> multilayer perceptrons (MLPs) were discussed by Steinbuch 
>>>>> [ST61-95] (1961), Joseph [R61] (1961), and Rosenblatt [R62] 
>>>>> (1962), who wrote about "back-propagating errors" in an MLP with a 
>>>>> hidden layer [R62], but did not yet have a general deep learning 
>>>>> algorithm for deep MLPs  (what's now called backpropagation is 
>>>>> quite different and was first published by Linnainmaa in 1970 
>>>>> [BP1-BP5][BPA-C]). Successful learning in deep architectures 
>>>>> started in 1965 when Ivakhnenko & Lapa published the first 
>>>>> general, working learning algorithms for deep MLPs with 
>>>>> arbitrarily many hidden layers (already containing the now popular 
>>>>> multiplicative gates) [DEEP1-2][DL1-2]. A paper of 1971 [DEEP2] 
>>>>> already described a deep learning net with 8 layers, trained by 
>>>>> their highly cited method which was still popular in the new 
>>>>> millennium [DL2], especially in Eastern Europ!
> e!
>>> , w!
>>>>> here much of Machine Learning was born [MIR](Sec. 1)[R8]. LBH !
>>>>> failed to
>>>>> cite this, just like they failed to cite Amari [GD1], who in 1967 
>>>>> proposed stochastic gradient descent [STO51-52] (SGD) for MLPs and 
>>>>> whose implementation [GD2,GD2a] (with Saito) learned internal 
>>>>> representations at a time when compute was billions of times more 
>>>>> expensive than today (see also Tsypkin's work [GDa-b]). (In 1972, 
>>>>> Amari also published what was later sometimes called the Hopfield 
>>>>> network or Amari-Hopfield Network [AMH1-3].) Fukushima's now 
>>>>> widely used deep convolutional NN architecture was first 
>>>>> introduced in the 1970s [CNN1].
>>>>>
>>>>>>> Jürgen
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ******************************
>>>>>>>
>>>>>>> On 27 Oct 2021, at 10:52, Schmidhuber Juergen
>>>>>>>
>>>>>>> <juergen at idsia.ch>
>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi, fellow artificial neural network enthusiasts!
>>>>>>>
>>>>>>> The connectionists mailing list is perhaps the oldest mailing 
>>>>>>> list on ANNs, and many neural net pioneers are still subscribed 
>>>>>>> to it. I am hoping that some of them - as well as their 
>>>>>>> contemporaries - might be able to provide additional valuable 
>>>>>>> insights into the history of the field.
>>>>>>>
>>>>>>> Following the great success of massive open online peer review
>>>>>>> (MOOR) for my 2015 survey of deep learning (now the most cited
>>>>>>> article ever published in the journal Neural Networks), I've
>>>>>>> decided to put forward another piece for MOOR. I want to thank the
>>>>>>> many experts who have already provided me with comments on it.
>>>>>>> Please send additional relevant references and suggestions for
>>>>>>> improvements for the following draft directly to me at
>>>>>>>
>>>>>>> juergen at idsia.ch
>>>>>>>
>>>>>>> :
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> https://urldefense.com/v3/__https://people.idsia.ch/*juergen/scient
>>>>>>> ific-integrity-turing-award-deep-learning.html__;fg!!IKRxdwAv5BmarQ
>>>>>>> !NsJ4lf4yO2BDIBzlUVfGKvTtf_QXY8dpZaHzCSzHCvEhXGJUTyRTzZybDQg-DZY$
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The above is a point-for-point critique of factual errors in 
>>>>>>> ACM's justification of the ACM A. M. Turing Award for deep 
>>>>>>> learning and a critique of the Turing Lecture published by ACM 
>>>>>>> in July 2021. This work can also be seen as a short history of 
>>>>>>> deep learning, at least as far as ACM's errors and the Turing 
>>>>>>> Lecture are concerned.
>>>>>>>
>>>>>>> I know that some view this as a controversial topic. However, it 
>>>>>>> is the very nature of science to resolve controversies through 
>>>>>>> facts. Credit assignment is as core to scientific history as it 
>>>>>>> is to machine learning. My aim is to ensure that the true 
>>>>>>> history of our field is preserved for posterity.
>>>>>>>
>>>>>>> Thank you all in advance for your help!
>>>>>>>
>>>>>>> Jürgen Schmidhuber
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> -- 
>>>>>> <signature.png>
>>>>>>
>>>> -- 
>>>> <signature.png>
>>>
>>
>
>

-- 
Pierre Baldi, Ph.D.
Distinguished Professor, Department of Computer Science
Director, Institute for Genomics and Bioinformatics
Associate Director, Center for Machine Learning and Intelligent Systems
University of California, Irvine
Irvine, CA 92697-3435
(949) 824-5809
(949) 824-9813 [FAX]
Assistant: Janet Ko  jko at uci.edu