Connectionists: Who invented backpropagation?

Schmidhuber Juergen juergen at idsia.ch
Tue Jul 22 08:22:28 EDT 2014


Efficient backpropagation (BP) is central to the ongoing Neural Network (NN) ReNNaissance and "Deep Learning."  Who invented it? 

It is easy to find misleading accounts of BP's history. I had a look at the original papers from the 1960s and 70s, and talked to BP pioneers. Here is a summary derived from my recent survey, which has additional references:

The minimisation of errors through gradient descent (Hadamard, 1908) in the parameter space of complex, nonlinear, differentiable, multi-stage, NN-related systems has been discussed at least since the early 1960s (e.g., Kelley, 1960; Bryson, 1961; Bryson and Denham, 1961; Pontryagin et al., 1961; Dreyfus, 1962; Wilkinson, 1965; Amari, 1967; Bryson and Ho, 1969; Director and Rohrer, 1969), initially within the framework of Euler-LaGrange equations in the Calculus of Variations (e.g., Euler, 1744). 

Steepest descent in the weight space of such systems can be performed (Bryson, 1961; Kelley, 1960; Bryson and Ho, 1969) by iterating the chain rule (Leibniz, 1676; L’Hopital, 1696) à la Dynamic Programming (DP, Bellman, 1957). A simplified derivation of this backpropagation method uses the chain rule only (Dreyfus, 1962).

The systems of the 1960s were already efficient in the DP sense. However, they backpropagated derivative information through standard Jacobian matrix calculations from one “layer” to the previous one, without explicitly addressing either direct links across several layers or potential additional efficiency gains due to network sparsity (but perhaps such enhancements seemed obvious to the authors). 

Explicit, efficient error backpropagation (BP) in arbitrary, discrete, possibly sparsely connected, NN-like networks apparently was first described in a 1970 master’s thesis (Linnainmaa, 1970, 1976), albeit without reference to NNs. BP is also known as the reverse mode of automatic differentiation (e.g., Griewank, 2012), where the costs of forward activation spreading essentially equal the costs of backward derivative calculation. See early BP FORTRAN code (Linnainmaa, 1970) and closely related work (Ostrovskii et al., 1971). 

BP was soon explicitly used to minimize cost functions by adapting control parameters (weights) (Dreyfus, 1973). This was followed by some preliminary, NN-specific discussion (Werbos, 1974, section 5.5.1), and a computer program for automatically deriving and implementing BP for any given differentiable system (Speelpenning, 1980).

To my knowledge, the first NN-specific application of efficient BP as above was described in 1981 (Werbos, 1981). Related work was published several years later (Parker, 1985; LeCun, 1985). A paper of 1986 significantly contributed to the popularisation of BP (Rumelhart et al., 1986). 

Compare also the first adaptive, deep, multilayer perceptrons (Ivakhnenko et al., since 1965), whose layers are incrementally grown and trained by regression analysis, as well as  a more recent method for multilayer threshold NNs (Bobrowski, 1978).

Precise references and more history in:

Deep Learning in Neural Networks: An Overview 
PDF & LATEX source & complete public BIBTEX file under
http://www.idsia.ch/~juergen/deep-learning-overview.html

Juergen Schmidhuber
http://www.idsia.ch/~juergen/whatsnew.html

P.S.: I'll give talks on Deep Learning and other things in the NYC area around 1-5 and 18-19 August, and in the Bay area around 7-15 August; videos of previous talks can be found under http://www.idsia.ch/~juergen/videos.html








More information about the Connectionists mailing list