A New Series of Virtual Textbooks on Neural Networks

Fri Oct 6 13:50:05 EDT 1995

October 6, 1995

Yesterday, a visitor to my office, while speaking of his enthusiasm 
for "The Handbook of Brain Theory and Neural Networks", 
mentioned that some of his colleagues had criticized the fact that 
the [266] articles [in Part III] were arranged in alphabetical order,
thus lacking the "logical order" to make the book easy to use for 
teaching.

The purpose of this note is to answer such concerns.

1.  The boring answer is that a Handbook is not a Textbook.  

Indeed, given that the 266 articles provide such a comprehensive 
overview - including detailed models of single neurons; analysis of 
a wide variety of neurobiological systems; connectionist studies; 
mathematical analyses of abstract neural networks; and 
technological applications of adaptive, artificial neural networks 
and related methodologies - it is hard to imagine a course that 
would cover the whole book, no matter in what order the articles 
were presented.

2. The exciting answer is that THE HANDBOOK IS A VIRTUAL 
LIBRARY OF TWENTY-THREE TEXTBOOKS!!

Before the 266 articles of Part III come Part I and Part II.

Part I  provides an introductory textbook level introduction
to Neural Networks.

Part II provides 23 "road maps", each of which lists the 
articles on a particular theme, followed by an essay which
offers a "logical order" in which to read these articles.  

Thus, the Handbook can be used to provide a "virtual textbook"
on any one of the following 23 topics:

Applications of Neural Networks
Artificial Intelligence and Neural Networks
Biological Motor Control
Biological Networks
Biological Neurons
Computability and Complexity
Connectionist Linguistics
Connectionist Psychology
Control Theory and Robotics
Cooperative Phenomena
Development and Regeneration of Neural Networks
Dynamic Systems and Optimization
Implementation of Neural Networks
Learning in Artificial Neural Networks, Deterministic
Learning in Artificial Neural Networks, Statistical
Learning in Biological Systems
Mammalian Brain Regions
Mechanisms of Neural Plasticity
Motor Pattern Generators and Neuroethology
Primate Motor Control
Self-Organization in Neural Networks
Other Sensory Systems
Vision

In each case, the instructor can follow the road map to traverse
the articles to provide full coverage of the topic, using the cross-
references to choose supplementary material from within the 
Handbook, and the carefully selected list of readings at the end
of each article to choose supplementary material from the general 
literature.

As an appendix to this message, I include a sample road map, that
on "Learning in Artificial Neural Networks, Deterministic".  All 
the road maps are available on the Web at:
http://www-mitpress.mit.edu/mitp/recent-
books/comp/handbook-brain-theo.html

If you have other queries about how best to use the Handbook, or 
suggestions for improving the Handbook, please feel free to contact 
me by email:  arbib at pollux.usc.edu.

With best wishes

Michael Arbib

*****

APPENDIX: 

The Road Map for
"Learning in Artificial Neural Networks, Deterministic"

from Part II of The Handbook of Brain Theory and Neural 
Networks, (M.A. Arbib, Ed.), A Bradford Book, copyright 1995, 
The MIT Press.

LEARNING IN ARTIFICIAL NEURAL NETWORKS, 
DETERMINISTIC

[Articles in the Road Map, listed in Alphabetical Order.]

Adaptive Resonance Theory
Associative Networks
Backpropagation: Basics and New Developments
Convolutional Networks for Images, Speech, and Time-Series
Coulomb Potential Learning
Kolmogorov's Theorem
Learning by Symbolic and Neural Methods
Learning as Hill-Climbing in Weight Space
Learning as Adaptive Control of Synaptic Matrices
Modular Neural Net Systems, Training of
Neocognitron: A Model for Visual Pattern Recognition
Neurosmithing: Improving Neural Network Learning
Nonmonotonic Neuron Associative Memory
Pattern Recognition
Perceptrons, Adalines, and Backpropagation
Recurrent Networks: Supervised Learning
Reinforcement Learning
Topology-Modifying Neural Network Algorithms

[Articles in the Road Map, discussed in Logical Order.]

Much of our concern is with supervised learning, getting a network 
to behave in a way which successfully approximates some specified 
pattern of behavior or input-output relationship. In particular, 
much emphasis has been placed on feedforward networks, that is, 
networks which have no loops, so that the output of the net 
depends on its input alone, since there is then no internal state 
defined by reverberating activity. The most direct form of this is a 
synaptic matrix, a one-layer neural network for which input lines 
directly drive the output neurons and a "supervised Hebbian" rule 
sets synapses so that the network will exhibit specified input-
output pairs in its response repertoire. This is addressed in the 
article on ASSOCIATIVE NETWORKS, which notes the problems 
that arise if the input patterns (the "keys" for associations) are not 
orthogonal vectors. Association also extends to recurrent networks 
obtained from one layer networks by feedback connections from the 
output to the input, but in such systems of "dynamic memories" (e.g., 
Hopfield networks) there are no external inputs as such. Rather the 
"input" is the initial state of the network, and the "output" is the 
"attractor" or equilibrium state to which the network then settles. 
Unfortunately, the usual "attractor network" memory model, with 
neurons whose output is a sigmoid function of the linear combination 
of their inputs, has many spurious memories, i.e., equilibria other 
than the memorized patterns, and there is no way to decide a 
memorized pattern is recalled or not. The article on 
NONMONOTONIC NEURON ASSOCIATIVE MEMORY shows that, if 
the output of each neuron is a nonmonotonic function of its input, the 
capacity of the network can be increased, and the network does not 
exhibit spurious memories: when the network fails to recall a 
correct memorized pattern, the state shows a chaotic behavior 
instead of falling into a spurious memory.

Historically, the earliest forms of supervised learning involved 
changing synaptic weights to oppose the error in a neuron with a 
binary output (the perceptron error-correction rule), or to minimize 
the sum of squares of errors of output neurons in a network with real-
valued outputs (the Widrow-Hoff rule). This work is charted in 
the article on PERCEPTRONS, ADALINES AND BACKPROPAGATION, 
which also charts the extension of these classic ideas to 
multilayered feedforward networks. Multilayered networks pose 
the structural credit assignment problem: when an error is made at 
the output of a network, how is credit (or blame) to be assigned to 
neurons deep within the network? One of the most popular 
techniques is called backpropagation, whereby the error of output 
units is propagated back to yield estimates of how much a given 
"hidden unit" contributed to the output error. These estimates are 
used in the adjustment of synaptic weights to these units within the 
network. The article on BACKPROPAGATION: BASICS AND NEW 
DEVELOPMENTS places this idea in a broader mathematical and 
historical framework in which backpropagation is seen as a 
general method for calculating derivatives to adjust the weights of 
nonlinear systems, whether or not they are neural networks. The 
underlying theoretical grounding is that, given any function f: X . 
Y for which X and Y are codable as input and output patterns of a 
neural network, then, as shown in the article on KOLMOGOROV'S 
THEOREM, f can be approximated arbitrarily well by a 
feedforward network with one layer of hidden units. The catch, of 
course, is that many, many hidden units may be required for a close 
fit. It is often an empirical question whether there exists a 
sufficiently good approximation achievable in principle by a 
network of a given size P an approximation which a given learning 
rule may or may not find (it may, for example, get stuck in a local 
optimum rather than a global one). The article on 
NEUROSMITHING: IMPROVING NEURAL NETWORK LEARNING 
provides a number of "rules of thumb" to be used in applying 
backpropagation in trying to find effective settings for network size 
and for various coefficients in the learning rules.

One useful perspective for supervised learning views LEARNING AS 
HILL-CLIMBING IN WEIGHT SPACE, so that each "experience" 
adjusts the synaptic weights of the network to climb (or descend) a 
metaphorical hill for which "height" at a particular point in 
"weight space" corresponds to some measure of the performance of 
the network (or the organism or robot of which it is a part). When 
the aim is to minimize this measure, one of the basic techniques for 
learning is what mathematicians call "gradient descent"; 
optimization theory also provides alternative methods such as, 
e.g., that of conjugate gradients, which are also used in the neural 
network literature. REINFORCEMENT LEARNING describes a form of 
"semi-supervised" learning where the network is not provided 
with an explicit form of error at each time step but rather receives 
only generalized reinforcement ("you're doing well"; "that was 
bad!") which yields little immediate indication of how any neuron 
should change its behavior. Moreover, the reinforcement is 
intermittent, thus raising the temporal credit assignment problem: 
how is an action at one time to be credited for positive 
reinforcement at a later time? One solution is to build an "adaptive 
critic" which learns to evaluate actions of the network on the basis 
of how often they occur on a path leading to positive or negative 
reinforcement.

Another perspective on supervised learning is presented in 
LEARNING AS ADAPTIVE CONTROL OF SYNAPTIC MATRICES, 
which views learning as a control problem (controlling synaptic 
matrices to yield a given network behavior) and then uses the 
adjoint equations of control theory to derive synaptic adjustment 
rules. Gradient descent methods have also been extended to adapt 
the synaptic weights of recurrent networks, as discussed in 
RECURRENT NETWORKS: SUPERVISED LEARNING, where the aim 
is to match the time course of network activity, rather than the 
(input, output) pairs of some training set.

The task par excellence for supervised learning is pattern 
recognition, the problem of classifying objects, often represented as 
vectors or as strings of symbols, into categories. Historically, the 
field of pattern recognition started with early efforts in neural 
networks (see PERCEPTRONS, ADALINES AND 
BACKPROPAGATION). While neural networks played a less central 
role in pattern recognition for some years, recent progress has made 
them the method of choice for many applications. As PATTERN 
RECOGNITION demonstrates, multilayer networks, when properly 
designed, can learn complex mappings in high-dimensional spaces 
without requiring complicated hand-crafted feature extractors. To 
rely more on learning, and less on detailed engineering of feature 
extractors, it is crucial to tailor the network architecture to the 
task, incorporating prior knowledge to be able to learn complex 
tasks without requiring excessively large networks and training 
sets. 

Many specific architectures have been developed to solve 
particular types of learning problem. ADAPTIVE RESONANCE 
THEORY (ART) bases learning on internal expectations. When the 
external world fails to match an ART network's expectations or 
predictions, a search process selects a new category, representing a 
new hypothesis about what is important in the present 
environment. The neocognitron (see NEOCOGNITRON: A MODEL 
FOR VISUAL PATTERN RECOGNITION) was developed as a neural 
network model for visual pattern recognition which addresses the 
specific question "how can a pattern be recognized despite 
variations in size and position?" by using a multilayer architecture 
in which local features are replicated in many different scales and 
locations. More generally, as shown in CONVOLUTIONAL 
NETWORKS FOR IMAGES, SPEECH, AND TIME SERIES, shift 
invariance in convolutional networks is obtained by forcing the 
replication of weight configurations across space. Moreover, the 
topology of the input is taken into account, enabling such networks 
to force the extraction of local features by restricting the receptive 
fields of hidden units to be local. COULOMB POTENTIAL LEARNING 
derives its name from its functional form's likeness to a coulomb 
charge potential, replacing the linear separability of a simple 
perceptron with a network that is capable of constructing arbitrary 
nonlinear boundaries for classification tasks. 

We have already noted that networks that are too small cannot 
learn the desired input to output mapping. However, networks can 
also be too large. Just as a polynomial of too high a degree is not 
useful for curve-fitting, a network that is too large will fail to 
generalize well, and will require longer training times. Smaller 
networks, with fewer free parameters, enforce a smoothness 
constraint on the function found. For best performance, it is, 
therefore, desirable to find the smallest network that will 
"properly" fit the training data. The article TOPOLOGY-
MODIFYING NEURAL NETWORK ALGORITHMS reviews algorithms 
which adjust network topology (i.e., adding or removing neurons 
during the learning process) to arrive at a network appropriate to a 
given task. 

The last two articles in this road map take a somewhat different 
viewpoint from that of adjusting the synaptic weights in a single 
network. MODULAR NEURAL NET SYSTEMS, TRAINING OF presents 
the idea that, although single neural networks are theoretically 
capable of learning complex functions, many problems are better 
solved by designing systems in which several modules cooperate 
together to perform a global task, replacing the complexity of a 
large neural network by the cooperation of neural network modules 
whose size is kept small. The article on LEARNING BY SYMBOLIC 
AND NEURAL METHODS focuses on the distinction between 
symbolic learning based on producing discrete combinations of the 
features used to describe examples and neural approaches which 
adjust continuous, nonlinear weightings of their inputs. The article 
not only compares but also combines the two approaches, showing 
for example how symbolic knowledge may be used to set the initial 
state of an adaptive network. 

[This Road Map is then followed by one on "Learning in Artificial 
Neural Networks, Statistical"]