Connectionist Learning - Some New Ideas/Questions

Thu May 30 18:44:10 EDT 1996

(This is for posting to your mailing list.)

This is an attempt to respond to some thoughts on one particular
aspect of our learning theory - the one that requires
connectionist/neural net algorithms to make an explicit "attempt" to
build the smallest possible net (generalize, that is). One school of
thought says that we should not attempt to build the smallest possible
net because some extra neurons in the net (and their extra connections)
provide the benefits of fault tolerance and reliability. And since the
brain has access to billions of neurons, it does not really need to worry
about a real resource constraint - it is practically an unlimited resource.
(It is a fact of life, however, that at some age we do have
difficulty memorizing and remembering things and learning- we perhaps
run out of space (neurons) like a storage device on a computer.
Even though billions of neurons is a large number, we must be using
most of it at some age. So it is indeed a finite resource and some of it
appears to be reused, like we reuse space on our storage devices.
For memorization, for example, it is possible that
the brain selectively erases some old memories to store some new ones.
So a finite capacity system is a sensible view of the brain.)
Another argument in favor of not trying to generalize is that by not
worrying about attempting to create the smallest possible net, the
connectionist algorithms are easier to develop and less complex. I hope
researchers will come forward with other arguments in favor of not
attempting to create the smallest possible net or to generalize.

There is one main problem with the argument that adding lots of extra
neurons to a net buys reliability and fault tolerance. First, we run the
severe risk of "learning nothing" if we don't attempt to generalize.
With lots of neurons available to a net, we would simply overfit the
net to the problem data. (Try it next time on your back prop net. Add
10 or 100 times the number of hidden nodes you need and observe the
results.) That is all we would achieve. Without good generalization, we
may have a fault tolerant and reliable net, but it may be "useless" for all
practical purposes because it may have "learnt nothing". Generalization
is the fundamental part of learning - it perhaps should be the first
learning criteria for our algorithms. We can't overlook or skip that part.
If an algorithm doesn't attempt to generalize, it doesn't attempt to learn.
It is as simple as that. So generalization needs to be our first priority and
fault tolerance comes later. First we must "learn" something, then make
it fault tolerant and reliable.

Here is a practical viewpoint for our algorithms. Even though neurons
are almost "unlimited" and free of cost to our brain, from a practical
engineering stand point, "silicon" neurons are not so cheap. So our
algorithms definitely need to be cost conscious and try to build the
smallest possible net; they cannot be wasteful in their use of expensive
"silicon" neurons.

Once we obtain good generalization on a problem, fault tolerance can
be achieved in many other ways. It would not hurt to examine the well
established theory of reliability for some neat ideas. A few backup
systems might be a more cost effective way to buy reliability than
throwing in lots of extra silicon in a single system which may buy us
nothing (it "learns nothing"). From controlling nuclear power plants
with backup computer systems to adding extra tires in our trucks and
buses, the backup idea works quite well. It is possible that "backup" is
also what is used in our brains. We need to find out. "Redundancy"
may be in the form of backup systems. "Repair" is another good idea
used in our everyday lives for not so critical systems. Is fault tolerance
and reliability sometimes achieved in the brain through the process of
"repair"? Patients do recover memory and other brain functions after a
stroke. Is that repair work by the biological system? It is a fact that
biological systems are good at repairing things (look at simple things
like cuts and bruises). We perhaps need to look closer at our biological
systems and facts and get real good clues to how it works. Let us not
jump to conclusions so quickly. Let us argue and debate with our facts.
We will do our science a good service and be able to make real progress.

I would welcome more thoughts and debate on this issue. I have
included all of the previous responses on this particular issue for easy
reference by the readers. I have also appended our earlier note on our
learning theory. Perhaps more researchers will come forward with
facts and ideas and enlighten all of us on this crucial question.
********************************************
On May 16 Kevin Cherkauer wrote:

"In a recent thought-provoking posting to the connectionist list, Asim
Roy said:
 >E.      Generalization in Learning: The method must be able to
 >generalize reasonably well so that only a small amount of network
 >resources is used. That is, it must try to design the smallest possible
 >net, although it might not be able to do so every time. This must be
 >an explicit part of the algorithm. This property is based on the
 >notion that the brain could not be wasteful of its limited resources,
  >so it must be trying to design the smallest possible net for every
 >task.

 I disagree with this point. According to Hertz, Krogh, and Palmer
(1991, p. 2), the human brain contains about 10^11 neurons. (They
also state on p. 3 that "the axon of a typical neuron makes a few
thousand synapses with other neurons," so we're looking at on the
order of 10^14 "connections" in the brain.) Note that a period of 100
years contains only about 3x10^9 seconds. Thus, if you lived 100 years
and learned continuously at a constant rate every second of your life,
your brain would be at liberty to "use up" the capacity of about 30
neurons (and 30,000 connections) per second. I would guess this is a
very conservative bound, because most of us probably spend quite a
bit of time where we aren't learning at such a furious rate. But even
using this conservative bound, I calculate that I'm allowed to use up
about 2.7x10^6 neurons (and 2.7x10^9 connections) today.

 I'll try not to spend them all in one place. :-)

 Dr. Roy's suggestion that the brain must try "to design the smallest
possible net for every task" because "the brain could not be wasteful of
its limited resources" is unlikely, in my opinion. It seems to me that the
brain has rather an abundance of neurons. On the other hand, finding
optimal solutions to many interesting "real-world" problems is often
very hard computationally. I am not a complexity theorist, but I will
hazard to suggest that a constraint on neural systems to be optimal or
near-optimal in their space usage is probably both impossible to realize
and, in fact, unnecessary.

 Wild speculation: the brain may have so many neurons precisely so
that it can afford to be suboptimal in its storage usage in order to
avoid computational time intractability.

 References

   Hertz, J.; Krogh, A.; & Palmer, R.G. 1991. Introduction to the
Theory of Neural Computation. Redwood City, CA:Addison-Wesley."
**************************************************
On May 15 Richard Kenyon wrote on the subject of generalization:

" The brain probably accepts some form of redundancy (waste).
 I agree that the brain is one hell of an optimisation machine.
 Intelligence whatever task it may be applied to is (again imho) one
long optimisation process.

 Generalisation arises (even emerges or is a side effect) as a result of
 ongoing optimisation, conglomeration, reprocessing etc etc. This is
again very important i agree, but i think (i do anyway) we in NN
commumnity are aware of this as with much of the above. I thought
that apart from point A we were doing all of this already, although to
have it explicitly published is very valuable."
*****************************************
On May 16 Lokendra Shastri replied to Kevin Cherkauer:

"There is another way to look at the numbers. The retina provides
10^6 inputs to the brain every 200 msec! A simple n^2 algorithm to
process this input would require more neurons than we have in our
brain. We can understand (or at least process) a potentially unbounded
number of sentences --- Here is one "the grandcanyon walked past the
banana" I could have said anyone of a gazzilion sentences at this point
and you would have probably understood it. Even if we just count the
overt symbolic knowledge, we carry in our heads, we can enumerate
about a million items. A coding scheme that consumed a 1000 neurons
per item (which is not much) would soon run out neurons. Remember
that a large fraction of our neurons are already taken up by
sensorimotor processes (vision itself consumes a fair fraction of the
brain).For an argument on the tight constraints posed by the "limited"
number of neurons vis-a-vis common sense knowledge, you may want
to see:

 ``From simple associations to systematic reasoning'', L. Shastri and
 V. Ajjanagadde. In Behavioral and Brain Sciences Vol. 16, No. 3,
 417--494, 1993.
 My home page has a URL to a postscript version.

 There was also a nice paper by Tsotsos in Behavioral and Brains
Sciences on this topic from the perspective of Visual Processing. Also
you might want to see Feldman and Ballard 1982 paper in Cognitive
Science."
***********************************************
On May 17 Steven Small replied to Keven Cherkauer:

 "I agree with this general idea, although I'm not sure that
"computational time intractability" is necessarily the principal reason.
There are a lot of good reasons for redundancy, overlap, and space
"suboptimality", not the least of which is the marvellous ability at
recovery that the brain manifests after both small injuries and larger
ones that give pause even to experienced neurologists."
*************************************************
On May 17 Jonathan Stein replied to Steven Small and Kevin
Cherkauer:

 "One needn't draw upon injuries to prove the point. One loses about
100,000 cortical neurons a day (about a percent of the original number
every three years) under normal conditions. This loss is apparently not
significant for brain function. This has been often called the strongest
argument for distributed processing in the brain. Compare this ability
with the fact that single conductor disconnection cause total system
failure with high probability in conventional computers.

 Although certainly acknowledged by the pioneers of artificial neural
 network techniques, very few networks designed and trained by
present techniques are anywhere near that robust. Studies carried out
on the Hopfield model of associative memory DO show graceful
degradation of memory capacity with synapse dilution under certain
conditions (see eg. DJ Amit's book "Attractor Neural Networks").
Synapse pruning has been applied to trained feedforward networks
(eg. LeCun's "Optimal Brain Damage") but requires retraining of the
network."
 ******************************************
On May 18 Raj Rao replied to Kevin Cherkauer and Steven Small:

" Does anyone have a concrete citation (a journal article) for this or
 any other similar estimate regarding the daily cell death rate in the
 cortex of a normal brain?  I've read such numbers in a number of
 connectionist papers but none cite any neurophysiological studies that
 substantiate these numbers."
********************************************
On May 19 Richard Long wrote:

"There may be another reason for the brain to construct
 networks that are 'minimal' having to do with Chaitin and
 Kolmogorov computational complexity.  If a minimal network
corresponds to a 'minimal algorithm' for implementing a particular
computation, then that particular network must utilize all of the
symmetries and regularities contained in the problem, or else these
symmetries could be used to reduce the network further.  Chaitin has
shown that no algorithm for finding this minimal algorithm in the
general case is possible. However, if an evolutionary programming
method is used in which the fitness function is both 'solves the
problem' and 'smallest size' (i.e. Occam's razor), then it is possible that
the symmetries and regularities in the problem would be extracted as
smaller and smaller networks are found.  I would argue that such
networks would compute the solution less by rote or brute force, and
more from a deep understanding of the problem. I would like to hear
anyone else's thoughts on this."
**************************************************
On May 20 Juergen Schmidhuber replies to Richard Long:

"Apparently, Kolmogorov was the first to show the impossibility of
finding the minimal algorithm in the general case (but Solomonoff also
mentions it in his early work). The reason is the halting problem, of
course - you don't know the runtime of the minimal algorithm. For all
practical applications, runtime has to be taken into account.
Interestingly, there is an ``optimal'' way of doing this, namely Levin's
universal search algorithm, which tests solution candidates in order of
their Levin complexities:

 L. A. Levin. Universal sequential search problems, Problems of
Information Transmission 9:3,265-266,1973.

 For finding Occam's razor neural networks with minimal Levin
complexity, see
 J. Schmidhuber: Discovering solutions with  low Kolmogorov
complexity and high generalization capability.  In A.Prieditis and
S.Russell, editors, Machine Learning: Proceedings of the 12th
International Conference, 488--496. Morgan Kaufmann Publishers,
San Francisco, CA, 1995.

 For Occam's razor solutions of non-Markovian reinforcement learning
tasks, see
 M. Wiering and J. Schmidhuber:  Solving POMDPs using Levin
search and EIRA.In Machine Learning: Proceedings of the 13th
International Conference. Morgan Kaufmann Publishers, San
Francisco, CA, 1996, to appear."
**********************************************
 On May 20 Sydney Lamb replied to Jonathan Stein and others:

" There seems to be some differing information coming from different
 sources.  The way I heard it, the typical person has lost only about 3%
 of the original total of cortical neurons after about 70 or 80 years.

 As for the argument about distributed processing, two comments: (1)
there are different kinds of distributive processing; one of them also
uses strict localization of points of convergence for distributed
subnetworks of information (cf. A. Damasio 1989 --- several papers
that year).  (2) If the brain is like other biological systems, the neurons
being lost are probably most the ones not being used --- ones that have
been remaining latent and available to assume some function, but never
called upon. Hence what you get with old age is not so much loss of
information as loss of ability to learn new things --- varying in amount,
of course, from one individual to the next."
 *****************************************
 On May 20 Mark Johnson replies to Raj Rao:

 "From my reading of the recent literature massive postnatal cell loss in
the human cortex is a myth.  There is postnatal cortical cell death in
rodents, but in primates (including humans) there is only (i) a
decreased density of cell packing, and (ii) massive (up to 50%)
synapse loss.  (The decreased density of cell packing was apparently
misinterpreted as cell loss in the past).  Of course, there are
pathological cases, such as Alzheimers, in which there is cell loss.

 I have written a review of human postnatal brain development which I
can send out on request."
**************************************************
***************************************************
APPENDIX

We have recently published a set of principles for learning in neural
networks/connectionist models that is different from classical
connectionist learning (Neural Networks, Vol. 8, No. 2; IEEE
Transactions on Neural Networks, to appear; see references
below). Below is a brief summary of the new learning theory and
why we think classical connectionist learning, which is
characterized by pre-defined nets, local learning laws and
memoryless learning (no storing of training examples for learning),
is not brain-like at all. Since vigorous and open debate is very
healthy for a scientific field, we invite comments for and against our
ideas from all sides.

"A New Theory for Learning in Connectionist Models"

We believe that a good rigorous theory for artificial neural
networks/connectionist models should include learning methods
that perform the following tasks or adhere to the following criteria:

A. Perform Network Design Task: A neural network/connectionist
learning method must be able to design an appropriate network for
a given problem, since, in general, it is a task performed by the
brain. A pre-designed net should not be provided to the method as
part of its external input, since it never is an external input to the
brain. From a neuroengineering and neuroscience point of view, this
is an essential property for any "stand-alone" learning system - a
system that is expected to learn "on its own" without any external
design assistance.

B. 	Robustness in Learning: The method must be robust so as
not to have the local minima problem, the problems of oscillation
and catastrophic forgetting, the problem of recall or lost memories
and similar learning difficulties. Some people might argue that
ordinary brains, and particularly  those with learning disabilities, do
exhibit such problems and that these learning requirements are the
attributes only of a "super" brain. The goal of neuroengineers and
neuroscientists is to design and build learning systems that are
robust, reliable and powerful. They have no interest in creating
weak and problematic learning devices that need constant attention
and intervention.

C. 	Quickness in Learning: The method must be quick in its
learning and learn rapidly from only a few examples, much as
humans do. For example, one which learns from only 10 examples
learns faster than one which requires a 100 or a 1000 examples. We
have shown that on-line learning (see references below),  when not
allowed to store training examples in memory, can be extremely
slow in learning - that is, would require many more examples to
learn a given task compared to methods that use memory to
remember training examples. It is not desirable that a neural
network/connectionist learning system be similar in characteristics
to learners characterized by such sayings as "Told him a million
times and he still doesn't understand." On-line learning systems
must learn rapidly from only a few examples.

D. 	Efficiency in Learning: The method must be
computationally efficient in its learning when provided with a finite
number of training examples (Minsky and Papert[1988]). It must be
able to both design and train an appropriate net in polynomial time.
That is, given P examples, the learning time (i.e. both design and
training time) should be a polynomial function of P. This, again, is a
critical computational property from a neuroengineering and
neuroscience point of view.  This property has its origins in the
belief that  biological systems (insects, birds for example) could not
be solving NP-hard problems, especially when efficient, polynomial
time learning methods can conceivably be designed and developed.

E. 	Generalization in Learning: The method must be able to
generalize reasonably well so that only a small amount of network
resources is used. That is, it must try to design the smallest possible
net, although it might not be able to do so every time. This must be
an explicit part of the algorithm. This property is based on the
notion that the brain could not be wasteful of its limited resources,
so it must be trying to design the smallest possible net for every
task.

General Comments

This theory defines algorithmic characteristics that are obviously
much more brain-like than those of classical connectionist theory,
which is characterized by pre-defined nets, local learning laws and
memoryless learning (no storing of actual training examples for
learning). Judging by the above characteristics, classical
connectionist learning is not very powerful or robust. First of all, it
does not even address the issue of network design, a task that
should be central to any neural network/connectionist learning
theory. It is also plagued by efficiency (lack of polynomial time
complexity, need for excessive number of teaching examples) and
robustness problems (local minima, oscillation, catastrophic
forgetting, lost memories), problems that are partly acquired from
its attempt to learn without using memory. Classical connectionist
learning, therefore, is not very brain-like at all.

As far as I know, there is no biological evidence for any of the
premises of classical connectionist learning. Without having to
reach into biology, simple common sense arguments can show that
the ideas of local learning, memoryless learning and predefined nets
are impractical even for the brain! For example, the idea of local
learning requires a predefined network. Classical connectionist
learning forgot to ask a very fundamental question - who designs
the net for the brain? The answer is very simple: Who else, but the
brain itself! So, who should construct the net for a neural net
algorithm? The answer again is very simple: Who else, but the
algorithm itself! (By the way, this is not a criticism of constructive
algorithms that do design nets.) Under classical connectionist
learning, a net has to be constructed (by someone, somehow - but
not by the algorithm!) prior to having seen a single training
example! I cannot imagine any system, biological or otherwise,
being able to construct a net with zero information about the
problem to be solved and with no knowledge of the complexity of
the problem. (Again, this is not a criticism of constructive
algorithms.)

A good test for a so-called "brain-like" algorithm is to imagine it
actually being part of a human brain. Then examine the learning
phenomenon of the algorithm and compare it with that of the
human's. For example, pose the following question: If an algorithm
like back propagation is "planted" in the brain, how will it behave?
Will it be similar to human behavior in every way? Look at the
following simple "model/algorithm" phenomenon when the back-
propagation algorithm is "fitted" to a human brain. You give it a
few learning examples for a simple problem and after a while this
"back prop fitted" brain says: "I am stuck in a local minimum. I
need to relearn this problem. Start over again." And you ask:
"Which examples should I go over again?" And this "back prop
fitted" brain replies: "You need to go over all of them. I don't
remember anything you told me." So you go over the teaching
examples again. And let's say it gets stuck in a local minimum again
and, as usual, does not remember any of the past examples. So you
provide the teaching examples again and this process is repeated a
few times until it learns properly. The obvious questions are as
follows: Is "not remembering" any of the learning examples a brain-
like phenomenon? Are the interactions with this so-called "brain-
like" algorithm similar to what one would actually encounter with a
human in a similar situation? If the interactions are not similar, then
the algorithm is not brain-like. A so-called brain-like algorithm's
interactions with the external world/teacher cannot be different
from that of the human.

In the context of this example, it should be noted that
storing/remembering relevant facts and examples is very much a
natural part of the human learning process. Without the ability to
store and recall facts/information and discuss, compare and argue
about them, our ability to learn would be in serious jeopardy.
Information storage facilitates mental comparison of facts and
information and is an integral part of rapid and efficient learning. It
is not biologically justified when "brain-like" algorithms disallow
usage of memory to store relevant information.

Another typical phenomenon of classical connectionist learning is
the "external tweaking" of algorithms. How many times do we
"externally tweak" the brain (e.g. adjust the net, try a different
parameter setting) for it to learn? Interactions with a brain-like
algorithm has to be brain-like indeed in all respect.

The learning scheme postulated above does not specify how
learning is to take place - that is, whether memory is to be used  or
not to store training examples for learning, or whether learning is to
be through local learning at each node in the net or through some
global mechanism. It merely defines broad computational
characteristics and tasks (i.e. fundamental learning principles) that
are brain-like and that all neural network/connectionist algorithms
should follow. But there is complete freedom otherwise in
designing the algorithms themselves. We have shown that robust,
reliable learning algorithms can indeed be developed that satisfy
these learning principles (see references below). Many constructive
algorithms satisfy many of the learning principles defined above.
They can, perhaps, be modified to satisfy all of the learning
principles.

The learning theory above defines computational and learning
characteristics that have always been desired by the neural
network/connectionist field. It is difficult to argue that these
characteristics are not "desirable," especially for self-learning, self-
contained systems.  For neuroscientists and neuroengineers, it
should open the door to development of brain-like systems they
have always wanted - those that can learn on their own without any
external intervention or assistance, much like the brain. It essentially
tries to redefine the nature of algorithms considered to be brain-
like. And it defines the foundations for developing truly self-
learning systems - ones that wouldn't require constant intervention
and tweaking by external agents (human experts) for it to learn.

It is perhaps time to reexamine the foundations of the neural
network/connectionist field. This mailing list/newsletter provides an
excellent opportunity for participation by all concerned throughout
the world. I am looking forward to a lively debate on these matters.
That is how a scientific field makes real progress.

Asim Roy
Arizona State University
Tempe, Arizona 85287-3606, USA
Email: ataxr at asuvm.inre.asu.edu

References

1.  Roy, A., Govil, S. & Miranda, R. 1995. A Neural Network
Learning Theory and a Polynomial Time RBF Algorithm. IEEE
Transactions on Neural Networks, to appear.

2.  Roy, A., Govil, S. & Miranda, R. 1995. An Algorithm to
Generate Radial Basis Function (RBF)-like Nets for Classification
Problems. Neural Networks, Vol. 8, No. 2, pp. 179-202.

3.  Roy, A., Kim, L.S. & Mukhopadhyay, S. 1993. A Polynomial
Time Algorithm for the Construction and Training of a Class of
Multilayer Perceptrons. Neural Networks, Vol. 6, No. 4, pp. 535-
545.

4.  Mukhopadhyay, S., Roy, A., Kim, L.S. & Govil, S. 1993. A
Polynomial Time Algorithm for Generating Neural Networks for
Pattern Classification - its Stability Properties and Some Test
Results. Neural Computation, Vol. 5, No. 2, pp. 225-238.