From joho%sw.MCC.COM at MCC.COM  Thu Mar  2 13:18:40 1989
From: joho%sw.MCC.COM at MCC.COM (Josiah Hoskins)
Date: Thu, 2 Mar 89 12:18:40 CST
Subject: Tech Report Announcement
Message-ID: <8903021818.AA22902@jelly.sw.mcc.com>

The following tech report is available.

		Speeding Up Artificial Neural Networks
			in the "Real" World

			 Josiah C. Hoskins

	A new heuristic, called focused-attention backpropagation (FAB)
	learning, is introduced. FAB enhances the backpropagation pro-
	cedure by focusing attention on the exemplar patterns that are
	most difficult to learn. Results are reported using FAB learning
	to train multilayer feed-forward artificial neural networks to
	represent real-valued elementary functions. The rate of learning
	observed using FAB is 1.5 to 10 times faster than backpropagation.


Request for copies should refer to MCC Technical Report Number STP-049-89
and should be sent to

	Kintner at mcc.com

or to

Josiah C. Hoskins                       
MCC - Software Technology Program       AT&T:           (512) 338-3684
9390 Research Blvd, Kaleido II Bldg.    UUCP/USENET:    milano!joho
Austin, Texas 78759                     ARPA/INTERNET:  joho at mcc.com


From cfields at NMSU.Edu  Fri Mar  3 17:16:53 1989
From: cfields at NMSU.Edu (cfields@NMSU.Edu)
Date: Fri, 3 Mar 89 15:16:53 MST
Subject: No subject
Message-ID: <8903032216.AA17939@NMSU.Edu>

_________________________________________________________________________

The following are abstracts of papers appearing in the inaugural issue
of the Journal of Experimental and Theoretical Artificial
Intelligence.  JETAI 1, 1 was published 1 January, 1989.

For submission information, please contact either of the editors:

Eric Dietrich                           Chris Fields
PACSS - Department of Philosophy        Box 30001/3CRL
SUNY Binghamton                         New Mexico State University
Binghamton, NY 13901                    Las Cruces, NM 88003-0001

dietrich at bingvaxu.cc.binghamton.edu     cfields at nmsu.edu

JETAI is published by Taylor & Francis, Ltd., London, New York, Philadelphia

_________________________________________________________________________

Minds, machines and Searle

Stevan Harnad

Behavioral & Brain Sciences, 20 Nassau Street, Princeton NJ 08542, USA

Searle's celebrated Chinese Room Argument has shaken the foundations
of Artificial Intelligence.  Many refutations have been attempted, but
none seem convincing.  This paper is an attempt to sort out explicitly
the assumptions and the logical, methodological and empirical points
of disagreement.  Searle is shown to have underestimated some features
of computer modeling, but the heart of the issue turns out to be an
empirical question about the scope and limits of the purely symbolic
(computational) model of the mind.  Nonsymbolic modeling turns out to
be immune to the Chinese Room Argument.  The issues discussed include
the Total Turing Test, modularity, neural modeling, robotics,
causality and the symbol-grounding problem.

_________________________________________________________________________

Explanation-based learning: its role in problem solving

Brent J. Krawchuck and Ian H. Witten

Knowledge Sciences Laboratory, Department of Computer Science,
University of Calgary, 2500 University Drive, NW, Calgary, Alta,
Canada, T2N 1N4.

`Explanation-based' learning is a semantically-driven,
knowledge-intensive paradigm for machine learning which contrasts
sharply with syntactic or `similarity-based' approaches.  This paper
redevelops the foundations of EBL from the perspective of
problem-solving.  Viewed in this light, the technique is revealed as a
simple modification to an inference engine which gives it the ability
to generalize the conditions under which the solution to a particular
problem holds.  We show how to embed generalization invisibly within
the problem solver, so that it is accomplished as inference proceeds
rather than as a separate step.  The approach is also extended to the
more complex domain of planning to illustrate that it is applicable to
a variety of logic-based problem-solvers and is by no means restricted
to only simple ones.  We argue against the current trend to isolate
learning from other activity and study it separately, preferred
instead to integrate it into the very heart of problem solving.

----------------------------------------------------------------------------

The recognition and classification of concepts in understanding
scientific texts

Fernando Gomez and Carlos Segami

Department of Computer Science, University of Central Florida,
Orlando, FL 32816, USA.

In understanding a novel scientific text, we may distinguish the
following processes.  First, concepts are built from the logical form
of the sentence into the final knowledge structures.  This is called
concept formation.  While these concepts are being formed, they are
also being recognized by checking whether they are already in
long-term memory (LTM).  Then, those concepts which are unrecognized
are integrated in LTM.  In this paper, algorithms for the recognition
and integration of concepts in understanding scientific texts are
presented.  It is shown that the integration of concepts in scientific
texts is essentially a classification task, which determines how and
where to integrate them in LTM.  In some cases, the integration of
concepts results in a reclassification of some of the concepts already
stored in LTM.  All the algorithms described here have been
implemented and are part of SNOWY, a program which reads short
scientific paragraphs and answer questions.

---------------------------------------------------------------------------

Exploring the No-Function-In-Structure principle

Anne Keuneke and Dean Allemang

Laboratory for Artificial Intelligence Research, Department of
Computer and Information Science, The Ohio State University, 2036 Neil
Avenue Mall, Columbus, OH 43210-1277, USA.

Although much of past work in AI has focused on compiled knowledge
systems, recent research shows renewed interest and advanced efforts
both in model-based reasoning and in the integration of this deep
knowledge with compiled problem solving structures.  Device-based
reasoning can only be as good as the model used; if the needed
knowledge, correct detail, or proper theoretical background is not
accessible, performance deteriorates.  Much of the work on model-based
reasoning references the `no-function-in-structure' principle, which
was introduced be de Kleer and Brown.  Although they were all well
motivated in establishing the guideline, this paper explores the
applicability and workability of the concept as a universal principle
for model representation.  This paper first describes the principle,
its intent and the concerns it addresses.  It then questions the
feasibility and the practicality of the principle as a universal
guideline for model representation.

___________________________________________________________________________

From jbower at bek-mc.caltech.edu  Sun Mar  5 21:09:10 1989
From: jbower at bek-mc.caltech.edu (Jim Bower)
Date: Sun, 5 Mar 89 18:09:10 pst
Subject: Summer course in computational neurobiology
Message-ID: <8903060209.AA03962@bek-mc.caltech.edu>

                            
 Course announcement:
 
               Methods in Computational Neuroscience
   
                  The Marine Biological Laboratory
                      Woods Hole, Massachusetts
 
                     August 6 - September 2,1989


                        General Description
 
      The Marine Biological Laboratory (MBL) in Woods Hole
 Massachusetts is a world famous marine biological laboratory that
 has been in existence for over 100 years.  In addition to providing
 research facilities for a large number of biologists during the
 summer, the MBL also sponsors a number of outstanding courses on
 different topics in Biology. 
 	This summer will be the second year in which the MBL has
 offered a course in "Methods in Computational Neuroscience".  This
 course is designed as a survey of the use of computer modeling
 techniques in studying the information processing capabilities of the
 nervous system and covers models at all levels from biologically
 realistic single cells and networks of cells to biologically relevant
 abstract models.  The principle aim of the course is to provide
 participants with the tools to simulate the functional properties of
 those neural systems of interest to them as well as to understand
 the general advantages and pitfalls of this experimental approach.
 
                  The Specific Structure of the Course 
 
      The course itself includes both a lecture series and a computer
 laboratory.  The lectures are given by invited faculty whose work
 represents the state of the art in computational neuroscience (see
 list below). The course lecture notes have been incorporated into a
 book published by MIT press (" Methods in Neuronal Modeling: From
 Synapses to Networks"  C. Koch and I. Segev, editors. MIT Press,
 Cambridge, MA.,1989). 
 	The computer laboratory is designed to give students hands-on
 experience with the simulation techniques considered in the lecture.
 It also provides students with the opportunity to actually begin
 simulations of neural systems of interest to them.  The students are
 guided in this effort by the visiting lecturers and course directors,
 but also by several students from the Computational Neural Systems
 (CNS) graduate program at Caltech who serve as Laboratory TAs.  The
 lab itself consists of state of the art graphics workstations running
 a GEneral NEtwork SImulation System (GENESIS) that Dr. Bower and
 his colleagues at Caltech have constructed over the last several years. 
 Students return to their home institutions with the GENESIS system to
 continue their work.
 
                           The Students
 
 	The course is designed for advanced graduate students and
 postdoctoral fellows in biology, computer science, electrical
 engineering, physics, or psychology with an interest in computational
 neuroscience.  Because of the heavy computer orientation of the Lab
 section, a good computer background is required (UNIX, C or PASCAL). 
 In addition, students are expected to have a basic background in
 neurobiology. Course enrollment is limited to 20 so as to assure the
 highest quality educational experience.

                          Course Directors

 James M. Bower and Christof Koch
 Computation and Neural Systems Program
 California Institute of Technology 

                            The Faculty
 
 
 Paul Adams (Stony Brook)
 Dan Alkon (NIH)
 Richard Anderson (MIT)
 John Hildebrand (Arizona)
 John Hopfield (Caltech)
 Rudolfo Llinas (NYU)
 David Rumelhart (Stanford)
 Idan Segev (Jerusalem)
 Terrence Sejnowski (Salk/UCSD)
 David Van Essen (Caltech)
 Christoph Von der Malsburg (USC)
 
 For further information and application materials contact:
 
 Admissions Coordinator
 Marine Biological Laboratory
 Woods Hole, MA 02543
 (508) 548-3705 extension 216
 
 Application Deadline May 15, 1989
 Acceptance notification in early June.


From mjolsness-eric at YALE.ARPA  Tue Mar  7 21:23:16 1989
From: mjolsness-eric at YALE.ARPA (Eric Mjolsness)
Date: Tue, 7 Mar 89 21:23:16 EST
Subject: "Transformations" tech report
Message-ID: <8903080223.AA17992@NEBULA.SUN3.CS.YALE.EDU>

A new technical report is available:

"Algebraic Transformations of Objective Functions"

(YALEU/DCS/RR-686)

by Eric Mjolsness and Charles Garrett
Yale Department of Computer Science
P.O. 2158 Yale Station
New Haven CT 06520

Abstract:
A standard neural network design trick reduces the number of connections
in the winner-take-all (WTA) network from O(N^2) to O(N).  We explain the
trick as a general fixpoint-preserving transformation applied to the
particular objective function associated with the WTA network.  The key
idea is to introduce new interneurons which act to maximize the objective,
so that the network seeks a saddle point rather than a minimum.  A number
of fixpoint-preserving transformations are derived, allowing the
simplification of such algebraic forms as products of expressions,
functions of one or two expressions, and sparse matrix products.  The
transformations may be applied to reduce or simplify the implementation of
a great many structured neural networks, as we demonstrate for inexact
graph-matching, convolutions and coordinate transformations, and sorting.
Simulations show that fixpoint-preserving transformations may be applied
repeatedly and elaborately, and the example networks still robustly
converge.  We discuss implications for circuit design.

To request a copy, please send your physical address by e-mail to
	mjolsness-eric at cs.yale.edu
OR	mjolsness-eric at yale.arpa	(old style)
Thank you.

-------


From prlb2!vub.vub.ac.be!prog1!wplybaer at uunet.UU.NET  Tue Mar  7 19:34:21 1989
From: prlb2!vub.vub.ac.be!prog1!wplybaer at uunet.UU.NET (Wim P. Lybaert)
Date: Wed, 8 Mar 89 01:34:21 +0100
Subject: No subject
Message-ID: <8903080034.AA10074@prog1.vub.ac.be>


Hi,

 i would like to be placed on the connectionist neural nets mailing list
 that you distribute.

Thanks,
Wim Lybaert
Brussels Free University
Department PROG
Oefenplein 2
1040 BRUSSELS
BELGIUM

    email:   <wplybaer at prog1.vub.ac.be>


From ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU  Wed Mar  8 11:36:31 1989
From: ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU (thanasis kehagias)
Date: Wed, 08 Mar 89 11:36:31 EST
Subject: information function vs. squared error
Message-ID: <mailman.106.1149540183.24850.connectionists@cs.cmu.edu>

i am looking for pointers to papers discussing the use of an alternative
criterion to squared error, in back propagation algorithms. the
alternative function i have in mind is called (in different contexts
and/or authors) cross entropy, entropy, information, inf. divergence and
so on. it is defined something like:


    G=sum{i=1}{N} p_i*log(p_i)


i am not quite sure what the index i runs through: untis, weights or
something else. i know people have been talking about this a lot, i just
cannot remember where i read aboout it ...  it seems like Geoff Hinton's
group work on this .


      thanks,

          Thanasis

From mdp%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK  Thu Mar  9 08:16:07 1989
From: mdp%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK (Mark Plumbley)
Date: Thu, 9 Mar 89 13:16:07 GMT
Subject: information function vs. squared error
Message-ID: <14398.8903091316@dsl.eng.cam.ac.uk>

Thanasis,

The "G" function you mentioned, based on an Entropy method, is probably
the one developed by Pearmutter and Hinton as a procedure for unsupervised
learning of binary units [1].  More recently, Linsker [2,3] and Plumbley
and Fallside [4] considered the principle of maximum information
transmission (or minimum information loss) for continuous units,  relating
this to Principal Component methods for linear units.

Unfortunately, these are mainly about unsupervised learning, rather than
Backprop specifically, although in [4] we do look at the way the
mean-squared error criterion places an *upper-bound* on the information loss
through a supervised network.  This bound will be tightest when the errors
on all the output units are independent and have the same variance (or the
same entropy for non-additive-Gaussian errors).  *If* you can choose the
target representation used by Backprop so that the errors are likely to
have these properties, it should perform closer to the (information-
theoretic) optimal.

Hope this is some help,

Mark.

References:

[1] B. A. Pearlmutter and G. E. Hinton: "G-Maximization: An Unsupervised
Learning Procedure for Discovering Regularities". In Proceedings of the
Conference on `Neural Networks for Computing'. American Institute of
Physics, 1986.

[2] R. Linsker: "Towards an Organisational Principle for a Layered
Perceptual Network". In "Neural Information Processing Systems
(Denver, CO. 1987)" (Ed. D. Z. Anderson), pp. 485-494.
American Institute of Physics, 1988.

[3] R. Linsker: "Self-Organization in a Perceptual Network". IEEE Computer,
vol. 21 (3), March 1988, pp. 105-117.

[4] M. D. Plumbley and F. Fallside: "An Information-Theoretic Approach to
Unsupervised Connectionist Models". Tech. Report CUED/F-INFENG/TR.7.
Cambridge University Engineering Department, 1988. Also in "Proceedings
of the 1988 Connectionist Models Summer School", pp. 239-245.
Morgan-Kaufmann, San Mateo, CA.

+--------------------------------------------+---------------------------+
| Mark Plumbley                              | Cambridge University      |
|  JANET: mdp at uk.ac.cam.eng.dsl              |   Engineering Department, |
|  ARPANET:                                  | Trumpington Street,       |
|     mdp%dsl.eng.cam.ac.uk at nss.cs.ucl.ac.uk | Cambridge  CB2 1PZ        |
|  Tel: +44 223 332754  Fax: +44 223 332662  | UK                        |
+--------------------------------------------+---------------------------+

From becker at ai.toronto.edu  Thu Mar  9 13:26:38 1989
From: becker at ai.toronto.edu (becker@ai.toronto.edu)
Date: Thu, 9 Mar 89 13:26:38 EST
Subject: information function vs. squared error 
Message-ID: <89Mar9.132645est.10489@ephemeral.ai.toronto.edu>


The use of the cross-entropy measure G = p log(p/q) + (1-p)log(1-p)/(1-q)
(Kullback, 1959), where p and q are the probabilities of a binary random
variable under 2 probability distributions) has been described in at least 
3 different contexts in the connectionist literature:

(i) As an objective function for supervised back-propagation; this
is appropriate if the output units are computing real values which
are to be interpreted as probability distributions over the space
of binary output vectors (Hinton, 1987). Here G-error represents 
the divergence between the desired and observed distributions.

(ii) As an objective function for Boltzmann machine learning (Hinton
and Sejnowski, 1986), where p and q are the output distributions 
in the + and - phases.

(iii) In the Gmax unsupervised learning algorithm (Pearlmutter and Hinton,
1986) as a measure of the difference between the actual
output distribution of a unit and the predicted distribution assuming
independent input lines.

References:

Hinton, G. E. 1987. "Connectionist Learning Procedures", Revised version
of Technical Report CMU-CS-87-115, to appear (appeared ?) in Artificial
Intelligence.

Hinton, G. E. and  Sejnowski, T. J. 1986. "Learning and relearning in 
Boltzmann machines", in Parallel distributed processing: Explorations in
the microstructure of cognition, Bradford Books.

Kullback, S., 1959. "Information Theory and Statistics", New York: Wiley.

Pearlmutter, B. A.  and  Hinton, G. E. 1986. "G-Maximization: An unsupervised 
learning procedure for discovering regularities.", Neural Networks for 
Computing: American Institute of Physics Conference Proceedings 151.


Sue Becker                      
DCS, University of Toronto          	


From mehra at aquinas.csl.uiuc.edu  Fri Mar 10 05:43:16 1989
From: mehra at aquinas.csl.uiuc.edu (Pankaj Mehra)
Date: Fri, 10 Mar 89 04:43:16 CST
Subject: No subject
Message-ID: <8903101043.AA02586@aquinas>

I have recently explored several connectionist models for learning
under _realistic_ learning scenarios. The class of  problems for
which we are trying to acquire solutions by learning are decision
problems with the following characteristics:

(i) large number of continuous-valued PARAMETERS, each of which
	(ia) takes on values from a finite range with a nonstationary
		distribution
	(ib) costs more to measure accurately.
		{however, accuracy can be controlled by focussed sampling}
	(ic) is not known to follow any particular parametric distribution
(ii) the optimization CRITERION (energy, if you will) is ill-defined
	{much like the _blackbox_ in David Ackley's thesis}
(iii) a set of OPERATORS is available, and these are the _only_ instruments
	for manipulating the problem state.
	(iiia) the _causal_ relationships between the states before and
		after the application of the operator are not known
	(iiib) the _persistence_ model is incomplete - i.e. it is not
		known a priori as to when the effect of an action will
		be felt and how long will it persist
(iv) the TRAINING ENVIRONMENT is _slow reactive_ : it can be assumed to
	produce reinforcement (prescriptive feedback) rather than an
	error (evaluative feedback); however, the delays between an action
	and subsequent reinforcement follow an _unknown_ distribution.
-------
These have been called Dynamic Decision Problems, and shown to be a rich class,
in the following publication [available upon request from the first author]:

Mehra, P. and B. W. Wah, "Architectures for Strategy Learning,"
  in Computer Architectures for Artificial Intelligence Appli-
  cations, ed. B. Wah and C. Ramamoorthy, Wiley, New York, NY,
  1989 (in press).

{send e-mail to: mehra at cs.uiuc.edu}
-------
The above publication also examines the applicability of other well-known
learning techniques {empirical, probabilistic, decision theoretic, EBL,
hybrid techniques, learning to plan, etc} and suggests why ANSs might be
prefered over others. As a part of this comparision, several contemporary
connectionist models were found lacking in certain respects. I shall
summarize the criticisms here, and would like to have feedback from
those who have supported the use of these techniques.

BACK-PROPAGATION:
positive aspects:
	Simplicity of programming the learning algorithm
	An effective procedure for tuning of large parameter
	  sets representable as _band matrices_ (layered networks)
problematic assumptions:
	Immediate feedback
	Corrective {as against prescriptive} feedback
		[I am aware of Ron Williams' work, though]
weakness as a learning approach
	Requires tweaking of features (normalization biases) to the
	extent that the degree of generalization varies drastically
	as the degree of coarse coding changes. A great part of the
	success in particular applications could therefore be attributed
	to the intelligence of the researcher who codes those features
	{rather than to the _learning_ algorithm}

REINFORCEMENT LEARNING
positive aspects
	Can handle prescriptive feedback
	Has been shown {Rich Sutton, Chuck Anderson} to work with delayed
	  feedback
problematic assumptions
	The implementations known to this author assume
		: persistence of effects decays _exponentially_ with time
		: heuristic assumptions such as "recency" (that the more
		  recent an action is, the more is it responsible for the
		  feedback) and frequency (that the more frequently an
		  action occurs preceding the feedback, the more likely it
		  is to have caused the feedback) are _hardwired_ into the
		  learning algorithms
	All the knowledge needed for learning is implicit as if the learning
		critter was born with algorithms assuming exponential decay
		and as if all actions in the world caused similar delay patterns
	The nodes of the network compute functions much more complex than
		in case of classical back-propagation.
weakness as a learning paradigm
	All actions that occur at the same time and with the same frequency
	are assumed equally likely to have caused the feedback. (ie. these
	algorithms have an implicitly coded causal model)

	No scope for using the same network to choose between actions having
	different causal and persistence assumptions.

	The learning algorithm amounts to a procedural encoding of environmental
	knowledge. Any success of these algorithms in realistic applications is
	in a large part due to the intelligence of the designer and the effort
	they put in (for example to find just the right lambda for the
	exponential decay factor).
-------
See my paper for details of Dynamic Decision Problems and extensive study of
how the basic learning model underlying _most_ of the existing learning
algorithms (either in AI or Connectionism) is at odds with the requirements
of training in the real world.

Comments welcome from those who read the paper, as well as from those
who just want to discuss the material of this basenote.

- Pankaj {Mehra at cs.uiuc.edu}

From mehra at aquinas.csl.uiuc.edu  Fri Mar 10 05:43:16 1989
From: mehra at aquinas.csl.uiuc.edu (Pankaj Mehra)
Date: Fri, 10 Mar 89 04:43:16 CST
Subject: No subject
Message-ID: <8903101043.AA02586@aquinas>

I have recently explored several connectionist models for learning
under _realistic_ learning scenarios. The class of  problems for
which we are trying to acquire solutions by learning are decision
problems with the following characteristics:

(i) large number of continuous-valued PARAMETERS, each of which
	(ia) takes on values from a finite range with a nonstationary
		distribution
	(ib) costs more to measure accurately.
		{however, accuracy can be controlled by focussed sampling}
	(ic) is not known to follow any particular parametric distribution
(ii) the optimization CRITERION (energy, if you will) is ill-defined
	{much like the _blackbox_ in David Ackley's thesis}
(iii) a set of OPERATORS is available, and these are the _only_ instruments
	for manipulating the problem state.
	(iiia) the _causal_ relationships between the states before and
		after the application of the operator are not known
	(iiib) the _persistence_ model is incomplete - i.e. it is not
		known a priori as to when the effect of an action will
		be felt and how long will it persist
(iv) the TRAINING ENVIRONMENT is _slow reactive_ : it can be assumed to
	produce reinforcement (prescriptive feedback) rather than an
	error (evaluative feedback); however, the delays between an action
	and subsequent reinforcement follow an _unknown_ distribution.
-------
These have been called Dynamic Decision Problems, and shown to be a rich class,
in the following publication [available upon request from the first author]:

Mehra, P. and B. W. Wah, "Architectures for Strategy Learning,"
  in Computer Architectures for Artificial Intelligence Appli-
  cations, ed. B. Wah and C. Ramamoorthy, Wiley, New York, NY,
  1989 (in press).

{send e-mail to: mehra at cs.uiuc.edu}
-------
The above publication also examines the applicability of other well-known
learning techniques {empirical, probabilistic, decision theoretic, EBL,
hybrid techniques, learning to plan, etc} and suggests why ANSs might be
prefered over others. As a part of this comparision, several contemporary
connectionist models were found lacking in certain respects. I shall
summarize the criticisms here, and would like to have feedback from
those who have supported the use of these techniques.

BACK-PROPAGATION:
positive aspects:
	Simplicity of programming the learning algorithm
	An effective procedure for tuning of large parameter
	  sets representable as _band matrices_ (layered networks)
problematic assumptions:
	Immediate feedback
	Corrective {as against prescriptive} feedback
		[I am aware of Ron Williams' work, though]
weakness as a learning approach
	Requires tweaking of features (normalization biases) to the
	extent that the degree of generalization varies drastically
	as the degree of coarse coding changes. A great part of the
	success in particular applications could therefore be attributed
	to the intelligence of the researcher who codes those features
	{rather than to the _learning_ algorithm}

REINFORCEMENT LEARNING
positive aspects
	Can handle prescriptive feedback
	Has been shown {Rich Sutton, Chuck Anderson} to work with delayed
	  feedback
problematic assumptions
	The implementations known to this author assume
		: persistence of effects decays _exponentially_ with time
		: heuristic assumptions such as "recency" (that the more
		  recent an action is, the more is it responsible for the
		  feedback) and frequency (that the more frequently an
		  action occurs preceding the feedback, the more likely it
		  is to have caused the feedback) are _hardwired_ into the
		  learning algorithms
	All the knowledge needed for learning is implicit as if the learning
		critter was born with algorithms assuming exponential decay
		and as if all actions in the world caused similar delay patterns
	The nodes of the network compute functions much more complex than
		in case of classical back-propagation.
weakness as a learning paradigm
	All actions that occur at the same time and with the same frequency
	are assumed equally likely to have caused the feedback. (ie. these
	algorithms have an implicitly coded causal model)

	No scope for using the same network to choose between actions having
	different causal and persistence assumptions.

	The learning algorithm amounts to a procedural encoding of environmental
	knowledge. Any success of these algorithms in realistic applications is
	in a large part due to the intelligence of the designer and the effort
	they put in (for example to find just the right lambda for the
	exponential decay factor).
-------
See my paper for details of Dynamic Decision Problems and extensive study of
how the basic learning model underlying _most_ of the existing learning
algorithms (either in AI or Connectionism) is at odds with the requirements
of training in the real world.

Comments welcome from those who read the paper, as well as from those
who just want to discuss the material of this basenote.

- Pankaj {Mehra at cs.uiuc.edu}

From mike at bucasb.BU.EDU  Fri Mar 10 12:22:14 1989
From: mike at bucasb.BU.EDU (Michael Cohen)
Date: Fri, 10 Mar 89 12:22:14 EST
Subject: network meeting announcement for distribution
Message-ID: <8903101722.AA27914@bucasb.bu.edu>

NEURAL NETWORK MODELS OF CONDITIONING AND ACTION

12th Symposium on Models of Behavior
Friday and Saturday, June 2 and 3, 1989
105 William James Hall, Harvard University
33 Kirkland Street, Cambridge, Massachusetts

PROGRAM COMMITTEE:
Michael Commons, Harvard Medical School
Stephen Grossberg, Boston University
John E.R. Staddon, Duke University 


JUNE 2, 8:30AM--11:45AM
-----------------------
Daniel L. Alkon, ``Pattern Recognition and Storage by an Artificial 
Network Derived from Biological Systems''

John H. Byrne, ``Analysis and Simulation of Cellular and Network Properties 
Contributing to Learning and Memory in Aplysia''

William B. Levy, ``Synaptic Modification Rules in Hippocampal Learning''


JUNE 2, 1:00PM--5:15PM
----------------------
Gail A. Carpenter, ``Recognition Learning by a Hierarchical ART Network 
Modulated by Reinforcement Feedback''

Stephen Grossberg, ``Neural Dynamics of Reinforcement Learning, Selective 
Attention, and Adaptive Timing''

Daniel S. Levine, ``Simulations of Conditioned Perseveration and Novelty 
Preference from Frontal Lobe Damage''

Nestor A. Schmajuk, ``Neural Dynamics of Hippocampal Modulation of Classical 
Conditioning''


JUNE 3, 8:30AM--11:45AM
-----------------------
John W. Moore, ``Implementing Connectionist Algorithms for Classical 
Conditioning in the Brain''

Russell M. Church, ``A Connectionist Model of Scalar Timing Theory''

William S. Maki, ``Connectionist Approach to Conditional Discrimination: 
Learning, Short-Term Memory, and Attention''


JUNE 3, 1:00PM--5:15PM
----------------------
Michael L. Commons, ``Models of Acquisition and Preference''

John E.R. Staddon, ``Simple Parallel Model for Operant Learning with 
Application to a Class of Inference Problems''

Alliston K. Reid, ``Computational Models of Instrumental and Scheduled 
Performance''

Stephen Jose Hanson, ``Behavioral Diversity, Hypothesis Testing, and 
the Stochastic Delta Rule''

Richard S. Sutton, ``Time Derivative Models of Pavlovian Reinforcement''


FOR REGISTRATION INFORMATION SEE ATTACHED OR WRITE:
Dr. Michael L. Commons
Society for Quantitative Analysis of Behavior 
234 Huron Avenue 
Cambridge, MA 02138
----------------------------------------------------------------------
----------------------------------------------------------------------

REGISTRATION FEE BY MAIL
(Paid by check to Society for Quantitative Analysis of Behavior)
(Postmarked by April 30, 1989)

Name: ______________________________________________
Title: _____________________________________________
Affiliation: _______________________________________
Address: ___________________________________________
Telephone(s): ______________________________________
E-mail address: ____________________________________


( ) Regular $35 
( ) Full-time student $25 

School ____________________________________________
Graduate Date _____________________________________
Print Faculty Name ________________________________
Faculty Signature _________________________________


PREPAID 10-COURSE CHINESE BANQUET ON JUNE 2
( ) $20 (add to pre-registration fee check) 

-----------------------------------------------------------------------------
(cut here and mail with your check to)

Dr. Michael L. Commons
Society for Quantitative Analysis of Behavior 
234 Huron Avenue
Cambridge, MA 02138 


REGISTRATION FEE AT THE MEETING
( ) Regular $45 
( ) Full-time Student $30 
    (Students must show active student I.D. to receive this rate)

ON SITE REGISTRATION
5:00--8:00PM, June 1, at the RECEPTION in Room 1550, William James Hall, 
33 Kirkland Street, and 7:30--8:30AM, June 2, in the LOBBY of William 
James Hall.

Registration by mail before April 30, 1989 is recommended 
as seating is limited


HOUSING INFORMATION
Rooms have been reserved in the name of the symposium for the Friday 
and Saturday nights at:

Best Western Homestead Inn
220 Alewife Brook Parkway
Cambridge, MA 02138 
Single: $72 
Double: $80 

Reserve your room as soon as possible. The hotel will not hold them past 
March 31. Because of Harvard and MIT graduation ceremonies, space will 
fill up rapidly. Other nearby hotels:

Howard Johnson's Motor Lodge 
777 Memorial Drive 
Cambridge, MA 02139 
(617) 492-7777 
(800) 654-2000 
Single: $115--$135 
Double: $115--$135 

Suisse Chalet 
211 Concord Turnpike Parkway 
Cambridge, MA 02140 
(617) 661-7800
(800) 258-1980 
Single: $48.70 
Double: $52.70 

---------------------------------------------------------------------------

From homxb!solla at research.att.com  Fri Mar 10 13:10:00 1989
From: homxb!solla at research.att.com (homxb!solla@research.att.com)
Date: Fri, 10 Mar 89 13:10 EST
Subject: Cross-entropy error
Message-ID: <mailman.107.1149540183.24850.connectionists@cs.cmu.edu>


A detailed discussion of cross-entropy error measure for back propagation, 
and a comparative study of its merits relative to the more commonly used 
quadratic measure is to be found in "Accelerated Learning in Layered Neural 
Networks" by S.A. Solla, E. Levin, and M. Fleisher. The paper has appeared 
in "Complex Systems", Vol. 2, 1988. 

Two other relevant references to the use of such error function in the 
context of supervised learning are:

E.B. Baum and F. Wilczek, "Supervised Learning of Probability Distributions 
by Neural Network" in "Neural Information Processing Systems", ed. by D. 
Anderson (AIP, New York, 1988)

J.J. Hopfield, "Leraning Algorithms and Probability Distributions in Feed-
forward and Feed-back Networks", Proc. Natl. Acad. Sci. USA, Vol. 84 ,1988, 
p. 8429-8433.

Sara A. Solla 
AT&T Bell Laboratories 
solla at homxb.att.com


From John.Hampshire at SPEECH2.CS.CMU.EDU  Sun Mar 12 13:21:21 1989
From: John.Hampshire at SPEECH2.CS.CMU.EDU (John.Hampshire@SPEECH2.CS.CMU.EDU)
Date: Sun, 12 Mar 89 13:21:21 EST
Subject: non-MSE objective function for backprop
Message-ID: <mailman.108.1149540183.24850.connectionists@cs.cmu.edu>

*********  FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD  ***********
****************  TO OTHER BBOARDS/ELECTRONIC MEDIA *******************


A NOVEL OBJECTIVE FUNCTION FOR IMPROVED CLASSIFICATION PERFORMANCE
     IN TIME-DELAY NEURAL NETS USED FOR PHONEME RECOGNITION

            J. B. Hampshire II       A. H. Waibel
  	         Carnegie Mellon University

We have been working on an alternative objective function
to the mean-squared-error (MSE) objective function typically
used in backpropagation.  Our alternative, which we term the
classification figure-of-merit (CFM), forms a mathematical assessment
of the *relative* activations of all output nodes of a backprop
network used as a classifier.  The objective function has a number
of unique characteristics; chief among these are 

1.  its formation of internal representations that consistently
    differ substantially from those of the MSE objective function

2.  its immunity to "over-learning" (i.e., the process by which
    MSE classifiers can be trained so much that they begin to
    key on "idiosyncratic" features of the training set that are
    not representative of the ensemble from which the training
    set was drawn.  As a result, over training actually results in
    degraded classification performance on a disjoint test set.)


While classification performance of the CFM objective function is
equivalent to that of the MSE objective function, results
from the two classifiers can be combined to reduce by a median 24%
the number of misclassifications made by the MSE classifier alone.
This equates to single and multi-speaker /b, d, g/ recognition rates
that consistently exceed 98%.

A preliminary paper is available on our results of applying
the CFM to phoneme recognition using Time-Delay Neural Nets now,
but if you want to wait another two weeks, you can get the NEW! IMPROVED!
full-fledged technical report.

If you absolutely can't wait to get your hands on this stuff, send
your mailing address and something to the effect of, "send me the
CFM paper."

If, on the other hand, you want to see a more thorough analysis,
send your mailing address and say, "send me the CFM tech report 
(CMU-CS-89-118) in two weeks."

In either case, send your request directly to 

hamps at speech2.cs.cmu.edu

***** DO NOT USE THE REPLY COMMAND IN YOUR MAILER *****


*********  FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD  ***********
****************  TO OTHER BBOARDS/ELECTRONIC MEDIA *******************

From netlist at psych.Stanford.EDU  Sun Mar 12 17:13:17 1989
From: netlist at psych.Stanford.EDU (Mark Gluck)
Date: Sun, 12 Mar 89 14:13:17 PST
Subject: Tues. 3/14: ALAN LAPEDES, Neural Nets and Signal Processing
Message-ID: <mailman.109.1149540183.24850.connectionists@cs.cmu.edu>

            Stanford University Interdisciplinary Colloquium Series:
                   Adaptive Networks and their Applications

                        Mar.  14th (Tuesday, 3:30pm):

********************************************************************************

            "Nonlinear Signal Processing with Adaptive Networks"

                             ALAN LAPEDES
                        Theoretical Division
                        Los Alamos National Laboratory, MS B213
                        Los Alamos, New Mexico 87545
********************************************************************************
     
                               Abstract

  Previous work on using the new generation of nonlinear neural networks
for signal processing tasks is reviewed. The concept of a nonlinear
system changing its behavior as a parameter is changed (bifurcations)
is introduced and investigated for the simple logistic map. In this
situation we show that instabilities (limit cycles, chaos) of this
system may be predicted as a function of a system parameter purely
from observations of the system in its stable regime where it evolves
to a stable fixedpoint. We consider predicting the bifurcation of a  
hydrodynamic experiment. Both backpropagation nets and radial basis networks
are used on this problem. Agreement with experiment is good, and plenty
of pretty three dimensional pictures will be shown. Unnecessary formalism
will be kept to a bare minimum.

                          Additional Information
                          ----------------------

Location: Room 380-380X, which can be reached through the lower level
 between the Psychology and Mathematical Sciences buildings. 
Level: Technically oriented for persons working in related areas.
Mailing lists: To be added to the network mailing list, netmail to
 netlist at psych.stanford.edu with "addme" as your subject header.
 For additional information, contact Mark Gluck (gluck at psych.stanford.edu).

From harnad at Princeton.EDU  Mon Mar 13 13:57:26 1989
From: harnad at Princeton.EDU (Stevan Harnad)
Date: Mon, 13 Mar 89 13:57:26 EST
Subject: Abstract for CNLS Conference
Message-ID: <8903131857.AA19332@clarity.Princeton.EDU>

Here is the abstract for my contribution to the session on the
"Emergence of Symbolic Structures" at the 9th Annual International
Conference on Emergent Computation, CNLS, Los Alamos National Laboratory,
May 22 - 26 1989

      Grounding Symbols in a Nonsymbolic Substrate

	    Stevan Harnad
	    Behavioral and Brain Sciences
	    Princeton NJ

There has been much discussion recently about the scope and limits of
purely symbolic models of the mind and of the proper role of
connectionism in mental modeling. In this paper the "symbol grounding
problem" -- the problem of how the meanings of meaningless symbols,
manipulated only on the basis of their shapes, can be grounded in
anything but more meaningless symbols in a purely symbolic system -- is
described, and then a potential solution is sketched: Symbolic
representations must be grounded bottom-up in nonsymbolic
representations of two kinds:  (1) iconic representations are analogs
of the sensory projections of objects and events and (2) categorical
representations are learned or innate feature-detectors that pick out
the invariant features of object and event categories. Elementary
symbols are the names of object and even categories, picked out by
their (nonsymbolic) categorical representations. Higher-order symbols
are then grounded in these elementary symbols. Connectionism is a
natural candidate for the mechanism that learns the invariant features.
In this way connectionism can be seen as a complementary component in a
hybrid nonsymbolic/symbolic model of the mind, rather than a rival to
purely symbolic modeling. Such a hybrid model would not have an
autonomous symbolic module, however; the symbolic functions would
emerge as an intrinsically "dedicated" symbol system as a consequence
of the bottom-up grounding of categories and their names.

From niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK  Tue Mar 14 10:16:44 1989
From: niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK (Mahesan Niranjan)
Date: Tue, 14 Mar 89 15:16:44 GMT
Subject: information function vs. squared error
Message-ID: <28888.8903141516@dsl.eng.cam.ac.uk>

I tried sending the following note last weekend but it failed for some
reason - apologies if anyone is getting a repeat!

Re:
    > Date:         Wed, 08 Mar 89 11:36:31 EST
    > From: thanasis kehagias <ST401843%bitnet.brownvm at edu.cmu.cc.vma>
    > Subject:      information function vs. squared error
    >
    > i am looking for pointers to papers discussing the use of an alternative
    > criterion to squared error, in back propagation algorithms. the
    [..]
    >    G=sum{i=1}{N} p_i*log(p_i)
    >

Here is a non-causal reference:

I have been looking at an error measure based on "approximate distances to
class-boundary" instead of the total squared error used in typical supervised
learning networks. The idea is motivated by the fact that a large network
has an inherent freedom to classify a training set in many ways (and thus
poor generalisation!).

In my training, an example of a particular class gets a target value
depending on where it lies with respect to examples from the other class
(in a two class problem).

This implies, that the target interpolation function that the network has to
construct is a smooth transition from one class to the other (rather than
a step-like cross section in the total squared error criterion).

The important consequence of doing this is that networks are automatically
deprived of the ability to form large weight (- sharp cross section)
solutions (an auto weight decay!!).

niranjan
PS: A Tech report will be announced soon.

From sven at iuvax.cs.indiana.edu  Tue Mar 14 10:12:36 1989
From: sven at iuvax.cs.indiana.edu (Sven Anderson)
Date: Tue, 14 Mar 89 10:12:36 -0500
Subject: Connection between Hidden Markov Models and Connectionist Networks
In-Reply-To: thanasis kehagias's message of Mon, 13 Feb 89 00:47:00 EST
Message-ID: <mailman.110.1149540183.24850.connectionists@cs.cmu.edu>

I'm interested in receiving the paper you described:

	   OPTIMAL CONTROL FOR TRAINING
            THE MISSING LINK BETWEEN
             HIDDEN MARKOV MODELS
           AND CONNECTIONIST NETWORKS
 
                by Athanasios Kehagias
               Division of Applied Mathematics
                 Brown University
                Providence, RI 02912

If it's more convenient you might just forward the div file.

thanks, 
Sven Anderson

From honavar at cs.wisc.edu  Tue Mar 14 17:59:39 1989
From: honavar at cs.wisc.edu (A Buggy AI Program)
Date: Tue, 14 Mar 89 16:59:39 -0600
Subject: TR available (** DO NOT FORWARD TO BULLETIN BOARDS **)
Message-ID: <8903142259.AA01452@goat.cs.wisc.edu>


** PLEASE DO NOT FORWARD TO BULLETIN BOARDS **

The following TR is now available:


---------------------------------------


              Perceptual Development and Learning:
 From Behavioral, Neurophysiological, and Morphological Evidence
                    To Computational Models

                       Vasant Honavar
                Computer Sciences Department
               University of Wisconsin-Madison
	   Computer Sciences TR # 818, January 1989

                          Abstract

     An intelligent system has to be capable  of  adapting  to  a
constantly  changing environment. It therefore, ought to be capa-
ble of learning from its perceptual interactions  with  its  sur-
roundings.   This  requires a certain amount of plasticity in its
structure.  Any attempt to model the perceptual capabilities of a
living  system or, for that matter, to construct a synthetic sys-
tem of comparable abilities, must  therefore,  account  for  such
plasticity  through  a  variety  of  developmental  and  learning
mechanisms. This paper examines some results  from  neuroanatomi-
cal, morphological, as well as behavioral studies of the develop-
ment of visual perception; integrates them into  a  computational
framework; and suggests several interesting experiments with com-
putational models that can yield insights into the development of
visual perception.

---------------------------------------

Requests for copies must be addressed to: honavar at cs.wisc.edu


From ash%cs at ucsd.edu  Tue Mar 14 19:15:54 1989
From: ash%cs at ucsd.edu (Tim Ash)
Date: Tue, 14 Mar 89 16:15:54 PST
Subject: No subject
Message-ID: <8903150015.AA19834@beowulf.ucsd.edu.UCSD.EDU>

-----------------------------------------------------------------------
The following technical report is now available.  
-----------------------------------------------------------------------


                   DYNAMIC NODE CREATION
                             IN
                  BACKPROPAGATION NETWORKS

                         Timur Ash
                        ash at ucsd.edu


                          Abstract


     Large backpropagation (BP) networks are very  difficult
to  train.  This fact complicates the process of iteratively
testing different sized networks (i.e., networks  with  dif-
ferent  numbers of hidden layer units) to find one that pro-
vides a good mapping approximation.  This paper introduces a
new  method  called Dynamic Node Creation (DNC) that attacks
both of these issues (training large  networks  and  testing
networks with different numbers of hidden layer units).  DNC
sequentially adds nodes one at a time to the hidden layer(s)
of  the  network until the desired approximation accuracy is
achieved.  Simulation results for parity,  symmetry,  binary
addition,  and  the encoder problem are presented.  The pro-
cedure was capable of finding known  minimal  topologies  in
many  cases,  and  was  always  within  three  nodes  of the
minimum. Computational expense for finding the solutions was
comparable  to  training  normal  BP  networks with the same
final topologies.  Starting out with fewer nodes than needed
to solve the problem actually seems to help find a solution.
The method yielded a solution for every problem  tried.   BP
applied  to  the same large networks with randomized initial
weights was unable, after repeated  attempts,  to  replicate
some minimum solutions found by DNC.

-----------------------------------------------------------------------
Requests for reprints (ICS Report 8901) should be directed to:

Claudia Fernety 
Institute for Cognitive Science 
C-015
University of California, San Diego
La Jolla, CA 92093.
-----------------------------------------------------------------------

From wine at CS.UCLA.EDU  Wed Mar 15 08:49:36 1989
From: wine at CS.UCLA.EDU (wine@CS.UCLA.EDU)
Date: Wed, 15 Mar 89 05:49:36 PST
Subject: TR available (** DO NOT FORWARD TO BULLETIN BOARDS **) 
In-Reply-To: Your message of Tue, 14 Mar 89 16:59:39 -0600.
             <8903142259.AA01452@goat.cs.wisc.edu> 
Message-ID: <8903151349.AA04692@retina.cs.ucla.edu>

Please send me a copy of your technical report #818.
Thank you in advance.

--David Wine

University of California at Los Angeles                        wine at cs.ucla.edu
Computer Science Department                                      (213) 825-6121
3531 Boelter Hall           ...!(uunet,rutgers,ucbvax,randvax)!cs.ucla.edu!wine
Los Angeles, CA  90024

From ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU  Wed Mar 15 18:24:14 1989
From: ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU (thanasis kehagias)
Date: Wed, 15 Mar 89 18:24:14 EST
Subject: what is a connectionist network?
Message-ID: <mailman.111.1149540184.24850.connectionists@cs.cmu.edu>

ok, here is my question. i hope it makes sense:

very often i want to refer to "these things". i do not want to call them
neural networks, since it is far from clear to me they really have a
similarity with the human nervous system. so i chose to call them
connectionist networks. i guess this means they are networks with (many)
connections. but this is very general. so i do not have a clear
definition of what i am talking about. i am sure i could come up with
several, but they seem to me to be either too restrictive or too
general. so would anybody care to give their definition of these objects
that this list is about?

the issue is not trivial or vacuously philosophical. i think that even
if we do not come up with a generally accepted definition of what a
connectionist net is, people will have a chance to present competing
opinions. possibly some lurking differences will come in the surface and
the foundations of connectionism will become more secure.

here is a case that i think is fraught with issues (that could be
cleared up). any dynamical system that evolves in discretetime can be
represented (over a finite time interval)  by a feedforward
connectionist network. is it fair to say that dyn.systems are
connectionist networks. conversely, is it fair to say that feedforward
nets are dynamical systems. what are the implications for a time-space
trade off? how much do we have to learn about dyn. systems to do
connectionist research?

ok, after all this i guess i have to give my definition of a
connectionist network. it is rather involved and it goes like this:

"connectionism is not a yes-or-no property. any directed graph
(collection of nodes and directed edges) has a connectionism index,
defined as the ratio of nr. of edges to  nr. of nodes. "

PS:

   has anybody already dealt with the question of defining a CN? references
   welcome.


      Thanasis

From ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU  Wed Mar 15 18:23:24 1989
From: ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU (thanasis kehagias)
Date: Wed, 15 Mar 89 18:23:24 EST
Subject: cross entropy and training time in Connectionist Nets and HMM's
Message-ID: <mailman.112.1149540184.24850.connectionists@cs.cmu.edu>


these are some random  thoughts on the issue of training in HMM and
Connectionist networks. i focus on the cross entropy function
and follow a different line of thinking than in my paper which i quote
further down. this note is the outcome of an exchange between Tony
Robinson and me; i thought some netters might be interested.
so i want to thank Tony for posing interesting ideas and
questions. also thanks to all the people who replied to my request
for information on the cross entropy function.

-----------------------
the starting point for this discussion is the following  question:

"why is  HMM training so much faster than Connectionist Networks?"

to put the question in perspective, let me first remark that, from a
certain point of view, HMM and CN are very similar objects. specifically
they use similar architectures to optimize appropriate cost functions.
for further explanation of this point, see [Kehagias], also [Kung].
the similarity is even more obvious when CM are used to solve speech
recognition problems.

the question remains: why, in attempting to solve the same problem,
CN require so much more training?

1. cost functions
-----------------

it appears that a (partial) explanation is the nature of the cost
function used in each case. in CN speech recognizers, the cost
function of choice is quadratic error (error being the difference of
appropriate vectors). however in most of what follows i will
consider CN that maximize the cross entropy function. a short
discussion of the relationship between cross entropy and square
error is included at the end. in HMM the function MAXIMIZED
is likelihood (of the observations).

however HMM are a bit more subtle. using the Markov Model, one can write
the likelihood of the observations used for training, call it L(q). here
q is a vector that contains the transition and emission probabilities
(usually called a_ij, b_kj, respectively). to keep the discussion
simple, let us consider the only unknown parameters to be the a_ij's.
that is, the elements of q are the a_ij's. now, q is a vector, but a mor
general view of it is that it is a function (specifically a probability
density function). so we will consider q as a vector or a function
interchangeably. (of course any vector is a function of its index!)

Now, to maximize L is not a trivial task : it
is a polynomial of n*T-th order in the elements of q
(where n is the order of the Markov model, T the number of observations)
furthermore, the elements of q are probabilties and they must
satisfy certain positivity and add-up-to-1 conditions.

2. Likelihood maximin, Backward-Forward, EM algorithm
-----------------------------------------------------

so HMM people have found a way to make the optimization problem easier:
consider an auxiliary function, call it Q(q,q'), to be presently
defined, which can be maximized much easier. then they prove the
remarkable inequality:

(1)       L(q)*log(L(q')/L(q)) >= (Q(q,q')-Q(q,q)).

the consequence  of (1) is the following: we can implement an iterative
algorithm that goes as follows:


Step 0: choose q(0)

.....

Step k: choose q(k) such that  Q(q(k-1),q(k)) is maximized.

        if Q(q(k-1),q(k))=0, terminate.
        if Q(q(k-1),q(k))>0 go to step k+1

.....

                    REMARKS:
1) observe that no provision is made for the case that Q(q(k-1),q(k)) is
negative. this is due to the fact that max G is always nonnegative, as
proved in  [Baum 1968] or [Dempster].

2) of course , in practice, the termination condition will
be replaced by : if G<epsilon terminate (epsilon small).

3) it is easy to see (by use of Calculus methods
that EXACT maximization of  Q(q(k-1),q(k)) with respect to q(k) is
possible and , in fact, easy. the maximizing values q(k) are given
in terms of q(k-1) by the reestimation
formulas (see [Baum 1970]) the form of which guarantees that they have
the probability properties (positivity, add-up-to-one).

4) finally note that the algorithm is a maximin algorithm, since it
maximizes the minimum gain in likelihood.

this algorithm in its general form is known as the EM algorithm
[Dempster]. in the HMM context it is known as the Backward-Forward
(BF) algorithm [Baum 1970]. it is a greedy algorithm that produces
a sequence of parameter values q(0),q(1),q(2),... such that:

(2)     Q(q(k-1),q(k))>Q(q(k-1),q(k-1)).

>From (1) and (2) and Remark (1)  follows that

(3)     L(q(k)) > L(q(k-1)).

3. Connection of EM with cross entropy and neural networks
----------------------------------------------------------

Now we will discuss the function G and point out the relationship to
CN.

The function Q(q,q') can be defined in quite a general setting. q , q'
are probability densities. as such they are functions themselves; we
write q(x), q'(x). x takes values in an appropriate range. e.g., in the
HMM model x ranges over  all the state transition pairs (i,j), giving
the probability of a certain state transition. now, define G:

(4) Q(q,q')=sum{over all x} q(x)log(q'(x)).

Then, the difference   Q(q,q)-Q(q,q') is:

(5)  Q(q,q)-Q(q,q')=G(q,q')=sum{all x}q(x)log(q(x)/q'(x)).

G is the well known to connectionists (and statisticians) cross-
entropy between q and q', that is, a measure of distance between
these two probability densities.

now we recognize two things:

I. there have been cases where G minimization has been proposed
as a CN training procedure . see [Hinton].
In these cases, a desired probability density was known and what
was desired was to minimize the distance between desired and actual
probability density of the CN output. in some of these cases, there was ncurrent
simultaneous maximization of likelihood. this is noted in [Ackley].
it follows necessarily from (1) that maximizing the cross-entropy
maximizes the minimum improvement in likelihood.


II. it is clear that the BF algorithm does a similar thing: likelihood
maximization, cross entropy minimization. as noted in [Baum 1968]
and also in [Levinson], the difference q(k)-q(k-1) points in the
same direction as grad L(q), evaluated at q(k-1). That is, the
q(k-1) is changed in the direction of steepest descent of L. of all
the possible steps (choices of q(k)) the one is chosen that minimizes
the distance between q(k-1) and q(k) in the cross entropy sense.

4. Comparison in training of HMM and CN:
---------------------------------------

now we can make a comparison of the performance of CN and HMM's. this
comparison is between  G-optimizing-CN's and HMM's. the square-error
CN is not discussed here.

firstly, we see that the main focus of attention is different in the two
cases. in CN we want to minimize cross entropy. in HMM we want to maximi
likelihood. however, likelihood maximinimization
is an automatc consequence of G minimization for CN's and local
G minimization is built in in the BF algorithm. in that
sense, the two tasks are very similar and so the question is once
again raised: why are HMM's faster to train?

at this point the answers are many and easy. even though HMM's use
observations in a nonlinear way, the state vector
of the adjoint network (see [Kehagias]) evolves linearly.
not so for CN's. the HMM adjoint  network is
sparsely connected. not necessarily so for the CN (pointed out
by [Tony Robinson]). though both cost functions used are nonlinear,
the BF is a much more efficient method to optimize the HMM cost function
than Back Propagation is for CN's.

the last answer is the really important one. due to the special nature
of the Hidden Markov Model, we can use the BF algorithm. this algorithm
allows to take large steps (large changes from q(k-1) to q(k)) in the   traying
Euclidean distance, without moving too far away in the cross entropy
distance. of all the probability distributions, we consider only the
ones that are "relevant", in that they  are close to the current one;
and yet, even though we take conservative steps, we are guaranteed
to maximize the minimum improvement in likelihood. indeed the maxmin
is a conservative attitude. the rational is the following:
"you want to maximize L. you know the steepest ascent direction;
you want to go in that direction, but you do not know how far to
go. BF will tell you how far you can go (and it will not be an
infinitesimal step) so that you maximize the minimum improvement."

another way to look at this is that the Euclidean distance imposes
a structure (topology) to the space of probability distributions.
the cross entropy distance imposes a different structure, which
apparently, is more relevant to the problem.

in contrast, in BP we have not much choice in the change we bring on q.
we have control over w, the weights of the connections, and we usually
choose them in the steepest descent direction, and small enough that
we actually have an improvement. but it is not clear that the cross
entropy between distributions imposes a suitable structure on the
space of weights. apparently it does not. even a relatively small step
in the weight space can change the cost function by much. we have to
tread more carefully.

of course BF can be used due to the very special structure of the
HMM problem (which is probably a good argument for the usefulness
of the HM Model). BF is applicable when the cost function is a
homogeneous polynomial with additive constraints on the variables.
(see [Baum 1968]). the CN problem is characterized by
harder nonlinearities (e.g. the sigmoid function) which induce
a warped relationship between the weights and cost function.

in short, the CN problem is more general and harder.

5. square error cost function
-----------------------------

first a general observation: the square error cost function can be
introduced under two asumptions. in the one case we assume the error
to be deterministic and we want to minimize a deterministic sum of
square errors (the sum is over all training patterns; the error is
the difference between desired and actual response)
by appropriate choice of weights. there is nothing
probabilistic here. alternatively, we can assume that
the training patterns are selected randomly (according to some
prob. density) and also the test patterns will come from the
same prob. density, and we choose the weights to minimize
expected square error.

even though the two points of view are distinct, they are not
that different, since in both cases we can define inner products,
distance functions etc. and so get a Hilbert space structure
that is practically the same for both cases. of course this would involv
some ergodicity assumption.

at any rate, assume  here the probabilistic point of view of square
error. what are then the connections between the two cost
functions: cross entropy and expected (or mean ) square error?

i have seen some remarks on this problem in the literature, but i
do not know enough about  at this point. however, judging from
training time, i would say that the nonlinear nature of CN with
sigmoids again maps the weight space to the cost function in a very
warped way. it would be interesting to examine the shape of the
cost function contour in the weight space. have such studies been made?
visualization seems to be a problem for high dimensional networks.

6. cross entropy maximization and some loose ends
------------------------------------------------
an interesting variation is G maximization. this usually occurs
in unsupervised learning. See [Linsker], [Plumbley]. it appears under
the name of transinformation maximization, or error information
minimization, but these quantities can be interpreted as cross
entropy between the joint input-output probability den. induced by the
CN (for given weights) and the probability den. where input and output
have the same marginals, but are independent (so the joint density
is a product of the two marginals). i guess a way to explain this
in terms of cross entropy is: even though we have no prior
information on the best input-output density, there is one density
we certainly want to avoid as much as possible, and this the one
where input and output are independent (so the input gives no
information as to what the output is). hence we want to maxmize the
cross entropy distance between this product distribution and the
CN induced distribution. there is also a possible interpretation
along the lines of the maximum entropy principle.

i must  say  that  these interpretations do not seem (yet)
to me as appealing as maximum   transinformation. however they are
possible   and indeed      statisticians   have been considering
them  for  many  years  now.

another interseting connection is between cross entropy and rate
of convergence (obviously rate of convergence is connected to
training time). [Ellis] gives an excellent analysis of the connection
between rate of convergence and crossentropy. application of
his results to computational problems is not obvious.

finally, an interesting example (of statistical work
that relates to this line of connectionist research) is [Rissanen];
there the linear regression model is considered, which of course
can be interpreted as a linear perceptron. in [Rissanen] selection
of the optimal model is based on minmax entropy criterion.

References:
-----------

D.H.Ackley:   "A Learning algorithm for Boltzmann machines"
     et.al.    Cognitive Science 9 (1985).

L.E. Baum &:  "Growth Transformations for Functions on Manifolds"
G.R. Sell     Pacific Journal of Mathematics, Vol.27, No.2., 1968.


L.E.Baum  :   "A Maximization Technique occurring in the Statistical
     et.al.    Analysis of Probabilistic Functions of Markov Chains"
               The Annals of Math, Stat., Vol. 41, No. 1, 1970.

A.P. Dempster:"Maximum Likelihood from Incomplete Data via EM algorithm"
       et. al. Pr. Roy. Stat. Soc., No. 1, 1977.

R. Ellis:     "Entropy, Large Deviations and Statistical Mechanics"
               Springer, New York 1985.

G. Hinton    :"Connectionist Learning Procedures", Technical Report
               CMU-CS-87-115 (Carnegie Mellon University), June 1987.

A. Kehagias: "Optimal Control for Training: Themissing link
              between HMM and Connectionist Networks"
              submitted to 7th int. Conf. on Math. and Computer
              Modelling, Chicago, Illinois, August 1989.

S.Y. Kung &: "A Unifying viewpoint of Multilayer Perceptrons
J.N. Hwang    and HMM models" (IEEE Int. Symposium and Systems
              Portland, Oregon, 1989.

S.E.Levinson: "An Introduction to the Application of the Theory of
       et.al.  Probabilistic Functions of a Markov Process to Automatic
               Speech Recognition", The Bell Sys. Tech. J., Vol.62,
               No. 4, April 1983.

R. Linsker:   "Self Organization in a Perceptual Network", IEEE
               Computer, Vol.21, No.3, March 1988.

M. Plumbley&: "An information Theoretic Approach to Unsupervised
F. Fallside    Connectionist Models", Proceedings of 1988 Connectionist
               Models Summer School, Pittsburgh, 1988.

J. Rissanen: "Minmax Entropy Estimation of Models for Vector
              Processes", in Lainiotis-Mehra (ed.), System Advances
              and case studies, Academic, New York, 1976.

T. Robinson: personal communication

From ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU  Thu Mar 16 09:54:52 1989
From: ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU (thanasis kehagias)
Date: Thu, 16 Mar 89 09:54:52 EST
Subject: HMM?
Message-ID: <mailman.113.1149540184.24850.connectionists@cs.cmu.edu>

with respect to my cross entropy posting, i guess i never said it explicitly:


      HMM stands for Hidden Markov Model

      it is a model widely used in speech research.


                                                      Thanasis

From sankar at caip.rutgers.edu  Thu Mar 16 09:42:44 1989
From: sankar at caip.rutgers.edu (ananth sankar)
Date: Thu, 16 Mar 89 09:42:44 EST
Subject: questions on kohonen's maps
Message-ID: <8903161442.AA14983@caip.rutgers.edu>

I am interested in the subject of Self Organization and have some
questions with regard to Kohonen's algorithm for Self Organizing Feature
Maps. I have tried to duplicate the results of Kohonen for the two dimensional
uniform input case i.e. two inputs. I used a 10 X 10 output grid. The maps
that resulted were not as good as reported in the papers.

Questions:
	
	
1	Is there any analytical expression for the neighbourhood and gain
	functions? I have seen a simulation were careful tweaking after
	every so many iterations produces a correctly evolving map. This
	is obviously not a proper approach.

2	Even if good results are available for particular functions for
	the uniform distribution input case, it is not clear to me that these
	same functions would result in good classification for some other
	problem. I have attempted to use these maps for word classification
	using LPC coeffs as features.

3	In Kohonen's book "Self Organization and Associative Memory", Ch 5
	the algorithm for weight adaptation does not produce normalized
	weights. Thus the output nodes cannot function as simply as taking
	a dot product of inputs and weights. They have to execute a distance
	calculation.

4	I have not seen as yet in the literature any reports on
	how the fact that neighbouring nodes respond to similar patterns
	from a feature space can be exploited.

5	Can the net become disordered after ordering is achieved at any
	particular iteration? 


I would appreciate any comments, suggestions etc on the above. Also so that
net mail clutter may be reduced please respond to

sankar at caip.rutgers.edu

Thank you.

Ananth Sankar
Department of Electrical Engineering
Rutgers University, NJ


From KELLY%BROWNCOG.BITNET at mitvma.mit.edu  Thu Mar 16 12:12:00 1989
From: KELLY%BROWNCOG.BITNET at mitvma.mit.edu (KELLY%BROWNCOG.BITNET@mitvma.mit.edu)
Date: Thu, 16 Mar 89 12:12 EST
Subject: What is a connectionist net?  Here's what it's not.
Message-ID: <mailman.114.1149540184.24850.connectionists@cs.cmu.edu>

        What is a connectionist model, you ask?  Well, I don't think I can
answer that specifically, but I can tell you what it's not.  In the first
place it *is* a member of a larger class of models called complex systems.
But that doesn't help us either, because nobody really knows what a complex
system is.  The generally conceived definition has something to do with large
numbers of simple, interconnecting units which can perform some type of
"cooperative computation".  That is, individually the units are so dumb that
they can't do anything, but together they can do alot.
        Well, then my claim (I'm really out on a limb here), is that systems
with large numbers of very complex, interconnecting units really aren't
connectionist models (or even complex systems) at all, no matter how many
connections there are or what type of amazing results they achieve.  In
particular I am referring to the result that Hecht-Nielson reports in his
paper on "Kolmogorov's Mapping Neural Network Theorem" [1987 INNS proceedings?].
There he describes a way of proving that a 2-layered net (one hidden layer)
is capable of solving any mapping problem.  However, the units in the
network are incredibly complex.  No longer are we dealing with units that
compute threshold functions.  The hidden layer units must be able to compute
any real, continuous monotonically increasing function, and the output layer
units must be able to compute any *arbitrary* real continuous function.
While the fact that a system like this can do some serious computation is
interesting (neat, even), it really tells us nothing about connectionist
networks.

From ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU  Thu Mar 16 22:19:54 1989
From: ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU (thanasis kehagias)
Date: Thu, 16 Mar 89 22:19:54 EST
Subject: credits
Message-ID: <mailman.115.1149540184.24850.connectionists@cs.cmu.edu>


recently i posted a note about traing of HMM and Connectionist Networks,
where i was not careful enough in giving credit to people that deserved
it. let me try to make up for it:

i had a very interesting exchange of mesages with Tony Robinson, that
formed the basis for my note.

i received messages with ideas and references from Mark Plumbley, Steven
Nowlan, Sue Becker and Sara Solla. Sara Solla referred me to a paper
written by Solla, Esther Levin and Michael Fleisher, that deals with the
question of cross entropy. i received a copy of this paper today. it is:

 "Accelerated Learning in Layered Neural Networks", by S. Solla, , E. Levin and
 M. Fleisher, Complex Systems, Vol. 2, 1988.


the paper compares cross entropy  and square error and includes a
numerical study and a study of the shape of the contours of these cost
functions. therefore, the similar question i posed at the end of my note
is at least partly answered.

i also received the revised copy of G. Hinton's report on Connectionist
learning procedures, referred to in  my note. in this report (Dec. 1987)
Hinton has already made a remark directly related to  my point of
maximinimizing likelihood in the BF algorithm. specifically, he says
that (in the context of CN training with cross entropy cost function)
Likelihood is maximized when cross entropy is minimized.

i think this is all. if i have missed someting , let me know about it .{


                            Thanasis

From ROB%BGERUG51.BITNET at VMA.CC.CMU.EDU  Fri Mar 17 09:24:00 1989
From: ROB%BGERUG51.BITNET at VMA.CC.CMU.EDU (Rob A. Vingerhoeds / Ghent State University)
Date: Fri, 17 Mar 89 09:24 N
Subject: Neural Networks Seminar Ghent, 25 april 1989, FINAL ANNOUNCEMENT
Message-ID: <mailman.116.1149540184.24850.connectionists@cs.cmu.edu>


                BIRA SEMINAR ON NEURAL NETWORKS

     "APPLICATION OF NEURAL NETWORKS IN INDUSTRY, WHEN AND HOW"

                       25 APRIL 1989

              INTERNATIONAL CONGRESS CENTRE GHENT

                          BELGIUM

                    FINAL ANNOUNCEMENT

BIRA (Belgian Institute for Control Engineering and Automation) is
organising a seminar on the state of the art in Neural Networks. The
central theme will be

"Application of Neural Networks in Industry, when and how"

To be able to give a good and reliable verdict to this theme, some of the
most important and leading scientists in this fascinating area have been
invited to present a lecture at the seminar and take part in a panel
discussion.

The following program is foreseen:

 8.30 -  9.00    Registration
 9.00 -  9.15    Opening on behalf of BIRA
                 Prof. L. Boullart, Ghent State University
 9.15 - 10.00    Learning Algorithms and applications in A.I.
                 Prof. Fogelman Soulie, Universite de Paris V
10.00 - 10.30    coffee
10.30 - 11.30    The Neural Network Framework
                 Prof. B. Kosko, University of Southern California
11.30 - 12.00    Presentation of ANZA+ products, hardware and software
                 Patrick Dumont, Digilog, France
12.00 - 14.00    lunch / exhibition
14.00 - 15.00    Integration of knowledge-based system and neural network
                 techniques for robotic control
                 Dr. David Handelman, Princeton, USA
15.00 - 16.00    Application in Image Processing and Pattern Recognition
                 (Neocognitron)
                 Dr. S. Miyake, ATR, Japan
16.00 - 16.30    tea
16.30 - 17.15    panel discussion over the central theme
17.15 - 17.30    closing and conclusions

The seminar will be held in the same period as the famous Flanders
Technology International (F.T.I.) exhibition is held. This exhibition is
for both representatives from industry and for other interested people very
interesting and going to both the seminar and the exhibition is double
interesting.


VENUE

International Congress Centre Ghent
- Orange Room -
Citadelpark
B-9000 Ghent

DATE
Tuesday 25 april 1989

LANGUAGE
The seminar language is English. No translation will be provided.

REGISTRATION FEES
members BIRA/IBRA    12.500 BEF
non-members          15.000 BEF
Teachers/Assistants   7.500 BEF

including coffee/tea, lunch and proceedings.

Students can get a special price of 1.500 BEF, which does NOT include a
lunch.

Tickets for FLANDERS TECHNOLOGY INTERNATIONAL can be obtained at the
registration desk.

Payments in Belgian Franks only, to be made on receipt of an invoice from
the BIRA office.

Registration will close on 18 april 1989.

Confirmations will NOT be send.


For further information or a printed announcement with a registration form
please contact either the BIRA coordinator (adress below) or one of us
(using e-mail).

You can also use the registration form printed below and send this via
e-mail back to us. We will then make sure it reaches BIRA in time.

--------------------------<cut here>--------------------------------------------

REGISTRATION FORM

Tuesday 25 april 1989
I.C.C.-Ghent
BIRA Seminar on NEURAL NETWORKS

NAME:                 ..................................................

FIRST NAME:           ..................................................

ADRESS:               ..................................................

                      ..................................................

POSITION:             ..................................................

CONCERN OR INSTITUTE: ..................................................

                      ..................................................

TEL:                  ..................................................

FAX:                  ..................................................

-------------------------
Member BIRA/IBRA    : ........ BEF
Non-members         : ........ BEF
Teachers/Assistants : ........ BEF
-------------------------

Please only settle payment upon receipt of an invoice from the BIRA-Office.

Please indicate whether the invoice should be adressed to the company or
the personal adress.


Date:

Please send back before 17 april 1989.

Do NOT use 'REPLY', because in that way everyone on the list will be
informed about your plans to come to the seminar and they just might not be
interested in it.

--------------------------<cut here>--------------------------------------------

Seminar Coordinators
Rob Vingerhoeds               Leo Vercauteren
<ROB at BGERUG51.BITNET>         <LEO at BGERUG51.BITNET>

BIRA COORDINATOR
L. Pauwels
BIRA-Office
Het Ingenieurshuis
Desguinlei 214
2018 Antwerpen
Belgium
tel: +32-3-216-09-96
fax: +32-3-216-06-89 (attn. BIRA L. Pauwels)

From alexis%yummy at gateway.mitre.org  Fri Mar 17 09:46:27 1989
From: alexis%yummy at gateway.mitre.org (alexis%yummy@gateway.mitre.org)
Date: Fri, 17 Mar 89 09:46:27 EST
Subject: What is a connectionist net?  Here's what it's not.
In-Reply-To: KELLY%BROWNCOG.BITNET@mitvma.mit.edu's message of Thu, 16 Mar 89 12:12 EST <8903170151.AA26943@gateway.mitre.org>
Message-ID: <8903171446.AA02093@marzipan.mitre.org>

************  Do Not Forward To Any Other BBoards, Etc  ************

Just an aside to KELLY%BROWNCOG's note, rather than worry if
Hecht-Nielson's neural net (and I use the term intentionally -- I mean
"artificial intelligence" is neither so ...) is really a connectionist
model, let me point out a paper/result worth being aware of.

G. Cybenko wrote a very interesting paper which proves that a neural 
network with *one* hidden layer of nodes (i.e., one more than a 
perceptron) with a sigmoid transfer function can "uniformly approximate 
any continuous function with support in the unit hypercube".  That is 
to say you actually can do any mapping with *ONE* hidden layer (albeit
often a very very large one).

Cybenko sent the paper to me because of a tirade I went on awhile ago
on this bboard, so I don't actually know if it has been published 
anywhere yet.  I'm writing this without his knowledge -- I'm pretty 
sure he's on this list.  G. Cybenko are you out there, and are you 
willing to say where your paper "Approximation by Superpositions of 
a Sigmoidal Function" can be found by the hungary masses?

alexis wieland.

************  Do Not Forward To Any Other BBoards, Etc  ************

From sontag at fermat.rutgers.edu  Sat Mar 18 18:27:29 1989
From: sontag at fermat.rutgers.edu (sontag@fermat.rutgers.edu)
Date: Sat, 18 Mar 89 18:27:29 EST
Subject: ONE HIDDEN LAYER IS ENOUGH -- re "what is a net?" discussion
Message-ID: <8903182327.AA06225@control.rutgers.edu>

This is in response to Alexis Wieland's request:

	"G. Cybenko are you out there, and are you willing to say where your
	paper "Approximation by Superpositions of a Sigmoidal Function" can be
	found by the hungary (sic) masses?"

(Presumably non-Hungarian masses are interested too, so:)

The paper by George Cybenko that proves this theorem (a neural network with
one hidden layer of nodes with a fixed sigmoid transfer function can uniformly
approximate any continuous function) is scheduled to appear in

     MATHEMATICS OF CONTROL, SIGNALS, AND SYSTEMS, Vol.2 (1989), Number 4.

Your library should have this journal, which specializes in the formal
mathematical analysis of problems related to signal processing and systems.
(The journal has published many other papers that should be relevant to
theoretical connectionist research, such as papers on iterated projection
methods, estimation, interpolation techniques, identification, and adaptive
control.)  If your library doesn't yet subscribe, you might as well provide
them with the following info:

MATHEMATICS OF CONTROL, SIGNALS, AND SYSTEMS
Springer-Verlag New York, Inc
ISSN 0932-4194, Title # 498

In North America, order from:

Springer-Verlag New York, Inc
Journal Fulfillment Services
44 Hartz Way, Secaucus, NJ 07094

(Volume 2, 1989 ... $179.00 incl. p&h)

Outside NA, order from:

Springer-Verlag
Heidelberger Platz 3
D-1000 Berlin 33, FRG

(Volume 2, 1989 ... DM 348.- incl. p&h)

-bradley dickinson and eduardo d. sontag, co-Managing eds.

From terry%sdbio2 at ucsd.edu  Sat Mar 18 21:11:09 1989
From: terry%sdbio2 at ucsd.edu (Terry Sejnowski)
Date: Sat, 18 Mar 89 18:11:09 PST
Subject: ONE HIDDEN LAYER IS ENOUGH -- re "what is a net?" discussion
Message-ID: <8903190211.AA17912@sdbio2.UCSD.EDU>

Hal White in the Economics Department at UCSD has also proved
that one hidden layer can uniformly approximate smooth mappings.
He has gone on to prove the even more interesting theorem that
it is possible to learn the mapping.  Write to him for a preprint:

Hal White
Department of Economics
UCSD
San Diego, CA 92093

Two related papers that are in press in Neural Computation:

What size net gives valid generalization? by Eric Baum
and David Haussler

A proposal for more powerful learning algorithms.
Eric Baum.

For preprints write to:

Eric Baum
Department of Physics
Princeton University
Princeton, NJ 08540

Terry Sejnowski

-----

From chrisley.pa at Xerox.COM  Mon Mar 20 14:25:00 1989
From: chrisley.pa at Xerox.COM (chrisley.pa@Xerox.COM)
Date: 20 Mar 89 11:25 PST
Subject: questions on kohonen's maps
In-Reply-To: ananth sankar <sankar@caip.rutgers.edu>'s message of Thu, 16
 Mar 89 09:42:44 EST
Message-ID: <890320-112612-6136@Xerox>

Ananth Sankar recently asked some questions about Kohonen's feature maps.
As I have worked on these issues with Kohonen, I feel like I might be able
to give some answers, but standard disclaimers apply:  I cannot be certain
that Kohonen would agree with all of the following.  Also, I do not have my
copy of his book with me, so I cannot be more specific about refrences.

Questions:
	
	
1	Is there any analytical expression for the neighbourhood and gain
	functions? I have seen a simulation were careful tweaking after
	every so many iterations produces a correctly evolving map. This
	is obviously not a proper approach.

Although there is probably more than one, correct, task-independent gain or
neighborhood function, Kohonen does mention constraints that all of them
should meet.  For example, both functions should decrease to zero over
time.  I do not know of any tweaking; Kohonen usually determines a number
of iterations and then decreases the gain linearly.  If you call this
tweaking, then your idea of domain-independent parameters might be a sort
of holy grail, since it does not seem likely that we are going to find a
completely parameter-free learning algorithm that will work in every
domain.

2	Even if good results are available for particular functions for
	the uniform distribution input case, it is not clear to me that these
	same functions would result in good classification for some other
	problem. I have attempted to use these maps for word classification
	using LPC coeffs as features.

As far as I know, Kohonen has used the same type of gain and neighborhood
functions for all of his map demonstrations.  These demonstrations, which
have been shown via an animated film at several major conferences,
demonstrate maps learning the distribution in cases where 1) the
dimensionality of the network topology and the input space mismatch, e.g.,
where the network is 2d and the distribution is a 3d 'cactus'; 2) the
distribution is not uniform.  The algorithm was developed with these 2
cases in mind, so it is no surprise that the results are good for them as
well.

3	In Kohonen's book "Self Organization and Associative Memory", Ch 5
	the algorithm for weight adaptation does not produce normalized
	weights. Thus the output nodes cannot function as simply as taking
	a dot product of inputs and weights. They have to execute a distance
	calculation.

That's right.  And Kohonen usually uses the Euclidean distance metric,
although other ones can be used (which he discusses in the book)
Furthermore, there have been independent efforts to normalize weights in
Kohonen maps so that the dot product measure can be used.  If you have any
doubts about the suitability of the Euclidean metric, as your question
seems to imply, express them.  It is an interesting issue.

4	I have not seen as yet in the literature any reports on
	how the fact that neighbouring nodes respond to similar patterns
	from a feature space can be exploited.

The primary interest in maps, I believe, came from a desire to display
high-dimensional information in low dimensional spaces, which are more
easily apprehended.  But there is evidence that there are other uses as
well:  1) Kohonen has published results on using maps for phoneme
recognition, where the topology-preservation plays a significant role (such
maps are used in the Otaniemi Phonetic Typewriter featured in, I think,
Computer magazine a year or two agao.); 2)  work has been done on using the
topology to store sequential information, which seems to be a good idea if
you are dealing with natural signals that can only temporally shift from a
state to similar states; 3)  several people have followed Kohonen's
suggestion of using maps for adaptive kinematic representations for robot
control (the work on Murphy, mentioned on this net a month or so ago, and
the work being done at Carlton (sp) University by Darryl Graf are two good
examples).  In short, just look at some ICNN or INNS proceedings, and
you'll find many papers where researchers found Kohonen maps to be a good
place from which to launch their own studies.

5	Can the net become disordered after ordering is achieved at any
	particular iteration? 

Of course, this is theoretically possible, and is almost certain if at some
point the distribution of the mapped function changes.  But this brings up
the difficult question:  what is the proper ordering in such a case?
Should a net try to integrate both past and present distributions, or
should it throw away the past on concentrate on the present?  I think nost
nn researchers would want a litlle of both, woth maybe some kind of
exponential decay in the weights.  But in many applications of maps, there
is no chance of the distribution changing:  it is fixed, and iterations are
over the same test data each time.  In this case, I would guess that the
ordering could not becone disrupted (at least for simple distributions and
a net of adequate size), but I realise that there is no proof of this, and
the terms 'simple' and 'adequate' are lacking definition.  But that's life
in nnets for you!

If anyone has any more questions, feel free.

Ron Chrisley

Xerox PARC System Sciences Lab
3333 Coyote Hill Road
Palo Alto, CA 94304
USA

chrisley.pa at xerox.com
tel: (415) 494-4728

OR

New College
Oxford OX1 3BN
UK

chrisley at vax.oxford.ac.uk
tel: (865) 279-492


From moody-john at YALE.ARPA  Tue Mar 21 16:11:08 1989
From: moody-john at YALE.ARPA (john moody)
Date: Tue, 21 Mar 89 16:11:08 EST
Subject: two research reports available
Message-ID: <8903212107.AA03190@NEBULA.SUN3.CS.YALE.EDU>


*********  FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD  ***********
****************  TO OTHER BBOARDS/ELECTRONIC MEDIA *******************


FAST LEARNING IN MULTI-RESOLUTION HIERARCHIES

John Moody

Research Report YALEU/DCS/RR-681, February 1989

ABSTRACT

A class of fast, supervised  learning  algorithms  is  presented.
They  use  local representations, hashing, and multiple scales of
resolution to approximate functions which are piece-wise continu-
ous.  Inspired by Albus's CMAC model, the algorithms learn orders
of magnitude more rapidly than typical  implementations  of  back
propagation,  while  often achieving comparable qualities of gen-
eralization.  Furthermore, unlike most traditional  function  ap-
proximation  methods,  the  algorithms are well suited for use in
real time adaptive signal processing.   Unlike  simpler  adaptive
systems,  such  as  linear predictive coding, the adaptive linear
combiner, and the Kalman filter, the new algorithms  are  capable
of  efficiently capturing the structure of complicated non-linear
systems.  As an illustration, the algorithm  is  applied  to  the
prediction of a chaotic timeseries.

NOTE: This research report will appear in Advances in Neural  In-
formation  Processing  Systems,  edited by David Touretzky, to be
published in April 1989 by Morgan Kaufmann Publishers, Inc.   The
author  gratefully acknowledges financial support under ONR grant
N00014-89-J-1228,  ONR  grant   N00014-86-K-0310,   AFOSR   grant
F49620-88-C0025, and a Purdue Army subcontract.

***********************************************************

FAST LEARNING IN NETWORKS OF LOCALLY-TUNED PROCESSING UNITS

John Moody and Christian J. Darken

Research Report YALEU/DCS/RR-654,  October  1988,  Revised  March
1989

ABSTRACT

We propose a network architecture which uses  a  single  internal
layer of locally-tuned processing units to learn both classifica-
tion tasks and real-valued function  approximations  We  consider
training  such  networks  in  a completely supervised manner, but
abandon this approach in favor of a  more  computationally  effi-
cient  hybrid  learning  method which combines self-organized and
supervised learning.  Our networks learn faster than back  propa-
gation  for  two  reasons:  the local representations ensure that
only a few units respond to any given input, thus reducing compu-
tational overhead, and the hybrid learning rules are linear rath-
er than nonlinear, thus leading to  faster  convergence.   Unlike
many existing methods for data analysis, our network architecture
and learning rules are truly adaptive and  are  thus  appropriate
for real-time use.

NOTE: This research report will appear in Neural  Computation,  a
new Journal edited by Terry Sejnowski and published by MIT Press.
The work was supported by ONR grant N00014-86-K-0310, AFOSR grant
F49620-88-C0025, and a Purdue Army subcontract.


***********************************************************

Copies of both reports can be obtained by sending a request to:

Judy Terrell 

Yale Computer Science
PO Box 2158 Yale Station
New Haven, CT 06520

(203)432-1200

e-mail:

terrell at cs.yale.edu
terrell at yale.arpa
terrell at yalecs.bitnet


*********  FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD  ***********
****************  TO OTHER BBOARDS/ELECTRONIC MEDIA *******************


-------


From chrisley.pa at Xerox.COM  Thu Mar 23 14:35:00 1989
From: chrisley.pa at Xerox.COM (chrisley.pa@Xerox.COM)
Date: 23 Mar 89 11:35 PST
Subject: questions on kohonen's maps
In-Reply-To: ananth sankar <sankar@caip.rutgers.edu>'s message of Thu, 16
 Mar 89 09:42:44 EST
Message-ID: <890323-113527-4949@Xerox>

One further note about Ananth Sankar's questions about Kohonen maps:

A friend of mine, Tony Bell, tells me (and Ananth) that Helge Ritter has 
a "neat set of expressions for the learning rate and neighbourhood size
parameters... and he also proves something about congergence elsewhere."

Unfortunately, I do not as yet have a reference for the papers, but I have
liked Ritter's work in the past, so I thought people on the net might be
interested.

From jose at tractatus.bellcore.com  Wed Mar 22 10:44:09 1989
From: jose at tractatus.bellcore.com (Stephen J Hanson)
Date: Wed, 22 Mar 89 10:44:09 EST
Subject: technical report available
Message-ID: <8903221544.AA14583@tractatus.bellcore.com>


           Princeton Cognitive Science Lab Technical Report: CSL36, February, 1989.

                         COMPARING BIASES FOR MINIMAL NETWORK
                          CONSTRUCTION WITH BACK-PROPAGATION


                                 Stephen Jos'e Hanson

                                       Bellcore
                                          and
                        Princeton Cognitive Science Laboratory

                                         and

                                    Lorien Y. Pratt
                                  Rutgers University


                                       ABSTRACT

                Rumelhart (1987), has proposed a method for choosing minimal or
                "simple" representations during learning in Back-propagation
                networks.  This approach can be used to (a) dynamically select
                the number of hidden units, (b) construct a representation that
                is appropriate for the problem and (c) thus improve the
                generalization ability of Back-propagation networks. The method
                Rumelhart suggests involves adding penalty terms to the usual
                error function. In this paper we introduce Rumelhart's minimal
                networks idea and compare two possible biases on the weight
                search space.  These biases are compared in both simple counting
                problems and a speech recognition problem.  In general, the
                constrained search does seem to minimize the number of hidden
                units required with an expected increase in local minima.

                to appear in Advances in Neural Information Processing, D. Touretzky Ed., 1989
                Research was jointly sponsered by Princeton CSL and Bellcore.


                REQUESTS FOR THIS TECHNICAL REPORT SHOULD BE SENT TO


                laura at clarity.princeton.edu


                Please do not reply to this message or forward, Thankyou.

From lwyse at bucasb.BU.EDU  Tue Mar 21 13:59:02 1989
From: lwyse at bucasb.BU.EDU (lwyse@bucasb.BU.EDU)
Date: Tue, 21 Mar 89 13:59:02 EST
Subject: questions on kohonen's maps
In-Reply-To: connectionists@c.cs.cmu.edu's message of 20 Mar 89 23:47:09 GMT
Message-ID: <8903211859.AA04927@cochlea.bu.edu>


What does "ordering" mean when your projecting inputs to a lower dimensional
space? For example, the "Peano" type curves that result from a one-D
neighborhood learning a 2-D input distribution, it is obviously NOT 
true that nearby points in the input space maximally activate nearby
points on the neighborhood chain. In this case, it is not even clear
that "untangling" the neighborhood is of utmost importance, since a 
tangled chain can still do a very good job of divvying up the space
almost equally between its nodes. 

-lonce


From jose at tractatus.bellcore.com  Thu Mar 23 17:19:35 1989
From: jose at tractatus.bellcore.com (Stephen J Hanson)
Date: Thu, 23 Mar 89 17:19:35 EST
Subject: No subject
Message-ID: <8903232219.AA16776@tractatus.bellcore.com>


     Princeton Cognitive Science Lab Technical Report: CSL36, February, 1989.

                         COMPARING BIASES FOR MINIMAL NETWORK
                          CONSTRUCTION WITH BACK-PROPAGATION


                                 Stephen Jos'e Hanson

                                       Bellcore
                                          and
                        Princeton Cognitive Science Laboratory

                                         and

                                    Lorien Y. Pratt
                                  Rutgers University


                                       ABSTRACT

         Rumelhart (1987), has proposed a method for choosing minimal or
         "simple" representations during learning in Back-propagation
         networks.  This approach can be used to (a) dynamically select
         the number of hidden units, (b) construct a representation that
         is appropriate for the problem and (c) thus improve the
         generalization ability of Back-propagation networks. The method
         Rumelhart suggests involves adding penalty terms to the usual
         error function. In this paper we introduce Rumelhart's minimal
         networks idea and compare two possible biases on the weight
         search space.  These biases are compared in both simple counting
         problems and a speech recognition problem.  In general, the
         constrained search does seem to minimize the number of hidden
         units required with an expected increase in local minima.

to appear in Advances in Neural Information Processing, D. Touretzky Ed., 1989
Research was jointly sponsered by Princeton CSL and Bellcore.


REQUESTS FOR THIS TECHNICAL REPORT SHOULD BE SENT TO

    laura at clarity.princeton.edu


  Please do not reply to this message or forward, Thankyou.

From gblee at CS.UCLA.EDU  Fri Mar 24 13:25:07 1989
From: gblee at CS.UCLA.EDU (Geunbae Lee)
Date: Fri, 24 Mar 89 10:25:07 PST
Subject: questions on konhonen's map
Message-ID: <8903241825.AA25252@maui.cs.ucla.edu>

>What does "ordering" mean when your projecting inputs to a lower dimensional
>space? 
It means topological ordering

>For example, the "Peano" type curves that result from a one-D
>neighborhood learning a 2-D input distribution, it is obviously NOT 
>true that nearby points in the input space maximally activate nearby
>points on the neighborhood chain. 
It depends on what you mean by "near by" If it is near by in 
relative sense (in topological relation), not absolute sense, then
the nearby points in the input space DOES maximally activate nearby
points on the neighborhood chain.

--Geunbae Lee
  AI Lab, UCLA


From LIN2 at ibm.com  Fri Mar 24 15:02:32 1989
From: LIN2 at ibm.com (Ralph Linsker)
Date: 24 Mar 89 15:02:32 EST
Subject: Technical report available
Message-ID: <032489.150233.lin2@ibm.com>

*********  FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD  ***********
****************  TO OTHER BBOARDS/ELECTRONIC MEDIA *******************

The following report (IBM Research Report RC 14195, Nov. 1988)
is available upon request to:

                  lin2 @ ibm.com

It will appear in: Advances in Neural Information Processing Systems 1,
ed. D. S. Touretzky (San Mateo, CA: Morgan Kaufmann), April 1989.

                  "An Application of the Principle of
                    Maximum Information Preservation
                            to Linear Systems,"

                               Ralph Linsker

          This paper addresses the  problem of determining the weights
          for a set  of linear filters (model "cells") so  as to maxi-
          mize the ensemble-averaged information  that the cells' out-
          put values  jointly convey  about their input  values, given
          the statistical properties of the ensemble of input vectors.
          The quantity  that is  maximized is the  Shannon information
          rate, or equivalently the average mutual information between
          input and output.* Several models for the role of processing
          noise are  analyzed, and the biological  motivation for con-
          sidering  them is  described.   For simple  models in  which
          nearby input  signal values  (in space  or time)  are corre-
          lated, the  cells resulting  from this  optimization process
          include  center-surround   cells  and  cells   sensitive  to
          temporal variations in input signal.

  *The possible relation between this optimization principle and the
   organization of a sensory processing system is discussed in:
   R. Linsker, Computer 21(3)105-117 (March 1988).  If you
   would like a reprint of the Computer article, please so note.

From chrisley.pa at Xerox.COM  Fri Mar 24 17:53:00 1989
From: chrisley.pa at Xerox.COM (chrisley.pa@Xerox.COM)
Date: 24 Mar 89 14:53 PST
Subject: questions on kohonen's maps
In-Reply-To: lwyse@bucasb.BU.EDU's message of Tue, 21 Mar 89 13:59:02 EST
Message-ID: <890324-145332-8519@Xerox>

Lonce (lwyse at bucasb.BU.EDU) writes:

"What does "ordering" mean when your projecting inputs to a lower
dimensional space? For example, the "Peano" type curves that result from a
one-D neighborhood learning a 2-D input distribution, it is obviously NOT 
true that nearby points in the input space maximally activate nearby
points on the neighborhood chain." 

It is not true that nearby points in input space are always mapped to
nearby points in the output space when the mapping is dimensionality
reducing, agreed.  But 'ordering' still makes sense.  The map is
topology-preserving if the dependency is in the other direction, i.e., if
nearby points in output space are always activated by nearby points in
input space.

Lonce goes on to say:

"In this case, it is not even clear that "untangling" the neighborhood is
of utmost importance, since a tangled chain can still do a very good job of
divvying up the space almost equally between its nodes."

I agree that topology preservation is not necessarily of utmost importance,
but it may be useful in some applications, such as the ones I mentioned a
few messages back (phoneme recognition, inverse kinematics, etc.).  Also,
there is 1) the interest in properties of self-organizing systems in
themselves, even though an application can't be immediately found; and 2)
the observation that for some reason the brain seems to use topology
preserving maps (with the one-way dependency I mentioned above), which,
although they *could* be computationally unnecessary or even
disadvantageous, are probably in fact, nature being what she is, good
solutions to tough real time problems. 

Ron Chrisley
After April 14th, please send personal email to Chrisley at vax.ox.ac.uk

From ken at phyb.ucsf.EDU  Sun Mar 26 01:17:59 1989
From: ken at phyb.ucsf.EDU (Ken Miller)
Date: Sat, 25 Mar 89 22:17:59 pst
Subject: Normalization of weights in Kohonen algorithm
Message-ID: <8903260617.AA08352@phyb>

re point 3 of recent posting about Kohonen algorithm: 

"3	In Kohonen's book "Self Organization and Associative Memory", Ch 5
	the algorithm for weight adaptation does not produce normalized
	weights."

the algorithm

du_{ij}/dt = a(t)[e_j(t) - u_{ij}(t)], i in N_c

where u = weights, e is input pattern, N_c is topological neighborhood of
maximally responding neighborhood, should I believe be written

du_{ij}/dt = a(t)[ e_j(t)/\sum_k(e_k(t)) - u_{ij}(t)/\sum_k(u_{ik}(t)) ], 
i in N_c.

That is, the change should be such as to move the jth synaptic weight on the
ith cell, as a PROPORTION of all the synaptic weights on the ith cell, in the
direction of matching the PROPORTION of input which was incoming on the jth
line.  Note that in this case \sum_j du_{ij}(t)/dt = 0, so weights remain
normalized in the sense that sum over each cell remains constant.

If inputs are normalize to sum to 1 (\sum_k(e_k(t)) = 1) then the first
denominator can be omitted.  If weights begin normalized to sum to 1 on each
cell ( \sum_k(u_{ik}(t)) = 1 for all i) then weights will remain normalized
to sum to 1, hence the second denominator can be omitted.  Perhaps Kohonen
was assuming these normalizations and hence dispensing with the denominators?

ken miller (ken at phyb.ucsf.edu)

From nowlan at ai.toronto.edu  Tue Mar 28 09:41:36 1989
From: nowlan at ai.toronto.edu (Steven J. Nowlan)
Date: Tue, 28 Mar 89 09:41:36 EST
Subject: training time in HMM and CN
Message-ID: <89Mar28.094139est.10529@ephemeral.ai.toronto.edu>

Two comments on Thansis' post on the relative training speed of HMM vs CN
for sequential problems such as speech recognition:

1. The BF algorithm is quite highly optimized, while vanilla BP doesn't
   implement anything that a numerical analyst would consider a real
   descent procedure (not even steepest descent). If you were to use a
   reasonably powerful numerical optimization technique, such as one of
   the Broyden methods you may find CN convergence extremely fast. Ray
   Watrous has in fact shown this sort of speedup for speech problems [1].
 
 2. A more subtle, but probably more important difference, is the issue of
    how targets are specified over an input sequence. The BF algorithm
    specifies targets for intermediate steps in an input sequence based on
    expectations of final outcome of that sequence collected from many
    similar sequences. It is not clear how to specify output targets for
    intermediate points of an input sequence in a CN, although Watrous
    has shown that intelligent choice of such targets can markedly improve
    CN convergence and performance. Of interest in this regard is the work
    by Sutton on Temporal Difference methods [2]. One can view this work as
    specifying a target function over a sequence in a dynamical way, so that
    the target function reflects the experience of the system to date in a
    clever way. Sutton [2] has shown an equivalence between one form of linear
    TD method and the maximum likelihood estimates of the parameters for an
    absorbing Markov chain model of the same process. This seems much closer
    in flavour to what the BF algorithm is doing, and when applied to a 
    non-linear system may in fact be an interesting generalization of BF.
 
Comments and requests for clarifications should be directed to me, not to
Connectionists please.
 
 	- Steve Nowlan
 	  nowlan at ai.toronto.edu
 
References:
 
 [1]  Watrous, Raymond L. "Speech Recognition Using Connectionist Networks",
      TR MS-CIS-88-96, Department of Computer and Information Science,
      University of Pennsylvania, Philadelphia, 1988.
 
 [2]  Sutton, Richard S. "Learning to Predict by the Methods of Temporal
       Difference", GTE Technical Report TR87-509.1, GTE Laboratories Inc.
       Waltham, Mass. 1987.
 
      
From cfields at NMSU.Edu  Tue Mar 28 19:56:24 1989
From: cfields at NMSU.Edu (cfields@NMSU.Edu)
Date: Tue, 28 Mar 89 17:56:24 MST
Subject: No subject
Message-ID: <8903290056.AA14581@NMSU.Edu>


              Call for Participants / Call for Abstracts


            Symbolic Problem Solving in Noisy, Novel, and
                   Uncertain Task Environments


           20-21 August, 1989 (tentative), Detroit, MI, USA
              An IJCAI-89 Workshop, Sponsored by AAAI


Goals.

Brittleness in the face of noise, novelty, and uncertainty is a
well-known failing of symbolic problem solvers.  The goals of this
Workshop are to characterize the features of task environments that
cause brittleness, to investigate mechanisms for decreasing the
brittleness of symbolic problem solvers, and to review case histories
of implemented systems that function in task environments high in
noise, novelty, and data of uncertain relevance.


Topics of interest for the Workshop include the following.

Analysis of task environments:  Definitions of noisy, novelty,
and uncertain relevance; exploration of related concepts in general
systems theory or logic; parameters for characterizing task
environments; knowledge engineering strategies.

Mechanisms for addressing noise and novelty:  Plasticity and
learning; constructive problem solving; fragmentation of knowledge
structures; dynamic modification of rules, schemata, or cases;
coherence maintenance; adaptive control mechanisms.

Representations:  Data structures allowing dynamic abstraction
and modification; representation of ``unstructured'' knowledge;
knowledge implicit in control or learning procedures; ordering of
knowledge structures; tradeoffs between explicit and implicit
knowledge representation.

Implementation issues:  Implementing symbolic problem solvers on
parallel machines; concurrency control strategies; integrating
symbolic systems with artificial neural networks; general systems
integration.

Researchers interested in participating in the Workshop are invited to
submit abstracts describing work in any of these topic areas.


Format.

All participants will present their current work, either as a brief
oral report or as a poster.  Most presentations will be posters, as
these provide the greatest opportunity for presentation and discussion
of technical details.  Presentations will be on the first day of the
Workshop, followed by discussions in working groups organized by
application domain and a panel discussion on the second day.

Attendance at IJCAI Workshops is limited to fifty participants.
Participants not registered for IJCAI must pay a $50/day fee.


Abstract Submission.

Please submit a 1 page abstract of the work to be presented,
together with a cover letter summarizing previous work in relevant
areas and expected contribution to the Workshop, to Mike Coombs, Box
30001/3CRL, New Mexico State University, Las Cruces, NM 88003-0001
USA, by 15 May 1989.  Authors will be notified as to acceptance by 1
June 1989.  Accepted abstracts will be distributed at the Workshop.  A
volume collecting selected papers from the Workshop is planned; papers
for this volume will be solicited at the Workshop.


Organizers.

Mike Coombs and Chris Fields (NMSU), Russ Frew (GE), David Goldberg
(Alabama), Jim Reggia (Maryland).  Points of contact: Mike Coombs,
505-646-5757, mcoombs at nmsu.edu; Chris Fields, 505-646-2848,
cfields at nmsu.edu.

From elman%amos at ucsd.edu  Wed Mar 29 00:30:44 1989
From: elman%amos at ucsd.edu (Jeff Elman)
Date: Tue, 28 Mar 89 21:30:44 PST
Subject: 1990 Connectionist Summer School announcement
Message-ID: <8903290530.AA23241@amos.UCSD.EDU>


March 28, 1989                      PRELMINARY ANNOUNCEMENT

         CONNECTIONIST SUMMER SCHOOL / SUMMER 1990

                            UCSD
                    La Jolla, California


     The next Connectionist Summer School will  be  held  at
the  University of California, San Diego in June 1990.  This
will be the third session in the series which  was  held  at
Carnegie-Mellon in the summers of 1986 and 1988.

     The summer school will offer courses in  a  variety  of
areas  of connectionist modelling, with emphasis on computa-
tional neuroscience, cognitive models, and  hardware  imple-
mentation.   In  addition  to  full courses, there will be a
series of shorter tutorials, colloquia, and public lectures.
Proceedings  of the summer school will be published the fol-
lowing fall.

     As in the past, participation will be limited to gradu-
ate students enrolled in PhD. programs (full- or part-time).
Admission will be on a competitive basis.   We hope to  have
sufficient funding to subsidize tuition and housing.

     THIS IS A  PRELMINARY  ANNOUNCEMENT.   Further  details
will be announced over the next several months.

    Terry Sejnowski         Jeff Elman
    UCSD/Salk               UCSD

    Geoff Hinton            Dave Touretzky
    Toronto                 CMU
    hinton at ai.toronto.edu   touretzky at cs.cmu.edu


From niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK  Wed Mar 29 09:17:49 1989
From: niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK (Mahesan Niranjan)
Date: Wed, 29 Mar 89 09:17:49 BST
Subject: Missing link etc...
Message-ID: <23751.8903290817@dsl.eng.cam.ac.uk>

Some recent papers and postings on this network compare HMMs and Multi-layer
neural networks. Here is something I find missing in these discussions.

In speech pattern processing, HMMs make an inherent assumption about
the time series; - that it can be chopped up into a sequence of
piecewise stationary regions. Thus, an HMM places break-points in the
transition regions of the signal and models the steady regions by the
statistical parameters of individual states.

For speech signals, this is a bad assumption (human speech production is
not at all like this) - but the recognisers somehow seem to work!!

In neural networks (with or without feedback) what is the equivalent
assumption about the time evolution of the signal?


niranjan

From ersoy at ee.ecn.purdue.edu  Wed Mar 29 12:22:20 1989
From: ersoy at ee.ecn.purdue.edu (Okan K Ersoy)
Date: Wed, 29 Mar 89 12:22:20 EST
Subject: No subject
Message-ID: <8903291722.AA07623@ee.ecn.purdue.edu>

CALL FOR PAPERS AND REFEREES
HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES - 23
NEURAL NETWORKS AND RELATED EMERGING TECHNOLOGIES
KAILUA-KONA, HAWAII - JANUARY 3-6, 1990

The Neural Networks Track of HICSS-23 will contain a special set
of papers focusing on a broad selection of topics in the
area of Neural Networks and Related Emerging Technologies.
The presentations will provide a forum to discuss new advances in 
learning theory, associative memory, self-organization,
architectures, implementations and applications.

Papers are invited that may be theoretical, conceptual, tutorial or
descriptive in nature.
Those papers selected for presentation will appear in the
Conference Proceedings which is published by the Computer Society
of the IEEE.
HICSS-23 is sponsored by the University of Hawaii in cooperation
with the ACM, the Computer Society,and the Pacific Research Institute
for Informaiton Systems and Management (PRIISM).

Submissions are solicited in:

Supervised and Unsupervised Learning
Associative Memory
Self-Organization
Architectures
Optical, Electronic and Other Novel Implementations
Optimization
Signal/Image Processing and Understanding
Novel Applications

INSTRUCTIONS FOR SUBMITTING PAPERS

Manuscripts should be 22-26 typewritten, double-spaced pages in length.
Do not send submissions that are significantly shorter or
longer than this.
Papers must not have been previously presented or published,
nor currently submitted for journal publication.
Each manuscript will be put through a rigorous refereeing process.
Manuscripts should have a title page that includes the title of
the paper, full name of its author(s), affiliations(s), complete
physical and electronic address(es), telephone number(s) and a
300-word abstract of the paper.

DEADLINES

 Six copies of the manuscript are due by June 10, 1989.
 Notification of accepted papers by September 1, 1989.
 Accpeted manuscripts, camera-ready, are due by October 3, 1989.

SEND SUBMISSIONS AND QUESTIONS TO

O. K. Ersoy				H. H. Szu
Purdue University			Naval Research Laboratories
School of Electrical Engineering	Code 5709
W. Lafayette, IN  47907			4555 Overlook Ave., SE
(317) 494-6162				Washington, DC  20375
E-Mail: ersoy at ee.ecn.purdue		(202) 767-2407

From lina at wheaties.ai.mit.edu  Wed Mar 29 13:23:33 1989
From: lina at wheaties.ai.mit.edu (Lina Massone)
Date: Wed, 29 Mar 89 13:23:33 EST
Subject: No subject
Message-ID: <8903291823.AA09549@gelatinosa.ai.mit.edu>


*********  FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD  ***********
****************  TO OTHER BBOARDS/ELECTRONIC MEDIA *******************

		    TECHNICAL REPORT AVAILABLE


	A NEURAL NETWORK MODEL FOR LIMB TRAJECTORY FORMATION

		Lina Massone and Emilio Bizzi
	    Dept. of Brain and Cognitive Sciences
	    Massachusetts Institute of Technology


  This paper deals with the problem of representing and generating
  unconstrained aiming movements of a limb by means of a neural network
  architecture. The network produced a time trajectory of a limb from a
  starting posture toward a target specified by a sensory stimulus. Thus 
  the network performed a sensory-motor transformation. The experimenters
  imposed a bell-shaped velocity profile on the trajectory. This type of
  profile is characteristic of most movements performed by biological systems.
  We investigated the generalization capabilities of the network as well as
  its internal organization. Experiments performed during learning and on
  the trained network showed that: (i) the task could be learned by a
  three-layer sequential network; (ii) the network successfully generalized
  in trajectory space and adjusted the velocity profiles properly; (iii) the
  same task could not be learned by a linear network; (iv) after learning,
  the internal connections became organized into inhibitory and excitatory
  zones and encoded the main features of the training set; (v) the model was
  robust to noise on the input signals; (vi) the network exhibited
  attractor-dynamics properties; (vii) the network was able to solve
  the motor-equivalence problem.
  A key feature of this work is the fact that the neural network was coupled
  to a mechanical model of a limb in which muscles are represented as springs. 
  With this representation the model solved the problem of motor redundancy.


A short version of this paper covering only part of the
described research was mailed in February to IJCNN.
The full report has been submitted to Biological Cybernetics.

All requests should be addressed to: lina at wheaties.ai.mit.edu


From marchman%amos at ucsd.edu  Wed Mar 29 19:20:36 1989
From: marchman%amos at ucsd.edu (Virginia Marchman)
Date: Wed, 29 Mar 89 16:20:36 PST
Subject: Technical Report Available
Message-ID: <8903300020.AA01129@amos.UCSD.EDU>


The following Technical Report (#8902) is available from the Center for
Research in Language.  (Please do not forward.)

*******************************************************************

	Pattern Association in a Back Propagation Network:
	   Implications for Child Language Acquisition

       Kim Plunkett                      Virginia Marchman
University of Aarhus, Denmark     University of California, San Diego


			Abstract

A 3-layer back propagation network is used to implement a pattern
association task which learns mappings that are analogous to the present
and past tense forms of English verbs, i.e., arbitrary, identity,
vowel change, and suffixation mappings.  The degree of correspondence
between connectionist models of tasks of this type (Rumelhart &
McClelland, 1986; 1987) and children's acquisition of inflectional
morphology has recently been highlighted in discussions of the
general applicability of PDP to the study of human cognition and
language (Pinker & Mehler, 1988).  In this paper, we attempt to
eliminate many of the shortcomings of the R&M work and adopt an
empirical, comparative approach to the analysis of learning (i.e.,
hit rate and error type) in these networks.  In all of our simulations,
the network is given a constant 'diet' of input stems -- that is,
discontinuities are not introduced into the learning set at any point.
Four sets of simulations are described in which input conditions (class
size and token frequency) and the presence/absence of phonological
subregularities are manipulated.  First, baseline simulations chart
the initial computational constraints of the system and reveal complex
"competition effects" when the four verb classes must be learned
simultaneously.  Next, we explore the nature of these competitions
given different type (class sizes) and token frequencies (# of
repetitions).  Several hypotheses about input to children are tested,
from dictionary counts and production corpora.  Results suggest that
relative class size determines which "default" transformation is
employed by the network, as well as the frequency of overgeneralization
errors (both "pure" and "blended" overgeneralizations).  A third series
of simulations manipulates token frequency within a constant class size,
searching for the set of token frequencies which results in "adult-like
competence" and "child-like" errors across learning. A final series
investigates the addition of phonological sub-regularities into the
identity and vowel change classes.  Phonological cues are clearly
exploited by the system, leading to overall improved performance.
However, overgeneralizations, U-shaped learning and competition effects
continue to be observed in similar conditions.  These models establish
that input configuration plays a role in detemining the types of
errors produced by the network - including the conditions under
which "rule-like" behavior and "U-shaped" development will and will
not emerge. The results are discussed with reference to behavioral
data on children's acquisition of the past tense and the validity
of drawing conclusions about the acquisition of language from models
of this sort.

*****************************************************************

Please send requests for hard copy to:

		  yvonne at amos.ucsd.edu

	      or  Center for Research in Language C-008
   		  University of California, San Diego
   		  La Jolla, CA  92093
		  Attn:  Yvonne

-- Virginia Marchman (marchman at amos.ucsd.edu)
   Kim Plunkett (psykimp at dkarh02.bitnet)


From sankar at caip.rutgers.edu  Fri Mar 31 15:14:12 1989
From: sankar at caip.rutgers.edu (ananth sankar)
Date: Fri, 31 Mar 89 15:14:12 EST
Subject: KOHONEN MAPS
Message-ID: <8903312014.AA03080@caip.rutgers.edu>

I had initiated a discussion on Kohonen's maps two weeks ago and
apart from the many replies I (and many others??) received there
were requests that I post the responses. It would be a good idea
to go through this material and then discuss again.

>From pastor at prc.unisys.com Thu Mar 16 16:58:47 1989
Received: from PRC-GW.PRC.UNISYS.COM by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA03401; Thu, 16 Mar 89 16:58:40 EST
Received: from bigburd.PRC.Unisys.COM by burdvax.PRC.Unisys.COM (5.61/Domain/jpb/2.9) 
	id AA11739; Thu, 16 Mar 89 16:58:28 -0500
Received: by bigburd.PRC.Unisys.COM (5.61/Domain/jpb/2.9) 
	id AA24449; Thu, 16 Mar 89 16:58:23 -0500
From: pastor at prc.unisys.com (Jon Pastor)
Message-Id: <8903162158.AA24449 at bigburd.PRC.Unisys.COM>
Received: from Xerox143 by bigburd.PRC.Unisys.COM with PUP; Thu, 16 Mar 89 16:58 EST
To: ananth sankar <sankar at caip.rutgers.edu>
Date: 16 Mar 89 16:56 EST (Thursday)
Subject: Re: questions on kohonen's maps
In-Reply-To: ananth sankar <sankar at caip.rutgers.edu>'s message of Thu, 16 Mar 89 09:42:44 EST
To: ananth sankar <sankar at caip.rutgers.edu>
Cc: pastor at bigburd.prc.unisys.com
Status: R

I am in the process of implementing a Kohonen-style system, and if I
actually get it running and obtain any results I'll let you know.  If
you get any responses, please let me know.

Thanks.

>From Connectionists-Request at q.cs.cmu.edu Thu Mar 16 16:59:58 1989
Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA03426; Thu, 16 Mar 89 16:59:52 EST
Received: from cs.cmu.edu by Q.CS.CMU.EDU id aa11454; 16 Mar 89 9:44:34 EST
Received: from CAIP.RUTGERS.EDU by CS.CMU.EDU; 16 Mar 89 09:42:55 EST
Received: by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA14983; Thu, 16 Mar 89 09:42:44 EST
Date: Thu, 16 Mar 89 09:42:44 EST
From: ananth sankar <sankar at caip.rutgers.edu>
Message-Id: <8903161442.AA14983 at caip.rutgers.edu>
To: connectionists at cs.cmu.edu
Subject: questions on kohonen's maps
Status: R

I am interested in the subject of Self Organization and have some
questions with regard to Kohonen's algorithm for Self Organizing Feature
Maps. I have tried to duplicate the results of Kohonen for the two dimensional
uniform input case i.e. two inputs. I used a 10 X 10 output grid. The maps
that resulted were not as good as reported in the papers.

Questions:
	
	
1	Is there any analytical expression for the neighbourhood and gain
	functions? I have seen a simulation were careful tweaking after
	every so many iterations produces a correctly evolving map. This
	is obviously not a proper approach.

2	Even if good results are available for particular functions for
	the uniform distribution input case, it is not clear to me that these
	same functions would result in good classification for some other
	problem. I have attempted to use these maps for word classification
	using LPC coeffs as features.

3	In Kohonen's book "Self Organization and Associative Memory", Ch 5
	the algorithm for weight adaptation does not produce normalized
	weights. Thus the output nodes cannot function as simply as taking
	a dot product of inputs and weights. They have to execute a distance
	calculation.

4	I have not seen as yet in the literature any reports on
	how the fact that neighbouring nodes respond to similar patterns
	from a feature space can be exploited.

5	Can the net become disordered after ordering is achieved at any
	particular iteration? 


I would appreciate any comments, suggestions etc on the above. Also so that
net mail clutter may be reduced please respond to

sankar at caip.rutgers.edu

Thank you.

Ananth Sankar
Department of Electrical Engineering
Rutgers University, NJ


>From regier at cogsci.berkeley.edu Thu Mar 16 17:07:20 1989
Received: from cogsci.Berkeley.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA03562; Thu, 16 Mar 89 17:07:16 EST
Received: by cogsci.berkeley.edu (5.61/1.29)
	id AA13666; Thu, 16 Mar 89 14:07:18 -0800
Date: Thu, 16 Mar 89 14:07:18 -0800
From: regier at cogsci.berkeley.edu (Terry Regier)
Message-Id: <8903162207.AA13666 at cogsci.berkeley.edu>
To: sankar at caip.rutgers.edu
Subject: Kohonen request
Status: R


	Hi, I'm interested in the responses to your recent Kohonen posting
on Connectionists.  Do you suppose you could post the results once all the
replies are in?  Thanks,
						-- Terry


>From ken at phyb.ucsf.edu Thu Mar 16 20:11:35 1989
Received: from cgl.ucsf.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA09101; Thu, 16 Mar 89 20:11:32 EST
Received: from phyb.ucsf.EDU by cgl.ucsf.EDU (5.59/GSC4.15)
	id AA01036; Thu, 16 Mar 89 17:11:23 PST
Received: by phyb (1.2/GSC4.15)
	id AA11601; Thu, 16 Mar 89 17:11:17 pst
Date: Thu, 16 Mar 89 17:11:17 pst
From: ken at phyb.ucsf.edu (Ken Miller)
Message-Id: <8903170111.AA11601 at phyb>
To: sankar at caip.rutgers.edu
Subject: kohonen
Status: R

re your point 3: 
the algorithm

du_{ij}/dt = a(t)[e_j(t) - u_{ij}(t)], i in N_c

where u = weights, e is input pattern, N_c is topological neighborhood of
maximally responding neighborhood, should actually be written

du_{ij}/dt = a(t)[ (e_j(t)/\sum_k(e_k(t))) - u_{ij}(t)/\sum_j(u_{ij}(t) ], 
i in N_c.

That is the change should be such as to move the jth synaptic weight on the
ith cell, as a PROPORTION of all the synaptic weights on the ith cell, in the
direction of matching the PROPORTION of input which was incoming on the jth
line.  Note that in this case \sum_j du_{ij}(t)/dt = 0, so weights remain
normalized in the sense that sum over each cell remains constant.

If you normalize your inputs to sum to 1 (\sum_k(e_k(t)) = 1) and start with
weights normalized to sum to 1 on each cell ( \sum_j(u_{ij}(t) = 1 for all
i) then weights will remain normalized to sum to 1, hence the two sums in
the denominators are both just = 1 and can be left out.  Kohonen was I
believe assuming these normalizations and hence dispensing with the sums.

ken miller (ken at phyb.ucsf.edu)
ucsf dept. of physiology

>From tds at wheaties.ai.mit.edu Thu Mar 16 23:26:42 1989
Received: from life.ai.mit.edu by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA12489; Thu, 16 Mar 89 23:26:39 EST
Received: from mauriac.ai.mit.edu by life.ai.mit.edu; Thu, 16 Mar 89 22:48:15 EST
Received: from localhost by mauriac.ai.mit.edu; Thu, 16 Mar 89 22:48:06 est
Date: Thu, 16 Mar 89 22:48:06 est
From: tds at wheaties.ai.mit.edu
Message-Id: <8903170348.AA19015 at mauriac.ai.mit.edu>
To: sankar at caip.rutgers.edu
Subject: Kohonen maps
Status: R

I share some of your confusion about Kohonen maps.  My main question is 
#4: are they really doing anything useful?  The mapping demonstrated in
Kohonen's 1982 paper (Biol. Cyb.) only shows mappings from a 2D manifold
in 3-space onto a two-dimensionally arranged set of units.  The book
talks about dimensionality issues in more detail, but so far as I can
tell what the network does (after training) is to map three numbers into
about 100 numbers.  Since the mapping is linear, I don't see how anything
at all is gained.  
  If the network is unable to generate an ordering, it may be one way to
tell if the data does not lie on a 2D manifold.  But there are many other
ways to do this that are more efficient!  Also, this is not robust if the
manifold folds back on itself (so that two distinct points on the surface
are in the same direction from the origin).
  
  Let me know if you find out the true significance of this widely-known
work,
		Terry

>From lwyse at bucasb.bu.edu Fri Mar 17 17:42:18 1989
Received: from BU-IT.BU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA05821; Fri, 17 Mar 89 17:42:12 EST
Received: from COCHLEA.BU.EDU by bu-it.BU.EDU (5.58/4.7)
	id AA17739; Fri, 17 Mar 89 17:38:02 EST
Received:  by cochlea.bu.edu (4.0/4.7)
	id AA02692; Fri, 17 Mar 89 17:38:21 EST
Date:  Fri, 17 Mar 89 17:38:21 EST
From: lwyse at bucasb.bu.edu
Message-Id:  <8903172238.AA02692 at cochlea.bu.edu>
To: sankar at caip.rutgers.edu
Subject: re:questions on Kohonen maps
Status: R


I would be surprised if there was some analytical expression for the
neighborhood and gain functions that was useful in practical applications.
I have found different "best functions" for different input vector
distributions, initial weight distributions, etc.

A related question to yours: What does "ordering" mean when mapping
accross different dimensional spaces? An excerpt from a report on my 
experiences with Kohonen maps:


When the input space and the neighborhood space of the weight vectors are of
different dimension, however, what "ordered" means becomes a sticky wicket.
For example, int Fig. 5.17, Kohonen shows a one-dimensional neigborhood of
weight vecotrs approximating a triangular distribution of inputs with what
he terms a "Peano-like" curve. But this type of curve folds in on itself in
an attempt to fill the space and thus moves points that may be far from each
other in their one-D neighborooh, but be maximally responsive to very close
input points. Is this "ordered"? He doesn't seem to address this point
directly. A point I would like to bring out is that in these situations
where the dimension of the input space and the dimension of the neighborhood
differ, whether or not the wheight-vector chain crosses itself is {\em not}
necessarily the important metric for measuring the ability of the weights
to approximate the input space. That is, there is not necessarily a correlation
between neighborhood-chain crossings, and the mean squared error of the
weight vector approximations of the input points. It is true, however, that
if the neighborhood chain crosses itself, then {\em there exists} a better
approximation to the input space.


-lonce

>From risto at cs.ucla.edu Sat Mar 18 02:59:46 1989
Received: from Oahu.CS.UCLA.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA14191; Sat, 18 Mar 89 02:59:35 EST
Return-Path: <risto>
Received: by oahu.cs.ucla.edu (Sendmail 5.59/2.16)
	id AA02486; Fri, 17 Mar 89 23:14:45 PST
Date: Fri, 17 Mar 89 23:14:45 PST
From: risto at cs.ucla.edu (Risto Miikkulainen)
Message-Id: <8903180714.AA02486 at oahu.cs.ucla.edu>
To: sankar at caip.rutgers.edu
In-Reply-To: ananth sankar's message of Thu, 16 Mar 89 09:42:44 EST <8903161442.AA14983 at caip.rutgers.edu>
Subject: questions on kohonen's maps
Reply-To: risto at cs.ucla.edu
Organization: UCLA Computer Science Department
Physical-Address: 3677 Boelter Hall
Status: R


   Date: Thu, 16 Mar 89 09:42:44 EST
   From: ananth sankar <sankar at caip.rutgers.edu>

   1	Is there any analytical expression for the neighbourhood and gain
	   functions? I have seen a simulation were careful tweaking after
	   every so many iterations produces a correctly evolving map. This
	   is obviously not a proper approach.
The trick is to start with a neighborhood large enough. For 10x10, a
radius of 8 units might be appropriate. Then reduce the radius
gradually (e.g. over a few thousand inputs) to 1 or even to 0.


   3	In Kohonen's book "Self Organization and Associative Memory", Ch 5
	   the algorithm for weight adaptation does not produce normalized
	   weights. Thus the output nodes cannot function as simply as taking
	   a dot product of inputs and weights. They have to execute a distance
	   calculation.
True. The original idea was to form the "activity bubble" with lateral
inhibition and change the weights by "redistribution of synaptic
resources". This neurologically plausible algorithm gave way to an
abstraction which uses distance, global selection and difference.
(I did some work comparing these two algorithms; I can send you the tech
report if you want to look at it. At least it has the parameters that work)


   5	Can the net become disordered after ordering is achieved at any
	   particular iteration? 
Kohonen proved (in ch 5) that this cannot happen (in the 1-d case) for
the abstract algorithm. This is a big problem for the biologically
plausible algorithm though.


>From djb at flash.bellcore.com Sat Mar 18 23:38:41 1989
Received: from flash.bellcore.com by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA27190; Sat, 18 Mar 89 23:38:32 EST
Received: by flash.bellcore.com (5.58/1.1)
	id AA06742; Sat, 18 Mar 89 23:38:10 EST
Date: Sat, 18 Mar 89 23:38:10 EST
From: djb at flash.bellcore.com (David J Burr)
Message-Id: <8903190438.AA06742 at flash.bellcore.com>
To: sankar at caip.rutgers.edu
Subject: Feature Map Learning
Status: R

Your questions regarding the feature map algorithm are ones that have also
concerned me.  I have been experimenting with a form of this elastic mapping
algorithm since about 1979. My early experiments were focussed on using such
an adaptive process to map handwritten characters onto reference characters
in an attempt to automate a form of elastic template matching.  The algorithm
I came up with was one which used nearest neighbor "attractors" to "pull" 
an elastic map into shape by an interative process.  I defined a window
or smoothing kernel which had a Gaussian shape as opposed to the bos
(box) shape commonly used in self organized mapping. My algorithm resembled
the Kohonen feature map classifier that you referred to in your email.

The gaussian kernel has advantages over the box kernel in that aliasing
distortion can be reduced.  This is similar to the use of Hamming windows
in the design of fast fourier transforms.  

With regard to your first and second questions, we have found that the 
actual window size and gain parameters can take on a number of different 
schedule shapes and give similar results.  It is important that window
size decrease very gradually to avoid to early committment to a particular
vector.  This is particularly important in the mapping of highly distorted
characters where a rapid schedule could cause a feature in one character
to map to the "wrong" feature in the reference character.  Gaussian
windows were the choice for that problem, since they guaranteed very
smooth maps.

You are right that a parameter schedule that works for one problem
may be poorly suited to a different problem.  We have recently applied
the feature map model to the traveling salesman problem and reported
some of our results at ICNN-88.  A one-dimensional version of the elastic
map ( a rubber band ) seems best suited to this problem.  We found that
there was a particular analytic form of the gain schedule which worked
well for this problem.  Window size, on the other hand, seemed to benefit
best from a feedback schedule in which the degree of progress toward the
solution served as input to set an appropriate window size.  I have
results studying some 700 different learning trials on 30-100 city
problems using this method.  Performance is considerable better than
the Hopfield-Tank solution.

Yes, it seems as though one needs distance calculation as the input for
this model, rather than dot product as used in back-propagation nets.

I would be happy to mail you some papers describing my implementation of
feature map learning model.  The first article appeared in Computer Graphics
and Image Processing Journal, 1981, entitled "A Dynamic Model for Image
Registration".  The recent work on traveling salesman was also reported
at last year's Snowbird meeting in addition to ICNN-88.  Please feel free
to correspond with me as I consider this a very interesting topic.

Best Wishes,
D. J. Burr
djb at bellcore.com

>From @relay.cs.net:tony at ifi.unizh.ch Mon Mar 20 03:12:51 1989
Received: from RELAY.CS.NET by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA02795; Mon, 20 Mar 89 03:12:46 EST
Received: from relay2.cs.net by RELAY.CS.NET id ab08738; 20 Mar 89 4:55 EST
Received: from switzerland by RELAY.CS.NET id ae29120; 20 Mar 89 4:48 EST
Received: from ean by scsult.SWITZERLAND.CSNET id a011717; 20 Mar 89 9:45 WET
Date: 19 Mar 89 21:45 +0100
From: tony bell <tony at ifi.unizh.ch>
To: sankar at caip.rutgers.edu
Mmdf-Warning:  Parse error in original version of preceding line at SWITZERLAND.CSNET
Message-Id: <342:tony at ifi.unizh.ch>
Subject: Top Maps
Status: R

You should see Ritter & Schulten's paper in IEEE ICNN proceedings 1988
(San Diego) for expressions answering question 1. Another paper from Helge Ritt
er
deals with the convergence properties. This was submitted to Biol Cybernetics
but maybe you should write to him at the University of Illinois where he
is now.

Tony Bell, Univ of Zurich


>From djb at flash.bellcore.com Mon Mar 20 17:51:22 1989
Received: from flash.bellcore.com by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA18086; Mon, 20 Mar 89 17:51:14 EST
Received: by flash.bellcore.com (5.58/1.1)
	id AA25760; Mon, 20 Mar 89 17:51:18 EST
Date: Mon, 20 Mar 89 17:51:18 EST
From: djb at flash.bellcore.com (David J Burr)
Message-Id: <8903202251.AA25760 at flash.bellcore.com>
To: sankar at caip.rutgers.edu
Subject: Self-Organized Mapping
Status: R

There has been interest on the net recently in some of the questions that
you posed in your recent mail.  I have personally received comments
regarding the neighborhood functions and whether there is an appropriate
analytic form.  My comments were summarized in my recent mailing to you.
If you get additional responses, I would certainly appreciate hearing about
peoples' experiences.  Would you consider posting a summary to the net?

I did not comment on your questions 4 and 5.  It seems that the neighbors-
matching-to-neighbors observation comes about as a result rather than
an input constraint.  In my 1981 paper on elastic matching of images I
used a more extended pattern matcher (area template insteat of a point-to-
point nearest neighbor) for gray scale images.  This tended to enforce the
constraint that you observed at the input level.  Unfortunately, I am not
sure what its generalization would be for non-image patterns (N-D instead of2-D).  

I have done all my experiments on elastic mapping of fixed patterns as opposed
to point distributions.  There was no problem of a map being undone after it
converged.  Have you had such problems with your speech data?  I have been
told that when the distributions are stochastic or sampled, that there is
even stronger need to proceed slowly.  Apparently one sampled point can
pull the map in one direction and this must be counterbalanced by opposing
samples pulling the other way to maintain stability of the map.  This 
unfortunately takes lots of computer cycles.

Hoping to hear from you.

Dave Burr

>From Connectionists-Request at q.cs.cmu.edu Mon Mar 20 18:01:41 1989
Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA18228; Mon, 20 Mar 89 18:01:34 EST
Received: from cs.cmu.edu by Q.CS.CMU.EDU id aa23263; 20 Mar 89 14:41:25 EST
Received: from XEROX.COM by CS.CMU.EDU; 20 Mar 89 14:39:19 EST
Received: from Semillon.ms by ArpaGateway.ms ; 20 MAR 89 11:26:12 PST
Date: 20 Mar 89 11:25 PST
From: chrisley.pa at xerox.com
Subject: Re: questions on kohonen's maps
In-Reply-To: ananth sankar <sankar at caip.rutgers.edu>'s message of Thu, 16
 Mar 89 09:42:44 EST
To: ananth sankar <sankar at caip.rutgers.edu>
Cc: connectionists at cs.cmu.edu, chrisley.pa at xerox.com
Message-Id: <890320-112612-6136 at Xerox>
Status: R

Ananth Sankar recently asked some questions about Kohonen's feature maps.
As I have worked on these issues with Kohonen, I feel like I might be able
to give some answers, but standard disclaimers apply:  I cannot be certain
that Kohonen would agree with all of the following.  Also, I do not have my
copy of his book with me, so I cannot be more specific about refrences.

Questions:
	
	
1	Is there any analytical expression for the neighbourhood and gain
	functions? I have seen a simulation were careful tweaking after
	every so many iterations produces a correctly evolving map. This
	is obviously not a proper approach.

Although there is probably more than one, correct, task-independent gain or
neighborhood function, Kohonen does mention constraints that all of them
should meet.  For example, both functions should decrease to zero over
time.  I do not know of any tweaking; Kohonen usually determines a number
of iterations and then decreases the gain linearly.  If you call this
tweaking, then your idea of domain-independent parameters might be a sort
of holy grail, since it does not seem likely that we are going to find a
completely parameter-free learning algorithm that will work in every
domain.

2	Even if good results are available for particular functions for
	the uniform distribution input case, it is not clear to me that these
	same functions would result in good classification for some other
	problem. I have attempted to use these maps for word classification
	using LPC coeffs as features.

As far as I know, Kohonen has used the same type of gain and neighborhood
functions for all of his map demonstrations.  These demonstrations, which
have been shown via an animated film at several major conferences,
demonstrate maps learning the distribution in cases where 1) the
dimensionality of the network topology and the input space mismatch, e.g.,
where the network is 2d and the distribution is a 3d 'cactus'; 2) the
distribution is not uniform.  The algorithm was developed with these 2
cases in mind, so it is no surprise that the results are good for them as
well.

3	In Kohonen's book "Self Organization and Associative Memory", Ch 5
	the algorithm for weight adaptation does not produce normalized
	weights. Thus the output nodes cannot function as simply as taking
	a dot product of inputs and weights. They have to execute a distance
	calculation.

That's right.  And Kohonen usually uses the Euclidean distance metric,
although other ones can be used (which he discusses in the book)
Furthermore, there have been independent efforts to normalize weights in
Kohonen maps so that the dot product measure can be used.  If you have any
doubts about the suitability of the Euclidean metric, as your question
seems to imply, express them.  It is an interesting issue.

4	I have not seen as yet in the literature any reports on
	how the fact that neighbouring nodes respond to similar patterns
	from a feature space can be exploited.

The primary interest in maps, I believe, came from a desire to display
high-dimensional information in low dimensional spaces, which are more
easily apprehended.  But there is evidence that there are other uses as
well:  1) Kohonen has published results on using maps for phoneme
recognition, where the topology-preservation plays a significant role (such
maps are used in the Otaniemi Phonetic Typewriter featured in, I think,
Computer magazine a year or two agao.); 2)  work has been done on using the
topology to store sequential information, which seems to be a good idea if
you are dealing with natural signals that can only temporally shift from a
state to similar states; 3)  several people have followed Kohonen's
suggestion of using maps for adaptive kinematic representations for robot
control (the work on Murphy, mentioned on this net a month or so ago, and
the work being done at Carlton (sp) University by Darryl Graf are two good
examples).  In short, just look at some ICNN or INNS proceedings, and
you'll find many papers where researchers found Kohonen maps to be a good
place from which to launch their own studies.

5	Can the net become disordered after ordering is achieved at any
	particular iteration? 

Of course, this is theoretically possible, and is almost certain if at some
point the distribution of the mapped function changes.  But this brings up
the difficult question:  what is the proper ordering in such a case?
Should a net try to integrate both past and present distributions, or
should it throw away the past on concentrate on the present?  I think nost
nn researchers would want a litlle of both, woth maybe some kind of
exponential decay in the weights.  But in many applications of maps, there
is no chance of the distribution changing:  it is fixed, and iterations are
over the same test data each time.  In this case, I would guess that the
ordering could not becone disrupted (at least for simple distributions and
a net of adequate size), but I realise that there is no proof of this, and
the terms 'simple' and 'adequate' are lacking definition.  But that's life
in nnets for you!

If anyone has any more questions, feel free.

Ron Chrisley

Xerox PARC System Sciences Lab
3333 Coyote Hill Road
Palo Alto, CA 94304
USA

chrisley.pa at xerox.com
tel: (415) 494-4728

OR

New College
Oxford OX1 3BN
UK

chrisley at vax.oxford.ac.uk
tel: (865) 279-492


>From chrisley.pa at xerox.com Thu Mar 23 15:00:13 1989
Received: from Xerox.COM by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA22224; Thu, 23 Mar 89 15:00:04 EST
Received: from Semillon.ms by ArpaGateway.ms ; 23 MAR 89 11:35:27 PST
Date: 23 Mar 89 11:35 PST
From: chrisley.pa at xerox.com
Subject: Re: questions on kohonen's maps
In-Reply-To: ananth sankar <sankar at caip.rutgers.edu>'s message of Thu, 16
 Mar 89 09:42:44 EST
To: ananth sankar <sankar at caip.rutgers.edu>
Cc: connectionists at cs.cmu.edu
Message-Id: <890323-113527-4949 at Xerox>
Status: R

One further note about Ananth Sankar's questions about Kohonen maps:

A friend of mine, Tony Bell, tells me (and Ananth) that Helge Ritter has 
a "neat set of expressions for the learning rate and neighbourhood size
parameters... and he also proves something about congergence elsewhere."

Unfortunately, I do not as yet have a reference for the papers, but I have
liked Ritter's work in the past, so I thought people on the net might be
interested.

>From Connectionists-Request at q.cs.cmu.edu Fri Mar 24 11:52:18 1989
Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA20326; Fri, 24 Mar 89 11:52:13 EST
Received: from ri.cmu.edu by Q.CS.CMU.EDU id aa17597; 24 Mar 89 8:48:01 EST
Received: from BU-IT.BU.EDU by RI.CMU.EDU; 24 Mar 89 08:41:54 EST
Received: from COCHLEA.BU.EDU by bu-it.BU.EDU (5.58/4.7)
	id AA06449; Tue, 21 Mar 89 13:58:32 EST
Received:  by cochlea.bu.edu (4.0/4.7)
	id AA04927; Tue, 21 Mar 89 13:59:02 EST
Date:  Tue, 21 Mar 89 13:59:02 EST
From: lwyse at bucasb.bu.edu
Message-Id:  <8903211859.AA04927 at cochlea.bu.edu>
To: connectionists at ri.cmu.edu
In-Reply-To: connectionists at c.cs.cmu.edu's message of 20 Mar 89 23:47:09 GMT
Subject: Re: questions on kohonen's maps
Organization: Center for Adaptive Systems, B.U.
Status: R


What does "ordering" mean when your projecting inputs to a lower dimensional
space? For example, the "Peano" type curves that result from a one-D
neighborhood learning a 2-D input distribution, it is obviously NOT 
true that nearby points in the input space maximally activate nearby
points on the neighborhood chain. In this case, it is not even clear
that "untangling" the neighborhood is of utmost importance, since a 
tangled chain can still do a very good job of divvying up the space
almost equally between its nodes. 

-lonce


>From @relay.cs.net:tony at ifi.unizh.ch Fri Mar 24 13:30:26 1989
Received: from RELAY.CS.NET by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA23163; Fri, 24 Mar 89 13:30:12 EST
Received: from relay2.cs.net by RELAY.CS.NET id ab09426; 24 Mar 89 12:01 EST
Received: from switzerland by RELAY.CS.NET id aa01417; 24 Mar 89 11:55 EST
Received: from ean by scsult.SWITZERLAND.CSNET id a011335; 24 Mar 89 17:53 WET
Date: 24 Mar 89 17:51 +0100
From: tony bell <tony at ifi.unizh.ch>
To: sankar at caip.rutgers.edu
Mmdf-Warning:  Parse error in original version of preceding line at SWITZERLAND.CSNET
Message-Id: <352:tony at ifi.unizh.ch>
Status: R

In case anyone else asks (or Ron sends any more vague messages to the
net), here are all the refs I have on Helge Ritter's work on topological maps:

[1]"Kohonen's Self-Organizing Maps: exploring their computational capabilities"
in Proc. IEEE ICNN 1988, San Diego.

[2]"Convergence Properties of Kohonen's Topology Conserving Maps: fluctuations,
stability and dimension selection" submitted to Biol. Cybernetica.

[3] "Extending Kohonen's self-organising mapping algorithm to learn Ballistic
Movements" in the book "Neural Computers" Eckmiller & von der Malsburg (eds)

[4] "Topology conserving mappings for learning motor tasks" in the book "Neural
Networks for Computing" Denker (ed) AIP Conf. proceedings, Snowbird, 1986.

The second one in particular uses some heavy statistical techniques (the inputs
are seen as a Markov process and a Fokker-Planck equation describes the learn-
ing) in order to prove that the map will reach equilibrium when the learning 
rate is time dependant (ie: it decays). Ritter's PhD thesis covers all his work,
but it's in German. Now, Ritter is at the University of Illinois. I hope
this helps you and I don't mind if you post this to the net if you think
people are interested enough.

yours,

Tony Bell.


>From Connectionists-Request at q.cs.cmu.edu Fri Mar 24 22:07:14 1989
Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA23834; Fri, 24 Mar 89 22:07:06 EST
Received: from ri.cmu.edu by Q.CS.CMU.EDU id aa22170; 24 Mar 89 13:28:20 EST
Received: from MAUI.CS.UCLA.EDU by RI.CMU.EDU; 24 Mar 89 13:26:10 EST
Return-Path: <gblee at CS.UCLA.EDU>
Received: by maui.cs.ucla.edu (Sendmail 5.59/2.16)
	id AA25252; Fri, 24 Mar 89 10:25:07 PST
Date: Fri, 24 Mar 89 10:25:07 PST
From: Geunbae Lee <gblee at cs.ucla.edu>
Message-Id: <8903241825.AA25252 at maui.cs.ucla.edu>
To: lwyse at bucasb.bu.edu
Subject: Re: questions on konhonen's map
Cc: connectionists at ri.cmu.edu
Status: R

>What does "ordering" mean when your projecting inputs to a lower dimensional
>space? 
It means topological ordering

>For example, the "Peano" type curves that result from a one-D
>neighborhood learning a 2-D input distribution, it is obviously NOT 
>true that nearby points in the input space maximally activate nearby
>points on the neighborhood chain. 
It depends on what you mean by "near by" If it is near by in 
relative sense (in topological relation), not absolute sense, then
the nearby points in the input space DOES maximally activate nearby
points on the neighborhood chain.

--Geunbae Lee
  AI Lab, UCLA


>From Connectionists-Request at q.cs.cmu.edu Sat Mar 25 02:26:12 1989
Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA26264; Sat, 25 Mar 89 02:26:06 EST
Received: from ri.cmu.edu by Q.CS.CMU.EDU id aa25584; 24 Mar 89 17:55:35 EST
Received: from XEROX.COM by RI.CMU.EDU; 24 Mar 89 17:53:44 EST
Received: from Semillon.ms by ArpaGateway.ms ; 24 MAR 89 14:53:32 PST
Date: 24 Mar 89 14:53 PST
From: chrisley.pa at xerox.com
Subject: Re: questions on kohonen's maps
In-Reply-To: lwyse at bucasb.BU.EDU's message of Tue, 21 Mar 89 13:59:02 EST
To: lwyse at bucasb.bu.edu
Cc: connectionists at ri.cmu.edu
Message-Id: <890324-145332-8519 at Xerox>
Status: R

Lonce (lwyse at bucasb.BU.EDU) writes:

"What does "ordering" mean when your projecting inputs to a lower
dimensional space? For example, the "Peano" type curves that result from a
one-D neighborhood learning a 2-D input distribution, it is obviously NOT 
true that nearby points in the input space maximally activate nearby
points on the neighborhood chain." 

It is not true that nearby points in input space are always mapped to
nearby points in the output space when the mapping is dimensionality
reducing, agreed.  But 'ordering' still makes sense.  The map is
topology-preserving if the dependency is in the other direction, i.e., if
nearby points in output space are always activated by nearby points in
input space.

Lonce goes on to say:

"In this case, it is not even clear that "untangling" the neighborhood is
of utmost importance, since a tangled chain can still do a very good job of
divvying up the space almost equally between its nodes."

I agree that topology preservation is not necessarily of utmost importance,
but it may be useful in some applications, such as the ones I mentioned a
few messages back (phoneme recognition, inverse kinematics, etc.).  Also,
there is 1) the interest in properties of self-organizing systems in
themselves, even though an application can't be immediately found; and 2)
the observation that for some reason the brain seems to use topology
preserving maps (with the one-way dependency I mentioned above), which,
although they *could* be computationally unnecessary or even
disadvantageous, are probably in fact, nature being what she is, good
solutions to tough real time problems. 

Ron Chrisley
After April 14th, please send personal email to Chrisley at vax.ox.ac.uk

>From Connectionists-Request at q.cs.cmu.edu Sun Mar 26 03:40:59 1989
Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA19433; Sun, 26 Mar 89 03:40:47 EST
Received: from cs.cmu.edu by Q.CS.CMU.EDU id aa07032; 26 Mar 89 1:22:01 EST
Received: from CGL.UCSF.EDU by CS.CMU.EDU; 26 Mar 89 01:18:16 EST
Received: from phyb.ucsf.EDU by cgl.ucsf.EDU (5.59/GSC4.16)
	id AA07448; Sat, 25 Mar 89 22:18:01 PST
Received: by phyb (1.2/GSC4.15)
	id AA08352; Sat, 25 Mar 89 22:17:59 pst
Date: Sat, 25 Mar 89 22:17:59 pst
From: Ken Miller <ken at phyb.ucsf.edu>
Message-Id: <8903260617.AA08352 at phyb>
To: Connectionists at cs.cmu.edu
Subject: Normalization of weights in Kohonen algorithm
Status: R

re point 3 of recent posting about Kohonen algorithm: 

"3	In Kohonen's book "Self Organization and Associative Memory", Ch 5
	the algorithm for weight adaptation does not produce normalized
	weights."

the algorithm

du_{ij}/dt = a(t)[e_j(t) - u_{ij}(t)], i in N_c

where u = weights, e is input pattern, N_c is topological neighborhood of
maximally responding neighborhood, should I believe be written

du_{ij}/dt = a(t)[ e_j(t)/\sum_k(e_k(t)) - u_{ij}(t)/\sum_k(u_{ik}(t)) ], 
i in N_c.

That is, the change should be such as to move the jth synaptic weight on the
ith cell, as a PROPORTION of all the synaptic weights on the ith cell, in the
direction of matching the PROPORTION of input which was incoming on the jth
line.  Note that in this case \sum_j du_{ij}(t)/dt = 0, so weights remain
normalized in the sense that sum over each cell remains constant.

If inputs are normalize to sum to 1 (\sum_k(e_k(t)) = 1) then the first
denominator can be omitted.  If weights begin normalized to sum to 1 on each
cell ( \sum_k(u_{ik}(t)) = 1 for all i) then weights will remain normalized
to sum to 1, hence the second denominator can be omitted.  Perhaps Kohonen
was assuming these normalizations and hence dispensing with the denominators?

ken miller (ken at phyb.ucsf.edu)


From mcvax!fib.upc.es!millan at uunet.UU.NET  Fri Mar 31 04:09:00 1989
From: mcvax!fib.upc.es!millan at uunet.UU.NET (Jose del R. MILLAN)
Date: 31 Mar 89 17:09 +0800
Subject: TR available
Message-ID: <92*millan@fib.upc.es>

The following Tech. Report is available. Requests should be sent to
		MILLAN at FIB.UPC.ES
________________________________________________________________________

		Learning by Back-Propagation:
	a Systolic Algorithm and its Transputer Implementation

		   Technical Report LSI-89-15

			Jose del R. MILLAN
		Dept. de Llenguatges i Sistemes Informatics
		Universitat Politecnica de Catalunya

			Pau BOFILL
		Dept. d'Arquitectura de Computadors
		Universitat Politecnica de Catalunya


ABSTRACT

In this paper we present a systolic algorithm for back-propagation, a 
supervised, iterative, gradient-descent, connectionist learning rule. The 
algorithm works on feedforward networks where connections can skip layers and 
fully exploits spatial and training parallelisms, which are inherent to 
back-propagation. Spatial parallelism arises during the propagation of activity 
---forward--- and error ---backward--- for a particular input-output pair. On 
the other hand, when this computation is carried out simultaneously for all 
input-output pairs, training parallelism is obtained. In the spatial dimension, 
a single systolic ring carries out sequentially the three main steps of the 
learnng rule ---forward, backward and weight increments update. Furthermore, the
same pattern of matrix delivery is used in both the forward and the backward 
passes. In this manner, the algorithm preserves the similarity of the forward 
and backward passes in the original model. The resulting systolic algorithm is 
dual with respect to the pattern of matrix delivery ---either columns or rows. 
Finally, an implementation of the systolic algorithm for the spatial dimension 
is derived, that uses a linear ring of Transputer processors.


From joho%sw.MCC.COM at MCC.COM  Thu Mar  2 13:18:40 1989
From: joho%sw.MCC.COM at MCC.COM (Josiah Hoskins)
Date: Thu, 2 Mar 89 12:18:40 CST
Subject: Tech Report Announcement
Message-ID: <8903021818.AA22902@jelly.sw.mcc.com>

The following tech report is available.

		Speeding Up Artificial Neural Networks
			in the "Real" World

			 Josiah C. Hoskins

	A new heuristic, called focused-attention backpropagation (FAB)
	learning, is introduced. FAB enhances the backpropagation pro-
	cedure by focusing attention on the exemplar patterns that are
	most difficult to learn. Results are reported using FAB learning
	to train multilayer feed-forward artificial neural networks to
	represent real-valued elementary functions. The rate of learning
	observed using FAB is 1.5 to 10 times faster than backpropagation.


Request for copies should refer to MCC Technical Report Number STP-049-89
and should be sent to

	Kintner at mcc.com

or to

Josiah C. Hoskins                       
MCC - Software Technology Program       AT&T:           (512) 338-3684
9390 Research Blvd, Kaleido II Bldg.    UUCP/USENET:    milano!joho
Austin, Texas 78759                     ARPA/INTERNET:  joho at mcc.com


From cfields at NMSU.Edu  Fri Mar  3 17:16:53 1989
From: cfields at NMSU.Edu (cfields@NMSU.Edu)
Date: Fri, 3 Mar 89 15:16:53 MST
Subject: No subject
Message-ID: <8903032216.AA17939@NMSU.Edu>

_________________________________________________________________________

The following are abstracts of papers appearing in the inaugural issue
of the Journal of Experimental and Theoretical Artificial
Intelligence.  JETAI 1, 1 was published 1 January, 1989.

For submission information, please contact either of the editors:

Eric Dietrich                           Chris Fields
PACSS - Department of Philosophy        Box 30001/3CRL
SUNY Binghamton                         New Mexico State University
Binghamton, NY 13901                    Las Cruces, NM 88003-0001

dietrich at bingvaxu.cc.binghamton.edu     cfields at nmsu.edu

JETAI is published by Taylor & Francis, Ltd., London, New York, Philadelphia

_________________________________________________________________________

Minds, machines and Searle

Stevan Harnad

Behavioral & Brain Sciences, 20 Nassau Street, Princeton NJ 08542, USA

Searle's celebrated Chinese Room Argument has shaken the foundations
of Artificial Intelligence.  Many refutations have been attempted, but
none seem convincing.  This paper is an attempt to sort out explicitly
the assumptions and the logical, methodological and empirical points
of disagreement.  Searle is shown to have underestimated some features
of computer modeling, but the heart of the issue turns out to be an
empirical question about the scope and limits of the purely symbolic
(computational) model of the mind.  Nonsymbolic modeling turns out to
be immune to the Chinese Room Argument.  The issues discussed include
the Total Turing Test, modularity, neural modeling, robotics,
causality and the symbol-grounding problem.

_________________________________________________________________________

Explanation-based learning: its role in problem solving

Brent J. Krawchuck and Ian H. Witten

Knowledge Sciences Laboratory, Department of Computer Science,
University of Calgary, 2500 University Drive, NW, Calgary, Alta,
Canada, T2N 1N4.

`Explanation-based' learning is a semantically-driven,
knowledge-intensive paradigm for machine learning which contrasts
sharply with syntactic or `similarity-based' approaches.  This paper
redevelops the foundations of EBL from the perspective of
problem-solving.  Viewed in this light, the technique is revealed as a
simple modification to an inference engine which gives it the ability
to generalize the conditions under which the solution to a particular
problem holds.  We show how to embed generalization invisibly within
the problem solver, so that it is accomplished as inference proceeds
rather than as a separate step.  The approach is also extended to the
more complex domain of planning to illustrate that it is applicable to
a variety of logic-based problem-solvers and is by no means restricted
to only simple ones.  We argue against the current trend to isolate
learning from other activity and study it separately, preferred
instead to integrate it into the very heart of problem solving.

----------------------------------------------------------------------------

The recognition and classification of concepts in understanding
scientific texts

Fernando Gomez and Carlos Segami

Department of Computer Science, University of Central Florida,
Orlando, FL 32816, USA.

In understanding a novel scientific text, we may distinguish the
following processes.  First, concepts are built from the logical form
of the sentence into the final knowledge structures.  This is called
concept formation.  While these concepts are being formed, they are
also being recognized by checking whether they are already in
long-term memory (LTM).  Then, those concepts which are unrecognized
are integrated in LTM.  In this paper, algorithms for the recognition
and integration of concepts in understanding scientific texts are
presented.  It is shown that the integration of concepts in scientific
texts is essentially a classification task, which determines how and
where to integrate them in LTM.  In some cases, the integration of
concepts results in a reclassification of some of the concepts already
stored in LTM.  All the algorithms described here have been
implemented and are part of SNOWY, a program which reads short
scientific paragraphs and answer questions.

---------------------------------------------------------------------------

Exploring the No-Function-In-Structure principle

Anne Keuneke and Dean Allemang

Laboratory for Artificial Intelligence Research, Department of
Computer and Information Science, The Ohio State University, 2036 Neil
Avenue Mall, Columbus, OH 43210-1277, USA.

Although much of past work in AI has focused on compiled knowledge
systems, recent research shows renewed interest and advanced efforts
both in model-based reasoning and in the integration of this deep
knowledge with compiled problem solving structures.  Device-based
reasoning can only be as good as the model used; if the needed
knowledge, correct detail, or proper theoretical background is not
accessible, performance deteriorates.  Much of the work on model-based
reasoning references the `no-function-in-structure' principle, which
was introduced be de Kleer and Brown.  Although they were all well
motivated in establishing the guideline, this paper explores the
applicability and workability of the concept as a universal principle
for model representation.  This paper first describes the principle,
its intent and the concerns it addresses.  It then questions the
feasibility and the practicality of the principle as a universal
guideline for model representation.

___________________________________________________________________________

From jbower at bek-mc.caltech.edu  Sun Mar  5 21:09:10 1989
From: jbower at bek-mc.caltech.edu (Jim Bower)
Date: Sun, 5 Mar 89 18:09:10 pst
Subject: Summer course in computational neurobiology
Message-ID: <8903060209.AA03962@bek-mc.caltech.edu>

                            
 Course announcement:
 
               Methods in Computational Neuroscience
   
                  The Marine Biological Laboratory
                      Woods Hole, Massachusetts
 
                     August 6 - September 2,1989


                        General Description
 
      The Marine Biological Laboratory (MBL) in Woods Hole
 Massachusetts is a world famous marine biological laboratory that
 has been in existence for over 100 years.  In addition to providing
 research facilities for a large number of biologists during the
 summer, the MBL also sponsors a number of outstanding courses on
 different topics in Biology. 
 	This summer will be the second year in which the MBL has
 offered a course in "Methods in Computational Neuroscience".  This
 course is designed as a survey of the use of computer modeling
 techniques in studying the information processing capabilities of the
 nervous system and covers models at all levels from biologically
 realistic single cells and networks of cells to biologically relevant
 abstract models.  The principle aim of the course is to provide
 participants with the tools to simulate the functional properties of
 those neural systems of interest to them as well as to understand
 the general advantages and pitfalls of this experimental approach.
 
                  The Specific Structure of the Course 
 
      The course itself includes both a lecture series and a computer
 laboratory.  The lectures are given by invited faculty whose work
 represents the state of the art in computational neuroscience (see
 list below). The course lecture notes have been incorporated into a
 book published by MIT press (" Methods in Neuronal Modeling: From
 Synapses to Networks"  C. Koch and I. Segev, editors. MIT Press,
 Cambridge, MA.,1989). 
 	The computer laboratory is designed to give students hands-on
 experience with the simulation techniques considered in the lecture.
 It also provides students with the opportunity to actually begin
 simulations of neural systems of interest to them.  The students are
 guided in this effort by the visiting lecturers and course directors,
 but also by several students from the Computational Neural Systems
 (CNS) graduate program at Caltech who serve as Laboratory TAs.  The
 lab itself consists of state of the art graphics workstations running
 a GEneral NEtwork SImulation System (GENESIS) that Dr. Bower and
 his colleagues at Caltech have constructed over the last several years. 
 Students return to their home institutions with the GENESIS system to
 continue their work.
 
                           The Students
 
 	The course is designed for advanced graduate students and
 postdoctoral fellows in biology, computer science, electrical
 engineering, physics, or psychology with an interest in computational
 neuroscience.  Because of the heavy computer orientation of the Lab
 section, a good computer background is required (UNIX, C or PASCAL). 
 In addition, students are expected to have a basic background in
 neurobiology. Course enrollment is limited to 20 so as to assure the
 highest quality educational experience.

                          Course Directors

 James M. Bower and Christof Koch
 Computation and Neural Systems Program
 California Institute of Technology 

                            The Faculty
 
 
 Paul Adams (Stony Brook)
 Dan Alkon (NIH)
 Richard Anderson (MIT)
 John Hildebrand (Arizona)
 John Hopfield (Caltech)
 Rudolfo Llinas (NYU)
 David Rumelhart (Stanford)
 Idan Segev (Jerusalem)
 Terrence Sejnowski (Salk/UCSD)
 David Van Essen (Caltech)
 Christoph Von der Malsburg (USC)
 
 For further information and application materials contact:
 
 Admissions Coordinator
 Marine Biological Laboratory
 Woods Hole, MA 02543
 (508) 548-3705 extension 216
 
 Application Deadline May 15, 1989
 Acceptance notification in early June.


From mjolsness-eric at YALE.ARPA  Tue Mar  7 21:23:16 1989
From: mjolsness-eric at YALE.ARPA (Eric Mjolsness)
Date: Tue, 7 Mar 89 21:23:16 EST
Subject: "Transformations" tech report
Message-ID: <8903080223.AA17992@NEBULA.SUN3.CS.YALE.EDU>

A new technical report is available:

"Algebraic Transformations of Objective Functions"

(YALEU/DCS/RR-686)

by Eric Mjolsness and Charles Garrett
Yale Department of Computer Science
P.O. 2158 Yale Station
New Haven CT 06520

Abstract:
A standard neural network design trick reduces the number of connections
in the winner-take-all (WTA) network from O(N^2) to O(N).  We explain the
trick as a general fixpoint-preserving transformation applied to the
particular objective function associated with the WTA network.  The key
idea is to introduce new interneurons which act to maximize the objective,
so that the network seeks a saddle point rather than a minimum.  A number
of fixpoint-preserving transformations are derived, allowing the
simplification of such algebraic forms as products of expressions,
functions of one or two expressions, and sparse matrix products.  The
transformations may be applied to reduce or simplify the implementation of
a great many structured neural networks, as we demonstrate for inexact
graph-matching, convolutions and coordinate transformations, and sorting.
Simulations show that fixpoint-preserving transformations may be applied
repeatedly and elaborately, and the example networks still robustly
converge.  We discuss implications for circuit design.

To request a copy, please send your physical address by e-mail to
	mjolsness-eric at cs.yale.edu
OR	mjolsness-eric at yale.arpa	(old style)
Thank you.

-------


From prlb2!vub.vub.ac.be!prog1!wplybaer at uunet.UU.NET  Tue Mar  7 19:34:21 1989
From: prlb2!vub.vub.ac.be!prog1!wplybaer at uunet.UU.NET (Wim P. Lybaert)
Date: Wed, 8 Mar 89 01:34:21 +0100
Subject: No subject
Message-ID: <8903080034.AA10074@prog1.vub.ac.be>


Hi,

 i would like to be placed on the connectionist neural nets mailing list
 that you distribute.

Thanks,
Wim Lybaert
Brussels Free University
Department PROG
Oefenplein 2
1040 BRUSSELS
BELGIUM

    email:   <wplybaer at prog1.vub.ac.be>


From ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU  Wed Mar  8 11:36:31 1989
From: ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU (thanasis kehagias)
Date: Wed, 08 Mar 89 11:36:31 EST
Subject: information function vs. squared error
Message-ID: <mailman.106.1149591161.29955.connectionists@cs.cmu.edu>

i am looking for pointers to papers discussing the use of an alternative
criterion to squared error, in back propagation algorithms. the
alternative function i have in mind is called (in different contexts
and/or authors) cross entropy, entropy, information, inf. divergence and
so on. it is defined something like:


    G=sum{i=1}{N} p_i*log(p_i)


i am not quite sure what the index i runs through: untis, weights or
something else. i know people have been talking about this a lot, i just
cannot remember where i read aboout it ...  it seems like Geoff Hinton's
group work on this .


      thanks,

          Thanasis

From mdp%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK  Thu Mar  9 08:16:07 1989
From: mdp%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK (Mark Plumbley)
Date: Thu, 9 Mar 89 13:16:07 GMT
Subject: information function vs. squared error
Message-ID: <14398.8903091316@dsl.eng.cam.ac.uk>

Thanasis,

The "G" function you mentioned, based on an Entropy method, is probably
the one developed by Pearmutter and Hinton as a procedure for unsupervised
learning of binary units [1].  More recently, Linsker [2,3] and Plumbley
and Fallside [4] considered the principle of maximum information
transmission (or minimum information loss) for continuous units,  relating
this to Principal Component methods for linear units.

Unfortunately, these are mainly about unsupervised learning, rather than
Backprop specifically, although in [4] we do look at the way the
mean-squared error criterion places an *upper-bound* on the information loss
through a supervised network.  This bound will be tightest when the errors
on all the output units are independent and have the same variance (or the
same entropy for non-additive-Gaussian errors).  *If* you can choose the
target representation used by Backprop so that the errors are likely to
have these properties, it should perform closer to the (information-
theoretic) optimal.

Hope this is some help,

Mark.

References:

[1] B. A. Pearlmutter and G. E. Hinton: "G-Maximization: An Unsupervised
Learning Procedure for Discovering Regularities". In Proceedings of the
Conference on `Neural Networks for Computing'. American Institute of
Physics, 1986.

[2] R. Linsker: "Towards an Organisational Principle for a Layered
Perceptual Network". In "Neural Information Processing Systems
(Denver, CO. 1987)" (Ed. D. Z. Anderson), pp. 485-494.
American Institute of Physics, 1988.

[3] R. Linsker: "Self-Organization in a Perceptual Network". IEEE Computer,
vol. 21 (3), March 1988, pp. 105-117.

[4] M. D. Plumbley and F. Fallside: "An Information-Theoretic Approach to
Unsupervised Connectionist Models". Tech. Report CUED/F-INFENG/TR.7.
Cambridge University Engineering Department, 1988. Also in "Proceedings
of the 1988 Connectionist Models Summer School", pp. 239-245.
Morgan-Kaufmann, San Mateo, CA.

+--------------------------------------------+---------------------------+
| Mark Plumbley                              | Cambridge University      |
|  JANET: mdp at uk.ac.cam.eng.dsl              |   Engineering Department, |
|  ARPANET:                                  | Trumpington Street,       |
|     mdp%dsl.eng.cam.ac.uk at nss.cs.ucl.ac.uk | Cambridge  CB2 1PZ        |
|  Tel: +44 223 332754  Fax: +44 223 332662  | UK                        |
+--------------------------------------------+---------------------------+

From becker at ai.toronto.edu  Thu Mar  9 13:26:38 1989
From: becker at ai.toronto.edu (becker@ai.toronto.edu)
Date: Thu, 9 Mar 89 13:26:38 EST
Subject: information function vs. squared error 
Message-ID: <89Mar9.132645est.10489@ephemeral.ai.toronto.edu>


The use of the cross-entropy measure G = p log(p/q) + (1-p)log(1-p)/(1-q)
(Kullback, 1959), where p and q are the probabilities of a binary random
variable under 2 probability distributions) has been described in at least 
3 different contexts in the connectionist literature:

(i) As an objective function for supervised back-propagation; this
is appropriate if the output units are computing real values which
are to be interpreted as probability distributions over the space
of binary output vectors (Hinton, 1987). Here G-error represents 
the divergence between the desired and observed distributions.

(ii) As an objective function for Boltzmann machine learning (Hinton
and Sejnowski, 1986), where p and q are the output distributions 
in the + and - phases.

(iii) In the Gmax unsupervised learning algorithm (Pearlmutter and Hinton,
1986) as a measure of the difference between the actual
output distribution of a unit and the predicted distribution assuming
independent input lines.

References:

Hinton, G. E. 1987. "Connectionist Learning Procedures", Revised version
of Technical Report CMU-CS-87-115, to appear (appeared ?) in Artificial
Intelligence.

Hinton, G. E. and  Sejnowski, T. J. 1986. "Learning and relearning in 
Boltzmann machines", in Parallel distributed processing: Explorations in
the microstructure of cognition, Bradford Books.

Kullback, S., 1959. "Information Theory and Statistics", New York: Wiley.

Pearlmutter, B. A.  and  Hinton, G. E. 1986. "G-Maximization: An unsupervised 
learning procedure for discovering regularities.", Neural Networks for 
Computing: American Institute of Physics Conference Proceedings 151.


Sue Becker                      
DCS, University of Toronto          	


From mehra at aquinas.csl.uiuc.edu  Fri Mar 10 05:43:16 1989
From: mehra at aquinas.csl.uiuc.edu (Pankaj Mehra)
Date: Fri, 10 Mar 89 04:43:16 CST
Subject: No subject
Message-ID: <8903101043.AA02586@aquinas>

I have recently explored several connectionist models for learning
under _realistic_ learning scenarios. The class of  problems for
which we are trying to acquire solutions by learning are decision
problems with the following characteristics:

(i) large number of continuous-valued PARAMETERS, each of which
	(ia) takes on values from a finite range with a nonstationary
		distribution
	(ib) costs more to measure accurately.
		{however, accuracy can be controlled by focussed sampling}
	(ic) is not known to follow any particular parametric distribution
(ii) the optimization CRITERION (energy, if you will) is ill-defined
	{much like the _blackbox_ in David Ackley's thesis}
(iii) a set of OPERATORS is available, and these are the _only_ instruments
	for manipulating the problem state.
	(iiia) the _causal_ relationships between the states before and
		after the application of the operator are not known
	(iiib) the _persistence_ model is incomplete - i.e. it is not
		known a priori as to when the effect of an action will
		be felt and how long will it persist
(iv) the TRAINING ENVIRONMENT is _slow reactive_ : it can be assumed to
	produce reinforcement (prescriptive feedback) rather than an
	error (evaluative feedback); however, the delays between an action
	and subsequent reinforcement follow an _unknown_ distribution.
-------
These have been called Dynamic Decision Problems, and shown to be a rich class,
in the following publication [available upon request from the first author]:

Mehra, P. and B. W. Wah, "Architectures for Strategy Learning,"
  in Computer Architectures for Artificial Intelligence Appli-
  cations, ed. B. Wah and C. Ramamoorthy, Wiley, New York, NY,
  1989 (in press).

{send e-mail to: mehra at cs.uiuc.edu}
-------
The above publication also examines the applicability of other well-known
learning techniques {empirical, probabilistic, decision theoretic, EBL,
hybrid techniques, learning to plan, etc} and suggests why ANSs might be
prefered over others. As a part of this comparision, several contemporary
connectionist models were found lacking in certain respects. I shall
summarize the criticisms here, and would like to have feedback from
those who have supported the use of these techniques.

BACK-PROPAGATION:
positive aspects:
	Simplicity of programming the learning algorithm
	An effective procedure for tuning of large parameter
	  sets representable as _band matrices_ (layered networks)
problematic assumptions:
	Immediate feedback
	Corrective {as against prescriptive} feedback
		[I am aware of Ron Williams' work, though]
weakness as a learning approach
	Requires tweaking of features (normalization biases) to the
	extent that the degree of generalization varies drastically
	as the degree of coarse coding changes. A great part of the
	success in particular applications could therefore be attributed
	to the intelligence of the researcher who codes those features
	{rather than to the _learning_ algorithm}

REINFORCEMENT LEARNING
positive aspects
	Can handle prescriptive feedback
	Has been shown {Rich Sutton, Chuck Anderson} to work with delayed
	  feedback
problematic assumptions
	The implementations known to this author assume
		: persistence of effects decays _exponentially_ with time
		: heuristic assumptions such as "recency" (that the more
		  recent an action is, the more is it responsible for the
		  feedback) and frequency (that the more frequently an
		  action occurs preceding the feedback, the more likely it
		  is to have caused the feedback) are _hardwired_ into the
		  learning algorithms
	All the knowledge needed for learning is implicit as if the learning
		critter was born with algorithms assuming exponential decay
		and as if all actions in the world caused similar delay patterns
	The nodes of the network compute functions much more complex than
		in case of classical back-propagation.
weakness as a learning paradigm
	All actions that occur at the same time and with the same frequency
	are assumed equally likely to have caused the feedback. (ie. these
	algorithms have an implicitly coded causal model)

	No scope for using the same network to choose between actions having
	different causal and persistence assumptions.

	The learning algorithm amounts to a procedural encoding of environmental
	knowledge. Any success of these algorithms in realistic applications is
	in a large part due to the intelligence of the designer and the effort
	they put in (for example to find just the right lambda for the
	exponential decay factor).
-------
See my paper for details of Dynamic Decision Problems and extensive study of
how the basic learning model underlying _most_ of the existing learning
algorithms (either in AI or Connectionism) is at odds with the requirements
of training in the real world.

Comments welcome from those who read the paper, as well as from those
who just want to discuss the material of this basenote.

- Pankaj {Mehra at cs.uiuc.edu}

From mehra at aquinas.csl.uiuc.edu  Fri Mar 10 05:43:16 1989
From: mehra at aquinas.csl.uiuc.edu (Pankaj Mehra)
Date: Fri, 10 Mar 89 04:43:16 CST
Subject: No subject
Message-ID: <8903101043.AA02586@aquinas>

I have recently explored several connectionist models for learning
under _realistic_ learning scenarios. The class of  problems for
which we are trying to acquire solutions by learning are decision
problems with the following characteristics:

(i) large number of continuous-valued PARAMETERS, each of which
	(ia) takes on values from a finite range with a nonstationary
		distribution
	(ib) costs more to measure accurately.
		{however, accuracy can be controlled by focussed sampling}
	(ic) is not known to follow any particular parametric distribution
(ii) the optimization CRITERION (energy, if you will) is ill-defined
	{much like the _blackbox_ in David Ackley's thesis}
(iii) a set of OPERATORS is available, and these are the _only_ instruments
	for manipulating the problem state.
	(iiia) the _causal_ relationships between the states before and
		after the application of the operator are not known
	(iiib) the _persistence_ model is incomplete - i.e. it is not
		known a priori as to when the effect of an action will
		be felt and how long will it persist
(iv) the TRAINING ENVIRONMENT is _slow reactive_ : it can be assumed to
	produce reinforcement (prescriptive feedback) rather than an
	error (evaluative feedback); however, the delays between an action
	and subsequent reinforcement follow an _unknown_ distribution.
-------
These have been called Dynamic Decision Problems, and shown to be a rich class,
in the following publication [available upon request from the first author]:

Mehra, P. and B. W. Wah, "Architectures for Strategy Learning,"
  in Computer Architectures for Artificial Intelligence Appli-
  cations, ed. B. Wah and C. Ramamoorthy, Wiley, New York, NY,
  1989 (in press).

{send e-mail to: mehra at cs.uiuc.edu}
-------
The above publication also examines the applicability of other well-known
learning techniques {empirical, probabilistic, decision theoretic, EBL,
hybrid techniques, learning to plan, etc} and suggests why ANSs might be
prefered over others. As a part of this comparision, several contemporary
connectionist models were found lacking in certain respects. I shall
summarize the criticisms here, and would like to have feedback from
those who have supported the use of these techniques.

BACK-PROPAGATION:
positive aspects:
	Simplicity of programming the learning algorithm
	An effective procedure for tuning of large parameter
	  sets representable as _band matrices_ (layered networks)
problematic assumptions:
	Immediate feedback
	Corrective {as against prescriptive} feedback
		[I am aware of Ron Williams' work, though]
weakness as a learning approach
	Requires tweaking of features (normalization biases) to the
	extent that the degree of generalization varies drastically
	as the degree of coarse coding changes. A great part of the
	success in particular applications could therefore be attributed
	to the intelligence of the researcher who codes those features
	{rather than to the _learning_ algorithm}

REINFORCEMENT LEARNING
positive aspects
	Can handle prescriptive feedback
	Has been shown {Rich Sutton, Chuck Anderson} to work with delayed
	  feedback
problematic assumptions
	The implementations known to this author assume
		: persistence of effects decays _exponentially_ with time
		: heuristic assumptions such as "recency" (that the more
		  recent an action is, the more is it responsible for the
		  feedback) and frequency (that the more frequently an
		  action occurs preceding the feedback, the more likely it
		  is to have caused the feedback) are _hardwired_ into the
		  learning algorithms
	All the knowledge needed for learning is implicit as if the learning
		critter was born with algorithms assuming exponential decay
		and as if all actions in the world caused similar delay patterns
	The nodes of the network compute functions much more complex than
		in case of classical back-propagation.
weakness as a learning paradigm
	All actions that occur at the same time and with the same frequency
	are assumed equally likely to have caused the feedback. (ie. these
	algorithms have an implicitly coded causal model)

	No scope for using the same network to choose between actions having
	different causal and persistence assumptions.

	The learning algorithm amounts to a procedural encoding of environmental
	knowledge. Any success of these algorithms in realistic applications is
	in a large part due to the intelligence of the designer and the effort
	they put in (for example to find just the right lambda for the
	exponential decay factor).
-------
See my paper for details of Dynamic Decision Problems and extensive study of
how the basic learning model underlying _most_ of the existing learning
algorithms (either in AI or Connectionism) is at odds with the requirements
of training in the real world.

Comments welcome from those who read the paper, as well as from those
who just want to discuss the material of this basenote.

- Pankaj {Mehra at cs.uiuc.edu}

From mike at bucasb.BU.EDU  Fri Mar 10 12:22:14 1989
From: mike at bucasb.BU.EDU (Michael Cohen)
Date: Fri, 10 Mar 89 12:22:14 EST
Subject: network meeting announcement for distribution
Message-ID: <8903101722.AA27914@bucasb.bu.edu>

NEURAL NETWORK MODELS OF CONDITIONING AND ACTION

12th Symposium on Models of Behavior
Friday and Saturday, June 2 and 3, 1989
105 William James Hall, Harvard University
33 Kirkland Street, Cambridge, Massachusetts

PROGRAM COMMITTEE:
Michael Commons, Harvard Medical School
Stephen Grossberg, Boston University
John E.R. Staddon, Duke University 


JUNE 2, 8:30AM--11:45AM
-----------------------
Daniel L. Alkon, ``Pattern Recognition and Storage by an Artificial 
Network Derived from Biological Systems''

John H. Byrne, ``Analysis and Simulation of Cellular and Network Properties 
Contributing to Learning and Memory in Aplysia''

William B. Levy, ``Synaptic Modification Rules in Hippocampal Learning''


JUNE 2, 1:00PM--5:15PM
----------------------
Gail A. Carpenter, ``Recognition Learning by a Hierarchical ART Network 
Modulated by Reinforcement Feedback''

Stephen Grossberg, ``Neural Dynamics of Reinforcement Learning, Selective 
Attention, and Adaptive Timing''

Daniel S. Levine, ``Simulations of Conditioned Perseveration and Novelty 
Preference from Frontal Lobe Damage''

Nestor A. Schmajuk, ``Neural Dynamics of Hippocampal Modulation of Classical 
Conditioning''


JUNE 3, 8:30AM--11:45AM
-----------------------
John W. Moore, ``Implementing Connectionist Algorithms for Classical 
Conditioning in the Brain''

Russell M. Church, ``A Connectionist Model of Scalar Timing Theory''

William S. Maki, ``Connectionist Approach to Conditional Discrimination: 
Learning, Short-Term Memory, and Attention''


JUNE 3, 1:00PM--5:15PM
----------------------
Michael L. Commons, ``Models of Acquisition and Preference''

John E.R. Staddon, ``Simple Parallel Model for Operant Learning with 
Application to a Class of Inference Problems''

Alliston K. Reid, ``Computational Models of Instrumental and Scheduled 
Performance''

Stephen Jose Hanson, ``Behavioral Diversity, Hypothesis Testing, and 
the Stochastic Delta Rule''

Richard S. Sutton, ``Time Derivative Models of Pavlovian Reinforcement''


FOR REGISTRATION INFORMATION SEE ATTACHED OR WRITE:
Dr. Michael L. Commons
Society for Quantitative Analysis of Behavior 
234 Huron Avenue 
Cambridge, MA 02138
----------------------------------------------------------------------
----------------------------------------------------------------------

REGISTRATION FEE BY MAIL
(Paid by check to Society for Quantitative Analysis of Behavior)
(Postmarked by April 30, 1989)

Name: ______________________________________________
Title: _____________________________________________
Affiliation: _______________________________________
Address: ___________________________________________
Telephone(s): ______________________________________
E-mail address: ____________________________________


( ) Regular $35 
( ) Full-time student $25 

School ____________________________________________
Graduate Date _____________________________________
Print Faculty Name ________________________________
Faculty Signature _________________________________


PREPAID 10-COURSE CHINESE BANQUET ON JUNE 2
( ) $20 (add to pre-registration fee check) 

-----------------------------------------------------------------------------
(cut here and mail with your check to)

Dr. Michael L. Commons
Society for Quantitative Analysis of Behavior 
234 Huron Avenue
Cambridge, MA 02138 


REGISTRATION FEE AT THE MEETING
( ) Regular $45 
( ) Full-time Student $30 
    (Students must show active student I.D. to receive this rate)

ON SITE REGISTRATION
5:00--8:00PM, June 1, at the RECEPTION in Room 1550, William James Hall, 
33 Kirkland Street, and 7:30--8:30AM, June 2, in the LOBBY of William 
James Hall.

Registration by mail before April 30, 1989 is recommended 
as seating is limited


HOUSING INFORMATION
Rooms have been reserved in the name of the symposium for the Friday 
and Saturday nights at:

Best Western Homestead Inn
220 Alewife Brook Parkway
Cambridge, MA 02138 
Single: $72 
Double: $80 

Reserve your room as soon as possible. The hotel will not hold them past 
March 31. Because of Harvard and MIT graduation ceremonies, space will 
fill up rapidly. Other nearby hotels:

Howard Johnson's Motor Lodge 
777 Memorial Drive 
Cambridge, MA 02139 
(617) 492-7777 
(800) 654-2000 
Single: $115--$135 
Double: $115--$135 

Suisse Chalet 
211 Concord Turnpike Parkway 
Cambridge, MA 02140 
(617) 661-7800
(800) 258-1980 
Single: $48.70 
Double: $52.70 

---------------------------------------------------------------------------

From homxb!solla at research.att.com  Fri Mar 10 13:10:00 1989
From: homxb!solla at research.att.com (homxb!solla@research.att.com)
Date: Fri, 10 Mar 89 13:10 EST
Subject: Cross-entropy error
Message-ID: <mailman.107.1149591161.29955.connectionists@cs.cmu.edu>


A detailed discussion of cross-entropy error measure for back propagation, 
and a comparative study of its merits relative to the more commonly used 
quadratic measure is to be found in "Accelerated Learning in Layered Neural 
Networks" by S.A. Solla, E. Levin, and M. Fleisher. The paper has appeared 
in "Complex Systems", Vol. 2, 1988. 

Two other relevant references to the use of such error function in the 
context of supervised learning are:

E.B. Baum and F. Wilczek, "Supervised Learning of Probability Distributions 
by Neural Network" in "Neural Information Processing Systems", ed. by D. 
Anderson (AIP, New York, 1988)

J.J. Hopfield, "Leraning Algorithms and Probability Distributions in Feed-
forward and Feed-back Networks", Proc. Natl. Acad. Sci. USA, Vol. 84 ,1988, 
p. 8429-8433.

Sara A. Solla 
AT&T Bell Laboratories 
solla at homxb.att.com


From John.Hampshire at SPEECH2.CS.CMU.EDU  Sun Mar 12 13:21:21 1989
From: John.Hampshire at SPEECH2.CS.CMU.EDU (John.Hampshire@SPEECH2.CS.CMU.EDU)
Date: Sun, 12 Mar 89 13:21:21 EST
Subject: non-MSE objective function for backprop
Message-ID: <mailman.108.1149591161.29955.connectionists@cs.cmu.edu>

*********  FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD  ***********
****************  TO OTHER BBOARDS/ELECTRONIC MEDIA *******************


A NOVEL OBJECTIVE FUNCTION FOR IMPROVED CLASSIFICATION PERFORMANCE
     IN TIME-DELAY NEURAL NETS USED FOR PHONEME RECOGNITION

            J. B. Hampshire II       A. H. Waibel
  	         Carnegie Mellon University

We have been working on an alternative objective function
to the mean-squared-error (MSE) objective function typically
used in backpropagation.  Our alternative, which we term the
classification figure-of-merit (CFM), forms a mathematical assessment
of the *relative* activations of all output nodes of a backprop
network used as a classifier.  The objective function has a number
of unique characteristics; chief among these are 

1.  its formation of internal representations that consistently
    differ substantially from those of the MSE objective function

2.  its immunity to "over-learning" (i.e., the process by which
    MSE classifiers can be trained so much that they begin to
    key on "idiosyncratic" features of the training set that are
    not representative of the ensemble from which the training
    set was drawn.  As a result, over training actually results in
    degraded classification performance on a disjoint test set.)


While classification performance of the CFM objective function is
equivalent to that of the MSE objective function, results
from the two classifiers can be combined to reduce by a median 24%
the number of misclassifications made by the MSE classifier alone.
This equates to single and multi-speaker /b, d, g/ recognition rates
that consistently exceed 98%.

A preliminary paper is available on our results of applying
the CFM to phoneme recognition using Time-Delay Neural Nets now,
but if you want to wait another two weeks, you can get the NEW! IMPROVED!
full-fledged technical report.

If you absolutely can't wait to get your hands on this stuff, send
your mailing address and something to the effect of, "send me the
CFM paper."

If, on the other hand, you want to see a more thorough analysis,
send your mailing address and say, "send me the CFM tech report 
(CMU-CS-89-118) in two weeks."

In either case, send your request directly to 

hamps at speech2.cs.cmu.edu

***** DO NOT USE THE REPLY COMMAND IN YOUR MAILER *****


*********  FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD  ***********
****************  TO OTHER BBOARDS/ELECTRONIC MEDIA *******************

From netlist at psych.Stanford.EDU  Sun Mar 12 17:13:17 1989
From: netlist at psych.Stanford.EDU (Mark Gluck)
Date: Sun, 12 Mar 89 14:13:17 PST
Subject: Tues. 3/14: ALAN LAPEDES, Neural Nets and Signal Processing
Message-ID: <mailman.109.1149591161.29955.connectionists@cs.cmu.edu>

            Stanford University Interdisciplinary Colloquium Series:
                   Adaptive Networks and their Applications

                        Mar.  14th (Tuesday, 3:30pm):

********************************************************************************

            "Nonlinear Signal Processing with Adaptive Networks"

                             ALAN LAPEDES
                        Theoretical Division
                        Los Alamos National Laboratory, MS B213
                        Los Alamos, New Mexico 87545
********************************************************************************
     
                               Abstract

  Previous work on using the new generation of nonlinear neural networks
for signal processing tasks is reviewed. The concept of a nonlinear
system changing its behavior as a parameter is changed (bifurcations)
is introduced and investigated for the simple logistic map. In this
situation we show that instabilities (limit cycles, chaos) of this
system may be predicted as a function of a system parameter purely
from observations of the system in its stable regime where it evolves
to a stable fixedpoint. We consider predicting the bifurcation of a  
hydrodynamic experiment. Both backpropagation nets and radial basis networks
are used on this problem. Agreement with experiment is good, and plenty
of pretty three dimensional pictures will be shown. Unnecessary formalism
will be kept to a bare minimum.

                          Additional Information
                          ----------------------

Location: Room 380-380X, which can be reached through the lower level
 between the Psychology and Mathematical Sciences buildings. 
Level: Technically oriented for persons working in related areas.
Mailing lists: To be added to the network mailing list, netmail to
 netlist at psych.stanford.edu with "addme" as your subject header.
 For additional information, contact Mark Gluck (gluck at psych.stanford.edu).

From harnad at Princeton.EDU  Mon Mar 13 13:57:26 1989
From: harnad at Princeton.EDU (Stevan Harnad)
Date: Mon, 13 Mar 89 13:57:26 EST
Subject: Abstract for CNLS Conference
Message-ID: <8903131857.AA19332@clarity.Princeton.EDU>

Here is the abstract for my contribution to the session on the
"Emergence of Symbolic Structures" at the 9th Annual International
Conference on Emergent Computation, CNLS, Los Alamos National Laboratory,
May 22 - 26 1989

      Grounding Symbols in a Nonsymbolic Substrate

	    Stevan Harnad
	    Behavioral and Brain Sciences
	    Princeton NJ

There has been much discussion recently about the scope and limits of
purely symbolic models of the mind and of the proper role of
connectionism in mental modeling. In this paper the "symbol grounding
problem" -- the problem of how the meanings of meaningless symbols,
manipulated only on the basis of their shapes, can be grounded in
anything but more meaningless symbols in a purely symbolic system -- is
described, and then a potential solution is sketched: Symbolic
representations must be grounded bottom-up in nonsymbolic
representations of two kinds:  (1) iconic representations are analogs
of the sensory projections of objects and events and (2) categorical
representations are learned or innate feature-detectors that pick out
the invariant features of object and event categories. Elementary
symbols are the names of object and even categories, picked out by
their (nonsymbolic) categorical representations. Higher-order symbols
are then grounded in these elementary symbols. Connectionism is a
natural candidate for the mechanism that learns the invariant features.
In this way connectionism can be seen as a complementary component in a
hybrid nonsymbolic/symbolic model of the mind, rather than a rival to
purely symbolic modeling. Such a hybrid model would not have an
autonomous symbolic module, however; the symbolic functions would
emerge as an intrinsically "dedicated" symbol system as a consequence
of the bottom-up grounding of categories and their names.

From niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK  Tue Mar 14 10:16:44 1989
From: niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK (Mahesan Niranjan)
Date: Tue, 14 Mar 89 15:16:44 GMT
Subject: information function vs. squared error
Message-ID: <28888.8903141516@dsl.eng.cam.ac.uk>

I tried sending the following note last weekend but it failed for some
reason - apologies if anyone is getting a repeat!

Re:
    > Date:         Wed, 08 Mar 89 11:36:31 EST
    > From: thanasis kehagias <ST401843%bitnet.brownvm at edu.cmu.cc.vma>
    > Subject:      information function vs. squared error
    >
    > i am looking for pointers to papers discussing the use of an alternative
    > criterion to squared error, in back propagation algorithms. the
    [..]
    >    G=sum{i=1}{N} p_i*log(p_i)
    >

Here is a non-causal reference:

I have been looking at an error measure based on "approximate distances to
class-boundary" instead of the total squared error used in typical supervised
learning networks. The idea is motivated by the fact that a large network
has an inherent freedom to classify a training set in many ways (and thus
poor generalisation!).

In my training, an example of a particular class gets a target value
depending on where it lies with respect to examples from the other class
(in a two class problem).

This implies, that the target interpolation function that the network has to
construct is a smooth transition from one class to the other (rather than
a step-like cross section in the total squared error criterion).

The important consequence of doing this is that networks are automatically
deprived of the ability to form large weight (- sharp cross section)
solutions (an auto weight decay!!).

niranjan
PS: A Tech report will be announced soon.

From sven at iuvax.cs.indiana.edu  Tue Mar 14 10:12:36 1989
From: sven at iuvax.cs.indiana.edu (Sven Anderson)
Date: Tue, 14 Mar 89 10:12:36 -0500
Subject: Connection between Hidden Markov Models and Connectionist Networks
In-Reply-To: thanasis kehagias's message of Mon, 13 Feb 89 00:47:00 EST
Message-ID: <mailman.110.1149591161.29955.connectionists@cs.cmu.edu>

I'm interested in receiving the paper you described:

	   OPTIMAL CONTROL FOR TRAINING
            THE MISSING LINK BETWEEN
             HIDDEN MARKOV MODELS
           AND CONNECTIONIST NETWORKS
 
                by Athanasios Kehagias
               Division of Applied Mathematics
                 Brown University
                Providence, RI 02912

If it's more convenient you might just forward the div file.

thanks, 
Sven Anderson

From honavar at cs.wisc.edu  Tue Mar 14 17:59:39 1989
From: honavar at cs.wisc.edu (A Buggy AI Program)
Date: Tue, 14 Mar 89 16:59:39 -0600
Subject: TR available (** DO NOT FORWARD TO BULLETIN BOARDS **)
Message-ID: <8903142259.AA01452@goat.cs.wisc.edu>


** PLEASE DO NOT FORWARD TO BULLETIN BOARDS **

The following TR is now available:


---------------------------------------


              Perceptual Development and Learning:
 From Behavioral, Neurophysiological, and Morphological Evidence
                    To Computational Models

                       Vasant Honavar
                Computer Sciences Department
               University of Wisconsin-Madison
	   Computer Sciences TR # 818, January 1989

                          Abstract

     An intelligent system has to be capable  of  adapting  to  a
constantly  changing environment. It therefore, ought to be capa-
ble of learning from its perceptual interactions  with  its  sur-
roundings.   This  requires a certain amount of plasticity in its
structure.  Any attempt to model the perceptual capabilities of a
living  system or, for that matter, to construct a synthetic sys-
tem of comparable abilities, must  therefore,  account  for  such
plasticity  through  a  variety  of  developmental  and  learning
mechanisms. This paper examines some results  from  neuroanatomi-
cal, morphological, as well as behavioral studies of the develop-
ment of visual perception; integrates them into  a  computational
framework; and suggests several interesting experiments with com-
putational models that can yield insights into the development of
visual perception.

---------------------------------------

Requests for copies must be addressed to: honavar at cs.wisc.edu


From ash%cs at ucsd.edu  Tue Mar 14 19:15:54 1989
From: ash%cs at ucsd.edu (Tim Ash)
Date: Tue, 14 Mar 89 16:15:54 PST
Subject: No subject
Message-ID: <8903150015.AA19834@beowulf.ucsd.edu.UCSD.EDU>

-----------------------------------------------------------------------
The following technical report is now available.  
-----------------------------------------------------------------------


                   DYNAMIC NODE CREATION
                             IN
                  BACKPROPAGATION NETWORKS

                         Timur Ash
                        ash at ucsd.edu


                          Abstract


     Large backpropagation (BP) networks are very  difficult
to  train.  This fact complicates the process of iteratively
testing different sized networks (i.e., networks  with  dif-
ferent  numbers of hidden layer units) to find one that pro-
vides a good mapping approximation.  This paper introduces a
new  method  called Dynamic Node Creation (DNC) that attacks
both of these issues (training large  networks  and  testing
networks with different numbers of hidden layer units).  DNC
sequentially adds nodes one at a time to the hidden layer(s)
of  the  network until the desired approximation accuracy is
achieved.  Simulation results for parity,  symmetry,  binary
addition,  and  the encoder problem are presented.  The pro-
cedure was capable of finding known  minimal  topologies  in
many  cases,  and  was  always  within  three  nodes  of the
minimum. Computational expense for finding the solutions was
comparable  to  training  normal  BP  networks with the same
final topologies.  Starting out with fewer nodes than needed
to solve the problem actually seems to help find a solution.
The method yielded a solution for every problem  tried.   BP
applied  to  the same large networks with randomized initial
weights was unable, after repeated  attempts,  to  replicate
some minimum solutions found by DNC.

-----------------------------------------------------------------------
Requests for reprints (ICS Report 8901) should be directed to:

Claudia Fernety 
Institute for Cognitive Science 
C-015
University of California, San Diego
La Jolla, CA 92093.
-----------------------------------------------------------------------

From wine at CS.UCLA.EDU  Wed Mar 15 08:49:36 1989
From: wine at CS.UCLA.EDU (wine@CS.UCLA.EDU)
Date: Wed, 15 Mar 89 05:49:36 PST
Subject: TR available (** DO NOT FORWARD TO BULLETIN BOARDS **) 
In-Reply-To: Your message of Tue, 14 Mar 89 16:59:39 -0600.
             <8903142259.AA01452@goat.cs.wisc.edu> 
Message-ID: <8903151349.AA04692@retina.cs.ucla.edu>

Please send me a copy of your technical report #818.
Thank you in advance.

--David Wine

University of California at Los Angeles                        wine at cs.ucla.edu
Computer Science Department                                      (213) 825-6121
3531 Boelter Hall           ...!(uunet,rutgers,ucbvax,randvax)!cs.ucla.edu!wine
Los Angeles, CA  90024

From ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU  Wed Mar 15 18:24:14 1989
From: ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU (thanasis kehagias)
Date: Wed, 15 Mar 89 18:24:14 EST
Subject: what is a connectionist network?
Message-ID: <mailman.111.1149591161.29955.connectionists@cs.cmu.edu>

ok, here is my question. i hope it makes sense:

very often i want to refer to "these things". i do not want to call them
neural networks, since it is far from clear to me they really have a
similarity with the human nervous system. so i chose to call them
connectionist networks. i guess this means they are networks with (many)
connections. but this is very general. so i do not have a clear
definition of what i am talking about. i am sure i could come up with
several, but they seem to me to be either too restrictive or too
general. so would anybody care to give their definition of these objects
that this list is about?

the issue is not trivial or vacuously philosophical. i think that even
if we do not come up with a generally accepted definition of what a
connectionist net is, people will have a chance to present competing
opinions. possibly some lurking differences will come in the surface and
the foundations of connectionism will become more secure.

here is a case that i think is fraught with issues (that could be
cleared up). any dynamical system that evolves in discretetime can be
represented (over a finite time interval)  by a feedforward
connectionist network. is it fair to say that dyn.systems are
connectionist networks. conversely, is it fair to say that feedforward
nets are dynamical systems. what are the implications for a time-space
trade off? how much do we have to learn about dyn. systems to do
connectionist research?

ok, after all this i guess i have to give my definition of a
connectionist network. it is rather involved and it goes like this:

"connectionism is not a yes-or-no property. any directed graph
(collection of nodes and directed edges) has a connectionism index,
defined as the ratio of nr. of edges to  nr. of nodes. "

PS:

   has anybody already dealt with the question of defining a CN? references
   welcome.


      Thanasis

From ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU  Wed Mar 15 18:23:24 1989
From: ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU (thanasis kehagias)
Date: Wed, 15 Mar 89 18:23:24 EST
Subject: cross entropy and training time in Connectionist Nets and HMM's
Message-ID: <mailman.112.1149591161.29955.connectionists@cs.cmu.edu>


these are some random  thoughts on the issue of training in HMM and
Connectionist networks. i focus on the cross entropy function
and follow a different line of thinking than in my paper which i quote
further down. this note is the outcome of an exchange between Tony
Robinson and me; i thought some netters might be interested.
so i want to thank Tony for posing interesting ideas and
questions. also thanks to all the people who replied to my request
for information on the cross entropy function.

-----------------------
the starting point for this discussion is the following  question:

"why is  HMM training so much faster than Connectionist Networks?"

to put the question in perspective, let me first remark that, from a
certain point of view, HMM and CN are very similar objects. specifically
they use similar architectures to optimize appropriate cost functions.
for further explanation of this point, see [Kehagias], also [Kung].
the similarity is even more obvious when CM are used to solve speech
recognition problems.

the question remains: why, in attempting to solve the same problem,
CN require so much more training?

1. cost functions
-----------------

it appears that a (partial) explanation is the nature of the cost
function used in each case. in CN speech recognizers, the cost
function of choice is quadratic error (error being the difference of
appropriate vectors). however in most of what follows i will
consider CN that maximize the cross entropy function. a short
discussion of the relationship between cross entropy and square
error is included at the end. in HMM the function MAXIMIZED
is likelihood (of the observations).

however HMM are a bit more subtle. using the Markov Model, one can write
the likelihood of the observations used for training, call it L(q). here
q is a vector that contains the transition and emission probabilities
(usually called a_ij, b_kj, respectively). to keep the discussion
simple, let us consider the only unknown parameters to be the a_ij's.
that is, the elements of q are the a_ij's. now, q is a vector, but a mor
general view of it is that it is a function (specifically a probability
density function). so we will consider q as a vector or a function
interchangeably. (of course any vector is a function of its index!)

Now, to maximize L is not a trivial task : it
is a polynomial of n*T-th order in the elements of q
(where n is the order of the Markov model, T the number of observations)
furthermore, the elements of q are probabilties and they must
satisfy certain positivity and add-up-to-1 conditions.

2. Likelihood maximin, Backward-Forward, EM algorithm
-----------------------------------------------------

so HMM people have found a way to make the optimization problem easier:
consider an auxiliary function, call it Q(q,q'), to be presently
defined, which can be maximized much easier. then they prove the
remarkable inequality:

(1)       L(q)*log(L(q')/L(q)) >= (Q(q,q')-Q(q,q)).

the consequence  of (1) is the following: we can implement an iterative
algorithm that goes as follows:


Step 0: choose q(0)

.....

Step k: choose q(k) such that  Q(q(k-1),q(k)) is maximized.

        if Q(q(k-1),q(k))=0, terminate.
        if Q(q(k-1),q(k))>0 go to step k+1

.....

                    REMARKS:
1) observe that no provision is made for the case that Q(q(k-1),q(k)) is
negative. this is due to the fact that max G is always nonnegative, as
proved in  [Baum 1968] or [Dempster].

2) of course , in practice, the termination condition will
be replaced by : if G<epsilon terminate (epsilon small).

3) it is easy to see (by use of Calculus methods
that EXACT maximization of  Q(q(k-1),q(k)) with respect to q(k) is
possible and , in fact, easy. the maximizing values q(k) are given
in terms of q(k-1) by the reestimation
formulas (see [Baum 1970]) the form of which guarantees that they have
the probability properties (positivity, add-up-to-one).

4) finally note that the algorithm is a maximin algorithm, since it
maximizes the minimum gain in likelihood.

this algorithm in its general form is known as the EM algorithm
[Dempster]. in the HMM context it is known as the Backward-Forward
(BF) algorithm [Baum 1970]. it is a greedy algorithm that produces
a sequence of parameter values q(0),q(1),q(2),... such that:

(2)     Q(q(k-1),q(k))>Q(q(k-1),q(k-1)).

>From (1) and (2) and Remark (1)  follows that

(3)     L(q(k)) > L(q(k-1)).

3. Connection of EM with cross entropy and neural networks
----------------------------------------------------------

Now we will discuss the function G and point out the relationship to
CN.

The function Q(q,q') can be defined in quite a general setting. q , q'
are probability densities. as such they are functions themselves; we
write q(x), q'(x). x takes values in an appropriate range. e.g., in the
HMM model x ranges over  all the state transition pairs (i,j), giving
the probability of a certain state transition. now, define G:

(4) Q(q,q')=sum{over all x} q(x)log(q'(x)).

Then, the difference   Q(q,q)-Q(q,q') is:

(5)  Q(q,q)-Q(q,q')=G(q,q')=sum{all x}q(x)log(q(x)/q'(x)).

G is the well known to connectionists (and statisticians) cross-
entropy between q and q', that is, a measure of distance between
these two probability densities.

now we recognize two things:

I. there have been cases where G minimization has been proposed
as a CN training procedure . see [Hinton].
In these cases, a desired probability density was known and what
was desired was to minimize the distance between desired and actual
probability density of the CN output. in some of these cases, there was ncurrent
simultaneous maximization of likelihood. this is noted in [Ackley].
it follows necessarily from (1) that maximizing the cross-entropy
maximizes the minimum improvement in likelihood.


II. it is clear that the BF algorithm does a similar thing: likelihood
maximization, cross entropy minimization. as noted in [Baum 1968]
and also in [Levinson], the difference q(k)-q(k-1) points in the
same direction as grad L(q), evaluated at q(k-1). That is, the
q(k-1) is changed in the direction of steepest descent of L. of all
the possible steps (choices of q(k)) the one is chosen that minimizes
the distance between q(k-1) and q(k) in the cross entropy sense.

4. Comparison in training of HMM and CN:
---------------------------------------

now we can make a comparison of the performance of CN and HMM's. this
comparison is between  G-optimizing-CN's and HMM's. the square-error
CN is not discussed here.

firstly, we see that the main focus of attention is different in the two
cases. in CN we want to minimize cross entropy. in HMM we want to maximi
likelihood. however, likelihood maximinimization
is an automatc consequence of G minimization for CN's and local
G minimization is built in in the BF algorithm. in that
sense, the two tasks are very similar and so the question is once
again raised: why are HMM's faster to train?

at this point the answers are many and easy. even though HMM's use
observations in a nonlinear way, the state vector
of the adjoint network (see [Kehagias]) evolves linearly.
not so for CN's. the HMM adjoint  network is
sparsely connected. not necessarily so for the CN (pointed out
by [Tony Robinson]). though both cost functions used are nonlinear,
the BF is a much more efficient method to optimize the HMM cost function
than Back Propagation is for CN's.

the last answer is the really important one. due to the special nature
of the Hidden Markov Model, we can use the BF algorithm. this algorithm
allows to take large steps (large changes from q(k-1) to q(k)) in the   traying
Euclidean distance, without moving too far away in the cross entropy
distance. of all the probability distributions, we consider only the
ones that are "relevant", in that they  are close to the current one;
and yet, even though we take conservative steps, we are guaranteed
to maximize the minimum improvement in likelihood. indeed the maxmin
is a conservative attitude. the rational is the following:
"you want to maximize L. you know the steepest ascent direction;
you want to go in that direction, but you do not know how far to
go. BF will tell you how far you can go (and it will not be an
infinitesimal step) so that you maximize the minimum improvement."

another way to look at this is that the Euclidean distance imposes
a structure (topology) to the space of probability distributions.
the cross entropy distance imposes a different structure, which
apparently, is more relevant to the problem.

in contrast, in BP we have not much choice in the change we bring on q.
we have control over w, the weights of the connections, and we usually
choose them in the steepest descent direction, and small enough that
we actually have an improvement. but it is not clear that the cross
entropy between distributions imposes a suitable structure on the
space of weights. apparently it does not. even a relatively small step
in the weight space can change the cost function by much. we have to
tread more carefully.

of course BF can be used due to the very special structure of the
HMM problem (which is probably a good argument for the usefulness
of the HM Model). BF is applicable when the cost function is a
homogeneous polynomial with additive constraints on the variables.
(see [Baum 1968]). the CN problem is characterized by
harder nonlinearities (e.g. the sigmoid function) which induce
a warped relationship between the weights and cost function.

in short, the CN problem is more general and harder.

5. square error cost function
-----------------------------

first a general observation: the square error cost function can be
introduced under two asumptions. in the one case we assume the error
to be deterministic and we want to minimize a deterministic sum of
square errors (the sum is over all training patterns; the error is
the difference between desired and actual response)
by appropriate choice of weights. there is nothing
probabilistic here. alternatively, we can assume that
the training patterns are selected randomly (according to some
prob. density) and also the test patterns will come from the
same prob. density, and we choose the weights to minimize
expected square error.

even though the two points of view are distinct, they are not
that different, since in both cases we can define inner products,
distance functions etc. and so get a Hilbert space structure
that is practically the same for both cases. of course this would involv
some ergodicity assumption.

at any rate, assume  here the probabilistic point of view of square
error. what are then the connections between the two cost
functions: cross entropy and expected (or mean ) square error?

i have seen some remarks on this problem in the literature, but i
do not know enough about  at this point. however, judging from
training time, i would say that the nonlinear nature of CN with
sigmoids again maps the weight space to the cost function in a very
warped way. it would be interesting to examine the shape of the
cost function contour in the weight space. have such studies been made?
visualization seems to be a problem for high dimensional networks.

6. cross entropy maximization and some loose ends
------------------------------------------------
an interesting variation is G maximization. this usually occurs
in unsupervised learning. See [Linsker], [Plumbley]. it appears under
the name of transinformation maximization, or error information
minimization, but these quantities can be interpreted as cross
entropy between the joint input-output probability den. induced by the
CN (for given weights) and the probability den. where input and output
have the same marginals, but are independent (so the joint density
is a product of the two marginals). i guess a way to explain this
in terms of cross entropy is: even though we have no prior
information on the best input-output density, there is one density
we certainly want to avoid as much as possible, and this the one
where input and output are independent (so the input gives no
information as to what the output is). hence we want to maxmize the
cross entropy distance between this product distribution and the
CN induced distribution. there is also a possible interpretation
along the lines of the maximum entropy principle.

i must  say  that  these interpretations do not seem (yet)
to me as appealing as maximum   transinformation. however they are
possible   and indeed      statisticians   have been considering
them  for  many  years  now.

another interseting connection is between cross entropy and rate
of convergence (obviously rate of convergence is connected to
training time). [Ellis] gives an excellent analysis of the connection
between rate of convergence and crossentropy. application of
his results to computational problems is not obvious.

finally, an interesting example (of statistical work
that relates to this line of connectionist research) is [Rissanen];
there the linear regression model is considered, which of course
can be interpreted as a linear perceptron. in [Rissanen] selection
of the optimal model is based on minmax entropy criterion.

References:
-----------

D.H.Ackley:   "A Learning algorithm for Boltzmann machines"
     et.al.    Cognitive Science 9 (1985).

L.E. Baum &:  "Growth Transformations for Functions on Manifolds"
G.R. Sell     Pacific Journal of Mathematics, Vol.27, No.2., 1968.


L.E.Baum  :   "A Maximization Technique occurring in the Statistical
     et.al.    Analysis of Probabilistic Functions of Markov Chains"
               The Annals of Math, Stat., Vol. 41, No. 1, 1970.

A.P. Dempster:"Maximum Likelihood from Incomplete Data via EM algorithm"
       et. al. Pr. Roy. Stat. Soc., No. 1, 1977.

R. Ellis:     "Entropy, Large Deviations and Statistical Mechanics"
               Springer, New York 1985.

G. Hinton    :"Connectionist Learning Procedures", Technical Report
               CMU-CS-87-115 (Carnegie Mellon University), June 1987.

A. Kehagias: "Optimal Control for Training: Themissing link
              between HMM and Connectionist Networks"
              submitted to 7th int. Conf. on Math. and Computer
              Modelling, Chicago, Illinois, August 1989.

S.Y. Kung &: "A Unifying viewpoint of Multilayer Perceptrons
J.N. Hwang    and HMM models" (IEEE Int. Symposium and Systems
              Portland, Oregon, 1989.

S.E.Levinson: "An Introduction to the Application of the Theory of
       et.al.  Probabilistic Functions of a Markov Process to Automatic
               Speech Recognition", The Bell Sys. Tech. J., Vol.62,
               No. 4, April 1983.

R. Linsker:   "Self Organization in a Perceptual Network", IEEE
               Computer, Vol.21, No.3, March 1988.

M. Plumbley&: "An information Theoretic Approach to Unsupervised
F. Fallside    Connectionist Models", Proceedings of 1988 Connectionist
               Models Summer School, Pittsburgh, 1988.

J. Rissanen: "Minmax Entropy Estimation of Models for Vector
              Processes", in Lainiotis-Mehra (ed.), System Advances
              and case studies, Academic, New York, 1976.

T. Robinson: personal communication

From ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU  Thu Mar 16 09:54:52 1989
From: ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU (thanasis kehagias)
Date: Thu, 16 Mar 89 09:54:52 EST
Subject: HMM?
Message-ID: <mailman.113.1149591161.29955.connectionists@cs.cmu.edu>

with respect to my cross entropy posting, i guess i never said it explicitly:


      HMM stands for Hidden Markov Model

      it is a model widely used in speech research.


                                                      Thanasis

From sankar at caip.rutgers.edu  Thu Mar 16 09:42:44 1989
From: sankar at caip.rutgers.edu (ananth sankar)
Date: Thu, 16 Mar 89 09:42:44 EST
Subject: questions on kohonen's maps
Message-ID: <8903161442.AA14983@caip.rutgers.edu>

I am interested in the subject of Self Organization and have some
questions with regard to Kohonen's algorithm for Self Organizing Feature
Maps. I have tried to duplicate the results of Kohonen for the two dimensional
uniform input case i.e. two inputs. I used a 10 X 10 output grid. The maps
that resulted were not as good as reported in the papers.

Questions:
	
	
1	Is there any analytical expression for the neighbourhood and gain
	functions? I have seen a simulation were careful tweaking after
	every so many iterations produces a correctly evolving map. This
	is obviously not a proper approach.

2	Even if good results are available for particular functions for
	the uniform distribution input case, it is not clear to me that these
	same functions would result in good classification for some other
	problem. I have attempted to use these maps for word classification
	using LPC coeffs as features.

3	In Kohonen's book "Self Organization and Associative Memory", Ch 5
	the algorithm for weight adaptation does not produce normalized
	weights. Thus the output nodes cannot function as simply as taking
	a dot product of inputs and weights. They have to execute a distance
	calculation.

4	I have not seen as yet in the literature any reports on
	how the fact that neighbouring nodes respond to similar patterns
	from a feature space can be exploited.

5	Can the net become disordered after ordering is achieved at any
	particular iteration? 


I would appreciate any comments, suggestions etc on the above. Also so that
net mail clutter may be reduced please respond to

sankar at caip.rutgers.edu

Thank you.

Ananth Sankar
Department of Electrical Engineering
Rutgers University, NJ


From KELLY%BROWNCOG.BITNET at mitvma.mit.edu  Thu Mar 16 12:12:00 1989
From: KELLY%BROWNCOG.BITNET at mitvma.mit.edu (KELLY%BROWNCOG.BITNET@mitvma.mit.edu)
Date: Thu, 16 Mar 89 12:12 EST
Subject: What is a connectionist net?  Here's what it's not.
Message-ID: <mailman.114.1149591161.29955.connectionists@cs.cmu.edu>

        What is a connectionist model, you ask?  Well, I don't think I can
answer that specifically, but I can tell you what it's not.  In the first
place it *is* a member of a larger class of models called complex systems.
But that doesn't help us either, because nobody really knows what a complex
system is.  The generally conceived definition has something to do with large
numbers of simple, interconnecting units which can perform some type of
"cooperative computation".  That is, individually the units are so dumb that
they can't do anything, but together they can do alot.
        Well, then my claim (I'm really out on a limb here), is that systems
with large numbers of very complex, interconnecting units really aren't
connectionist models (or even complex systems) at all, no matter how many
connections there are or what type of amazing results they achieve.  In
particular I am referring to the result that Hecht-Nielson reports in his
paper on "Kolmogorov's Mapping Neural Network Theorem" [1987 INNS proceedings?].
There he describes a way of proving that a 2-layered net (one hidden layer)
is capable of solving any mapping problem.  However, the units in the
network are incredibly complex.  No longer are we dealing with units that
compute threshold functions.  The hidden layer units must be able to compute
any real, continuous monotonically increasing function, and the output layer
units must be able to compute any *arbitrary* real continuous function.
While the fact that a system like this can do some serious computation is
interesting (neat, even), it really tells us nothing about connectionist
networks.

From ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU  Thu Mar 16 22:19:54 1989
From: ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU (thanasis kehagias)
Date: Thu, 16 Mar 89 22:19:54 EST
Subject: credits
Message-ID: <mailman.115.1149591161.29955.connectionists@cs.cmu.edu>


recently i posted a note about traing of HMM and Connectionist Networks,
where i was not careful enough in giving credit to people that deserved
it. let me try to make up for it:

i had a very interesting exchange of mesages with Tony Robinson, that
formed the basis for my note.

i received messages with ideas and references from Mark Plumbley, Steven
Nowlan, Sue Becker and Sara Solla. Sara Solla referred me to a paper
written by Solla, Esther Levin and Michael Fleisher, that deals with the
question of cross entropy. i received a copy of this paper today. it is:

 "Accelerated Learning in Layered Neural Networks", by S. Solla, , E. Levin and
 M. Fleisher, Complex Systems, Vol. 2, 1988.


the paper compares cross entropy  and square error and includes a
numerical study and a study of the shape of the contours of these cost
functions. therefore, the similar question i posed at the end of my note
is at least partly answered.

i also received the revised copy of G. Hinton's report on Connectionist
learning procedures, referred to in  my note. in this report (Dec. 1987)
Hinton has already made a remark directly related to  my point of
maximinimizing likelihood in the BF algorithm. specifically, he says
that (in the context of CN training with cross entropy cost function)
Likelihood is maximized when cross entropy is minimized.

i think this is all. if i have missed someting , let me know about it .{


                            Thanasis

From ROB%BGERUG51.BITNET at VMA.CC.CMU.EDU  Fri Mar 17 09:24:00 1989
From: ROB%BGERUG51.BITNET at VMA.CC.CMU.EDU (Rob A. Vingerhoeds / Ghent State University)
Date: Fri, 17 Mar 89 09:24 N
Subject: Neural Networks Seminar Ghent, 25 april 1989, FINAL ANNOUNCEMENT
Message-ID: <mailman.116.1149591162.29955.connectionists@cs.cmu.edu>


                BIRA SEMINAR ON NEURAL NETWORKS

     "APPLICATION OF NEURAL NETWORKS IN INDUSTRY, WHEN AND HOW"

                       25 APRIL 1989

              INTERNATIONAL CONGRESS CENTRE GHENT

                          BELGIUM

                    FINAL ANNOUNCEMENT

BIRA (Belgian Institute for Control Engineering and Automation) is
organising a seminar on the state of the art in Neural Networks. The
central theme will be

"Application of Neural Networks in Industry, when and how"

To be able to give a good and reliable verdict to this theme, some of the
most important and leading scientists in this fascinating area have been
invited to present a lecture at the seminar and take part in a panel
discussion.

The following program is foreseen:

 8.30 -  9.00    Registration
 9.00 -  9.15    Opening on behalf of BIRA
                 Prof. L. Boullart, Ghent State University
 9.15 - 10.00    Learning Algorithms and applications in A.I.
                 Prof. Fogelman Soulie, Universite de Paris V
10.00 - 10.30    coffee
10.30 - 11.30    The Neural Network Framework
                 Prof. B. Kosko, University of Southern California
11.30 - 12.00    Presentation of ANZA+ products, hardware and software
                 Patrick Dumont, Digilog, France
12.00 - 14.00    lunch / exhibition
14.00 - 15.00    Integration of knowledge-based system and neural network
                 techniques for robotic control
                 Dr. David Handelman, Princeton, USA
15.00 - 16.00    Application in Image Processing and Pattern Recognition
                 (Neocognitron)
                 Dr. S. Miyake, ATR, Japan
16.00 - 16.30    tea
16.30 - 17.15    panel discussion over the central theme
17.15 - 17.30    closing and conclusions

The seminar will be held in the same period as the famous Flanders
Technology International (F.T.I.) exhibition is held. This exhibition is
for both representatives from industry and for other interested people very
interesting and going to both the seminar and the exhibition is double
interesting.


VENUE

International Congress Centre Ghent
- Orange Room -
Citadelpark
B-9000 Ghent

DATE
Tuesday 25 april 1989

LANGUAGE
The seminar language is English. No translation will be provided.

REGISTRATION FEES
members BIRA/IBRA    12.500 BEF
non-members          15.000 BEF
Teachers/Assistants   7.500 BEF

including coffee/tea, lunch and proceedings.

Students can get a special price of 1.500 BEF, which does NOT include a
lunch.

Tickets for FLANDERS TECHNOLOGY INTERNATIONAL can be obtained at the
registration desk.

Payments in Belgian Franks only, to be made on receipt of an invoice from
the BIRA office.

Registration will close on 18 april 1989.

Confirmations will NOT be send.


For further information or a printed announcement with a registration form
please contact either the BIRA coordinator (adress below) or one of us
(using e-mail).

You can also use the registration form printed below and send this via
e-mail back to us. We will then make sure it reaches BIRA in time.

--------------------------<cut here>--------------------------------------------

REGISTRATION FORM

Tuesday 25 april 1989
I.C.C.-Ghent
BIRA Seminar on NEURAL NETWORKS

NAME:                 ..................................................

FIRST NAME:           ..................................................

ADRESS:               ..................................................

                      ..................................................

POSITION:             ..................................................

CONCERN OR INSTITUTE: ..................................................

                      ..................................................

TEL:                  ..................................................

FAX:                  ..................................................

-------------------------
Member BIRA/IBRA    : ........ BEF
Non-members         : ........ BEF
Teachers/Assistants : ........ BEF
-------------------------

Please only settle payment upon receipt of an invoice from the BIRA-Office.

Please indicate whether the invoice should be adressed to the company or
the personal adress.


Date:

Please send back before 17 april 1989.

Do NOT use 'REPLY', because in that way everyone on the list will be
informed about your plans to come to the seminar and they just might not be
interested in it.

--------------------------<cut here>--------------------------------------------

Seminar Coordinators
Rob Vingerhoeds               Leo Vercauteren
<ROB at BGERUG51.BITNET>         <LEO at BGERUG51.BITNET>

BIRA COORDINATOR
L. Pauwels
BIRA-Office
Het Ingenieurshuis
Desguinlei 214
2018 Antwerpen
Belgium
tel: +32-3-216-09-96
fax: +32-3-216-06-89 (attn. BIRA L. Pauwels)

From alexis%yummy at gateway.mitre.org  Fri Mar 17 09:46:27 1989
From: alexis%yummy at gateway.mitre.org (alexis%yummy@gateway.mitre.org)
Date: Fri, 17 Mar 89 09:46:27 EST
Subject: What is a connectionist net?  Here's what it's not.
In-Reply-To: KELLY%BROWNCOG.BITNET@mitvma.mit.edu's message of Thu, 16 Mar 89 12:12 EST <8903170151.AA26943@gateway.mitre.org>
Message-ID: <8903171446.AA02093@marzipan.mitre.org>

************  Do Not Forward To Any Other BBoards, Etc  ************

Just an aside to KELLY%BROWNCOG's note, rather than worry if
Hecht-Nielson's neural net (and I use the term intentionally -- I mean
"artificial intelligence" is neither so ...) is really a connectionist
model, let me point out a paper/result worth being aware of.

G. Cybenko wrote a very interesting paper which proves that a neural 
network with *one* hidden layer of nodes (i.e., one more than a 
perceptron) with a sigmoid transfer function can "uniformly approximate 
any continuous function with support in the unit hypercube".  That is 
to say you actually can do any mapping with *ONE* hidden layer (albeit
often a very very large one).

Cybenko sent the paper to me because of a tirade I went on awhile ago
on this bboard, so I don't actually know if it has been published 
anywhere yet.  I'm writing this without his knowledge -- I'm pretty 
sure he's on this list.  G. Cybenko are you out there, and are you 
willing to say where your paper "Approximation by Superpositions of 
a Sigmoidal Function" can be found by the hungary masses?

alexis wieland.

************  Do Not Forward To Any Other BBoards, Etc  ************

From sontag at fermat.rutgers.edu  Sat Mar 18 18:27:29 1989
From: sontag at fermat.rutgers.edu (sontag@fermat.rutgers.edu)
Date: Sat, 18 Mar 89 18:27:29 EST
Subject: ONE HIDDEN LAYER IS ENOUGH -- re "what is a net?" discussion
Message-ID: <8903182327.AA06225@control.rutgers.edu>

This is in response to Alexis Wieland's request:

	"G. Cybenko are you out there, and are you willing to say where your
	paper "Approximation by Superpositions of a Sigmoidal Function" can be
	found by the hungary (sic) masses?"

(Presumably non-Hungarian masses are interested too, so:)

The paper by George Cybenko that proves this theorem (a neural network with
one hidden layer of nodes with a fixed sigmoid transfer function can uniformly
approximate any continuous function) is scheduled to appear in

     MATHEMATICS OF CONTROL, SIGNALS, AND SYSTEMS, Vol.2 (1989), Number 4.

Your library should have this journal, which specializes in the formal
mathematical analysis of problems related to signal processing and systems.
(The journal has published many other papers that should be relevant to
theoretical connectionist research, such as papers on iterated projection
methods, estimation, interpolation techniques, identification, and adaptive
control.)  If your library doesn't yet subscribe, you might as well provide
them with the following info:

MATHEMATICS OF CONTROL, SIGNALS, AND SYSTEMS
Springer-Verlag New York, Inc
ISSN 0932-4194, Title # 498

In North America, order from:

Springer-Verlag New York, Inc
Journal Fulfillment Services
44 Hartz Way, Secaucus, NJ 07094

(Volume 2, 1989 ... $179.00 incl. p&h)

Outside NA, order from:

Springer-Verlag
Heidelberger Platz 3
D-1000 Berlin 33, FRG

(Volume 2, 1989 ... DM 348.- incl. p&h)

-bradley dickinson and eduardo d. sontag, co-Managing eds.

From terry%sdbio2 at ucsd.edu  Sat Mar 18 21:11:09 1989
From: terry%sdbio2 at ucsd.edu (Terry Sejnowski)
Date: Sat, 18 Mar 89 18:11:09 PST
Subject: ONE HIDDEN LAYER IS ENOUGH -- re "what is a net?" discussion
Message-ID: <8903190211.AA17912@sdbio2.UCSD.EDU>

Hal White in the Economics Department at UCSD has also proved
that one hidden layer can uniformly approximate smooth mappings.
He has gone on to prove the even more interesting theorem that
it is possible to learn the mapping.  Write to him for a preprint:

Hal White
Department of Economics
UCSD
San Diego, CA 92093

Two related papers that are in press in Neural Computation:

What size net gives valid generalization? by Eric Baum
and David Haussler

A proposal for more powerful learning algorithms.
Eric Baum.

For preprints write to:

Eric Baum
Department of Physics
Princeton University
Princeton, NJ 08540

Terry Sejnowski

-----

From chrisley.pa at Xerox.COM  Mon Mar 20 14:25:00 1989
From: chrisley.pa at Xerox.COM (chrisley.pa@Xerox.COM)
Date: 20 Mar 89 11:25 PST
Subject: questions on kohonen's maps
In-Reply-To: ananth sankar <sankar@caip.rutgers.edu>'s message of Thu, 16
 Mar 89 09:42:44 EST
Message-ID: <890320-112612-6136@Xerox>

Ananth Sankar recently asked some questions about Kohonen's feature maps.
As I have worked on these issues with Kohonen, I feel like I might be able
to give some answers, but standard disclaimers apply:  I cannot be certain
that Kohonen would agree with all of the following.  Also, I do not have my
copy of his book with me, so I cannot be more specific about refrences.

Questions:
	
	
1	Is there any analytical expression for the neighbourhood and gain
	functions? I have seen a simulation were careful tweaking after
	every so many iterations produces a correctly evolving map. This
	is obviously not a proper approach.

Although there is probably more than one, correct, task-independent gain or
neighborhood function, Kohonen does mention constraints that all of them
should meet.  For example, both functions should decrease to zero over
time.  I do not know of any tweaking; Kohonen usually determines a number
of iterations and then decreases the gain linearly.  If you call this
tweaking, then your idea of domain-independent parameters might be a sort
of holy grail, since it does not seem likely that we are going to find a
completely parameter-free learning algorithm that will work in every
domain.

2	Even if good results are available for particular functions for
	the uniform distribution input case, it is not clear to me that these
	same functions would result in good classification for some other
	problem. I have attempted to use these maps for word classification
	using LPC coeffs as features.

As far as I know, Kohonen has used the same type of gain and neighborhood
functions for all of his map demonstrations.  These demonstrations, which
have been shown via an animated film at several major conferences,
demonstrate maps learning the distribution in cases where 1) the
dimensionality of the network topology and the input space mismatch, e.g.,
where the network is 2d and the distribution is a 3d 'cactus'; 2) the
distribution is not uniform.  The algorithm was developed with these 2
cases in mind, so it is no surprise that the results are good for them as
well.

3	In Kohonen's book "Self Organization and Associative Memory", Ch 5
	the algorithm for weight adaptation does not produce normalized
	weights. Thus the output nodes cannot function as simply as taking
	a dot product of inputs and weights. They have to execute a distance
	calculation.

That's right.  And Kohonen usually uses the Euclidean distance metric,
although other ones can be used (which he discusses in the book)
Furthermore, there have been independent efforts to normalize weights in
Kohonen maps so that the dot product measure can be used.  If you have any
doubts about the suitability of the Euclidean metric, as your question
seems to imply, express them.  It is an interesting issue.

4	I have not seen as yet in the literature any reports on
	how the fact that neighbouring nodes respond to similar patterns
	from a feature space can be exploited.

The primary interest in maps, I believe, came from a desire to display
high-dimensional information in low dimensional spaces, which are more
easily apprehended.  But there is evidence that there are other uses as
well:  1) Kohonen has published results on using maps for phoneme
recognition, where the topology-preservation plays a significant role (such
maps are used in the Otaniemi Phonetic Typewriter featured in, I think,
Computer magazine a year or two agao.); 2)  work has been done on using the
topology to store sequential information, which seems to be a good idea if
you are dealing with natural signals that can only temporally shift from a
state to similar states; 3)  several people have followed Kohonen's
suggestion of using maps for adaptive kinematic representations for robot
control (the work on Murphy, mentioned on this net a month or so ago, and
the work being done at Carlton (sp) University by Darryl Graf are two good
examples).  In short, just look at some ICNN or INNS proceedings, and
you'll find many papers where researchers found Kohonen maps to be a good
place from which to launch their own studies.

5	Can the net become disordered after ordering is achieved at any
	particular iteration? 

Of course, this is theoretically possible, and is almost certain if at some
point the distribution of the mapped function changes.  But this brings up
the difficult question:  what is the proper ordering in such a case?
Should a net try to integrate both past and present distributions, or
should it throw away the past on concentrate on the present?  I think nost
nn researchers would want a litlle of both, woth maybe some kind of
exponential decay in the weights.  But in many applications of maps, there
is no chance of the distribution changing:  it is fixed, and iterations are
over the same test data each time.  In this case, I would guess that the
ordering could not becone disrupted (at least for simple distributions and
a net of adequate size), but I realise that there is no proof of this, and
the terms 'simple' and 'adequate' are lacking definition.  But that's life
in nnets for you!

If anyone has any more questions, feel free.

Ron Chrisley

Xerox PARC System Sciences Lab
3333 Coyote Hill Road
Palo Alto, CA 94304
USA

chrisley.pa at xerox.com
tel: (415) 494-4728

OR

New College
Oxford OX1 3BN
UK

chrisley at vax.oxford.ac.uk
tel: (865) 279-492


From moody-john at YALE.ARPA  Tue Mar 21 16:11:08 1989
From: moody-john at YALE.ARPA (john moody)
Date: Tue, 21 Mar 89 16:11:08 EST
Subject: two research reports available
Message-ID: <8903212107.AA03190@NEBULA.SUN3.CS.YALE.EDU>


*********  FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD  ***********
****************  TO OTHER BBOARDS/ELECTRONIC MEDIA *******************


FAST LEARNING IN MULTI-RESOLUTION HIERARCHIES

John Moody

Research Report YALEU/DCS/RR-681, February 1989

ABSTRACT

A class of fast, supervised  learning  algorithms  is  presented.
They  use  local representations, hashing, and multiple scales of
resolution to approximate functions which are piece-wise continu-
ous.  Inspired by Albus's CMAC model, the algorithms learn orders
of magnitude more rapidly than typical  implementations  of  back
propagation,  while  often achieving comparable qualities of gen-
eralization.  Furthermore, unlike most traditional  function  ap-
proximation  methods,  the  algorithms are well suited for use in
real time adaptive signal processing.   Unlike  simpler  adaptive
systems,  such  as  linear predictive coding, the adaptive linear
combiner, and the Kalman filter, the new algorithms  are  capable
of  efficiently capturing the structure of complicated non-linear
systems.  As an illustration, the algorithm  is  applied  to  the
prediction of a chaotic timeseries.

NOTE: This research report will appear in Advances in Neural  In-
formation  Processing  Systems,  edited by David Touretzky, to be
published in April 1989 by Morgan Kaufmann Publishers, Inc.   The
author  gratefully acknowledges financial support under ONR grant
N00014-89-J-1228,  ONR  grant   N00014-86-K-0310,   AFOSR   grant
F49620-88-C0025, and a Purdue Army subcontract.

***********************************************************

FAST LEARNING IN NETWORKS OF LOCALLY-TUNED PROCESSING UNITS

John Moody and Christian J. Darken

Research Report YALEU/DCS/RR-654,  October  1988,  Revised  March
1989

ABSTRACT

We propose a network architecture which uses  a  single  internal
layer of locally-tuned processing units to learn both classifica-
tion tasks and real-valued function  approximations  We  consider
training  such  networks  in  a completely supervised manner, but
abandon this approach in favor of a  more  computationally  effi-
cient  hybrid  learning  method which combines self-organized and
supervised learning.  Our networks learn faster than back  propa-
gation  for  two  reasons:  the local representations ensure that
only a few units respond to any given input, thus reducing compu-
tational overhead, and the hybrid learning rules are linear rath-
er than nonlinear, thus leading to  faster  convergence.   Unlike
many existing methods for data analysis, our network architecture
and learning rules are truly adaptive and  are  thus  appropriate
for real-time use.

NOTE: This research report will appear in Neural  Computation,  a
new Journal edited by Terry Sejnowski and published by MIT Press.
The work was supported by ONR grant N00014-86-K-0310, AFOSR grant
F49620-88-C0025, and a Purdue Army subcontract.


***********************************************************

Copies of both reports can be obtained by sending a request to:

Judy Terrell 

Yale Computer Science
PO Box 2158 Yale Station
New Haven, CT 06520

(203)432-1200

e-mail:

terrell at cs.yale.edu
terrell at yale.arpa
terrell at yalecs.bitnet


*********  FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD  ***********
****************  TO OTHER BBOARDS/ELECTRONIC MEDIA *******************


-------


From chrisley.pa at Xerox.COM  Thu Mar 23 14:35:00 1989
From: chrisley.pa at Xerox.COM (chrisley.pa@Xerox.COM)
Date: 23 Mar 89 11:35 PST
Subject: questions on kohonen's maps
In-Reply-To: ananth sankar <sankar@caip.rutgers.edu>'s message of Thu, 16
 Mar 89 09:42:44 EST
Message-ID: <890323-113527-4949@Xerox>

One further note about Ananth Sankar's questions about Kohonen maps:

A friend of mine, Tony Bell, tells me (and Ananth) that Helge Ritter has 
a "neat set of expressions for the learning rate and neighbourhood size
parameters... and he also proves something about congergence elsewhere."

Unfortunately, I do not as yet have a reference for the papers, but I have
liked Ritter's work in the past, so I thought people on the net might be
interested.

From jose at tractatus.bellcore.com  Wed Mar 22 10:44:09 1989
From: jose at tractatus.bellcore.com (Stephen J Hanson)
Date: Wed, 22 Mar 89 10:44:09 EST
Subject: technical report available
Message-ID: <8903221544.AA14583@tractatus.bellcore.com>


           Princeton Cognitive Science Lab Technical Report: CSL36, February, 1989.

                         COMPARING BIASES FOR MINIMAL NETWORK
                          CONSTRUCTION WITH BACK-PROPAGATION


                                 Stephen Jos'e Hanson

                                       Bellcore
                                          and
                        Princeton Cognitive Science Laboratory

                                         and

                                    Lorien Y. Pratt
                                  Rutgers University


                                       ABSTRACT

                Rumelhart (1987), has proposed a method for choosing minimal or
                "simple" representations during learning in Back-propagation
                networks.  This approach can be used to (a) dynamically select
                the number of hidden units, (b) construct a representation that
                is appropriate for the problem and (c) thus improve the
                generalization ability of Back-propagation networks. The method
                Rumelhart suggests involves adding penalty terms to the usual
                error function. In this paper we introduce Rumelhart's minimal
                networks idea and compare two possible biases on the weight
                search space.  These biases are compared in both simple counting
                problems and a speech recognition problem.  In general, the
                constrained search does seem to minimize the number of hidden
                units required with an expected increase in local minima.

                to appear in Advances in Neural Information Processing, D. Touretzky Ed., 1989
                Research was jointly sponsered by Princeton CSL and Bellcore.


                REQUESTS FOR THIS TECHNICAL REPORT SHOULD BE SENT TO


                laura at clarity.princeton.edu


                Please do not reply to this message or forward, Thankyou.

From lwyse at bucasb.BU.EDU  Tue Mar 21 13:59:02 1989
From: lwyse at bucasb.BU.EDU (lwyse@bucasb.BU.EDU)
Date: Tue, 21 Mar 89 13:59:02 EST
Subject: questions on kohonen's maps
In-Reply-To: connectionists@c.cs.cmu.edu's message of 20 Mar 89 23:47:09 GMT
Message-ID: <8903211859.AA04927@cochlea.bu.edu>


What does "ordering" mean when your projecting inputs to a lower dimensional
space? For example, the "Peano" type curves that result from a one-D
neighborhood learning a 2-D input distribution, it is obviously NOT 
true that nearby points in the input space maximally activate nearby
points on the neighborhood chain. In this case, it is not even clear
that "untangling" the neighborhood is of utmost importance, since a 
tangled chain can still do a very good job of divvying up the space
almost equally between its nodes. 

-lonce


From jose at tractatus.bellcore.com  Thu Mar 23 17:19:35 1989
From: jose at tractatus.bellcore.com (Stephen J Hanson)
Date: Thu, 23 Mar 89 17:19:35 EST
Subject: No subject
Message-ID: <8903232219.AA16776@tractatus.bellcore.com>


     Princeton Cognitive Science Lab Technical Report: CSL36, February, 1989.

                         COMPARING BIASES FOR MINIMAL NETWORK
                          CONSTRUCTION WITH BACK-PROPAGATION


                                 Stephen Jos'e Hanson

                                       Bellcore
                                          and
                        Princeton Cognitive Science Laboratory

                                         and

                                    Lorien Y. Pratt
                                  Rutgers University


                                       ABSTRACT

         Rumelhart (1987), has proposed a method for choosing minimal or
         "simple" representations during learning in Back-propagation
         networks.  This approach can be used to (a) dynamically select
         the number of hidden units, (b) construct a representation that
         is appropriate for the problem and (c) thus improve the
         generalization ability of Back-propagation networks. The method
         Rumelhart suggests involves adding penalty terms to the usual
         error function. In this paper we introduce Rumelhart's minimal
         networks idea and compare two possible biases on the weight
         search space.  These biases are compared in both simple counting
         problems and a speech recognition problem.  In general, the
         constrained search does seem to minimize the number of hidden
         units required with an expected increase in local minima.

to appear in Advances in Neural Information Processing, D. Touretzky Ed., 1989
Research was jointly sponsered by Princeton CSL and Bellcore.


REQUESTS FOR THIS TECHNICAL REPORT SHOULD BE SENT TO

    laura at clarity.princeton.edu


  Please do not reply to this message or forward, Thankyou.

From gblee at CS.UCLA.EDU  Fri Mar 24 13:25:07 1989
From: gblee at CS.UCLA.EDU (Geunbae Lee)
Date: Fri, 24 Mar 89 10:25:07 PST
Subject: questions on konhonen's map
Message-ID: <8903241825.AA25252@maui.cs.ucla.edu>

>What does "ordering" mean when your projecting inputs to a lower dimensional
>space? 
It means topological ordering

>For example, the "Peano" type curves that result from a one-D
>neighborhood learning a 2-D input distribution, it is obviously NOT 
>true that nearby points in the input space maximally activate nearby
>points on the neighborhood chain. 
It depends on what you mean by "near by" If it is near by in 
relative sense (in topological relation), not absolute sense, then
the nearby points in the input space DOES maximally activate nearby
points on the neighborhood chain.

--Geunbae Lee
  AI Lab, UCLA


From LIN2 at ibm.com  Fri Mar 24 15:02:32 1989
From: LIN2 at ibm.com (Ralph Linsker)
Date: 24 Mar 89 15:02:32 EST
Subject: Technical report available
Message-ID: <032489.150233.lin2@ibm.com>

*********  FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD  ***********
****************  TO OTHER BBOARDS/ELECTRONIC MEDIA *******************

The following report (IBM Research Report RC 14195, Nov. 1988)
is available upon request to:

                  lin2 @ ibm.com

It will appear in: Advances in Neural Information Processing Systems 1,
ed. D. S. Touretzky (San Mateo, CA: Morgan Kaufmann), April 1989.

                  "An Application of the Principle of
                    Maximum Information Preservation
                            to Linear Systems,"

                               Ralph Linsker

          This paper addresses the  problem of determining the weights
          for a set  of linear filters (model "cells") so  as to maxi-
          mize the ensemble-averaged information  that the cells' out-
          put values  jointly convey  about their input  values, given
          the statistical properties of the ensemble of input vectors.
          The quantity  that is  maximized is the  Shannon information
          rate, or equivalently the average mutual information between
          input and output.* Several models for the role of processing
          noise are  analyzed, and the biological  motivation for con-
          sidering  them is  described.   For simple  models in  which
          nearby input  signal values  (in space  or time)  are corre-
          lated, the  cells resulting  from this  optimization process
          include  center-surround   cells  and  cells   sensitive  to
          temporal variations in input signal.

  *The possible relation between this optimization principle and the
   organization of a sensory processing system is discussed in:
   R. Linsker, Computer 21(3)105-117 (March 1988).  If you
   would like a reprint of the Computer article, please so note.

From chrisley.pa at Xerox.COM  Fri Mar 24 17:53:00 1989
From: chrisley.pa at Xerox.COM (chrisley.pa@Xerox.COM)
Date: 24 Mar 89 14:53 PST
Subject: questions on kohonen's maps
In-Reply-To: lwyse@bucasb.BU.EDU's message of Tue, 21 Mar 89 13:59:02 EST
Message-ID: <890324-145332-8519@Xerox>

Lonce (lwyse at bucasb.BU.EDU) writes:

"What does "ordering" mean when your projecting inputs to a lower
dimensional space? For example, the "Peano" type curves that result from a
one-D neighborhood learning a 2-D input distribution, it is obviously NOT 
true that nearby points in the input space maximally activate nearby
points on the neighborhood chain." 

It is not true that nearby points in input space are always mapped to
nearby points in the output space when the mapping is dimensionality
reducing, agreed.  But 'ordering' still makes sense.  The map is
topology-preserving if the dependency is in the other direction, i.e., if
nearby points in output space are always activated by nearby points in
input space.

Lonce goes on to say:

"In this case, it is not even clear that "untangling" the neighborhood is
of utmost importance, since a tangled chain can still do a very good job of
divvying up the space almost equally between its nodes."

I agree that topology preservation is not necessarily of utmost importance,
but it may be useful in some applications, such as the ones I mentioned a
few messages back (phoneme recognition, inverse kinematics, etc.).  Also,
there is 1) the interest in properties of self-organizing systems in
themselves, even though an application can't be immediately found; and 2)
the observation that for some reason the brain seems to use topology
preserving maps (with the one-way dependency I mentioned above), which,
although they *could* be computationally unnecessary or even
disadvantageous, are probably in fact, nature being what she is, good
solutions to tough real time problems. 

Ron Chrisley
After April 14th, please send personal email to Chrisley at vax.ox.ac.uk

From ken at phyb.ucsf.EDU  Sun Mar 26 01:17:59 1989
From: ken at phyb.ucsf.EDU (Ken Miller)
Date: Sat, 25 Mar 89 22:17:59 pst
Subject: Normalization of weights in Kohonen algorithm
Message-ID: <8903260617.AA08352@phyb>

re point 3 of recent posting about Kohonen algorithm: 

"3	In Kohonen's book "Self Organization and Associative Memory", Ch 5
	the algorithm for weight adaptation does not produce normalized
	weights."

the algorithm

du_{ij}/dt = a(t)[e_j(t) - u_{ij}(t)], i in N_c

where u = weights, e is input pattern, N_c is topological neighborhood of
maximally responding neighborhood, should I believe be written

du_{ij}/dt = a(t)[ e_j(t)/\sum_k(e_k(t)) - u_{ij}(t)/\sum_k(u_{ik}(t)) ], 
i in N_c.

That is, the change should be such as to move the jth synaptic weight on the
ith cell, as a PROPORTION of all the synaptic weights on the ith cell, in the
direction of matching the PROPORTION of input which was incoming on the jth
line.  Note that in this case \sum_j du_{ij}(t)/dt = 0, so weights remain
normalized in the sense that sum over each cell remains constant.

If inputs are normalize to sum to 1 (\sum_k(e_k(t)) = 1) then the first
denominator can be omitted.  If weights begin normalized to sum to 1 on each
cell ( \sum_k(u_{ik}(t)) = 1 for all i) then weights will remain normalized
to sum to 1, hence the second denominator can be omitted.  Perhaps Kohonen
was assuming these normalizations and hence dispensing with the denominators?

ken miller (ken at phyb.ucsf.edu)

From nowlan at ai.toronto.edu  Tue Mar 28 09:41:36 1989
From: nowlan at ai.toronto.edu (Steven J. Nowlan)
Date: Tue, 28 Mar 89 09:41:36 EST
Subject: training time in HMM and CN
Message-ID: <89Mar28.094139est.10529@ephemeral.ai.toronto.edu>

Two comments on Thansis' post on the relative training speed of HMM vs CN
for sequential problems such as speech recognition:

1. The BF algorithm is quite highly optimized, while vanilla BP doesn't
   implement anything that a numerical analyst would consider a real
   descent procedure (not even steepest descent). If you were to use a
   reasonably powerful numerical optimization technique, such as one of
   the Broyden methods you may find CN convergence extremely fast. Ray
   Watrous has in fact shown this sort of speedup for speech problems [1].
 
 2. A more subtle, but probably more important difference, is the issue of
    how targets are specified over an input sequence. The BF algorithm
    specifies targets for intermediate steps in an input sequence based on
    expectations of final outcome of that sequence collected from many
    similar sequences. It is not clear how to specify output targets for
    intermediate points of an input sequence in a CN, although Watrous
    has shown that intelligent choice of such targets can markedly improve
    CN convergence and performance. Of interest in this regard is the work
    by Sutton on Temporal Difference methods [2]. One can view this work as
    specifying a target function over a sequence in a dynamical way, so that
    the target function reflects the experience of the system to date in a
    clever way. Sutton [2] has shown an equivalence between one form of linear
    TD method and the maximum likelihood estimates of the parameters for an
    absorbing Markov chain model of the same process. This seems much closer
    in flavour to what the BF algorithm is doing, and when applied to a 
    non-linear system may in fact be an interesting generalization of BF.
 
Comments and requests for clarifications should be directed to me, not to
Connectionists please.
 
 	- Steve Nowlan
 	  nowlan at ai.toronto.edu
 
References:
 
 [1]  Watrous, Raymond L. "Speech Recognition Using Connectionist Networks",
      TR MS-CIS-88-96, Department of Computer and Information Science,
      University of Pennsylvania, Philadelphia, 1988.
 
 [2]  Sutton, Richard S. "Learning to Predict by the Methods of Temporal
       Difference", GTE Technical Report TR87-509.1, GTE Laboratories Inc.
       Waltham, Mass. 1987.
 
      
From cfields at NMSU.Edu  Tue Mar 28 19:56:24 1989
From: cfields at NMSU.Edu (cfields@NMSU.Edu)
Date: Tue, 28 Mar 89 17:56:24 MST
Subject: No subject
Message-ID: <8903290056.AA14581@NMSU.Edu>


              Call for Participants / Call for Abstracts


            Symbolic Problem Solving in Noisy, Novel, and
                   Uncertain Task Environments


           20-21 August, 1989 (tentative), Detroit, MI, USA
              An IJCAI-89 Workshop, Sponsored by AAAI


Goals.

Brittleness in the face of noise, novelty, and uncertainty is a
well-known failing of symbolic problem solvers.  The goals of this
Workshop are to characterize the features of task environments that
cause brittleness, to investigate mechanisms for decreasing the
brittleness of symbolic problem solvers, and to review case histories
of implemented systems that function in task environments high in
noise, novelty, and data of uncertain relevance.


Topics of interest for the Workshop include the following.

Analysis of task environments:  Definitions of noisy, novelty,
and uncertain relevance; exploration of related concepts in general
systems theory or logic; parameters for characterizing task
environments; knowledge engineering strategies.

Mechanisms for addressing noise and novelty:  Plasticity and
learning; constructive problem solving; fragmentation of knowledge
structures; dynamic modification of rules, schemata, or cases;
coherence maintenance; adaptive control mechanisms.

Representations:  Data structures allowing dynamic abstraction
and modification; representation of ``unstructured'' knowledge;
knowledge implicit in control or learning procedures; ordering of
knowledge structures; tradeoffs between explicit and implicit
knowledge representation.

Implementation issues:  Implementing symbolic problem solvers on
parallel machines; concurrency control strategies; integrating
symbolic systems with artificial neural networks; general systems
integration.

Researchers interested in participating in the Workshop are invited to
submit abstracts describing work in any of these topic areas.


Format.

All participants will present their current work, either as a brief
oral report or as a poster.  Most presentations will be posters, as
these provide the greatest opportunity for presentation and discussion
of technical details.  Presentations will be on the first day of the
Workshop, followed by discussions in working groups organized by
application domain and a panel discussion on the second day.

Attendance at IJCAI Workshops is limited to fifty participants.
Participants not registered for IJCAI must pay a $50/day fee.


Abstract Submission.

Please submit a 1 page abstract of the work to be presented,
together with a cover letter summarizing previous work in relevant
areas and expected contribution to the Workshop, to Mike Coombs, Box
30001/3CRL, New Mexico State University, Las Cruces, NM 88003-0001
USA, by 15 May 1989.  Authors will be notified as to acceptance by 1
June 1989.  Accepted abstracts will be distributed at the Workshop.  A
volume collecting selected papers from the Workshop is planned; papers
for this volume will be solicited at the Workshop.


Organizers.

Mike Coombs and Chris Fields (NMSU), Russ Frew (GE), David Goldberg
(Alabama), Jim Reggia (Maryland).  Points of contact: Mike Coombs,
505-646-5757, mcoombs at nmsu.edu; Chris Fields, 505-646-2848,
cfields at nmsu.edu.

From elman%amos at ucsd.edu  Wed Mar 29 00:30:44 1989
From: elman%amos at ucsd.edu (Jeff Elman)
Date: Tue, 28 Mar 89 21:30:44 PST
Subject: 1990 Connectionist Summer School announcement
Message-ID: <8903290530.AA23241@amos.UCSD.EDU>


March 28, 1989                      PRELMINARY ANNOUNCEMENT

         CONNECTIONIST SUMMER SCHOOL / SUMMER 1990

                            UCSD
                    La Jolla, California


     The next Connectionist Summer School will  be  held  at
the  University of California, San Diego in June 1990.  This
will be the third session in the series which  was  held  at
Carnegie-Mellon in the summers of 1986 and 1988.

     The summer school will offer courses in  a  variety  of
areas  of connectionist modelling, with emphasis on computa-
tional neuroscience, cognitive models, and  hardware  imple-
mentation.   In  addition  to  full courses, there will be a
series of shorter tutorials, colloquia, and public lectures.
Proceedings  of the summer school will be published the fol-
lowing fall.

     As in the past, participation will be limited to gradu-
ate students enrolled in PhD. programs (full- or part-time).
Admission will be on a competitive basis.   We hope to  have
sufficient funding to subsidize tuition and housing.

     THIS IS A  PRELMINARY  ANNOUNCEMENT.   Further  details
will be announced over the next several months.

    Terry Sejnowski         Jeff Elman
    UCSD/Salk               UCSD

    Geoff Hinton            Dave Touretzky
    Toronto                 CMU
    hinton at ai.toronto.edu   touretzky at cs.cmu.edu


From niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK  Wed Mar 29 09:17:49 1989
From: niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK (Mahesan Niranjan)
Date: Wed, 29 Mar 89 09:17:49 BST
Subject: Missing link etc...
Message-ID: <23751.8903290817@dsl.eng.cam.ac.uk>

Some recent papers and postings on this network compare HMMs and Multi-layer
neural networks. Here is something I find missing in these discussions.

In speech pattern processing, HMMs make an inherent assumption about
the time series; - that it can be chopped up into a sequence of
piecewise stationary regions. Thus, an HMM places break-points in the
transition regions of the signal and models the steady regions by the
statistical parameters of individual states.

For speech signals, this is a bad assumption (human speech production is
not at all like this) - but the recognisers somehow seem to work!!

In neural networks (with or without feedback) what is the equivalent
assumption about the time evolution of the signal?


niranjan

From ersoy at ee.ecn.purdue.edu  Wed Mar 29 12:22:20 1989
From: ersoy at ee.ecn.purdue.edu (Okan K Ersoy)
Date: Wed, 29 Mar 89 12:22:20 EST
Subject: No subject
Message-ID: <8903291722.AA07623@ee.ecn.purdue.edu>

CALL FOR PAPERS AND REFEREES
HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES - 23
NEURAL NETWORKS AND RELATED EMERGING TECHNOLOGIES
KAILUA-KONA, HAWAII - JANUARY 3-6, 1990

The Neural Networks Track of HICSS-23 will contain a special set
of papers focusing on a broad selection of topics in the
area of Neural Networks and Related Emerging Technologies.
The presentations will provide a forum to discuss new advances in 
learning theory, associative memory, self-organization,
architectures, implementations and applications.

Papers are invited that may be theoretical, conceptual, tutorial or
descriptive in nature.
Those papers selected for presentation will appear in the
Conference Proceedings which is published by the Computer Society
of the IEEE.
HICSS-23 is sponsored by the University of Hawaii in cooperation
with the ACM, the Computer Society,and the Pacific Research Institute
for Informaiton Systems and Management (PRIISM).

Submissions are solicited in:

Supervised and Unsupervised Learning
Associative Memory
Self-Organization
Architectures
Optical, Electronic and Other Novel Implementations
Optimization
Signal/Image Processing and Understanding
Novel Applications

INSTRUCTIONS FOR SUBMITTING PAPERS

Manuscripts should be 22-26 typewritten, double-spaced pages in length.
Do not send submissions that are significantly shorter or
longer than this.
Papers must not have been previously presented or published,
nor currently submitted for journal publication.
Each manuscript will be put through a rigorous refereeing process.
Manuscripts should have a title page that includes the title of
the paper, full name of its author(s), affiliations(s), complete
physical and electronic address(es), telephone number(s) and a
300-word abstract of the paper.

DEADLINES

 Six copies of the manuscript are due by June 10, 1989.
 Notification of accepted papers by September 1, 1989.
 Accpeted manuscripts, camera-ready, are due by October 3, 1989.

SEND SUBMISSIONS AND QUESTIONS TO

O. K. Ersoy				H. H. Szu
Purdue University			Naval Research Laboratories
School of Electrical Engineering	Code 5709
W. Lafayette, IN  47907			4555 Overlook Ave., SE
(317) 494-6162				Washington, DC  20375
E-Mail: ersoy at ee.ecn.purdue		(202) 767-2407

From lina at wheaties.ai.mit.edu  Wed Mar 29 13:23:33 1989
From: lina at wheaties.ai.mit.edu (Lina Massone)
Date: Wed, 29 Mar 89 13:23:33 EST
Subject: No subject
Message-ID: <8903291823.AA09549@gelatinosa.ai.mit.edu>


*********  FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD  ***********
****************  TO OTHER BBOARDS/ELECTRONIC MEDIA *******************

		    TECHNICAL REPORT AVAILABLE


	A NEURAL NETWORK MODEL FOR LIMB TRAJECTORY FORMATION

		Lina Massone and Emilio Bizzi
	    Dept. of Brain and Cognitive Sciences
	    Massachusetts Institute of Technology


  This paper deals with the problem of representing and generating
  unconstrained aiming movements of a limb by means of a neural network
  architecture. The network produced a time trajectory of a limb from a
  starting posture toward a target specified by a sensory stimulus. Thus 
  the network performed a sensory-motor transformation. The experimenters
  imposed a bell-shaped velocity profile on the trajectory. This type of
  profile is characteristic of most movements performed by biological systems.
  We investigated the generalization capabilities of the network as well as
  its internal organization. Experiments performed during learning and on
  the trained network showed that: (i) the task could be learned by a
  three-layer sequential network; (ii) the network successfully generalized
  in trajectory space and adjusted the velocity profiles properly; (iii) the
  same task could not be learned by a linear network; (iv) after learning,
  the internal connections became organized into inhibitory and excitatory
  zones and encoded the main features of the training set; (v) the model was
  robust to noise on the input signals; (vi) the network exhibited
  attractor-dynamics properties; (vii) the network was able to solve
  the motor-equivalence problem.
  A key feature of this work is the fact that the neural network was coupled
  to a mechanical model of a limb in which muscles are represented as springs. 
  With this representation the model solved the problem of motor redundancy.


A short version of this paper covering only part of the
described research was mailed in February to IJCNN.
The full report has been submitted to Biological Cybernetics.

All requests should be addressed to: lina at wheaties.ai.mit.edu


From marchman%amos at ucsd.edu  Wed Mar 29 19:20:36 1989
From: marchman%amos at ucsd.edu (Virginia Marchman)
Date: Wed, 29 Mar 89 16:20:36 PST
Subject: Technical Report Available
Message-ID: <8903300020.AA01129@amos.UCSD.EDU>


The following Technical Report (#8902) is available from the Center for
Research in Language.  (Please do not forward.)

*******************************************************************

	Pattern Association in a Back Propagation Network:
	   Implications for Child Language Acquisition

       Kim Plunkett                      Virginia Marchman
University of Aarhus, Denmark     University of California, San Diego


			Abstract

A 3-layer back propagation network is used to implement a pattern
association task which learns mappings that are analogous to the present
and past tense forms of English verbs, i.e., arbitrary, identity,
vowel change, and suffixation mappings.  The degree of correspondence
between connectionist models of tasks of this type (Rumelhart &
McClelland, 1986; 1987) and children's acquisition of inflectional
morphology has recently been highlighted in discussions of the
general applicability of PDP to the study of human cognition and
language (Pinker & Mehler, 1988).  In this paper, we attempt to
eliminate many of the shortcomings of the R&M work and adopt an
empirical, comparative approach to the analysis of learning (i.e.,
hit rate and error type) in these networks.  In all of our simulations,
the network is given a constant 'diet' of input stems -- that is,
discontinuities are not introduced into the learning set at any point.
Four sets of simulations are described in which input conditions (class
size and token frequency) and the presence/absence of phonological
subregularities are manipulated.  First, baseline simulations chart
the initial computational constraints of the system and reveal complex
"competition effects" when the four verb classes must be learned
simultaneously.  Next, we explore the nature of these competitions
given different type (class sizes) and token frequencies (# of
repetitions).  Several hypotheses about input to children are tested,
from dictionary counts and production corpora.  Results suggest that
relative class size determines which "default" transformation is
employed by the network, as well as the frequency of overgeneralization
errors (both "pure" and "blended" overgeneralizations).  A third series
of simulations manipulates token frequency within a constant class size,
searching for the set of token frequencies which results in "adult-like
competence" and "child-like" errors across learning. A final series
investigates the addition of phonological sub-regularities into the
identity and vowel change classes.  Phonological cues are clearly
exploited by the system, leading to overall improved performance.
However, overgeneralizations, U-shaped learning and competition effects
continue to be observed in similar conditions.  These models establish
that input configuration plays a role in detemining the types of
errors produced by the network - including the conditions under
which "rule-like" behavior and "U-shaped" development will and will
not emerge. The results are discussed with reference to behavioral
data on children's acquisition of the past tense and the validity
of drawing conclusions about the acquisition of language from models
of this sort.

*****************************************************************

Please send requests for hard copy to:

		  yvonne at amos.ucsd.edu

	      or  Center for Research in Language C-008
   		  University of California, San Diego
   		  La Jolla, CA  92093
		  Attn:  Yvonne

-- Virginia Marchman (marchman at amos.ucsd.edu)
   Kim Plunkett (psykimp at dkarh02.bitnet)


From sankar at caip.rutgers.edu  Fri Mar 31 15:14:12 1989
From: sankar at caip.rutgers.edu (ananth sankar)
Date: Fri, 31 Mar 89 15:14:12 EST
Subject: KOHONEN MAPS
Message-ID: <8903312014.AA03080@caip.rutgers.edu>

I had initiated a discussion on Kohonen's maps two weeks ago and
apart from the many replies I (and many others??) received there
were requests that I post the responses. It would be a good idea
to go through this material and then discuss again.

>From pastor at prc.unisys.com Thu Mar 16 16:58:47 1989
Received: from PRC-GW.PRC.UNISYS.COM by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA03401; Thu, 16 Mar 89 16:58:40 EST
Received: from bigburd.PRC.Unisys.COM by burdvax.PRC.Unisys.COM (5.61/Domain/jpb/2.9) 
	id AA11739; Thu, 16 Mar 89 16:58:28 -0500
Received: by bigburd.PRC.Unisys.COM (5.61/Domain/jpb/2.9) 
	id AA24449; Thu, 16 Mar 89 16:58:23 -0500
From: pastor at prc.unisys.com (Jon Pastor)
Message-Id: <8903162158.AA24449 at bigburd.PRC.Unisys.COM>
Received: from Xerox143 by bigburd.PRC.Unisys.COM with PUP; Thu, 16 Mar 89 16:58 EST
To: ananth sankar <sankar at caip.rutgers.edu>
Date: 16 Mar 89 16:56 EST (Thursday)
Subject: Re: questions on kohonen's maps
In-Reply-To: ananth sankar <sankar at caip.rutgers.edu>'s message of Thu, 16 Mar 89 09:42:44 EST
To: ananth sankar <sankar at caip.rutgers.edu>
Cc: pastor at bigburd.prc.unisys.com
Status: R

I am in the process of implementing a Kohonen-style system, and if I
actually get it running and obtain any results I'll let you know.  If
you get any responses, please let me know.

Thanks.

>From Connectionists-Request at q.cs.cmu.edu Thu Mar 16 16:59:58 1989
Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA03426; Thu, 16 Mar 89 16:59:52 EST
Received: from cs.cmu.edu by Q.CS.CMU.EDU id aa11454; 16 Mar 89 9:44:34 EST
Received: from CAIP.RUTGERS.EDU by CS.CMU.EDU; 16 Mar 89 09:42:55 EST
Received: by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA14983; Thu, 16 Mar 89 09:42:44 EST
Date: Thu, 16 Mar 89 09:42:44 EST
From: ananth sankar <sankar at caip.rutgers.edu>
Message-Id: <8903161442.AA14983 at caip.rutgers.edu>
To: connectionists at cs.cmu.edu
Subject: questions on kohonen's maps
Status: R

I am interested in the subject of Self Organization and have some
questions with regard to Kohonen's algorithm for Self Organizing Feature
Maps. I have tried to duplicate the results of Kohonen for the two dimensional
uniform input case i.e. two inputs. I used a 10 X 10 output grid. The maps
that resulted were not as good as reported in the papers.

Questions:
	
	
1	Is there any analytical expression for the neighbourhood and gain
	functions? I have seen a simulation were careful tweaking after
	every so many iterations produces a correctly evolving map. This
	is obviously not a proper approach.

2	Even if good results are available for particular functions for
	the uniform distribution input case, it is not clear to me that these
	same functions would result in good classification for some other
	problem. I have attempted to use these maps for word classification
	using LPC coeffs as features.

3	In Kohonen's book "Self Organization and Associative Memory", Ch 5
	the algorithm for weight adaptation does not produce normalized
	weights. Thus the output nodes cannot function as simply as taking
	a dot product of inputs and weights. They have to execute a distance
	calculation.

4	I have not seen as yet in the literature any reports on
	how the fact that neighbouring nodes respond to similar patterns
	from a feature space can be exploited.

5	Can the net become disordered after ordering is achieved at any
	particular iteration? 


I would appreciate any comments, suggestions etc on the above. Also so that
net mail clutter may be reduced please respond to

sankar at caip.rutgers.edu

Thank you.

Ananth Sankar
Department of Electrical Engineering
Rutgers University, NJ


>From regier at cogsci.berkeley.edu Thu Mar 16 17:07:20 1989
Received: from cogsci.Berkeley.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA03562; Thu, 16 Mar 89 17:07:16 EST
Received: by cogsci.berkeley.edu (5.61/1.29)
	id AA13666; Thu, 16 Mar 89 14:07:18 -0800
Date: Thu, 16 Mar 89 14:07:18 -0800
From: regier at cogsci.berkeley.edu (Terry Regier)
Message-Id: <8903162207.AA13666 at cogsci.berkeley.edu>
To: sankar at caip.rutgers.edu
Subject: Kohonen request
Status: R


	Hi, I'm interested in the responses to your recent Kohonen posting
on Connectionists.  Do you suppose you could post the results once all the
replies are in?  Thanks,
						-- Terry


>From ken at phyb.ucsf.edu Thu Mar 16 20:11:35 1989
Received: from cgl.ucsf.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA09101; Thu, 16 Mar 89 20:11:32 EST
Received: from phyb.ucsf.EDU by cgl.ucsf.EDU (5.59/GSC4.15)
	id AA01036; Thu, 16 Mar 89 17:11:23 PST
Received: by phyb (1.2/GSC4.15)
	id AA11601; Thu, 16 Mar 89 17:11:17 pst
Date: Thu, 16 Mar 89 17:11:17 pst
From: ken at phyb.ucsf.edu (Ken Miller)
Message-Id: <8903170111.AA11601 at phyb>
To: sankar at caip.rutgers.edu
Subject: kohonen
Status: R

re your point 3: 
the algorithm

du_{ij}/dt = a(t)[e_j(t) - u_{ij}(t)], i in N_c

where u = weights, e is input pattern, N_c is topological neighborhood of
maximally responding neighborhood, should actually be written

du_{ij}/dt = a(t)[ (e_j(t)/\sum_k(e_k(t))) - u_{ij}(t)/\sum_j(u_{ij}(t) ], 
i in N_c.

That is the change should be such as to move the jth synaptic weight on the
ith cell, as a PROPORTION of all the synaptic weights on the ith cell, in the
direction of matching the PROPORTION of input which was incoming on the jth
line.  Note that in this case \sum_j du_{ij}(t)/dt = 0, so weights remain
normalized in the sense that sum over each cell remains constant.

If you normalize your inputs to sum to 1 (\sum_k(e_k(t)) = 1) and start with
weights normalized to sum to 1 on each cell ( \sum_j(u_{ij}(t) = 1 for all
i) then weights will remain normalized to sum to 1, hence the two sums in
the denominators are both just = 1 and can be left out.  Kohonen was I
believe assuming these normalizations and hence dispensing with the sums.

ken miller (ken at phyb.ucsf.edu)
ucsf dept. of physiology

>From tds at wheaties.ai.mit.edu Thu Mar 16 23:26:42 1989
Received: from life.ai.mit.edu by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA12489; Thu, 16 Mar 89 23:26:39 EST
Received: from mauriac.ai.mit.edu by life.ai.mit.edu; Thu, 16 Mar 89 22:48:15 EST
Received: from localhost by mauriac.ai.mit.edu; Thu, 16 Mar 89 22:48:06 est
Date: Thu, 16 Mar 89 22:48:06 est
From: tds at wheaties.ai.mit.edu
Message-Id: <8903170348.AA19015 at mauriac.ai.mit.edu>
To: sankar at caip.rutgers.edu
Subject: Kohonen maps
Status: R

I share some of your confusion about Kohonen maps.  My main question is 
#4: are they really doing anything useful?  The mapping demonstrated in
Kohonen's 1982 paper (Biol. Cyb.) only shows mappings from a 2D manifold
in 3-space onto a two-dimensionally arranged set of units.  The book
talks about dimensionality issues in more detail, but so far as I can
tell what the network does (after training) is to map three numbers into
about 100 numbers.  Since the mapping is linear, I don't see how anything
at all is gained.  
  If the network is unable to generate an ordering, it may be one way to
tell if the data does not lie on a 2D manifold.  But there are many other
ways to do this that are more efficient!  Also, this is not robust if the
manifold folds back on itself (so that two distinct points on the surface
are in the same direction from the origin).
  
  Let me know if you find out the true significance of this widely-known
work,
		Terry

>From lwyse at bucasb.bu.edu Fri Mar 17 17:42:18 1989
Received: from BU-IT.BU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA05821; Fri, 17 Mar 89 17:42:12 EST
Received: from COCHLEA.BU.EDU by bu-it.BU.EDU (5.58/4.7)
	id AA17739; Fri, 17 Mar 89 17:38:02 EST
Received:  by cochlea.bu.edu (4.0/4.7)
	id AA02692; Fri, 17 Mar 89 17:38:21 EST
Date:  Fri, 17 Mar 89 17:38:21 EST
From: lwyse at bucasb.bu.edu
Message-Id:  <8903172238.AA02692 at cochlea.bu.edu>
To: sankar at caip.rutgers.edu
Subject: re:questions on Kohonen maps
Status: R


I would be surprised if there was some analytical expression for the
neighborhood and gain functions that was useful in practical applications.
I have found different "best functions" for different input vector
distributions, initial weight distributions, etc.

A related question to yours: What does "ordering" mean when mapping
accross different dimensional spaces? An excerpt from a report on my 
experiences with Kohonen maps:


When the input space and the neighborhood space of the weight vectors are of
different dimension, however, what "ordered" means becomes a sticky wicket.
For example, int Fig. 5.17, Kohonen shows a one-dimensional neigborhood of
weight vecotrs approximating a triangular distribution of inputs with what
he terms a "Peano-like" curve. But this type of curve folds in on itself in
an attempt to fill the space and thus moves points that may be far from each
other in their one-D neighborooh, but be maximally responsive to very close
input points. Is this "ordered"? He doesn't seem to address this point
directly. A point I would like to bring out is that in these situations
where the dimension of the input space and the dimension of the neighborhood
differ, whether or not the wheight-vector chain crosses itself is {\em not}
necessarily the important metric for measuring the ability of the weights
to approximate the input space. That is, there is not necessarily a correlation
between neighborhood-chain crossings, and the mean squared error of the
weight vector approximations of the input points. It is true, however, that
if the neighborhood chain crosses itself, then {\em there exists} a better
approximation to the input space.


-lonce

>From risto at cs.ucla.edu Sat Mar 18 02:59:46 1989
Received: from Oahu.CS.UCLA.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA14191; Sat, 18 Mar 89 02:59:35 EST
Return-Path: <risto>
Received: by oahu.cs.ucla.edu (Sendmail 5.59/2.16)
	id AA02486; Fri, 17 Mar 89 23:14:45 PST
Date: Fri, 17 Mar 89 23:14:45 PST
From: risto at cs.ucla.edu (Risto Miikkulainen)
Message-Id: <8903180714.AA02486 at oahu.cs.ucla.edu>
To: sankar at caip.rutgers.edu
In-Reply-To: ananth sankar's message of Thu, 16 Mar 89 09:42:44 EST <8903161442.AA14983 at caip.rutgers.edu>
Subject: questions on kohonen's maps
Reply-To: risto at cs.ucla.edu
Organization: UCLA Computer Science Department
Physical-Address: 3677 Boelter Hall
Status: R


   Date: Thu, 16 Mar 89 09:42:44 EST
   From: ananth sankar <sankar at caip.rutgers.edu>

   1	Is there any analytical expression for the neighbourhood and gain
	   functions? I have seen a simulation were careful tweaking after
	   every so many iterations produces a correctly evolving map. This
	   is obviously not a proper approach.
The trick is to start with a neighborhood large enough. For 10x10, a
radius of 8 units might be appropriate. Then reduce the radius
gradually (e.g. over a few thousand inputs) to 1 or even to 0.


   3	In Kohonen's book "Self Organization and Associative Memory", Ch 5
	   the algorithm for weight adaptation does not produce normalized
	   weights. Thus the output nodes cannot function as simply as taking
	   a dot product of inputs and weights. They have to execute a distance
	   calculation.
True. The original idea was to form the "activity bubble" with lateral
inhibition and change the weights by "redistribution of synaptic
resources". This neurologically plausible algorithm gave way to an
abstraction which uses distance, global selection and difference.
(I did some work comparing these two algorithms; I can send you the tech
report if you want to look at it. At least it has the parameters that work)


   5	Can the net become disordered after ordering is achieved at any
	   particular iteration? 
Kohonen proved (in ch 5) that this cannot happen (in the 1-d case) for
the abstract algorithm. This is a big problem for the biologically
plausible algorithm though.


>From djb at flash.bellcore.com Sat Mar 18 23:38:41 1989
Received: from flash.bellcore.com by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA27190; Sat, 18 Mar 89 23:38:32 EST
Received: by flash.bellcore.com (5.58/1.1)
	id AA06742; Sat, 18 Mar 89 23:38:10 EST
Date: Sat, 18 Mar 89 23:38:10 EST
From: djb at flash.bellcore.com (David J Burr)
Message-Id: <8903190438.AA06742 at flash.bellcore.com>
To: sankar at caip.rutgers.edu
Subject: Feature Map Learning
Status: R

Your questions regarding the feature map algorithm are ones that have also
concerned me.  I have been experimenting with a form of this elastic mapping
algorithm since about 1979. My early experiments were focussed on using such
an adaptive process to map handwritten characters onto reference characters
in an attempt to automate a form of elastic template matching.  The algorithm
I came up with was one which used nearest neighbor "attractors" to "pull" 
an elastic map into shape by an interative process.  I defined a window
or smoothing kernel which had a Gaussian shape as opposed to the bos
(box) shape commonly used in self organized mapping. My algorithm resembled
the Kohonen feature map classifier that you referred to in your email.

The gaussian kernel has advantages over the box kernel in that aliasing
distortion can be reduced.  This is similar to the use of Hamming windows
in the design of fast fourier transforms.  

With regard to your first and second questions, we have found that the 
actual window size and gain parameters can take on a number of different 
schedule shapes and give similar results.  It is important that window
size decrease very gradually to avoid to early committment to a particular
vector.  This is particularly important in the mapping of highly distorted
characters where a rapid schedule could cause a feature in one character
to map to the "wrong" feature in the reference character.  Gaussian
windows were the choice for that problem, since they guaranteed very
smooth maps.

You are right that a parameter schedule that works for one problem
may be poorly suited to a different problem.  We have recently applied
the feature map model to the traveling salesman problem and reported
some of our results at ICNN-88.  A one-dimensional version of the elastic
map ( a rubber band ) seems best suited to this problem.  We found that
there was a particular analytic form of the gain schedule which worked
well for this problem.  Window size, on the other hand, seemed to benefit
best from a feedback schedule in which the degree of progress toward the
solution served as input to set an appropriate window size.  I have
results studying some 700 different learning trials on 30-100 city
problems using this method.  Performance is considerable better than
the Hopfield-Tank solution.

Yes, it seems as though one needs distance calculation as the input for
this model, rather than dot product as used in back-propagation nets.

I would be happy to mail you some papers describing my implementation of
feature map learning model.  The first article appeared in Computer Graphics
and Image Processing Journal, 1981, entitled "A Dynamic Model for Image
Registration".  The recent work on traveling salesman was also reported
at last year's Snowbird meeting in addition to ICNN-88.  Please feel free
to correspond with me as I consider this a very interesting topic.

Best Wishes,
D. J. Burr
djb at bellcore.com

>From @relay.cs.net:tony at ifi.unizh.ch Mon Mar 20 03:12:51 1989
Received: from RELAY.CS.NET by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA02795; Mon, 20 Mar 89 03:12:46 EST
Received: from relay2.cs.net by RELAY.CS.NET id ab08738; 20 Mar 89 4:55 EST
Received: from switzerland by RELAY.CS.NET id ae29120; 20 Mar 89 4:48 EST
Received: from ean by scsult.SWITZERLAND.CSNET id a011717; 20 Mar 89 9:45 WET
Date: 19 Mar 89 21:45 +0100
From: tony bell <tony at ifi.unizh.ch>
To: sankar at caip.rutgers.edu
Mmdf-Warning:  Parse error in original version of preceding line at SWITZERLAND.CSNET
Message-Id: <342:tony at ifi.unizh.ch>
Subject: Top Maps
Status: R

You should see Ritter & Schulten's paper in IEEE ICNN proceedings 1988
(San Diego) for expressions answering question 1. Another paper from Helge Ritt
er
deals with the convergence properties. This was submitted to Biol Cybernetics
but maybe you should write to him at the University of Illinois where he
is now.

Tony Bell, Univ of Zurich


>From djb at flash.bellcore.com Mon Mar 20 17:51:22 1989
Received: from flash.bellcore.com by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA18086; Mon, 20 Mar 89 17:51:14 EST
Received: by flash.bellcore.com (5.58/1.1)
	id AA25760; Mon, 20 Mar 89 17:51:18 EST
Date: Mon, 20 Mar 89 17:51:18 EST
From: djb at flash.bellcore.com (David J Burr)
Message-Id: <8903202251.AA25760 at flash.bellcore.com>
To: sankar at caip.rutgers.edu
Subject: Self-Organized Mapping
Status: R

There has been interest on the net recently in some of the questions that
you posed in your recent mail.  I have personally received comments
regarding the neighborhood functions and whether there is an appropriate
analytic form.  My comments were summarized in my recent mailing to you.
If you get additional responses, I would certainly appreciate hearing about
peoples' experiences.  Would you consider posting a summary to the net?

I did not comment on your questions 4 and 5.  It seems that the neighbors-
matching-to-neighbors observation comes about as a result rather than
an input constraint.  In my 1981 paper on elastic matching of images I
used a more extended pattern matcher (area template insteat of a point-to-
point nearest neighbor) for gray scale images.  This tended to enforce the
constraint that you observed at the input level.  Unfortunately, I am not
sure what its generalization would be for non-image patterns (N-D instead of2-D).  

I have done all my experiments on elastic mapping of fixed patterns as opposed
to point distributions.  There was no problem of a map being undone after it
converged.  Have you had such problems with your speech data?  I have been
told that when the distributions are stochastic or sampled, that there is
even stronger need to proceed slowly.  Apparently one sampled point can
pull the map in one direction and this must be counterbalanced by opposing
samples pulling the other way to maintain stability of the map.  This 
unfortunately takes lots of computer cycles.

Hoping to hear from you.

Dave Burr

>From Connectionists-Request at q.cs.cmu.edu Mon Mar 20 18:01:41 1989
Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA18228; Mon, 20 Mar 89 18:01:34 EST
Received: from cs.cmu.edu by Q.CS.CMU.EDU id aa23263; 20 Mar 89 14:41:25 EST
Received: from XEROX.COM by CS.CMU.EDU; 20 Mar 89 14:39:19 EST
Received: from Semillon.ms by ArpaGateway.ms ; 20 MAR 89 11:26:12 PST
Date: 20 Mar 89 11:25 PST
From: chrisley.pa at xerox.com
Subject: Re: questions on kohonen's maps
In-Reply-To: ananth sankar <sankar at caip.rutgers.edu>'s message of Thu, 16
 Mar 89 09:42:44 EST
To: ananth sankar <sankar at caip.rutgers.edu>
Cc: connectionists at cs.cmu.edu, chrisley.pa at xerox.com
Message-Id: <890320-112612-6136 at Xerox>
Status: R

Ananth Sankar recently asked some questions about Kohonen's feature maps.
As I have worked on these issues with Kohonen, I feel like I might be able
to give some answers, but standard disclaimers apply:  I cannot be certain
that Kohonen would agree with all of the following.  Also, I do not have my
copy of his book with me, so I cannot be more specific about refrences.

Questions:
	
	
1	Is there any analytical expression for the neighbourhood and gain
	functions? I have seen a simulation were careful tweaking after
	every so many iterations produces a correctly evolving map. This
	is obviously not a proper approach.

Although there is probably more than one, correct, task-independent gain or
neighborhood function, Kohonen does mention constraints that all of them
should meet.  For example, both functions should decrease to zero over
time.  I do not know of any tweaking; Kohonen usually determines a number
of iterations and then decreases the gain linearly.  If you call this
tweaking, then your idea of domain-independent parameters might be a sort
of holy grail, since it does not seem likely that we are going to find a
completely parameter-free learning algorithm that will work in every
domain.

2	Even if good results are available for particular functions for
	the uniform distribution input case, it is not clear to me that these
	same functions would result in good classification for some other
	problem. I have attempted to use these maps for word classification
	using LPC coeffs as features.

As far as I know, Kohonen has used the same type of gain and neighborhood
functions for all of his map demonstrations.  These demonstrations, which
have been shown via an animated film at several major conferences,
demonstrate maps learning the distribution in cases where 1) the
dimensionality of the network topology and the input space mismatch, e.g.,
where the network is 2d and the distribution is a 3d 'cactus'; 2) the
distribution is not uniform.  The algorithm was developed with these 2
cases in mind, so it is no surprise that the results are good for them as
well.

3	In Kohonen's book "Self Organization and Associative Memory", Ch 5
	the algorithm for weight adaptation does not produce normalized
	weights. Thus the output nodes cannot function as simply as taking
	a dot product of inputs and weights. They have to execute a distance
	calculation.

That's right.  And Kohonen usually uses the Euclidean distance metric,
although other ones can be used (which he discusses in the book)
Furthermore, there have been independent efforts to normalize weights in
Kohonen maps so that the dot product measure can be used.  If you have any
doubts about the suitability of the Euclidean metric, as your question
seems to imply, express them.  It is an interesting issue.

4	I have not seen as yet in the literature any reports on
	how the fact that neighbouring nodes respond to similar patterns
	from a feature space can be exploited.

The primary interest in maps, I believe, came from a desire to display
high-dimensional information in low dimensional spaces, which are more
easily apprehended.  But there is evidence that there are other uses as
well:  1) Kohonen has published results on using maps for phoneme
recognition, where the topology-preservation plays a significant role (such
maps are used in the Otaniemi Phonetic Typewriter featured in, I think,
Computer magazine a year or two agao.); 2)  work has been done on using the
topology to store sequential information, which seems to be a good idea if
you are dealing with natural signals that can only temporally shift from a
state to similar states; 3)  several people have followed Kohonen's
suggestion of using maps for adaptive kinematic representations for robot
control (the work on Murphy, mentioned on this net a month or so ago, and
the work being done at Carlton (sp) University by Darryl Graf are two good
examples).  In short, just look at some ICNN or INNS proceedings, and
you'll find many papers where researchers found Kohonen maps to be a good
place from which to launch their own studies.

5	Can the net become disordered after ordering is achieved at any
	particular iteration? 

Of course, this is theoretically possible, and is almost certain if at some
point the distribution of the mapped function changes.  But this brings up
the difficult question:  what is the proper ordering in such a case?
Should a net try to integrate both past and present distributions, or
should it throw away the past on concentrate on the present?  I think nost
nn researchers would want a litlle of both, woth maybe some kind of
exponential decay in the weights.  But in many applications of maps, there
is no chance of the distribution changing:  it is fixed, and iterations are
over the same test data each time.  In this case, I would guess that the
ordering could not becone disrupted (at least for simple distributions and
a net of adequate size), but I realise that there is no proof of this, and
the terms 'simple' and 'adequate' are lacking definition.  But that's life
in nnets for you!

If anyone has any more questions, feel free.

Ron Chrisley

Xerox PARC System Sciences Lab
3333 Coyote Hill Road
Palo Alto, CA 94304
USA

chrisley.pa at xerox.com
tel: (415) 494-4728

OR

New College
Oxford OX1 3BN
UK

chrisley at vax.oxford.ac.uk
tel: (865) 279-492


>From chrisley.pa at xerox.com Thu Mar 23 15:00:13 1989
Received: from Xerox.COM by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA22224; Thu, 23 Mar 89 15:00:04 EST
Received: from Semillon.ms by ArpaGateway.ms ; 23 MAR 89 11:35:27 PST
Date: 23 Mar 89 11:35 PST
From: chrisley.pa at xerox.com
Subject: Re: questions on kohonen's maps
In-Reply-To: ananth sankar <sankar at caip.rutgers.edu>'s message of Thu, 16
 Mar 89 09:42:44 EST
To: ananth sankar <sankar at caip.rutgers.edu>
Cc: connectionists at cs.cmu.edu
Message-Id: <890323-113527-4949 at Xerox>
Status: R

One further note about Ananth Sankar's questions about Kohonen maps:

A friend of mine, Tony Bell, tells me (and Ananth) that Helge Ritter has 
a "neat set of expressions for the learning rate and neighbourhood size
parameters... and he also proves something about congergence elsewhere."

Unfortunately, I do not as yet have a reference for the papers, but I have
liked Ritter's work in the past, so I thought people on the net might be
interested.

>From Connectionists-Request at q.cs.cmu.edu Fri Mar 24 11:52:18 1989
Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA20326; Fri, 24 Mar 89 11:52:13 EST
Received: from ri.cmu.edu by Q.CS.CMU.EDU id aa17597; 24 Mar 89 8:48:01 EST
Received: from BU-IT.BU.EDU by RI.CMU.EDU; 24 Mar 89 08:41:54 EST
Received: from COCHLEA.BU.EDU by bu-it.BU.EDU (5.58/4.7)
	id AA06449; Tue, 21 Mar 89 13:58:32 EST
Received:  by cochlea.bu.edu (4.0/4.7)
	id AA04927; Tue, 21 Mar 89 13:59:02 EST
Date:  Tue, 21 Mar 89 13:59:02 EST
From: lwyse at bucasb.bu.edu
Message-Id:  <8903211859.AA04927 at cochlea.bu.edu>
To: connectionists at ri.cmu.edu
In-Reply-To: connectionists at c.cs.cmu.edu's message of 20 Mar 89 23:47:09 GMT
Subject: Re: questions on kohonen's maps
Organization: Center for Adaptive Systems, B.U.
Status: R


What does "ordering" mean when your projecting inputs to a lower dimensional
space? For example, the "Peano" type curves that result from a one-D
neighborhood learning a 2-D input distribution, it is obviously NOT 
true that nearby points in the input space maximally activate nearby
points on the neighborhood chain. In this case, it is not even clear
that "untangling" the neighborhood is of utmost importance, since a 
tangled chain can still do a very good job of divvying up the space
almost equally between its nodes. 

-lonce


>From @relay.cs.net:tony at ifi.unizh.ch Fri Mar 24 13:30:26 1989
Received: from RELAY.CS.NET by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA23163; Fri, 24 Mar 89 13:30:12 EST
Received: from relay2.cs.net by RELAY.CS.NET id ab09426; 24 Mar 89 12:01 EST
Received: from switzerland by RELAY.CS.NET id aa01417; 24 Mar 89 11:55 EST
Received: from ean by scsult.SWITZERLAND.CSNET id a011335; 24 Mar 89 17:53 WET
Date: 24 Mar 89 17:51 +0100
From: tony bell <tony at ifi.unizh.ch>
To: sankar at caip.rutgers.edu
Mmdf-Warning:  Parse error in original version of preceding line at SWITZERLAND.CSNET
Message-Id: <352:tony at ifi.unizh.ch>
Status: R

In case anyone else asks (or Ron sends any more vague messages to the
net), here are all the refs I have on Helge Ritter's work on topological maps:

[1]"Kohonen's Self-Organizing Maps: exploring their computational capabilities"
in Proc. IEEE ICNN 1988, San Diego.

[2]"Convergence Properties of Kohonen's Topology Conserving Maps: fluctuations,
stability and dimension selection" submitted to Biol. Cybernetica.

[3] "Extending Kohonen's self-organising mapping algorithm to learn Ballistic
Movements" in the book "Neural Computers" Eckmiller & von der Malsburg (eds)

[4] "Topology conserving mappings for learning motor tasks" in the book "Neural
Networks for Computing" Denker (ed) AIP Conf. proceedings, Snowbird, 1986.

The second one in particular uses some heavy statistical techniques (the inputs
are seen as a Markov process and a Fokker-Planck equation describes the learn-
ing) in order to prove that the map will reach equilibrium when the learning 
rate is time dependant (ie: it decays). Ritter's PhD thesis covers all his work,
but it's in German. Now, Ritter is at the University of Illinois. I hope
this helps you and I don't mind if you post this to the net if you think
people are interested enough.

yours,

Tony Bell.


>From Connectionists-Request at q.cs.cmu.edu Fri Mar 24 22:07:14 1989
Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA23834; Fri, 24 Mar 89 22:07:06 EST
Received: from ri.cmu.edu by Q.CS.CMU.EDU id aa22170; 24 Mar 89 13:28:20 EST
Received: from MAUI.CS.UCLA.EDU by RI.CMU.EDU; 24 Mar 89 13:26:10 EST
Return-Path: <gblee at CS.UCLA.EDU>
Received: by maui.cs.ucla.edu (Sendmail 5.59/2.16)
	id AA25252; Fri, 24 Mar 89 10:25:07 PST
Date: Fri, 24 Mar 89 10:25:07 PST
From: Geunbae Lee <gblee at cs.ucla.edu>
Message-Id: <8903241825.AA25252 at maui.cs.ucla.edu>
To: lwyse at bucasb.bu.edu
Subject: Re: questions on konhonen's map
Cc: connectionists at ri.cmu.edu
Status: R

>What does "ordering" mean when your projecting inputs to a lower dimensional
>space? 
It means topological ordering

>For example, the "Peano" type curves that result from a one-D
>neighborhood learning a 2-D input distribution, it is obviously NOT 
>true that nearby points in the input space maximally activate nearby
>points on the neighborhood chain. 
It depends on what you mean by "near by" If it is near by in 
relative sense (in topological relation), not absolute sense, then
the nearby points in the input space DOES maximally activate nearby
points on the neighborhood chain.

--Geunbae Lee
  AI Lab, UCLA


>From Connectionists-Request at q.cs.cmu.edu Sat Mar 25 02:26:12 1989
Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA26264; Sat, 25 Mar 89 02:26:06 EST
Received: from ri.cmu.edu by Q.CS.CMU.EDU id aa25584; 24 Mar 89 17:55:35 EST
Received: from XEROX.COM by RI.CMU.EDU; 24 Mar 89 17:53:44 EST
Received: from Semillon.ms by ArpaGateway.ms ; 24 MAR 89 14:53:32 PST
Date: 24 Mar 89 14:53 PST
From: chrisley.pa at xerox.com
Subject: Re: questions on kohonen's maps
In-Reply-To: lwyse at bucasb.BU.EDU's message of Tue, 21 Mar 89 13:59:02 EST
To: lwyse at bucasb.bu.edu
Cc: connectionists at ri.cmu.edu
Message-Id: <890324-145332-8519 at Xerox>
Status: R

Lonce (lwyse at bucasb.BU.EDU) writes:

"What does "ordering" mean when your projecting inputs to a lower
dimensional space? For example, the "Peano" type curves that result from a
one-D neighborhood learning a 2-D input distribution, it is obviously NOT 
true that nearby points in the input space maximally activate nearby
points on the neighborhood chain." 

It is not true that nearby points in input space are always mapped to
nearby points in the output space when the mapping is dimensionality
reducing, agreed.  But 'ordering' still makes sense.  The map is
topology-preserving if the dependency is in the other direction, i.e., if
nearby points in output space are always activated by nearby points in
input space.

Lonce goes on to say:

"In this case, it is not even clear that "untangling" the neighborhood is
of utmost importance, since a tangled chain can still do a very good job of
divvying up the space almost equally between its nodes."

I agree that topology preservation is not necessarily of utmost importance,
but it may be useful in some applications, such as the ones I mentioned a
few messages back (phoneme recognition, inverse kinematics, etc.).  Also,
there is 1) the interest in properties of self-organizing systems in
themselves, even though an application can't be immediately found; and 2)
the observation that for some reason the brain seems to use topology
preserving maps (with the one-way dependency I mentioned above), which,
although they *could* be computationally unnecessary or even
disadvantageous, are probably in fact, nature being what she is, good
solutions to tough real time problems. 

Ron Chrisley
After April 14th, please send personal email to Chrisley at vax.ox.ac.uk

>From Connectionists-Request at q.cs.cmu.edu Sun Mar 26 03:40:59 1989
Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) 
	id AA19433; Sun, 26 Mar 89 03:40:47 EST
Received: from cs.cmu.edu by Q.CS.CMU.EDU id aa07032; 26 Mar 89 1:22:01 EST
Received: from CGL.UCSF.EDU by CS.CMU.EDU; 26 Mar 89 01:18:16 EST
Received: from phyb.ucsf.EDU by cgl.ucsf.EDU (5.59/GSC4.16)
	id AA07448; Sat, 25 Mar 89 22:18:01 PST
Received: by phyb (1.2/GSC4.15)
	id AA08352; Sat, 25 Mar 89 22:17:59 pst
Date: Sat, 25 Mar 89 22:17:59 pst
From: Ken Miller <ken at phyb.ucsf.edu>
Message-Id: <8903260617.AA08352 at phyb>
To: Connectionists at cs.cmu.edu
Subject: Normalization of weights in Kohonen algorithm
Status: R

re point 3 of recent posting about Kohonen algorithm: 

"3	In Kohonen's book "Self Organization and Associative Memory", Ch 5
	the algorithm for weight adaptation does not produce normalized
	weights."

the algorithm

du_{ij}/dt = a(t)[e_j(t) - u_{ij}(t)], i in N_c

where u = weights, e is input pattern, N_c is topological neighborhood of
maximally responding neighborhood, should I believe be written

du_{ij}/dt = a(t)[ e_j(t)/\sum_k(e_k(t)) - u_{ij}(t)/\sum_k(u_{ik}(t)) ], 
i in N_c.

That is, the change should be such as to move the jth synaptic weight on the
ith cell, as a PROPORTION of all the synaptic weights on the ith cell, in the
direction of matching the PROPORTION of input which was incoming on the jth
line.  Note that in this case \sum_j du_{ij}(t)/dt = 0, so weights remain
normalized in the sense that sum over each cell remains constant.

If inputs are normalize to sum to 1 (\sum_k(e_k(t)) = 1) then the first
denominator can be omitted.  If weights begin normalized to sum to 1 on each
cell ( \sum_k(u_{ik}(t)) = 1 for all i) then weights will remain normalized
to sum to 1, hence the second denominator can be omitted.  Perhaps Kohonen
was assuming these normalizations and hence dispensing with the denominators?

ken miller (ken at phyb.ucsf.edu)


From mcvax!fib.upc.es!millan at uunet.UU.NET  Fri Mar 31 04:09:00 1989
From: mcvax!fib.upc.es!millan at uunet.UU.NET (Jose del R. MILLAN)
Date: 31 Mar 89 17:09 +0800
Subject: TR available
Message-ID: <92*millan@fib.upc.es>

The following Tech. Report is available. Requests should be sent to
		MILLAN at FIB.UPC.ES
________________________________________________________________________

		Learning by Back-Propagation:
	a Systolic Algorithm and its Transputer Implementation

		   Technical Report LSI-89-15

			Jose del R. MILLAN
		Dept. de Llenguatges i Sistemes Informatics
		Universitat Politecnica de Catalunya

			Pau BOFILL
		Dept. d'Arquitectura de Computadors
		Universitat Politecnica de Catalunya


ABSTRACT

In this paper we present a systolic algorithm for back-propagation, a 
supervised, iterative, gradient-descent, connectionist learning rule. The 
algorithm works on feedforward networks where connections can skip layers and 
fully exploits spatial and training parallelisms, which are inherent to 
back-propagation. Spatial parallelism arises during the propagation of activity 
---forward--- and error ---backward--- for a particular input-output pair. On 
the other hand, when this computation is carried out simultaneously for all 
input-output pairs, training parallelism is obtained. In the spatial dimension, 
a single systolic ring carries out sequentially the three main steps of the 
learnng rule ---forward, backward and weight increments update. Furthermore, the
same pattern of matrix delivery is used in both the forward and the backward 
passes. In this manner, the algorithm preserves the similarity of the forward 
and backward passes in the original model. The resulting systolic algorithm is 
dual with respect to the pattern of matrix delivery ---either columns or rows. 
Finally, an implementation of the systolic algorithm for the spatial dimension 
is derived, that uses a linear ring of Transputer processors.