From joho%sw.MCC.COM at MCC.COM Thu Mar 2 13:18:40 1989 From: joho%sw.MCC.COM at MCC.COM (Josiah Hoskins) Date: Thu, 2 Mar 89 12:18:40 CST Subject: Tech Report Announcement Message-ID: <8903021818.AA22902@jelly.sw.mcc.com> The following tech report is available. Speeding Up Artificial Neural Networks in the "Real" World Josiah C. Hoskins A new heuristic, called focused-attention backpropagation (FAB) learning, is introduced. FAB enhances the backpropagation pro- cedure by focusing attention on the exemplar patterns that are most difficult to learn. Results are reported using FAB learning to train multilayer feed-forward artificial neural networks to represent real-valued elementary functions. The rate of learning observed using FAB is 1.5 to 10 times faster than backpropagation. Request for copies should refer to MCC Technical Report Number STP-049-89 and should be sent to Kintner at mcc.com or to Josiah C. Hoskins MCC - Software Technology Program AT&T: (512) 338-3684 9390 Research Blvd, Kaleido II Bldg. UUCP/USENET: milano!joho Austin, Texas 78759 ARPA/INTERNET: joho at mcc.com From cfields at NMSU.Edu Fri Mar 3 17:16:53 1989 From: cfields at NMSU.Edu (cfields@NMSU.Edu) Date: Fri, 3 Mar 89 15:16:53 MST Subject: No subject Message-ID: <8903032216.AA17939@NMSU.Edu> _________________________________________________________________________ The following are abstracts of papers appearing in the inaugural issue of the Journal of Experimental and Theoretical Artificial Intelligence. JETAI 1, 1 was published 1 January, 1989. For submission information, please contact either of the editors: Eric Dietrich Chris Fields PACSS - Department of Philosophy Box 30001/3CRL SUNY Binghamton New Mexico State University Binghamton, NY 13901 Las Cruces, NM 88003-0001 dietrich at bingvaxu.cc.binghamton.edu cfields at nmsu.edu JETAI is published by Taylor & Francis, Ltd., London, New York, Philadelphia _________________________________________________________________________ Minds, machines and Searle Stevan Harnad Behavioral & Brain Sciences, 20 Nassau Street, Princeton NJ 08542, USA Searle's celebrated Chinese Room Argument has shaken the foundations of Artificial Intelligence. Many refutations have been attempted, but none seem convincing. This paper is an attempt to sort out explicitly the assumptions and the logical, methodological and empirical points of disagreement. Searle is shown to have underestimated some features of computer modeling, but the heart of the issue turns out to be an empirical question about the scope and limits of the purely symbolic (computational) model of the mind. Nonsymbolic modeling turns out to be immune to the Chinese Room Argument. The issues discussed include the Total Turing Test, modularity, neural modeling, robotics, causality and the symbol-grounding problem. _________________________________________________________________________ Explanation-based learning: its role in problem solving Brent J. Krawchuck and Ian H. Witten Knowledge Sciences Laboratory, Department of Computer Science, University of Calgary, 2500 University Drive, NW, Calgary, Alta, Canada, T2N 1N4. `Explanation-based' learning is a semantically-driven, knowledge-intensive paradigm for machine learning which contrasts sharply with syntactic or `similarity-based' approaches. This paper redevelops the foundations of EBL from the perspective of problem-solving. Viewed in this light, the technique is revealed as a simple modification to an inference engine which gives it the ability to generalize the conditions under which the solution to a particular problem holds. We show how to embed generalization invisibly within the problem solver, so that it is accomplished as inference proceeds rather than as a separate step. The approach is also extended to the more complex domain of planning to illustrate that it is applicable to a variety of logic-based problem-solvers and is by no means restricted to only simple ones. We argue against the current trend to isolate learning from other activity and study it separately, preferred instead to integrate it into the very heart of problem solving. ---------------------------------------------------------------------------- The recognition and classification of concepts in understanding scientific texts Fernando Gomez and Carlos Segami Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA. In understanding a novel scientific text, we may distinguish the following processes. First, concepts are built from the logical form of the sentence into the final knowledge structures. This is called concept formation. While these concepts are being formed, they are also being recognized by checking whether they are already in long-term memory (LTM). Then, those concepts which are unrecognized are integrated in LTM. In this paper, algorithms for the recognition and integration of concepts in understanding scientific texts are presented. It is shown that the integration of concepts in scientific texts is essentially a classification task, which determines how and where to integrate them in LTM. In some cases, the integration of concepts results in a reclassification of some of the concepts already stored in LTM. All the algorithms described here have been implemented and are part of SNOWY, a program which reads short scientific paragraphs and answer questions. --------------------------------------------------------------------------- Exploring the No-Function-In-Structure principle Anne Keuneke and Dean Allemang Laboratory for Artificial Intelligence Research, Department of Computer and Information Science, The Ohio State University, 2036 Neil Avenue Mall, Columbus, OH 43210-1277, USA. Although much of past work in AI has focused on compiled knowledge systems, recent research shows renewed interest and advanced efforts both in model-based reasoning and in the integration of this deep knowledge with compiled problem solving structures. Device-based reasoning can only be as good as the model used; if the needed knowledge, correct detail, or proper theoretical background is not accessible, performance deteriorates. Much of the work on model-based reasoning references the `no-function-in-structure' principle, which was introduced be de Kleer and Brown. Although they were all well motivated in establishing the guideline, this paper explores the applicability and workability of the concept as a universal principle for model representation. This paper first describes the principle, its intent and the concerns it addresses. It then questions the feasibility and the practicality of the principle as a universal guideline for model representation. ___________________________________________________________________________ From jbower at bek-mc.caltech.edu Sun Mar 5 21:09:10 1989 From: jbower at bek-mc.caltech.edu (Jim Bower) Date: Sun, 5 Mar 89 18:09:10 pst Subject: Summer course in computational neurobiology Message-ID: <8903060209.AA03962@bek-mc.caltech.edu> Course announcement: Methods in Computational Neuroscience The Marine Biological Laboratory Woods Hole, Massachusetts August 6 - September 2,1989 General Description The Marine Biological Laboratory (MBL) in Woods Hole Massachusetts is a world famous marine biological laboratory that has been in existence for over 100 years. In addition to providing research facilities for a large number of biologists during the summer, the MBL also sponsors a number of outstanding courses on different topics in Biology. This summer will be the second year in which the MBL has offered a course in "Methods in Computational Neuroscience". This course is designed as a survey of the use of computer modeling techniques in studying the information processing capabilities of the nervous system and covers models at all levels from biologically realistic single cells and networks of cells to biologically relevant abstract models. The principle aim of the course is to provide participants with the tools to simulate the functional properties of those neural systems of interest to them as well as to understand the general advantages and pitfalls of this experimental approach. The Specific Structure of the Course The course itself includes both a lecture series and a computer laboratory. The lectures are given by invited faculty whose work represents the state of the art in computational neuroscience (see list below). The course lecture notes have been incorporated into a book published by MIT press (" Methods in Neuronal Modeling: From Synapses to Networks" C. Koch and I. Segev, editors. MIT Press, Cambridge, MA.,1989). The computer laboratory is designed to give students hands-on experience with the simulation techniques considered in the lecture. It also provides students with the opportunity to actually begin simulations of neural systems of interest to them. The students are guided in this effort by the visiting lecturers and course directors, but also by several students from the Computational Neural Systems (CNS) graduate program at Caltech who serve as Laboratory TAs. The lab itself consists of state of the art graphics workstations running a GEneral NEtwork SImulation System (GENESIS) that Dr. Bower and his colleagues at Caltech have constructed over the last several years. Students return to their home institutions with the GENESIS system to continue their work. The Students The course is designed for advanced graduate students and postdoctoral fellows in biology, computer science, electrical engineering, physics, or psychology with an interest in computational neuroscience. Because of the heavy computer orientation of the Lab section, a good computer background is required (UNIX, C or PASCAL). In addition, students are expected to have a basic background in neurobiology. Course enrollment is limited to 20 so as to assure the highest quality educational experience. Course Directors James M. Bower and Christof Koch Computation and Neural Systems Program California Institute of Technology The Faculty Paul Adams (Stony Brook) Dan Alkon (NIH) Richard Anderson (MIT) John Hildebrand (Arizona) John Hopfield (Caltech) Rudolfo Llinas (NYU) David Rumelhart (Stanford) Idan Segev (Jerusalem) Terrence Sejnowski (Salk/UCSD) David Van Essen (Caltech) Christoph Von der Malsburg (USC) For further information and application materials contact: Admissions Coordinator Marine Biological Laboratory Woods Hole, MA 02543 (508) 548-3705 extension 216 Application Deadline May 15, 1989 Acceptance notification in early June. From mjolsness-eric at YALE.ARPA Tue Mar 7 21:23:16 1989 From: mjolsness-eric at YALE.ARPA (Eric Mjolsness) Date: Tue, 7 Mar 89 21:23:16 EST Subject: "Transformations" tech report Message-ID: <8903080223.AA17992@NEBULA.SUN3.CS.YALE.EDU> A new technical report is available: "Algebraic Transformations of Objective Functions" (YALEU/DCS/RR-686) by Eric Mjolsness and Charles Garrett Yale Department of Computer Science P.O. 2158 Yale Station New Haven CT 06520 Abstract: A standard neural network design trick reduces the number of connections in the winner-take-all (WTA) network from O(N^2) to O(N). We explain the trick as a general fixpoint-preserving transformation applied to the particular objective function associated with the WTA network. The key idea is to introduce new interneurons which act to maximize the objective, so that the network seeks a saddle point rather than a minimum. A number of fixpoint-preserving transformations are derived, allowing the simplification of such algebraic forms as products of expressions, functions of one or two expressions, and sparse matrix products. The transformations may be applied to reduce or simplify the implementation of a great many structured neural networks, as we demonstrate for inexact graph-matching, convolutions and coordinate transformations, and sorting. Simulations show that fixpoint-preserving transformations may be applied repeatedly and elaborately, and the example networks still robustly converge. We discuss implications for circuit design. To request a copy, please send your physical address by e-mail to mjolsness-eric at cs.yale.edu OR mjolsness-eric at yale.arpa (old style) Thank you. ------- From prlb2!vub.vub.ac.be!prog1!wplybaer at uunet.UU.NET Tue Mar 7 19:34:21 1989 From: prlb2!vub.vub.ac.be!prog1!wplybaer at uunet.UU.NET (Wim P. Lybaert) Date: Wed, 8 Mar 89 01:34:21 +0100 Subject: No subject Message-ID: <8903080034.AA10074@prog1.vub.ac.be> Hi, i would like to be placed on the connectionist neural nets mailing list that you distribute. Thanks, Wim Lybaert Brussels Free University Department PROG Oefenplein 2 1040 BRUSSELS BELGIUM email: From ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU Wed Mar 8 11:36:31 1989 From: ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU (thanasis kehagias) Date: Wed, 08 Mar 89 11:36:31 EST Subject: information function vs. squared error Message-ID: i am looking for pointers to papers discussing the use of an alternative criterion to squared error, in back propagation algorithms. the alternative function i have in mind is called (in different contexts and/or authors) cross entropy, entropy, information, inf. divergence and so on. it is defined something like: G=sum{i=1}{N} p_i*log(p_i) i am not quite sure what the index i runs through: untis, weights or something else. i know people have been talking about this a lot, i just cannot remember where i read aboout it ... it seems like Geoff Hinton's group work on this . thanks, Thanasis From mdp%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK Thu Mar 9 08:16:07 1989 From: mdp%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK (Mark Plumbley) Date: Thu, 9 Mar 89 13:16:07 GMT Subject: information function vs. squared error Message-ID: <14398.8903091316@dsl.eng.cam.ac.uk> Thanasis, The "G" function you mentioned, based on an Entropy method, is probably the one developed by Pearmutter and Hinton as a procedure for unsupervised learning of binary units [1]. More recently, Linsker [2,3] and Plumbley and Fallside [4] considered the principle of maximum information transmission (or minimum information loss) for continuous units, relating this to Principal Component methods for linear units. Unfortunately, these are mainly about unsupervised learning, rather than Backprop specifically, although in [4] we do look at the way the mean-squared error criterion places an *upper-bound* on the information loss through a supervised network. This bound will be tightest when the errors on all the output units are independent and have the same variance (or the same entropy for non-additive-Gaussian errors). *If* you can choose the target representation used by Backprop so that the errors are likely to have these properties, it should perform closer to the (information- theoretic) optimal. Hope this is some help, Mark. References: [1] B. A. Pearlmutter and G. E. Hinton: "G-Maximization: An Unsupervised Learning Procedure for Discovering Regularities". In Proceedings of the Conference on `Neural Networks for Computing'. American Institute of Physics, 1986. [2] R. Linsker: "Towards an Organisational Principle for a Layered Perceptual Network". In "Neural Information Processing Systems (Denver, CO. 1987)" (Ed. D. Z. Anderson), pp. 485-494. American Institute of Physics, 1988. [3] R. Linsker: "Self-Organization in a Perceptual Network". IEEE Computer, vol. 21 (3), March 1988, pp. 105-117. [4] M. D. Plumbley and F. Fallside: "An Information-Theoretic Approach to Unsupervised Connectionist Models". Tech. Report CUED/F-INFENG/TR.7. Cambridge University Engineering Department, 1988. Also in "Proceedings of the 1988 Connectionist Models Summer School", pp. 239-245. Morgan-Kaufmann, San Mateo, CA. +--------------------------------------------+---------------------------+ | Mark Plumbley | Cambridge University | | JANET: mdp at uk.ac.cam.eng.dsl | Engineering Department, | | ARPANET: | Trumpington Street, | | mdp%dsl.eng.cam.ac.uk at nss.cs.ucl.ac.uk | Cambridge CB2 1PZ | | Tel: +44 223 332754 Fax: +44 223 332662 | UK | +--------------------------------------------+---------------------------+ From becker at ai.toronto.edu Thu Mar 9 13:26:38 1989 From: becker at ai.toronto.edu (becker@ai.toronto.edu) Date: Thu, 9 Mar 89 13:26:38 EST Subject: information function vs. squared error Message-ID: <89Mar9.132645est.10489@ephemeral.ai.toronto.edu> The use of the cross-entropy measure G = p log(p/q) + (1-p)log(1-p)/(1-q) (Kullback, 1959), where p and q are the probabilities of a binary random variable under 2 probability distributions) has been described in at least 3 different contexts in the connectionist literature: (i) As an objective function for supervised back-propagation; this is appropriate if the output units are computing real values which are to be interpreted as probability distributions over the space of binary output vectors (Hinton, 1987). Here G-error represents the divergence between the desired and observed distributions. (ii) As an objective function for Boltzmann machine learning (Hinton and Sejnowski, 1986), where p and q are the output distributions in the + and - phases. (iii) In the Gmax unsupervised learning algorithm (Pearlmutter and Hinton, 1986) as a measure of the difference between the actual output distribution of a unit and the predicted distribution assuming independent input lines. References: Hinton, G. E. 1987. "Connectionist Learning Procedures", Revised version of Technical Report CMU-CS-87-115, to appear (appeared ?) in Artificial Intelligence. Hinton, G. E. and Sejnowski, T. J. 1986. "Learning and relearning in Boltzmann machines", in Parallel distributed processing: Explorations in the microstructure of cognition, Bradford Books. Kullback, S., 1959. "Information Theory and Statistics", New York: Wiley. Pearlmutter, B. A. and Hinton, G. E. 1986. "G-Maximization: An unsupervised learning procedure for discovering regularities.", Neural Networks for Computing: American Institute of Physics Conference Proceedings 151. Sue Becker DCS, University of Toronto From mehra at aquinas.csl.uiuc.edu Fri Mar 10 05:43:16 1989 From: mehra at aquinas.csl.uiuc.edu (Pankaj Mehra) Date: Fri, 10 Mar 89 04:43:16 CST Subject: No subject Message-ID: <8903101043.AA02586@aquinas> I have recently explored several connectionist models for learning under _realistic_ learning scenarios. The class of problems for which we are trying to acquire solutions by learning are decision problems with the following characteristics: (i) large number of continuous-valued PARAMETERS, each of which (ia) takes on values from a finite range with a nonstationary distribution (ib) costs more to measure accurately. {however, accuracy can be controlled by focussed sampling} (ic) is not known to follow any particular parametric distribution (ii) the optimization CRITERION (energy, if you will) is ill-defined {much like the _blackbox_ in David Ackley's thesis} (iii) a set of OPERATORS is available, and these are the _only_ instruments for manipulating the problem state. (iiia) the _causal_ relationships between the states before and after the application of the operator are not known (iiib) the _persistence_ model is incomplete - i.e. it is not known a priori as to when the effect of an action will be felt and how long will it persist (iv) the TRAINING ENVIRONMENT is _slow reactive_ : it can be assumed to produce reinforcement (prescriptive feedback) rather than an error (evaluative feedback); however, the delays between an action and subsequent reinforcement follow an _unknown_ distribution. ------- These have been called Dynamic Decision Problems, and shown to be a rich class, in the following publication [available upon request from the first author]: Mehra, P. and B. W. Wah, "Architectures for Strategy Learning," in Computer Architectures for Artificial Intelligence Appli- cations, ed. B. Wah and C. Ramamoorthy, Wiley, New York, NY, 1989 (in press). {send e-mail to: mehra at cs.uiuc.edu} ------- The above publication also examines the applicability of other well-known learning techniques {empirical, probabilistic, decision theoretic, EBL, hybrid techniques, learning to plan, etc} and suggests why ANSs might be prefered over others. As a part of this comparision, several contemporary connectionist models were found lacking in certain respects. I shall summarize the criticisms here, and would like to have feedback from those who have supported the use of these techniques. BACK-PROPAGATION: positive aspects: Simplicity of programming the learning algorithm An effective procedure for tuning of large parameter sets representable as _band matrices_ (layered networks) problematic assumptions: Immediate feedback Corrective {as against prescriptive} feedback [I am aware of Ron Williams' work, though] weakness as a learning approach Requires tweaking of features (normalization biases) to the extent that the degree of generalization varies drastically as the degree of coarse coding changes. A great part of the success in particular applications could therefore be attributed to the intelligence of the researcher who codes those features {rather than to the _learning_ algorithm} REINFORCEMENT LEARNING positive aspects Can handle prescriptive feedback Has been shown {Rich Sutton, Chuck Anderson} to work with delayed feedback problematic assumptions The implementations known to this author assume : persistence of effects decays _exponentially_ with time : heuristic assumptions such as "recency" (that the more recent an action is, the more is it responsible for the feedback) and frequency (that the more frequently an action occurs preceding the feedback, the more likely it is to have caused the feedback) are _hardwired_ into the learning algorithms All the knowledge needed for learning is implicit as if the learning critter was born with algorithms assuming exponential decay and as if all actions in the world caused similar delay patterns The nodes of the network compute functions much more complex than in case of classical back-propagation. weakness as a learning paradigm All actions that occur at the same time and with the same frequency are assumed equally likely to have caused the feedback. (ie. these algorithms have an implicitly coded causal model) No scope for using the same network to choose between actions having different causal and persistence assumptions. The learning algorithm amounts to a procedural encoding of environmental knowledge. Any success of these algorithms in realistic applications is in a large part due to the intelligence of the designer and the effort they put in (for example to find just the right lambda for the exponential decay factor). ------- See my paper for details of Dynamic Decision Problems and extensive study of how the basic learning model underlying _most_ of the existing learning algorithms (either in AI or Connectionism) is at odds with the requirements of training in the real world. Comments welcome from those who read the paper, as well as from those who just want to discuss the material of this basenote. - Pankaj {Mehra at cs.uiuc.edu} From mehra at aquinas.csl.uiuc.edu Fri Mar 10 05:43:16 1989 From: mehra at aquinas.csl.uiuc.edu (Pankaj Mehra) Date: Fri, 10 Mar 89 04:43:16 CST Subject: No subject Message-ID: <8903101043.AA02586@aquinas> I have recently explored several connectionist models for learning under _realistic_ learning scenarios. The class of problems for which we are trying to acquire solutions by learning are decision problems with the following characteristics: (i) large number of continuous-valued PARAMETERS, each of which (ia) takes on values from a finite range with a nonstationary distribution (ib) costs more to measure accurately. {however, accuracy can be controlled by focussed sampling} (ic) is not known to follow any particular parametric distribution (ii) the optimization CRITERION (energy, if you will) is ill-defined {much like the _blackbox_ in David Ackley's thesis} (iii) a set of OPERATORS is available, and these are the _only_ instruments for manipulating the problem state. (iiia) the _causal_ relationships between the states before and after the application of the operator are not known (iiib) the _persistence_ model is incomplete - i.e. it is not known a priori as to when the effect of an action will be felt and how long will it persist (iv) the TRAINING ENVIRONMENT is _slow reactive_ : it can be assumed to produce reinforcement (prescriptive feedback) rather than an error (evaluative feedback); however, the delays between an action and subsequent reinforcement follow an _unknown_ distribution. ------- These have been called Dynamic Decision Problems, and shown to be a rich class, in the following publication [available upon request from the first author]: Mehra, P. and B. W. Wah, "Architectures for Strategy Learning," in Computer Architectures for Artificial Intelligence Appli- cations, ed. B. Wah and C. Ramamoorthy, Wiley, New York, NY, 1989 (in press). {send e-mail to: mehra at cs.uiuc.edu} ------- The above publication also examines the applicability of other well-known learning techniques {empirical, probabilistic, decision theoretic, EBL, hybrid techniques, learning to plan, etc} and suggests why ANSs might be prefered over others. As a part of this comparision, several contemporary connectionist models were found lacking in certain respects. I shall summarize the criticisms here, and would like to have feedback from those who have supported the use of these techniques. BACK-PROPAGATION: positive aspects: Simplicity of programming the learning algorithm An effective procedure for tuning of large parameter sets representable as _band matrices_ (layered networks) problematic assumptions: Immediate feedback Corrective {as against prescriptive} feedback [I am aware of Ron Williams' work, though] weakness as a learning approach Requires tweaking of features (normalization biases) to the extent that the degree of generalization varies drastically as the degree of coarse coding changes. A great part of the success in particular applications could therefore be attributed to the intelligence of the researcher who codes those features {rather than to the _learning_ algorithm} REINFORCEMENT LEARNING positive aspects Can handle prescriptive feedback Has been shown {Rich Sutton, Chuck Anderson} to work with delayed feedback problematic assumptions The implementations known to this author assume : persistence of effects decays _exponentially_ with time : heuristic assumptions such as "recency" (that the more recent an action is, the more is it responsible for the feedback) and frequency (that the more frequently an action occurs preceding the feedback, the more likely it is to have caused the feedback) are _hardwired_ into the learning algorithms All the knowledge needed for learning is implicit as if the learning critter was born with algorithms assuming exponential decay and as if all actions in the world caused similar delay patterns The nodes of the network compute functions much more complex than in case of classical back-propagation. weakness as a learning paradigm All actions that occur at the same time and with the same frequency are assumed equally likely to have caused the feedback. (ie. these algorithms have an implicitly coded causal model) No scope for using the same network to choose between actions having different causal and persistence assumptions. The learning algorithm amounts to a procedural encoding of environmental knowledge. Any success of these algorithms in realistic applications is in a large part due to the intelligence of the designer and the effort they put in (for example to find just the right lambda for the exponential decay factor). ------- See my paper for details of Dynamic Decision Problems and extensive study of how the basic learning model underlying _most_ of the existing learning algorithms (either in AI or Connectionism) is at odds with the requirements of training in the real world. Comments welcome from those who read the paper, as well as from those who just want to discuss the material of this basenote. - Pankaj {Mehra at cs.uiuc.edu} From mike at bucasb.BU.EDU Fri Mar 10 12:22:14 1989 From: mike at bucasb.BU.EDU (Michael Cohen) Date: Fri, 10 Mar 89 12:22:14 EST Subject: network meeting announcement for distribution Message-ID: <8903101722.AA27914@bucasb.bu.edu> NEURAL NETWORK MODELS OF CONDITIONING AND ACTION 12th Symposium on Models of Behavior Friday and Saturday, June 2 and 3, 1989 105 William James Hall, Harvard University 33 Kirkland Street, Cambridge, Massachusetts PROGRAM COMMITTEE: Michael Commons, Harvard Medical School Stephen Grossberg, Boston University John E.R. Staddon, Duke University JUNE 2, 8:30AM--11:45AM ----------------------- Daniel L. Alkon, ``Pattern Recognition and Storage by an Artificial Network Derived from Biological Systems'' John H. Byrne, ``Analysis and Simulation of Cellular and Network Properties Contributing to Learning and Memory in Aplysia'' William B. Levy, ``Synaptic Modification Rules in Hippocampal Learning'' JUNE 2, 1:00PM--5:15PM ---------------------- Gail A. Carpenter, ``Recognition Learning by a Hierarchical ART Network Modulated by Reinforcement Feedback'' Stephen Grossberg, ``Neural Dynamics of Reinforcement Learning, Selective Attention, and Adaptive Timing'' Daniel S. Levine, ``Simulations of Conditioned Perseveration and Novelty Preference from Frontal Lobe Damage'' Nestor A. Schmajuk, ``Neural Dynamics of Hippocampal Modulation of Classical Conditioning'' JUNE 3, 8:30AM--11:45AM ----------------------- John W. Moore, ``Implementing Connectionist Algorithms for Classical Conditioning in the Brain'' Russell M. Church, ``A Connectionist Model of Scalar Timing Theory'' William S. Maki, ``Connectionist Approach to Conditional Discrimination: Learning, Short-Term Memory, and Attention'' JUNE 3, 1:00PM--5:15PM ---------------------- Michael L. Commons, ``Models of Acquisition and Preference'' John E.R. Staddon, ``Simple Parallel Model for Operant Learning with Application to a Class of Inference Problems'' Alliston K. Reid, ``Computational Models of Instrumental and Scheduled Performance'' Stephen Jose Hanson, ``Behavioral Diversity, Hypothesis Testing, and the Stochastic Delta Rule'' Richard S. Sutton, ``Time Derivative Models of Pavlovian Reinforcement'' FOR REGISTRATION INFORMATION SEE ATTACHED OR WRITE: Dr. Michael L. Commons Society for Quantitative Analysis of Behavior 234 Huron Avenue Cambridge, MA 02138 ---------------------------------------------------------------------- ---------------------------------------------------------------------- REGISTRATION FEE BY MAIL (Paid by check to Society for Quantitative Analysis of Behavior) (Postmarked by April 30, 1989) Name: ______________________________________________ Title: _____________________________________________ Affiliation: _______________________________________ Address: ___________________________________________ Telephone(s): ______________________________________ E-mail address: ____________________________________ ( ) Regular $35 ( ) Full-time student $25 School ____________________________________________ Graduate Date _____________________________________ Print Faculty Name ________________________________ Faculty Signature _________________________________ PREPAID 10-COURSE CHINESE BANQUET ON JUNE 2 ( ) $20 (add to pre-registration fee check) ----------------------------------------------------------------------------- (cut here and mail with your check to) Dr. Michael L. Commons Society for Quantitative Analysis of Behavior 234 Huron Avenue Cambridge, MA 02138 REGISTRATION FEE AT THE MEETING ( ) Regular $45 ( ) Full-time Student $30 (Students must show active student I.D. to receive this rate) ON SITE REGISTRATION 5:00--8:00PM, June 1, at the RECEPTION in Room 1550, William James Hall, 33 Kirkland Street, and 7:30--8:30AM, June 2, in the LOBBY of William James Hall. Registration by mail before April 30, 1989 is recommended as seating is limited HOUSING INFORMATION Rooms have been reserved in the name of the symposium for the Friday and Saturday nights at: Best Western Homestead Inn 220 Alewife Brook Parkway Cambridge, MA 02138 Single: $72 Double: $80 Reserve your room as soon as possible. The hotel will not hold them past March 31. Because of Harvard and MIT graduation ceremonies, space will fill up rapidly. Other nearby hotels: Howard Johnson's Motor Lodge 777 Memorial Drive Cambridge, MA 02139 (617) 492-7777 (800) 654-2000 Single: $115--$135 Double: $115--$135 Suisse Chalet 211 Concord Turnpike Parkway Cambridge, MA 02140 (617) 661-7800 (800) 258-1980 Single: $48.70 Double: $52.70 --------------------------------------------------------------------------- From homxb!solla at research.att.com Fri Mar 10 13:10:00 1989 From: homxb!solla at research.att.com (homxb!solla@research.att.com) Date: Fri, 10 Mar 89 13:10 EST Subject: Cross-entropy error Message-ID: A detailed discussion of cross-entropy error measure for back propagation, and a comparative study of its merits relative to the more commonly used quadratic measure is to be found in "Accelerated Learning in Layered Neural Networks" by S.A. Solla, E. Levin, and M. Fleisher. The paper has appeared in "Complex Systems", Vol. 2, 1988. Two other relevant references to the use of such error function in the context of supervised learning are: E.B. Baum and F. Wilczek, "Supervised Learning of Probability Distributions by Neural Network" in "Neural Information Processing Systems", ed. by D. Anderson (AIP, New York, 1988) J.J. Hopfield, "Leraning Algorithms and Probability Distributions in Feed- forward and Feed-back Networks", Proc. Natl. Acad. Sci. USA, Vol. 84 ,1988, p. 8429-8433. Sara A. Solla AT&T Bell Laboratories solla at homxb.att.com From John.Hampshire at SPEECH2.CS.CMU.EDU Sun Mar 12 13:21:21 1989 From: John.Hampshire at SPEECH2.CS.CMU.EDU (John.Hampshire@SPEECH2.CS.CMU.EDU) Date: Sun, 12 Mar 89 13:21:21 EST Subject: non-MSE objective function for backprop Message-ID: ********* FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD *********** **************** TO OTHER BBOARDS/ELECTRONIC MEDIA ******************* A NOVEL OBJECTIVE FUNCTION FOR IMPROVED CLASSIFICATION PERFORMANCE IN TIME-DELAY NEURAL NETS USED FOR PHONEME RECOGNITION J. B. Hampshire II A. H. Waibel Carnegie Mellon University We have been working on an alternative objective function to the mean-squared-error (MSE) objective function typically used in backpropagation. Our alternative, which we term the classification figure-of-merit (CFM), forms a mathematical assessment of the *relative* activations of all output nodes of a backprop network used as a classifier. The objective function has a number of unique characteristics; chief among these are 1. its formation of internal representations that consistently differ substantially from those of the MSE objective function 2. its immunity to "over-learning" (i.e., the process by which MSE classifiers can be trained so much that they begin to key on "idiosyncratic" features of the training set that are not representative of the ensemble from which the training set was drawn. As a result, over training actually results in degraded classification performance on a disjoint test set.) While classification performance of the CFM objective function is equivalent to that of the MSE objective function, results from the two classifiers can be combined to reduce by a median 24% the number of misclassifications made by the MSE classifier alone. This equates to single and multi-speaker /b, d, g/ recognition rates that consistently exceed 98%. A preliminary paper is available on our results of applying the CFM to phoneme recognition using Time-Delay Neural Nets now, but if you want to wait another two weeks, you can get the NEW! IMPROVED! full-fledged technical report. If you absolutely can't wait to get your hands on this stuff, send your mailing address and something to the effect of, "send me the CFM paper." If, on the other hand, you want to see a more thorough analysis, send your mailing address and say, "send me the CFM tech report (CMU-CS-89-118) in two weeks." In either case, send your request directly to hamps at speech2.cs.cmu.edu ***** DO NOT USE THE REPLY COMMAND IN YOUR MAILER ***** ********* FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD *********** **************** TO OTHER BBOARDS/ELECTRONIC MEDIA ******************* From netlist at psych.Stanford.EDU Sun Mar 12 17:13:17 1989 From: netlist at psych.Stanford.EDU (Mark Gluck) Date: Sun, 12 Mar 89 14:13:17 PST Subject: Tues. 3/14: ALAN LAPEDES, Neural Nets and Signal Processing Message-ID: Stanford University Interdisciplinary Colloquium Series: Adaptive Networks and their Applications Mar. 14th (Tuesday, 3:30pm): ******************************************************************************** "Nonlinear Signal Processing with Adaptive Networks" ALAN LAPEDES Theoretical Division Los Alamos National Laboratory, MS B213 Los Alamos, New Mexico 87545 ******************************************************************************** Abstract Previous work on using the new generation of nonlinear neural networks for signal processing tasks is reviewed. The concept of a nonlinear system changing its behavior as a parameter is changed (bifurcations) is introduced and investigated for the simple logistic map. In this situation we show that instabilities (limit cycles, chaos) of this system may be predicted as a function of a system parameter purely from observations of the system in its stable regime where it evolves to a stable fixedpoint. We consider predicting the bifurcation of a hydrodynamic experiment. Both backpropagation nets and radial basis networks are used on this problem. Agreement with experiment is good, and plenty of pretty three dimensional pictures will be shown. Unnecessary formalism will be kept to a bare minimum. Additional Information ---------------------- Location: Room 380-380X, which can be reached through the lower level between the Psychology and Mathematical Sciences buildings. Level: Technically oriented for persons working in related areas. Mailing lists: To be added to the network mailing list, netmail to netlist at psych.stanford.edu with "addme" as your subject header. For additional information, contact Mark Gluck (gluck at psych.stanford.edu). From harnad at Princeton.EDU Mon Mar 13 13:57:26 1989 From: harnad at Princeton.EDU (Stevan Harnad) Date: Mon, 13 Mar 89 13:57:26 EST Subject: Abstract for CNLS Conference Message-ID: <8903131857.AA19332@clarity.Princeton.EDU> Here is the abstract for my contribution to the session on the "Emergence of Symbolic Structures" at the 9th Annual International Conference on Emergent Computation, CNLS, Los Alamos National Laboratory, May 22 - 26 1989 Grounding Symbols in a Nonsymbolic Substrate Stevan Harnad Behavioral and Brain Sciences Princeton NJ There has been much discussion recently about the scope and limits of purely symbolic models of the mind and of the proper role of connectionism in mental modeling. In this paper the "symbol grounding problem" -- the problem of how the meanings of meaningless symbols, manipulated only on the basis of their shapes, can be grounded in anything but more meaningless symbols in a purely symbolic system -- is described, and then a potential solution is sketched: Symbolic representations must be grounded bottom-up in nonsymbolic representations of two kinds: (1) iconic representations are analogs of the sensory projections of objects and events and (2) categorical representations are learned or innate feature-detectors that pick out the invariant features of object and event categories. Elementary symbols are the names of object and even categories, picked out by their (nonsymbolic) categorical representations. Higher-order symbols are then grounded in these elementary symbols. Connectionism is a natural candidate for the mechanism that learns the invariant features. In this way connectionism can be seen as a complementary component in a hybrid nonsymbolic/symbolic model of the mind, rather than a rival to purely symbolic modeling. Such a hybrid model would not have an autonomous symbolic module, however; the symbolic functions would emerge as an intrinsically "dedicated" symbol system as a consequence of the bottom-up grounding of categories and their names. From niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK Tue Mar 14 10:16:44 1989 From: niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK (Mahesan Niranjan) Date: Tue, 14 Mar 89 15:16:44 GMT Subject: information function vs. squared error Message-ID: <28888.8903141516@dsl.eng.cam.ac.uk> I tried sending the following note last weekend but it failed for some reason - apologies if anyone is getting a repeat! Re: > Date: Wed, 08 Mar 89 11:36:31 EST > From: thanasis kehagias > Subject: information function vs. squared error > > i am looking for pointers to papers discussing the use of an alternative > criterion to squared error, in back propagation algorithms. the [..] > G=sum{i=1}{N} p_i*log(p_i) > Here is a non-causal reference: I have been looking at an error measure based on "approximate distances to class-boundary" instead of the total squared error used in typical supervised learning networks. The idea is motivated by the fact that a large network has an inherent freedom to classify a training set in many ways (and thus poor generalisation!). In my training, an example of a particular class gets a target value depending on where it lies with respect to examples from the other class (in a two class problem). This implies, that the target interpolation function that the network has to construct is a smooth transition from one class to the other (rather than a step-like cross section in the total squared error criterion). The important consequence of doing this is that networks are automatically deprived of the ability to form large weight (- sharp cross section) solutions (an auto weight decay!!). niranjan PS: A Tech report will be announced soon. From sven at iuvax.cs.indiana.edu Tue Mar 14 10:12:36 1989 From: sven at iuvax.cs.indiana.edu (Sven Anderson) Date: Tue, 14 Mar 89 10:12:36 -0500 Subject: Connection between Hidden Markov Models and Connectionist Networks In-Reply-To: thanasis kehagias's message of Mon, 13 Feb 89 00:47:00 EST Message-ID: I'm interested in receiving the paper you described: OPTIMAL CONTROL FOR TRAINING THE MISSING LINK BETWEEN HIDDEN MARKOV MODELS AND CONNECTIONIST NETWORKS by Athanasios Kehagias Division of Applied Mathematics Brown University Providence, RI 02912 If it's more convenient you might just forward the div file. thanks, Sven Anderson From honavar at cs.wisc.edu Tue Mar 14 17:59:39 1989 From: honavar at cs.wisc.edu (A Buggy AI Program) Date: Tue, 14 Mar 89 16:59:39 -0600 Subject: TR available (** DO NOT FORWARD TO BULLETIN BOARDS **) Message-ID: <8903142259.AA01452@goat.cs.wisc.edu> ** PLEASE DO NOT FORWARD TO BULLETIN BOARDS ** The following TR is now available: --------------------------------------- Perceptual Development and Learning: From Behavioral, Neurophysiological, and Morphological Evidence To Computational Models Vasant Honavar Computer Sciences Department University of Wisconsin-Madison Computer Sciences TR # 818, January 1989 Abstract An intelligent system has to be capable of adapting to a constantly changing environment. It therefore, ought to be capa- ble of learning from its perceptual interactions with its sur- roundings. This requires a certain amount of plasticity in its structure. Any attempt to model the perceptual capabilities of a living system or, for that matter, to construct a synthetic sys- tem of comparable abilities, must therefore, account for such plasticity through a variety of developmental and learning mechanisms. This paper examines some results from neuroanatomi- cal, morphological, as well as behavioral studies of the develop- ment of visual perception; integrates them into a computational framework; and suggests several interesting experiments with com- putational models that can yield insights into the development of visual perception. --------------------------------------- Requests for copies must be addressed to: honavar at cs.wisc.edu From ash%cs at ucsd.edu Tue Mar 14 19:15:54 1989 From: ash%cs at ucsd.edu (Tim Ash) Date: Tue, 14 Mar 89 16:15:54 PST Subject: No subject Message-ID: <8903150015.AA19834@beowulf.ucsd.edu.UCSD.EDU> ----------------------------------------------------------------------- The following technical report is now available. ----------------------------------------------------------------------- DYNAMIC NODE CREATION IN BACKPROPAGATION NETWORKS Timur Ash ash at ucsd.edu Abstract Large backpropagation (BP) networks are very difficult to train. This fact complicates the process of iteratively testing different sized networks (i.e., networks with dif- ferent numbers of hidden layer units) to find one that pro- vides a good mapping approximation. This paper introduces a new method called Dynamic Node Creation (DNC) that attacks both of these issues (training large networks and testing networks with different numbers of hidden layer units). DNC sequentially adds nodes one at a time to the hidden layer(s) of the network until the desired approximation accuracy is achieved. Simulation results for parity, symmetry, binary addition, and the encoder problem are presented. The pro- cedure was capable of finding known minimal topologies in many cases, and was always within three nodes of the minimum. Computational expense for finding the solutions was comparable to training normal BP networks with the same final topologies. Starting out with fewer nodes than needed to solve the problem actually seems to help find a solution. The method yielded a solution for every problem tried. BP applied to the same large networks with randomized initial weights was unable, after repeated attempts, to replicate some minimum solutions found by DNC. ----------------------------------------------------------------------- Requests for reprints (ICS Report 8901) should be directed to: Claudia Fernety Institute for Cognitive Science C-015 University of California, San Diego La Jolla, CA 92093. ----------------------------------------------------------------------- From wine at CS.UCLA.EDU Wed Mar 15 08:49:36 1989 From: wine at CS.UCLA.EDU (wine@CS.UCLA.EDU) Date: Wed, 15 Mar 89 05:49:36 PST Subject: TR available (** DO NOT FORWARD TO BULLETIN BOARDS **) In-Reply-To: Your message of Tue, 14 Mar 89 16:59:39 -0600. <8903142259.AA01452@goat.cs.wisc.edu> Message-ID: <8903151349.AA04692@retina.cs.ucla.edu> Please send me a copy of your technical report #818. Thank you in advance. --David Wine University of California at Los Angeles wine at cs.ucla.edu Computer Science Department (213) 825-6121 3531 Boelter Hall ...!(uunet,rutgers,ucbvax,randvax)!cs.ucla.edu!wine Los Angeles, CA 90024 From ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU Wed Mar 15 18:24:14 1989 From: ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU (thanasis kehagias) Date: Wed, 15 Mar 89 18:24:14 EST Subject: what is a connectionist network? Message-ID: ok, here is my question. i hope it makes sense: very often i want to refer to "these things". i do not want to call them neural networks, since it is far from clear to me they really have a similarity with the human nervous system. so i chose to call them connectionist networks. i guess this means they are networks with (many) connections. but this is very general. so i do not have a clear definition of what i am talking about. i am sure i could come up with several, but they seem to me to be either too restrictive or too general. so would anybody care to give their definition of these objects that this list is about? the issue is not trivial or vacuously philosophical. i think that even if we do not come up with a generally accepted definition of what a connectionist net is, people will have a chance to present competing opinions. possibly some lurking differences will come in the surface and the foundations of connectionism will become more secure. here is a case that i think is fraught with issues (that could be cleared up). any dynamical system that evolves in discretetime can be represented (over a finite time interval) by a feedforward connectionist network. is it fair to say that dyn.systems are connectionist networks. conversely, is it fair to say that feedforward nets are dynamical systems. what are the implications for a time-space trade off? how much do we have to learn about dyn. systems to do connectionist research? ok, after all this i guess i have to give my definition of a connectionist network. it is rather involved and it goes like this: "connectionism is not a yes-or-no property. any directed graph (collection of nodes and directed edges) has a connectionism index, defined as the ratio of nr. of edges to nr. of nodes. " PS: has anybody already dealt with the question of defining a CN? references welcome. Thanasis From ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU Wed Mar 15 18:23:24 1989 From: ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU (thanasis kehagias) Date: Wed, 15 Mar 89 18:23:24 EST Subject: cross entropy and training time in Connectionist Nets and HMM's Message-ID: these are some random thoughts on the issue of training in HMM and Connectionist networks. i focus on the cross entropy function and follow a different line of thinking than in my paper which i quote further down. this note is the outcome of an exchange between Tony Robinson and me; i thought some netters might be interested. so i want to thank Tony for posing interesting ideas and questions. also thanks to all the people who replied to my request for information on the cross entropy function. ----------------------- the starting point for this discussion is the following question: "why is HMM training so much faster than Connectionist Networks?" to put the question in perspective, let me first remark that, from a certain point of view, HMM and CN are very similar objects. specifically they use similar architectures to optimize appropriate cost functions. for further explanation of this point, see [Kehagias], also [Kung]. the similarity is even more obvious when CM are used to solve speech recognition problems. the question remains: why, in attempting to solve the same problem, CN require so much more training? 1. cost functions ----------------- it appears that a (partial) explanation is the nature of the cost function used in each case. in CN speech recognizers, the cost function of choice is quadratic error (error being the difference of appropriate vectors). however in most of what follows i will consider CN that maximize the cross entropy function. a short discussion of the relationship between cross entropy and square error is included at the end. in HMM the function MAXIMIZED is likelihood (of the observations). however HMM are a bit more subtle. using the Markov Model, one can write the likelihood of the observations used for training, call it L(q). here q is a vector that contains the transition and emission probabilities (usually called a_ij, b_kj, respectively). to keep the discussion simple, let us consider the only unknown parameters to be the a_ij's. that is, the elements of q are the a_ij's. now, q is a vector, but a mor general view of it is that it is a function (specifically a probability density function). so we will consider q as a vector or a function interchangeably. (of course any vector is a function of its index!) Now, to maximize L is not a trivial task : it is a polynomial of n*T-th order in the elements of q (where n is the order of the Markov model, T the number of observations) furthermore, the elements of q are probabilties and they must satisfy certain positivity and add-up-to-1 conditions. 2. Likelihood maximin, Backward-Forward, EM algorithm ----------------------------------------------------- so HMM people have found a way to make the optimization problem easier: consider an auxiliary function, call it Q(q,q'), to be presently defined, which can be maximized much easier. then they prove the remarkable inequality: (1) L(q)*log(L(q')/L(q)) >= (Q(q,q')-Q(q,q)). the consequence of (1) is the following: we can implement an iterative algorithm that goes as follows: Step 0: choose q(0) ..... Step k: choose q(k) such that Q(q(k-1),q(k)) is maximized. if Q(q(k-1),q(k))=0, terminate. if Q(q(k-1),q(k))>0 go to step k+1 ..... REMARKS: 1) observe that no provision is made for the case that Q(q(k-1),q(k)) is negative. this is due to the fact that max G is always nonnegative, as proved in [Baum 1968] or [Dempster]. 2) of course , in practice, the termination condition will be replaced by : if GQ(q(k-1),q(k-1)). >From (1) and (2) and Remark (1) follows that (3) L(q(k)) > L(q(k-1)). 3. Connection of EM with cross entropy and neural networks ---------------------------------------------------------- Now we will discuss the function G and point out the relationship to CN. The function Q(q,q') can be defined in quite a general setting. q , q' are probability densities. as such they are functions themselves; we write q(x), q'(x). x takes values in an appropriate range. e.g., in the HMM model x ranges over all the state transition pairs (i,j), giving the probability of a certain state transition. now, define G: (4) Q(q,q')=sum{over all x} q(x)log(q'(x)). Then, the difference Q(q,q)-Q(q,q') is: (5) Q(q,q)-Q(q,q')=G(q,q')=sum{all x}q(x)log(q(x)/q'(x)). G is the well known to connectionists (and statisticians) cross- entropy between q and q', that is, a measure of distance between these two probability densities. now we recognize two things: I. there have been cases where G minimization has been proposed as a CN training procedure . see [Hinton]. In these cases, a desired probability density was known and what was desired was to minimize the distance between desired and actual probability density of the CN output. in some of these cases, there was ncurrent simultaneous maximization of likelihood. this is noted in [Ackley]. it follows necessarily from (1) that maximizing the cross-entropy maximizes the minimum improvement in likelihood. II. it is clear that the BF algorithm does a similar thing: likelihood maximization, cross entropy minimization. as noted in [Baum 1968] and also in [Levinson], the difference q(k)-q(k-1) points in the same direction as grad L(q), evaluated at q(k-1). That is, the q(k-1) is changed in the direction of steepest descent of L. of all the possible steps (choices of q(k)) the one is chosen that minimizes the distance between q(k-1) and q(k) in the cross entropy sense. 4. Comparison in training of HMM and CN: --------------------------------------- now we can make a comparison of the performance of CN and HMM's. this comparison is between G-optimizing-CN's and HMM's. the square-error CN is not discussed here. firstly, we see that the main focus of attention is different in the two cases. in CN we want to minimize cross entropy. in HMM we want to maximi likelihood. however, likelihood maximinimization is an automatc consequence of G minimization for CN's and local G minimization is built in in the BF algorithm. in that sense, the two tasks are very similar and so the question is once again raised: why are HMM's faster to train? at this point the answers are many and easy. even though HMM's use observations in a nonlinear way, the state vector of the adjoint network (see [Kehagias]) evolves linearly. not so for CN's. the HMM adjoint network is sparsely connected. not necessarily so for the CN (pointed out by [Tony Robinson]). though both cost functions used are nonlinear, the BF is a much more efficient method to optimize the HMM cost function than Back Propagation is for CN's. the last answer is the really important one. due to the special nature of the Hidden Markov Model, we can use the BF algorithm. this algorithm allows to take large steps (large changes from q(k-1) to q(k)) in the traying Euclidean distance, without moving too far away in the cross entropy distance. of all the probability distributions, we consider only the ones that are "relevant", in that they are close to the current one; and yet, even though we take conservative steps, we are guaranteed to maximize the minimum improvement in likelihood. indeed the maxmin is a conservative attitude. the rational is the following: "you want to maximize L. you know the steepest ascent direction; you want to go in that direction, but you do not know how far to go. BF will tell you how far you can go (and it will not be an infinitesimal step) so that you maximize the minimum improvement." another way to look at this is that the Euclidean distance imposes a structure (topology) to the space of probability distributions. the cross entropy distance imposes a different structure, which apparently, is more relevant to the problem. in contrast, in BP we have not much choice in the change we bring on q. we have control over w, the weights of the connections, and we usually choose them in the steepest descent direction, and small enough that we actually have an improvement. but it is not clear that the cross entropy between distributions imposes a suitable structure on the space of weights. apparently it does not. even a relatively small step in the weight space can change the cost function by much. we have to tread more carefully. of course BF can be used due to the very special structure of the HMM problem (which is probably a good argument for the usefulness of the HM Model). BF is applicable when the cost function is a homogeneous polynomial with additive constraints on the variables. (see [Baum 1968]). the CN problem is characterized by harder nonlinearities (e.g. the sigmoid function) which induce a warped relationship between the weights and cost function. in short, the CN problem is more general and harder. 5. square error cost function ----------------------------- first a general observation: the square error cost function can be introduced under two asumptions. in the one case we assume the error to be deterministic and we want to minimize a deterministic sum of square errors (the sum is over all training patterns; the error is the difference between desired and actual response) by appropriate choice of weights. there is nothing probabilistic here. alternatively, we can assume that the training patterns are selected randomly (according to some prob. density) and also the test patterns will come from the same prob. density, and we choose the weights to minimize expected square error. even though the two points of view are distinct, they are not that different, since in both cases we can define inner products, distance functions etc. and so get a Hilbert space structure that is practically the same for both cases. of course this would involv some ergodicity assumption. at any rate, assume here the probabilistic point of view of square error. what are then the connections between the two cost functions: cross entropy and expected (or mean ) square error? i have seen some remarks on this problem in the literature, but i do not know enough about at this point. however, judging from training time, i would say that the nonlinear nature of CN with sigmoids again maps the weight space to the cost function in a very warped way. it would be interesting to examine the shape of the cost function contour in the weight space. have such studies been made? visualization seems to be a problem for high dimensional networks. 6. cross entropy maximization and some loose ends ------------------------------------------------ an interesting variation is G maximization. this usually occurs in unsupervised learning. See [Linsker], [Plumbley]. it appears under the name of transinformation maximization, or error information minimization, but these quantities can be interpreted as cross entropy between the joint input-output probability den. induced by the CN (for given weights) and the probability den. where input and output have the same marginals, but are independent (so the joint density is a product of the two marginals). i guess a way to explain this in terms of cross entropy is: even though we have no prior information on the best input-output density, there is one density we certainly want to avoid as much as possible, and this the one where input and output are independent (so the input gives no information as to what the output is). hence we want to maxmize the cross entropy distance between this product distribution and the CN induced distribution. there is also a possible interpretation along the lines of the maximum entropy principle. i must say that these interpretations do not seem (yet) to me as appealing as maximum transinformation. however they are possible and indeed statisticians have been considering them for many years now. another interseting connection is between cross entropy and rate of convergence (obviously rate of convergence is connected to training time). [Ellis] gives an excellent analysis of the connection between rate of convergence and crossentropy. application of his results to computational problems is not obvious. finally, an interesting example (of statistical work that relates to this line of connectionist research) is [Rissanen]; there the linear regression model is considered, which of course can be interpreted as a linear perceptron. in [Rissanen] selection of the optimal model is based on minmax entropy criterion. References: ----------- D.H.Ackley: "A Learning algorithm for Boltzmann machines" et.al. Cognitive Science 9 (1985). L.E. Baum &: "Growth Transformations for Functions on Manifolds" G.R. Sell Pacific Journal of Mathematics, Vol.27, No.2., 1968. L.E.Baum : "A Maximization Technique occurring in the Statistical et.al. Analysis of Probabilistic Functions of Markov Chains" The Annals of Math, Stat., Vol. 41, No. 1, 1970. A.P. Dempster:"Maximum Likelihood from Incomplete Data via EM algorithm" et. al. Pr. Roy. Stat. Soc., No. 1, 1977. R. Ellis: "Entropy, Large Deviations and Statistical Mechanics" Springer, New York 1985. G. Hinton :"Connectionist Learning Procedures", Technical Report CMU-CS-87-115 (Carnegie Mellon University), June 1987. A. Kehagias: "Optimal Control for Training: Themissing link between HMM and Connectionist Networks" submitted to 7th int. Conf. on Math. and Computer Modelling, Chicago, Illinois, August 1989. S.Y. Kung &: "A Unifying viewpoint of Multilayer Perceptrons J.N. Hwang and HMM models" (IEEE Int. Symposium and Systems Portland, Oregon, 1989. S.E.Levinson: "An Introduction to the Application of the Theory of et.al. Probabilistic Functions of a Markov Process to Automatic Speech Recognition", The Bell Sys. Tech. J., Vol.62, No. 4, April 1983. R. Linsker: "Self Organization in a Perceptual Network", IEEE Computer, Vol.21, No.3, March 1988. M. Plumbley&: "An information Theoretic Approach to Unsupervised F. Fallside Connectionist Models", Proceedings of 1988 Connectionist Models Summer School, Pittsburgh, 1988. J. Rissanen: "Minmax Entropy Estimation of Models for Vector Processes", in Lainiotis-Mehra (ed.), System Advances and case studies, Academic, New York, 1976. T. Robinson: personal communication From ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU Thu Mar 16 09:54:52 1989 From: ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU (thanasis kehagias) Date: Thu, 16 Mar 89 09:54:52 EST Subject: HMM? Message-ID: with respect to my cross entropy posting, i guess i never said it explicitly: HMM stands for Hidden Markov Model it is a model widely used in speech research. Thanasis From sankar at caip.rutgers.edu Thu Mar 16 09:42:44 1989 From: sankar at caip.rutgers.edu (ananth sankar) Date: Thu, 16 Mar 89 09:42:44 EST Subject: questions on kohonen's maps Message-ID: <8903161442.AA14983@caip.rutgers.edu> I am interested in the subject of Self Organization and have some questions with regard to Kohonen's algorithm for Self Organizing Feature Maps. I have tried to duplicate the results of Kohonen for the two dimensional uniform input case i.e. two inputs. I used a 10 X 10 output grid. The maps that resulted were not as good as reported in the papers. Questions: 1 Is there any analytical expression for the neighbourhood and gain functions? I have seen a simulation were careful tweaking after every so many iterations produces a correctly evolving map. This is obviously not a proper approach. 2 Even if good results are available for particular functions for the uniform distribution input case, it is not clear to me that these same functions would result in good classification for some other problem. I have attempted to use these maps for word classification using LPC coeffs as features. 3 In Kohonen's book "Self Organization and Associative Memory", Ch 5 the algorithm for weight adaptation does not produce normalized weights. Thus the output nodes cannot function as simply as taking a dot product of inputs and weights. They have to execute a distance calculation. 4 I have not seen as yet in the literature any reports on how the fact that neighbouring nodes respond to similar patterns from a feature space can be exploited. 5 Can the net become disordered after ordering is achieved at any particular iteration? I would appreciate any comments, suggestions etc on the above. Also so that net mail clutter may be reduced please respond to sankar at caip.rutgers.edu Thank you. Ananth Sankar Department of Electrical Engineering Rutgers University, NJ From KELLY%BROWNCOG.BITNET at mitvma.mit.edu Thu Mar 16 12:12:00 1989 From: KELLY%BROWNCOG.BITNET at mitvma.mit.edu (KELLY%BROWNCOG.BITNET@mitvma.mit.edu) Date: Thu, 16 Mar 89 12:12 EST Subject: What is a connectionist net? Here's what it's not. Message-ID: What is a connectionist model, you ask? Well, I don't think I can answer that specifically, but I can tell you what it's not. In the first place it *is* a member of a larger class of models called complex systems. But that doesn't help us either, because nobody really knows what a complex system is. The generally conceived definition has something to do with large numbers of simple, interconnecting units which can perform some type of "cooperative computation". That is, individually the units are so dumb that they can't do anything, but together they can do alot. Well, then my claim (I'm really out on a limb here), is that systems with large numbers of very complex, interconnecting units really aren't connectionist models (or even complex systems) at all, no matter how many connections there are or what type of amazing results they achieve. In particular I am referring to the result that Hecht-Nielson reports in his paper on "Kolmogorov's Mapping Neural Network Theorem" [1987 INNS proceedings?]. There he describes a way of proving that a 2-layered net (one hidden layer) is capable of solving any mapping problem. However, the units in the network are incredibly complex. No longer are we dealing with units that compute threshold functions. The hidden layer units must be able to compute any real, continuous monotonically increasing function, and the output layer units must be able to compute any *arbitrary* real continuous function. While the fact that a system like this can do some serious computation is interesting (neat, even), it really tells us nothing about connectionist networks. From ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU Thu Mar 16 22:19:54 1989 From: ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU (thanasis kehagias) Date: Thu, 16 Mar 89 22:19:54 EST Subject: credits Message-ID: recently i posted a note about traing of HMM and Connectionist Networks, where i was not careful enough in giving credit to people that deserved it. let me try to make up for it: i had a very interesting exchange of mesages with Tony Robinson, that formed the basis for my note. i received messages with ideas and references from Mark Plumbley, Steven Nowlan, Sue Becker and Sara Solla. Sara Solla referred me to a paper written by Solla, Esther Levin and Michael Fleisher, that deals with the question of cross entropy. i received a copy of this paper today. it is: "Accelerated Learning in Layered Neural Networks", by S. Solla, , E. Levin and M. Fleisher, Complex Systems, Vol. 2, 1988. the paper compares cross entropy and square error and includes a numerical study and a study of the shape of the contours of these cost functions. therefore, the similar question i posed at the end of my note is at least partly answered. i also received the revised copy of G. Hinton's report on Connectionist learning procedures, referred to in my note. in this report (Dec. 1987) Hinton has already made a remark directly related to my point of maximinimizing likelihood in the BF algorithm. specifically, he says that (in the context of CN training with cross entropy cost function) Likelihood is maximized when cross entropy is minimized. i think this is all. if i have missed someting , let me know about it .{ Thanasis From ROB%BGERUG51.BITNET at VMA.CC.CMU.EDU Fri Mar 17 09:24:00 1989 From: ROB%BGERUG51.BITNET at VMA.CC.CMU.EDU (Rob A. Vingerhoeds / Ghent State University) Date: Fri, 17 Mar 89 09:24 N Subject: Neural Networks Seminar Ghent, 25 april 1989, FINAL ANNOUNCEMENT Message-ID: BIRA SEMINAR ON NEURAL NETWORKS "APPLICATION OF NEURAL NETWORKS IN INDUSTRY, WHEN AND HOW" 25 APRIL 1989 INTERNATIONAL CONGRESS CENTRE GHENT BELGIUM FINAL ANNOUNCEMENT BIRA (Belgian Institute for Control Engineering and Automation) is organising a seminar on the state of the art in Neural Networks. The central theme will be "Application of Neural Networks in Industry, when and how" To be able to give a good and reliable verdict to this theme, some of the most important and leading scientists in this fascinating area have been invited to present a lecture at the seminar and take part in a panel discussion. The following program is foreseen: 8.30 - 9.00 Registration 9.00 - 9.15 Opening on behalf of BIRA Prof. L. Boullart, Ghent State University 9.15 - 10.00 Learning Algorithms and applications in A.I. Prof. Fogelman Soulie, Universite de Paris V 10.00 - 10.30 coffee 10.30 - 11.30 The Neural Network Framework Prof. B. Kosko, University of Southern California 11.30 - 12.00 Presentation of ANZA+ products, hardware and software Patrick Dumont, Digilog, France 12.00 - 14.00 lunch / exhibition 14.00 - 15.00 Integration of knowledge-based system and neural network techniques for robotic control Dr. David Handelman, Princeton, USA 15.00 - 16.00 Application in Image Processing and Pattern Recognition (Neocognitron) Dr. S. Miyake, ATR, Japan 16.00 - 16.30 tea 16.30 - 17.15 panel discussion over the central theme 17.15 - 17.30 closing and conclusions The seminar will be held in the same period as the famous Flanders Technology International (F.T.I.) exhibition is held. This exhibition is for both representatives from industry and for other interested people very interesting and going to both the seminar and the exhibition is double interesting. VENUE International Congress Centre Ghent - Orange Room - Citadelpark B-9000 Ghent DATE Tuesday 25 april 1989 LANGUAGE The seminar language is English. No translation will be provided. REGISTRATION FEES members BIRA/IBRA 12.500 BEF non-members 15.000 BEF Teachers/Assistants 7.500 BEF including coffee/tea, lunch and proceedings. Students can get a special price of 1.500 BEF, which does NOT include a lunch. Tickets for FLANDERS TECHNOLOGY INTERNATIONAL can be obtained at the registration desk. Payments in Belgian Franks only, to be made on receipt of an invoice from the BIRA office. Registration will close on 18 april 1989. Confirmations will NOT be send. For further information or a printed announcement with a registration form please contact either the BIRA coordinator (adress below) or one of us (using e-mail). You can also use the registration form printed below and send this via e-mail back to us. We will then make sure it reaches BIRA in time. ---------------------------------------------------------------------- REGISTRATION FORM Tuesday 25 april 1989 I.C.C.-Ghent BIRA Seminar on NEURAL NETWORKS NAME: .................................................. FIRST NAME: .................................................. ADRESS: .................................................. .................................................. POSITION: .................................................. CONCERN OR INSTITUTE: .................................................. .................................................. TEL: .................................................. FAX: .................................................. ------------------------- Member BIRA/IBRA : ........ BEF Non-members : ........ BEF Teachers/Assistants : ........ BEF ------------------------- Please only settle payment upon receipt of an invoice from the BIRA-Office. Please indicate whether the invoice should be adressed to the company or the personal adress. Date: Please send back before 17 april 1989. Do NOT use 'REPLY', because in that way everyone on the list will be informed about your plans to come to the seminar and they just might not be interested in it. ---------------------------------------------------------------------- Seminar Coordinators Rob Vingerhoeds Leo Vercauteren BIRA COORDINATOR L. Pauwels BIRA-Office Het Ingenieurshuis Desguinlei 214 2018 Antwerpen Belgium tel: +32-3-216-09-96 fax: +32-3-216-06-89 (attn. BIRA L. Pauwels) From alexis%yummy at gateway.mitre.org Fri Mar 17 09:46:27 1989 From: alexis%yummy at gateway.mitre.org (alexis%yummy@gateway.mitre.org) Date: Fri, 17 Mar 89 09:46:27 EST Subject: What is a connectionist net? Here's what it's not. In-Reply-To: KELLY%BROWNCOG.BITNET@mitvma.mit.edu's message of Thu, 16 Mar 89 12:12 EST <8903170151.AA26943@gateway.mitre.org> Message-ID: <8903171446.AA02093@marzipan.mitre.org> ************ Do Not Forward To Any Other BBoards, Etc ************ Just an aside to KELLY%BROWNCOG's note, rather than worry if Hecht-Nielson's neural net (and I use the term intentionally -- I mean "artificial intelligence" is neither so ...) is really a connectionist model, let me point out a paper/result worth being aware of. G. Cybenko wrote a very interesting paper which proves that a neural network with *one* hidden layer of nodes (i.e., one more than a perceptron) with a sigmoid transfer function can "uniformly approximate any continuous function with support in the unit hypercube". That is to say you actually can do any mapping with *ONE* hidden layer (albeit often a very very large one). Cybenko sent the paper to me because of a tirade I went on awhile ago on this bboard, so I don't actually know if it has been published anywhere yet. I'm writing this without his knowledge -- I'm pretty sure he's on this list. G. Cybenko are you out there, and are you willing to say where your paper "Approximation by Superpositions of a Sigmoidal Function" can be found by the hungary masses? alexis wieland. ************ Do Not Forward To Any Other BBoards, Etc ************ From sontag at fermat.rutgers.edu Sat Mar 18 18:27:29 1989 From: sontag at fermat.rutgers.edu (sontag@fermat.rutgers.edu) Date: Sat, 18 Mar 89 18:27:29 EST Subject: ONE HIDDEN LAYER IS ENOUGH -- re "what is a net?" discussion Message-ID: <8903182327.AA06225@control.rutgers.edu> This is in response to Alexis Wieland's request: "G. Cybenko are you out there, and are you willing to say where your paper "Approximation by Superpositions of a Sigmoidal Function" can be found by the hungary (sic) masses?" (Presumably non-Hungarian masses are interested too, so:) The paper by George Cybenko that proves this theorem (a neural network with one hidden layer of nodes with a fixed sigmoid transfer function can uniformly approximate any continuous function) is scheduled to appear in MATHEMATICS OF CONTROL, SIGNALS, AND SYSTEMS, Vol.2 (1989), Number 4. Your library should have this journal, which specializes in the formal mathematical analysis of problems related to signal processing and systems. (The journal has published many other papers that should be relevant to theoretical connectionist research, such as papers on iterated projection methods, estimation, interpolation techniques, identification, and adaptive control.) If your library doesn't yet subscribe, you might as well provide them with the following info: MATHEMATICS OF CONTROL, SIGNALS, AND SYSTEMS Springer-Verlag New York, Inc ISSN 0932-4194, Title # 498 In North America, order from: Springer-Verlag New York, Inc Journal Fulfillment Services 44 Hartz Way, Secaucus, NJ 07094 (Volume 2, 1989 ... $179.00 incl. p&h) Outside NA, order from: Springer-Verlag Heidelberger Platz 3 D-1000 Berlin 33, FRG (Volume 2, 1989 ... DM 348.- incl. p&h) -bradley dickinson and eduardo d. sontag, co-Managing eds. From terry%sdbio2 at ucsd.edu Sat Mar 18 21:11:09 1989 From: terry%sdbio2 at ucsd.edu (Terry Sejnowski) Date: Sat, 18 Mar 89 18:11:09 PST Subject: ONE HIDDEN LAYER IS ENOUGH -- re "what is a net?" discussion Message-ID: <8903190211.AA17912@sdbio2.UCSD.EDU> Hal White in the Economics Department at UCSD has also proved that one hidden layer can uniformly approximate smooth mappings. He has gone on to prove the even more interesting theorem that it is possible to learn the mapping. Write to him for a preprint: Hal White Department of Economics UCSD San Diego, CA 92093 Two related papers that are in press in Neural Computation: What size net gives valid generalization? by Eric Baum and David Haussler A proposal for more powerful learning algorithms. Eric Baum. For preprints write to: Eric Baum Department of Physics Princeton University Princeton, NJ 08540 Terry Sejnowski ----- From chrisley.pa at Xerox.COM Mon Mar 20 14:25:00 1989 From: chrisley.pa at Xerox.COM (chrisley.pa@Xerox.COM) Date: 20 Mar 89 11:25 PST Subject: questions on kohonen's maps In-Reply-To: ananth sankar 's message of Thu, 16 Mar 89 09:42:44 EST Message-ID: <890320-112612-6136@Xerox> Ananth Sankar recently asked some questions about Kohonen's feature maps. As I have worked on these issues with Kohonen, I feel like I might be able to give some answers, but standard disclaimers apply: I cannot be certain that Kohonen would agree with all of the following. Also, I do not have my copy of his book with me, so I cannot be more specific about refrences. Questions: 1 Is there any analytical expression for the neighbourhood and gain functions? I have seen a simulation were careful tweaking after every so many iterations produces a correctly evolving map. This is obviously not a proper approach. Although there is probably more than one, correct, task-independent gain or neighborhood function, Kohonen does mention constraints that all of them should meet. For example, both functions should decrease to zero over time. I do not know of any tweaking; Kohonen usually determines a number of iterations and then decreases the gain linearly. If you call this tweaking, then your idea of domain-independent parameters might be a sort of holy grail, since it does not seem likely that we are going to find a completely parameter-free learning algorithm that will work in every domain. 2 Even if good results are available for particular functions for the uniform distribution input case, it is not clear to me that these same functions would result in good classification for some other problem. I have attempted to use these maps for word classification using LPC coeffs as features. As far as I know, Kohonen has used the same type of gain and neighborhood functions for all of his map demonstrations. These demonstrations, which have been shown via an animated film at several major conferences, demonstrate maps learning the distribution in cases where 1) the dimensionality of the network topology and the input space mismatch, e.g., where the network is 2d and the distribution is a 3d 'cactus'; 2) the distribution is not uniform. The algorithm was developed with these 2 cases in mind, so it is no surprise that the results are good for them as well. 3 In Kohonen's book "Self Organization and Associative Memory", Ch 5 the algorithm for weight adaptation does not produce normalized weights. Thus the output nodes cannot function as simply as taking a dot product of inputs and weights. They have to execute a distance calculation. That's right. And Kohonen usually uses the Euclidean distance metric, although other ones can be used (which he discusses in the book) Furthermore, there have been independent efforts to normalize weights in Kohonen maps so that the dot product measure can be used. If you have any doubts about the suitability of the Euclidean metric, as your question seems to imply, express them. It is an interesting issue. 4 I have not seen as yet in the literature any reports on how the fact that neighbouring nodes respond to similar patterns from a feature space can be exploited. The primary interest in maps, I believe, came from a desire to display high-dimensional information in low dimensional spaces, which are more easily apprehended. But there is evidence that there are other uses as well: 1) Kohonen has published results on using maps for phoneme recognition, where the topology-preservation plays a significant role (such maps are used in the Otaniemi Phonetic Typewriter featured in, I think, Computer magazine a year or two agao.); 2) work has been done on using the topology to store sequential information, which seems to be a good idea if you are dealing with natural signals that can only temporally shift from a state to similar states; 3) several people have followed Kohonen's suggestion of using maps for adaptive kinematic representations for robot control (the work on Murphy, mentioned on this net a month or so ago, and the work being done at Carlton (sp) University by Darryl Graf are two good examples). In short, just look at some ICNN or INNS proceedings, and you'll find many papers where researchers found Kohonen maps to be a good place from which to launch their own studies. 5 Can the net become disordered after ordering is achieved at any particular iteration? Of course, this is theoretically possible, and is almost certain if at some point the distribution of the mapped function changes. But this brings up the difficult question: what is the proper ordering in such a case? Should a net try to integrate both past and present distributions, or should it throw away the past on concentrate on the present? I think nost nn researchers would want a litlle of both, woth maybe some kind of exponential decay in the weights. But in many applications of maps, there is no chance of the distribution changing: it is fixed, and iterations are over the same test data each time. In this case, I would guess that the ordering could not becone disrupted (at least for simple distributions and a net of adequate size), but I realise that there is no proof of this, and the terms 'simple' and 'adequate' are lacking definition. But that's life in nnets for you! If anyone has any more questions, feel free. Ron Chrisley Xerox PARC System Sciences Lab 3333 Coyote Hill Road Palo Alto, CA 94304 USA chrisley.pa at xerox.com tel: (415) 494-4728 OR New College Oxford OX1 3BN UK chrisley at vax.oxford.ac.uk tel: (865) 279-492 From moody-john at YALE.ARPA Tue Mar 21 16:11:08 1989 From: moody-john at YALE.ARPA (john moody) Date: Tue, 21 Mar 89 16:11:08 EST Subject: two research reports available Message-ID: <8903212107.AA03190@NEBULA.SUN3.CS.YALE.EDU> ********* FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD *********** **************** TO OTHER BBOARDS/ELECTRONIC MEDIA ******************* FAST LEARNING IN MULTI-RESOLUTION HIERARCHIES John Moody Research Report YALEU/DCS/RR-681, February 1989 ABSTRACT A class of fast, supervised learning algorithms is presented. They use local representations, hashing, and multiple scales of resolution to approximate functions which are piece-wise continu- ous. Inspired by Albus's CMAC model, the algorithms learn orders of magnitude more rapidly than typical implementations of back propagation, while often achieving comparable qualities of gen- eralization. Furthermore, unlike most traditional function ap- proximation methods, the algorithms are well suited for use in real time adaptive signal processing. Unlike simpler adaptive systems, such as linear predictive coding, the adaptive linear combiner, and the Kalman filter, the new algorithms are capable of efficiently capturing the structure of complicated non-linear systems. As an illustration, the algorithm is applied to the prediction of a chaotic timeseries. NOTE: This research report will appear in Advances in Neural In- formation Processing Systems, edited by David Touretzky, to be published in April 1989 by Morgan Kaufmann Publishers, Inc. The author gratefully acknowledges financial support under ONR grant N00014-89-J-1228, ONR grant N00014-86-K-0310, AFOSR grant F49620-88-C0025, and a Purdue Army subcontract. *********************************************************** FAST LEARNING IN NETWORKS OF LOCALLY-TUNED PROCESSING UNITS John Moody and Christian J. Darken Research Report YALEU/DCS/RR-654, October 1988, Revised March 1989 ABSTRACT We propose a network architecture which uses a single internal layer of locally-tuned processing units to learn both classifica- tion tasks and real-valued function approximations We consider training such networks in a completely supervised manner, but abandon this approach in favor of a more computationally effi- cient hybrid learning method which combines self-organized and supervised learning. Our networks learn faster than back propa- gation for two reasons: the local representations ensure that only a few units respond to any given input, thus reducing compu- tational overhead, and the hybrid learning rules are linear rath- er than nonlinear, thus leading to faster convergence. Unlike many existing methods for data analysis, our network architecture and learning rules are truly adaptive and are thus appropriate for real-time use. NOTE: This research report will appear in Neural Computation, a new Journal edited by Terry Sejnowski and published by MIT Press. The work was supported by ONR grant N00014-86-K-0310, AFOSR grant F49620-88-C0025, and a Purdue Army subcontract. *********************************************************** Copies of both reports can be obtained by sending a request to: Judy Terrell Yale Computer Science PO Box 2158 Yale Station New Haven, CT 06520 (203)432-1200 e-mail: terrell at cs.yale.edu terrell at yale.arpa terrell at yalecs.bitnet ********* FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD *********** **************** TO OTHER BBOARDS/ELECTRONIC MEDIA ******************* ------- From chrisley.pa at Xerox.COM Thu Mar 23 14:35:00 1989 From: chrisley.pa at Xerox.COM (chrisley.pa@Xerox.COM) Date: 23 Mar 89 11:35 PST Subject: questions on kohonen's maps In-Reply-To: ananth sankar 's message of Thu, 16 Mar 89 09:42:44 EST Message-ID: <890323-113527-4949@Xerox> One further note about Ananth Sankar's questions about Kohonen maps: A friend of mine, Tony Bell, tells me (and Ananth) that Helge Ritter has a "neat set of expressions for the learning rate and neighbourhood size parameters... and he also proves something about congergence elsewhere." Unfortunately, I do not as yet have a reference for the papers, but I have liked Ritter's work in the past, so I thought people on the net might be interested. From jose at tractatus.bellcore.com Wed Mar 22 10:44:09 1989 From: jose at tractatus.bellcore.com (Stephen J Hanson) Date: Wed, 22 Mar 89 10:44:09 EST Subject: technical report available Message-ID: <8903221544.AA14583@tractatus.bellcore.com> Princeton Cognitive Science Lab Technical Report: CSL36, February, 1989. COMPARING BIASES FOR MINIMAL NETWORK CONSTRUCTION WITH BACK-PROPAGATION Stephen Jos'e Hanson Bellcore and Princeton Cognitive Science Laboratory and Lorien Y. Pratt Rutgers University ABSTRACT Rumelhart (1987), has proposed a method for choosing minimal or "simple" representations during learning in Back-propagation networks. This approach can be used to (a) dynamically select the number of hidden units, (b) construct a representation that is appropriate for the problem and (c) thus improve the generalization ability of Back-propagation networks. The method Rumelhart suggests involves adding penalty terms to the usual error function. In this paper we introduce Rumelhart's minimal networks idea and compare two possible biases on the weight search space. These biases are compared in both simple counting problems and a speech recognition problem. In general, the constrained search does seem to minimize the number of hidden units required with an expected increase in local minima. to appear in Advances in Neural Information Processing, D. Touretzky Ed., 1989 Research was jointly sponsered by Princeton CSL and Bellcore. REQUESTS FOR THIS TECHNICAL REPORT SHOULD BE SENT TO laura at clarity.princeton.edu Please do not reply to this message or forward, Thankyou. From lwyse at bucasb.BU.EDU Tue Mar 21 13:59:02 1989 From: lwyse at bucasb.BU.EDU (lwyse@bucasb.BU.EDU) Date: Tue, 21 Mar 89 13:59:02 EST Subject: questions on kohonen's maps In-Reply-To: connectionists@c.cs.cmu.edu's message of 20 Mar 89 23:47:09 GMT Message-ID: <8903211859.AA04927@cochlea.bu.edu> What does "ordering" mean when your projecting inputs to a lower dimensional space? For example, the "Peano" type curves that result from a one-D neighborhood learning a 2-D input distribution, it is obviously NOT true that nearby points in the input space maximally activate nearby points on the neighborhood chain. In this case, it is not even clear that "untangling" the neighborhood is of utmost importance, since a tangled chain can still do a very good job of divvying up the space almost equally between its nodes. -lonce From jose at tractatus.bellcore.com Thu Mar 23 17:19:35 1989 From: jose at tractatus.bellcore.com (Stephen J Hanson) Date: Thu, 23 Mar 89 17:19:35 EST Subject: No subject Message-ID: <8903232219.AA16776@tractatus.bellcore.com> Princeton Cognitive Science Lab Technical Report: CSL36, February, 1989. COMPARING BIASES FOR MINIMAL NETWORK CONSTRUCTION WITH BACK-PROPAGATION Stephen Jos'e Hanson Bellcore and Princeton Cognitive Science Laboratory and Lorien Y. Pratt Rutgers University ABSTRACT Rumelhart (1987), has proposed a method for choosing minimal or "simple" representations during learning in Back-propagation networks. This approach can be used to (a) dynamically select the number of hidden units, (b) construct a representation that is appropriate for the problem and (c) thus improve the generalization ability of Back-propagation networks. The method Rumelhart suggests involves adding penalty terms to the usual error function. In this paper we introduce Rumelhart's minimal networks idea and compare two possible biases on the weight search space. These biases are compared in both simple counting problems and a speech recognition problem. In general, the constrained search does seem to minimize the number of hidden units required with an expected increase in local minima. to appear in Advances in Neural Information Processing, D. Touretzky Ed., 1989 Research was jointly sponsered by Princeton CSL and Bellcore. REQUESTS FOR THIS TECHNICAL REPORT SHOULD BE SENT TO laura at clarity.princeton.edu Please do not reply to this message or forward, Thankyou. From gblee at CS.UCLA.EDU Fri Mar 24 13:25:07 1989 From: gblee at CS.UCLA.EDU (Geunbae Lee) Date: Fri, 24 Mar 89 10:25:07 PST Subject: questions on konhonen's map Message-ID: <8903241825.AA25252@maui.cs.ucla.edu> >What does "ordering" mean when your projecting inputs to a lower dimensional >space? It means topological ordering >For example, the "Peano" type curves that result from a one-D >neighborhood learning a 2-D input distribution, it is obviously NOT >true that nearby points in the input space maximally activate nearby >points on the neighborhood chain. It depends on what you mean by "near by" If it is near by in relative sense (in topological relation), not absolute sense, then the nearby points in the input space DOES maximally activate nearby points on the neighborhood chain. --Geunbae Lee AI Lab, UCLA From LIN2 at ibm.com Fri Mar 24 15:02:32 1989 From: LIN2 at ibm.com (Ralph Linsker) Date: 24 Mar 89 15:02:32 EST Subject: Technical report available Message-ID: <032489.150233.lin2@ibm.com> ********* FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD *********** **************** TO OTHER BBOARDS/ELECTRONIC MEDIA ******************* The following report (IBM Research Report RC 14195, Nov. 1988) is available upon request to: lin2 @ ibm.com It will appear in: Advances in Neural Information Processing Systems 1, ed. D. S. Touretzky (San Mateo, CA: Morgan Kaufmann), April 1989. "An Application of the Principle of Maximum Information Preservation to Linear Systems," Ralph Linsker This paper addresses the problem of determining the weights for a set of linear filters (model "cells") so as to maxi- mize the ensemble-averaged information that the cells' out- put values jointly convey about their input values, given the statistical properties of the ensemble of input vectors. The quantity that is maximized is the Shannon information rate, or equivalently the average mutual information between input and output.* Several models for the role of processing noise are analyzed, and the biological motivation for con- sidering them is described. For simple models in which nearby input signal values (in space or time) are corre- lated, the cells resulting from this optimization process include center-surround cells and cells sensitive to temporal variations in input signal. *The possible relation between this optimization principle and the organization of a sensory processing system is discussed in: R. Linsker, Computer 21(3)105-117 (March 1988). If you would like a reprint of the Computer article, please so note. From chrisley.pa at Xerox.COM Fri Mar 24 17:53:00 1989 From: chrisley.pa at Xerox.COM (chrisley.pa@Xerox.COM) Date: 24 Mar 89 14:53 PST Subject: questions on kohonen's maps In-Reply-To: lwyse@bucasb.BU.EDU's message of Tue, 21 Mar 89 13:59:02 EST Message-ID: <890324-145332-8519@Xerox> Lonce (lwyse at bucasb.BU.EDU) writes: "What does "ordering" mean when your projecting inputs to a lower dimensional space? For example, the "Peano" type curves that result from a one-D neighborhood learning a 2-D input distribution, it is obviously NOT true that nearby points in the input space maximally activate nearby points on the neighborhood chain." It is not true that nearby points in input space are always mapped to nearby points in the output space when the mapping is dimensionality reducing, agreed. But 'ordering' still makes sense. The map is topology-preserving if the dependency is in the other direction, i.e., if nearby points in output space are always activated by nearby points in input space. Lonce goes on to say: "In this case, it is not even clear that "untangling" the neighborhood is of utmost importance, since a tangled chain can still do a very good job of divvying up the space almost equally between its nodes." I agree that topology preservation is not necessarily of utmost importance, but it may be useful in some applications, such as the ones I mentioned a few messages back (phoneme recognition, inverse kinematics, etc.). Also, there is 1) the interest in properties of self-organizing systems in themselves, even though an application can't be immediately found; and 2) the observation that for some reason the brain seems to use topology preserving maps (with the one-way dependency I mentioned above), which, although they *could* be computationally unnecessary or even disadvantageous, are probably in fact, nature being what she is, good solutions to tough real time problems. Ron Chrisley After April 14th, please send personal email to Chrisley at vax.ox.ac.uk From ken at phyb.ucsf.EDU Sun Mar 26 01:17:59 1989 From: ken at phyb.ucsf.EDU (Ken Miller) Date: Sat, 25 Mar 89 22:17:59 pst Subject: Normalization of weights in Kohonen algorithm Message-ID: <8903260617.AA08352@phyb> re point 3 of recent posting about Kohonen algorithm: "3 In Kohonen's book "Self Organization and Associative Memory", Ch 5 the algorithm for weight adaptation does not produce normalized weights." the algorithm du_{ij}/dt = a(t)[e_j(t) - u_{ij}(t)], i in N_c where u = weights, e is input pattern, N_c is topological neighborhood of maximally responding neighborhood, should I believe be written du_{ij}/dt = a(t)[ e_j(t)/\sum_k(e_k(t)) - u_{ij}(t)/\sum_k(u_{ik}(t)) ], i in N_c. That is, the change should be such as to move the jth synaptic weight on the ith cell, as a PROPORTION of all the synaptic weights on the ith cell, in the direction of matching the PROPORTION of input which was incoming on the jth line. Note that in this case \sum_j du_{ij}(t)/dt = 0, so weights remain normalized in the sense that sum over each cell remains constant. If inputs are normalize to sum to 1 (\sum_k(e_k(t)) = 1) then the first denominator can be omitted. If weights begin normalized to sum to 1 on each cell ( \sum_k(u_{ik}(t)) = 1 for all i) then weights will remain normalized to sum to 1, hence the second denominator can be omitted. Perhaps Kohonen was assuming these normalizations and hence dispensing with the denominators? ken miller (ken at phyb.ucsf.edu) From nowlan at ai.toronto.edu Tue Mar 28 09:41:36 1989 From: nowlan at ai.toronto.edu (Steven J. Nowlan) Date: Tue, 28 Mar 89 09:41:36 EST Subject: training time in HMM and CN Message-ID: <89Mar28.094139est.10529@ephemeral.ai.toronto.edu> Two comments on Thansis' post on the relative training speed of HMM vs CN for sequential problems such as speech recognition: 1. The BF algorithm is quite highly optimized, while vanilla BP doesn't implement anything that a numerical analyst would consider a real descent procedure (not even steepest descent). If you were to use a reasonably powerful numerical optimization technique, such as one of the Broyden methods you may find CN convergence extremely fast. Ray Watrous has in fact shown this sort of speedup for speech problems [1]. 2. A more subtle, but probably more important difference, is the issue of how targets are specified over an input sequence. The BF algorithm specifies targets for intermediate steps in an input sequence based on expectations of final outcome of that sequence collected from many similar sequences. It is not clear how to specify output targets for intermediate points of an input sequence in a CN, although Watrous has shown that intelligent choice of such targets can markedly improve CN convergence and performance. Of interest in this regard is the work by Sutton on Temporal Difference methods [2]. One can view this work as specifying a target function over a sequence in a dynamical way, so that the target function reflects the experience of the system to date in a clever way. Sutton [2] has shown an equivalence between one form of linear TD method and the maximum likelihood estimates of the parameters for an absorbing Markov chain model of the same process. This seems much closer in flavour to what the BF algorithm is doing, and when applied to a non-linear system may in fact be an interesting generalization of BF. Comments and requests for clarifications should be directed to me, not to Connectionists please. - Steve Nowlan nowlan at ai.toronto.edu References: [1] Watrous, Raymond L. "Speech Recognition Using Connectionist Networks", TR MS-CIS-88-96, Department of Computer and Information Science, University of Pennsylvania, Philadelphia, 1988. [2] Sutton, Richard S. "Learning to Predict by the Methods of Temporal Difference", GTE Technical Report TR87-509.1, GTE Laboratories Inc. Waltham, Mass. 1987. From cfields at NMSU.Edu Tue Mar 28 19:56:24 1989 From: cfields at NMSU.Edu (cfields@NMSU.Edu) Date: Tue, 28 Mar 89 17:56:24 MST Subject: No subject Message-ID: <8903290056.AA14581@NMSU.Edu> Call for Participants / Call for Abstracts Symbolic Problem Solving in Noisy, Novel, and Uncertain Task Environments 20-21 August, 1989 (tentative), Detroit, MI, USA An IJCAI-89 Workshop, Sponsored by AAAI Goals. Brittleness in the face of noise, novelty, and uncertainty is a well-known failing of symbolic problem solvers. The goals of this Workshop are to characterize the features of task environments that cause brittleness, to investigate mechanisms for decreasing the brittleness of symbolic problem solvers, and to review case histories of implemented systems that function in task environments high in noise, novelty, and data of uncertain relevance. Topics of interest for the Workshop include the following. Analysis of task environments: Definitions of noisy, novelty, and uncertain relevance; exploration of related concepts in general systems theory or logic; parameters for characterizing task environments; knowledge engineering strategies. Mechanisms for addressing noise and novelty: Plasticity and learning; constructive problem solving; fragmentation of knowledge structures; dynamic modification of rules, schemata, or cases; coherence maintenance; adaptive control mechanisms. Representations: Data structures allowing dynamic abstraction and modification; representation of ``unstructured'' knowledge; knowledge implicit in control or learning procedures; ordering of knowledge structures; tradeoffs between explicit and implicit knowledge representation. Implementation issues: Implementing symbolic problem solvers on parallel machines; concurrency control strategies; integrating symbolic systems with artificial neural networks; general systems integration. Researchers interested in participating in the Workshop are invited to submit abstracts describing work in any of these topic areas. Format. All participants will present their current work, either as a brief oral report or as a poster. Most presentations will be posters, as these provide the greatest opportunity for presentation and discussion of technical details. Presentations will be on the first day of the Workshop, followed by discussions in working groups organized by application domain and a panel discussion on the second day. Attendance at IJCAI Workshops is limited to fifty participants. Participants not registered for IJCAI must pay a $50/day fee. Abstract Submission. Please submit a 1 page abstract of the work to be presented, together with a cover letter summarizing previous work in relevant areas and expected contribution to the Workshop, to Mike Coombs, Box 30001/3CRL, New Mexico State University, Las Cruces, NM 88003-0001 USA, by 15 May 1989. Authors will be notified as to acceptance by 1 June 1989. Accepted abstracts will be distributed at the Workshop. A volume collecting selected papers from the Workshop is planned; papers for this volume will be solicited at the Workshop. Organizers. Mike Coombs and Chris Fields (NMSU), Russ Frew (GE), David Goldberg (Alabama), Jim Reggia (Maryland). Points of contact: Mike Coombs, 505-646-5757, mcoombs at nmsu.edu; Chris Fields, 505-646-2848, cfields at nmsu.edu. From elman%amos at ucsd.edu Wed Mar 29 00:30:44 1989 From: elman%amos at ucsd.edu (Jeff Elman) Date: Tue, 28 Mar 89 21:30:44 PST Subject: 1990 Connectionist Summer School announcement Message-ID: <8903290530.AA23241@amos.UCSD.EDU> March 28, 1989 PRELMINARY ANNOUNCEMENT CONNECTIONIST SUMMER SCHOOL / SUMMER 1990 UCSD La Jolla, California The next Connectionist Summer School will be held at the University of California, San Diego in June 1990. This will be the third session in the series which was held at Carnegie-Mellon in the summers of 1986 and 1988. The summer school will offer courses in a variety of areas of connectionist modelling, with emphasis on computa- tional neuroscience, cognitive models, and hardware imple- mentation. In addition to full courses, there will be a series of shorter tutorials, colloquia, and public lectures. Proceedings of the summer school will be published the fol- lowing fall. As in the past, participation will be limited to gradu- ate students enrolled in PhD. programs (full- or part-time). Admission will be on a competitive basis. We hope to have sufficient funding to subsidize tuition and housing. THIS IS A PRELMINARY ANNOUNCEMENT. Further details will be announced over the next several months. Terry Sejnowski Jeff Elman UCSD/Salk UCSD Geoff Hinton Dave Touretzky Toronto CMU hinton at ai.toronto.edu touretzky at cs.cmu.edu From niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK Wed Mar 29 09:17:49 1989 From: niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK (Mahesan Niranjan) Date: Wed, 29 Mar 89 09:17:49 BST Subject: Missing link etc... Message-ID: <23751.8903290817@dsl.eng.cam.ac.uk> Some recent papers and postings on this network compare HMMs and Multi-layer neural networks. Here is something I find missing in these discussions. In speech pattern processing, HMMs make an inherent assumption about the time series; - that it can be chopped up into a sequence of piecewise stationary regions. Thus, an HMM places break-points in the transition regions of the signal and models the steady regions by the statistical parameters of individual states. For speech signals, this is a bad assumption (human speech production is not at all like this) - but the recognisers somehow seem to work!! In neural networks (with or without feedback) what is the equivalent assumption about the time evolution of the signal? niranjan From ersoy at ee.ecn.purdue.edu Wed Mar 29 12:22:20 1989 From: ersoy at ee.ecn.purdue.edu (Okan K Ersoy) Date: Wed, 29 Mar 89 12:22:20 EST Subject: No subject Message-ID: <8903291722.AA07623@ee.ecn.purdue.edu> CALL FOR PAPERS AND REFEREES HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES - 23 NEURAL NETWORKS AND RELATED EMERGING TECHNOLOGIES KAILUA-KONA, HAWAII - JANUARY 3-6, 1990 The Neural Networks Track of HICSS-23 will contain a special set of papers focusing on a broad selection of topics in the area of Neural Networks and Related Emerging Technologies. The presentations will provide a forum to discuss new advances in learning theory, associative memory, self-organization, architectures, implementations and applications. Papers are invited that may be theoretical, conceptual, tutorial or descriptive in nature. Those papers selected for presentation will appear in the Conference Proceedings which is published by the Computer Society of the IEEE. HICSS-23 is sponsored by the University of Hawaii in cooperation with the ACM, the Computer Society,and the Pacific Research Institute for Informaiton Systems and Management (PRIISM). Submissions are solicited in: Supervised and Unsupervised Learning Associative Memory Self-Organization Architectures Optical, Electronic and Other Novel Implementations Optimization Signal/Image Processing and Understanding Novel Applications INSTRUCTIONS FOR SUBMITTING PAPERS Manuscripts should be 22-26 typewritten, double-spaced pages in length. Do not send submissions that are significantly shorter or longer than this. Papers must not have been previously presented or published, nor currently submitted for journal publication. Each manuscript will be put through a rigorous refereeing process. Manuscripts should have a title page that includes the title of the paper, full name of its author(s), affiliations(s), complete physical and electronic address(es), telephone number(s) and a 300-word abstract of the paper. DEADLINES Six copies of the manuscript are due by June 10, 1989. Notification of accepted papers by September 1, 1989. Accpeted manuscripts, camera-ready, are due by October 3, 1989. SEND SUBMISSIONS AND QUESTIONS TO O. K. Ersoy H. H. Szu Purdue University Naval Research Laboratories School of Electrical Engineering Code 5709 W. Lafayette, IN 47907 4555 Overlook Ave., SE (317) 494-6162 Washington, DC 20375 E-Mail: ersoy at ee.ecn.purdue (202) 767-2407 From lina at wheaties.ai.mit.edu Wed Mar 29 13:23:33 1989 From: lina at wheaties.ai.mit.edu (Lina Massone) Date: Wed, 29 Mar 89 13:23:33 EST Subject: No subject Message-ID: <8903291823.AA09549@gelatinosa.ai.mit.edu> ********* FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD *********** **************** TO OTHER BBOARDS/ELECTRONIC MEDIA ******************* TECHNICAL REPORT AVAILABLE A NEURAL NETWORK MODEL FOR LIMB TRAJECTORY FORMATION Lina Massone and Emilio Bizzi Dept. of Brain and Cognitive Sciences Massachusetts Institute of Technology This paper deals with the problem of representing and generating unconstrained aiming movements of a limb by means of a neural network architecture. The network produced a time trajectory of a limb from a starting posture toward a target specified by a sensory stimulus. Thus the network performed a sensory-motor transformation. The experimenters imposed a bell-shaped velocity profile on the trajectory. This type of profile is characteristic of most movements performed by biological systems. We investigated the generalization capabilities of the network as well as its internal organization. Experiments performed during learning and on the trained network showed that: (i) the task could be learned by a three-layer sequential network; (ii) the network successfully generalized in trajectory space and adjusted the velocity profiles properly; (iii) the same task could not be learned by a linear network; (iv) after learning, the internal connections became organized into inhibitory and excitatory zones and encoded the main features of the training set; (v) the model was robust to noise on the input signals; (vi) the network exhibited attractor-dynamics properties; (vii) the network was able to solve the motor-equivalence problem. A key feature of this work is the fact that the neural network was coupled to a mechanical model of a limb in which muscles are represented as springs. With this representation the model solved the problem of motor redundancy. A short version of this paper covering only part of the described research was mailed in February to IJCNN. The full report has been submitted to Biological Cybernetics. All requests should be addressed to: lina at wheaties.ai.mit.edu From marchman%amos at ucsd.edu Wed Mar 29 19:20:36 1989 From: marchman%amos at ucsd.edu (Virginia Marchman) Date: Wed, 29 Mar 89 16:20:36 PST Subject: Technical Report Available Message-ID: <8903300020.AA01129@amos.UCSD.EDU> The following Technical Report (#8902) is available from the Center for Research in Language. (Please do not forward.) ******************************************************************* Pattern Association in a Back Propagation Network: Implications for Child Language Acquisition Kim Plunkett Virginia Marchman University of Aarhus, Denmark University of California, San Diego Abstract A 3-layer back propagation network is used to implement a pattern association task which learns mappings that are analogous to the present and past tense forms of English verbs, i.e., arbitrary, identity, vowel change, and suffixation mappings. The degree of correspondence between connectionist models of tasks of this type (Rumelhart & McClelland, 1986; 1987) and children's acquisition of inflectional morphology has recently been highlighted in discussions of the general applicability of PDP to the study of human cognition and language (Pinker & Mehler, 1988). In this paper, we attempt to eliminate many of the shortcomings of the R&M work and adopt an empirical, comparative approach to the analysis of learning (i.e., hit rate and error type) in these networks. In all of our simulations, the network is given a constant 'diet' of input stems -- that is, discontinuities are not introduced into the learning set at any point. Four sets of simulations are described in which input conditions (class size and token frequency) and the presence/absence of phonological subregularities are manipulated. First, baseline simulations chart the initial computational constraints of the system and reveal complex "competition effects" when the four verb classes must be learned simultaneously. Next, we explore the nature of these competitions given different type (class sizes) and token frequencies (# of repetitions). Several hypotheses about input to children are tested, from dictionary counts and production corpora. Results suggest that relative class size determines which "default" transformation is employed by the network, as well as the frequency of overgeneralization errors (both "pure" and "blended" overgeneralizations). A third series of simulations manipulates token frequency within a constant class size, searching for the set of token frequencies which results in "adult-like competence" and "child-like" errors across learning. A final series investigates the addition of phonological sub-regularities into the identity and vowel change classes. Phonological cues are clearly exploited by the system, leading to overall improved performance. However, overgeneralizations, U-shaped learning and competition effects continue to be observed in similar conditions. These models establish that input configuration plays a role in detemining the types of errors produced by the network - including the conditions under which "rule-like" behavior and "U-shaped" development will and will not emerge. The results are discussed with reference to behavioral data on children's acquisition of the past tense and the validity of drawing conclusions about the acquisition of language from models of this sort. ***************************************************************** Please send requests for hard copy to: yvonne at amos.ucsd.edu or Center for Research in Language C-008 University of California, San Diego La Jolla, CA 92093 Attn: Yvonne -- Virginia Marchman (marchman at amos.ucsd.edu) Kim Plunkett (psykimp at dkarh02.bitnet) From sankar at caip.rutgers.edu Fri Mar 31 15:14:12 1989 From: sankar at caip.rutgers.edu (ananth sankar) Date: Fri, 31 Mar 89 15:14:12 EST Subject: KOHONEN MAPS Message-ID: <8903312014.AA03080@caip.rutgers.edu> I had initiated a discussion on Kohonen's maps two weeks ago and apart from the many replies I (and many others??) received there were requests that I post the responses. It would be a good idea to go through this material and then discuss again. >From pastor at prc.unisys.com Thu Mar 16 16:58:47 1989 Received: from PRC-GW.PRC.UNISYS.COM by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA03401; Thu, 16 Mar 89 16:58:40 EST Received: from bigburd.PRC.Unisys.COM by burdvax.PRC.Unisys.COM (5.61/Domain/jpb/2.9) id AA11739; Thu, 16 Mar 89 16:58:28 -0500 Received: by bigburd.PRC.Unisys.COM (5.61/Domain/jpb/2.9) id AA24449; Thu, 16 Mar 89 16:58:23 -0500 From: pastor at prc.unisys.com (Jon Pastor) Message-Id: <8903162158.AA24449 at bigburd.PRC.Unisys.COM> Received: from Xerox143 by bigburd.PRC.Unisys.COM with PUP; Thu, 16 Mar 89 16:58 EST To: ananth sankar Date: 16 Mar 89 16:56 EST (Thursday) Subject: Re: questions on kohonen's maps In-Reply-To: ananth sankar 's message of Thu, 16 Mar 89 09:42:44 EST To: ananth sankar Cc: pastor at bigburd.prc.unisys.com Status: R I am in the process of implementing a Kohonen-style system, and if I actually get it running and obtain any results I'll let you know. If you get any responses, please let me know. Thanks. >From Connectionists-Request at q.cs.cmu.edu Thu Mar 16 16:59:58 1989 Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA03426; Thu, 16 Mar 89 16:59:52 EST Received: from cs.cmu.edu by Q.CS.CMU.EDU id aa11454; 16 Mar 89 9:44:34 EST Received: from CAIP.RUTGERS.EDU by CS.CMU.EDU; 16 Mar 89 09:42:55 EST Received: by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA14983; Thu, 16 Mar 89 09:42:44 EST Date: Thu, 16 Mar 89 09:42:44 EST From: ananth sankar Message-Id: <8903161442.AA14983 at caip.rutgers.edu> To: connectionists at cs.cmu.edu Subject: questions on kohonen's maps Status: R I am interested in the subject of Self Organization and have some questions with regard to Kohonen's algorithm for Self Organizing Feature Maps. I have tried to duplicate the results of Kohonen for the two dimensional uniform input case i.e. two inputs. I used a 10 X 10 output grid. The maps that resulted were not as good as reported in the papers. Questions: 1 Is there any analytical expression for the neighbourhood and gain functions? I have seen a simulation were careful tweaking after every so many iterations produces a correctly evolving map. This is obviously not a proper approach. 2 Even if good results are available for particular functions for the uniform distribution input case, it is not clear to me that these same functions would result in good classification for some other problem. I have attempted to use these maps for word classification using LPC coeffs as features. 3 In Kohonen's book "Self Organization and Associative Memory", Ch 5 the algorithm for weight adaptation does not produce normalized weights. Thus the output nodes cannot function as simply as taking a dot product of inputs and weights. They have to execute a distance calculation. 4 I have not seen as yet in the literature any reports on how the fact that neighbouring nodes respond to similar patterns from a feature space can be exploited. 5 Can the net become disordered after ordering is achieved at any particular iteration? I would appreciate any comments, suggestions etc on the above. Also so that net mail clutter may be reduced please respond to sankar at caip.rutgers.edu Thank you. Ananth Sankar Department of Electrical Engineering Rutgers University, NJ >From regier at cogsci.berkeley.edu Thu Mar 16 17:07:20 1989 Received: from cogsci.Berkeley.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA03562; Thu, 16 Mar 89 17:07:16 EST Received: by cogsci.berkeley.edu (5.61/1.29) id AA13666; Thu, 16 Mar 89 14:07:18 -0800 Date: Thu, 16 Mar 89 14:07:18 -0800 From: regier at cogsci.berkeley.edu (Terry Regier) Message-Id: <8903162207.AA13666 at cogsci.berkeley.edu> To: sankar at caip.rutgers.edu Subject: Kohonen request Status: R Hi, I'm interested in the responses to your recent Kohonen posting on Connectionists. Do you suppose you could post the results once all the replies are in? Thanks, -- Terry >From ken at phyb.ucsf.edu Thu Mar 16 20:11:35 1989 Received: from cgl.ucsf.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA09101; Thu, 16 Mar 89 20:11:32 EST Received: from phyb.ucsf.EDU by cgl.ucsf.EDU (5.59/GSC4.15) id AA01036; Thu, 16 Mar 89 17:11:23 PST Received: by phyb (1.2/GSC4.15) id AA11601; Thu, 16 Mar 89 17:11:17 pst Date: Thu, 16 Mar 89 17:11:17 pst From: ken at phyb.ucsf.edu (Ken Miller) Message-Id: <8903170111.AA11601 at phyb> To: sankar at caip.rutgers.edu Subject: kohonen Status: R re your point 3: the algorithm du_{ij}/dt = a(t)[e_j(t) - u_{ij}(t)], i in N_c where u = weights, e is input pattern, N_c is topological neighborhood of maximally responding neighborhood, should actually be written du_{ij}/dt = a(t)[ (e_j(t)/\sum_k(e_k(t))) - u_{ij}(t)/\sum_j(u_{ij}(t) ], i in N_c. That is the change should be such as to move the jth synaptic weight on the ith cell, as a PROPORTION of all the synaptic weights on the ith cell, in the direction of matching the PROPORTION of input which was incoming on the jth line. Note that in this case \sum_j du_{ij}(t)/dt = 0, so weights remain normalized in the sense that sum over each cell remains constant. If you normalize your inputs to sum to 1 (\sum_k(e_k(t)) = 1) and start with weights normalized to sum to 1 on each cell ( \sum_j(u_{ij}(t) = 1 for all i) then weights will remain normalized to sum to 1, hence the two sums in the denominators are both just = 1 and can be left out. Kohonen was I believe assuming these normalizations and hence dispensing with the sums. ken miller (ken at phyb.ucsf.edu) ucsf dept. of physiology >From tds at wheaties.ai.mit.edu Thu Mar 16 23:26:42 1989 Received: from life.ai.mit.edu by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA12489; Thu, 16 Mar 89 23:26:39 EST Received: from mauriac.ai.mit.edu by life.ai.mit.edu; Thu, 16 Mar 89 22:48:15 EST Received: from localhost by mauriac.ai.mit.edu; Thu, 16 Mar 89 22:48:06 est Date: Thu, 16 Mar 89 22:48:06 est From: tds at wheaties.ai.mit.edu Message-Id: <8903170348.AA19015 at mauriac.ai.mit.edu> To: sankar at caip.rutgers.edu Subject: Kohonen maps Status: R I share some of your confusion about Kohonen maps. My main question is #4: are they really doing anything useful? The mapping demonstrated in Kohonen's 1982 paper (Biol. Cyb.) only shows mappings from a 2D manifold in 3-space onto a two-dimensionally arranged set of units. The book talks about dimensionality issues in more detail, but so far as I can tell what the network does (after training) is to map three numbers into about 100 numbers. Since the mapping is linear, I don't see how anything at all is gained. If the network is unable to generate an ordering, it may be one way to tell if the data does not lie on a 2D manifold. But there are many other ways to do this that are more efficient! Also, this is not robust if the manifold folds back on itself (so that two distinct points on the surface are in the same direction from the origin). Let me know if you find out the true significance of this widely-known work, Terry >From lwyse at bucasb.bu.edu Fri Mar 17 17:42:18 1989 Received: from BU-IT.BU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA05821; Fri, 17 Mar 89 17:42:12 EST Received: from COCHLEA.BU.EDU by bu-it.BU.EDU (5.58/4.7) id AA17739; Fri, 17 Mar 89 17:38:02 EST Received: by cochlea.bu.edu (4.0/4.7) id AA02692; Fri, 17 Mar 89 17:38:21 EST Date: Fri, 17 Mar 89 17:38:21 EST From: lwyse at bucasb.bu.edu Message-Id: <8903172238.AA02692 at cochlea.bu.edu> To: sankar at caip.rutgers.edu Subject: re:questions on Kohonen maps Status: R I would be surprised if there was some analytical expression for the neighborhood and gain functions that was useful in practical applications. I have found different "best functions" for different input vector distributions, initial weight distributions, etc. A related question to yours: What does "ordering" mean when mapping accross different dimensional spaces? An excerpt from a report on my experiences with Kohonen maps: When the input space and the neighborhood space of the weight vectors are of different dimension, however, what "ordered" means becomes a sticky wicket. For example, int Fig. 5.17, Kohonen shows a one-dimensional neigborhood of weight vecotrs approximating a triangular distribution of inputs with what he terms a "Peano-like" curve. But this type of curve folds in on itself in an attempt to fill the space and thus moves points that may be far from each other in their one-D neighborooh, but be maximally responsive to very close input points. Is this "ordered"? He doesn't seem to address this point directly. A point I would like to bring out is that in these situations where the dimension of the input space and the dimension of the neighborhood differ, whether or not the wheight-vector chain crosses itself is {\em not} necessarily the important metric for measuring the ability of the weights to approximate the input space. That is, there is not necessarily a correlation between neighborhood-chain crossings, and the mean squared error of the weight vector approximations of the input points. It is true, however, that if the neighborhood chain crosses itself, then {\em there exists} a better approximation to the input space. -lonce >From risto at cs.ucla.edu Sat Mar 18 02:59:46 1989 Received: from Oahu.CS.UCLA.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA14191; Sat, 18 Mar 89 02:59:35 EST Return-Path: Received: by oahu.cs.ucla.edu (Sendmail 5.59/2.16) id AA02486; Fri, 17 Mar 89 23:14:45 PST Date: Fri, 17 Mar 89 23:14:45 PST From: risto at cs.ucla.edu (Risto Miikkulainen) Message-Id: <8903180714.AA02486 at oahu.cs.ucla.edu> To: sankar at caip.rutgers.edu In-Reply-To: ananth sankar's message of Thu, 16 Mar 89 09:42:44 EST <8903161442.AA14983 at caip.rutgers.edu> Subject: questions on kohonen's maps Reply-To: risto at cs.ucla.edu Organization: UCLA Computer Science Department Physical-Address: 3677 Boelter Hall Status: R Date: Thu, 16 Mar 89 09:42:44 EST From: ananth sankar 1 Is there any analytical expression for the neighbourhood and gain functions? I have seen a simulation were careful tweaking after every so many iterations produces a correctly evolving map. This is obviously not a proper approach. The trick is to start with a neighborhood large enough. For 10x10, a radius of 8 units might be appropriate. Then reduce the radius gradually (e.g. over a few thousand inputs) to 1 or even to 0. 3 In Kohonen's book "Self Organization and Associative Memory", Ch 5 the algorithm for weight adaptation does not produce normalized weights. Thus the output nodes cannot function as simply as taking a dot product of inputs and weights. They have to execute a distance calculation. True. The original idea was to form the "activity bubble" with lateral inhibition and change the weights by "redistribution of synaptic resources". This neurologically plausible algorithm gave way to an abstraction which uses distance, global selection and difference. (I did some work comparing these two algorithms; I can send you the tech report if you want to look at it. At least it has the parameters that work) 5 Can the net become disordered after ordering is achieved at any particular iteration? Kohonen proved (in ch 5) that this cannot happen (in the 1-d case) for the abstract algorithm. This is a big problem for the biologically plausible algorithm though. >From djb at flash.bellcore.com Sat Mar 18 23:38:41 1989 Received: from flash.bellcore.com by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA27190; Sat, 18 Mar 89 23:38:32 EST Received: by flash.bellcore.com (5.58/1.1) id AA06742; Sat, 18 Mar 89 23:38:10 EST Date: Sat, 18 Mar 89 23:38:10 EST From: djb at flash.bellcore.com (David J Burr) Message-Id: <8903190438.AA06742 at flash.bellcore.com> To: sankar at caip.rutgers.edu Subject: Feature Map Learning Status: R Your questions regarding the feature map algorithm are ones that have also concerned me. I have been experimenting with a form of this elastic mapping algorithm since about 1979. My early experiments were focussed on using such an adaptive process to map handwritten characters onto reference characters in an attempt to automate a form of elastic template matching. The algorithm I came up with was one which used nearest neighbor "attractors" to "pull" an elastic map into shape by an interative process. I defined a window or smoothing kernel which had a Gaussian shape as opposed to the bos (box) shape commonly used in self organized mapping. My algorithm resembled the Kohonen feature map classifier that you referred to in your email. The gaussian kernel has advantages over the box kernel in that aliasing distortion can be reduced. This is similar to the use of Hamming windows in the design of fast fourier transforms. With regard to your first and second questions, we have found that the actual window size and gain parameters can take on a number of different schedule shapes and give similar results. It is important that window size decrease very gradually to avoid to early committment to a particular vector. This is particularly important in the mapping of highly distorted characters where a rapid schedule could cause a feature in one character to map to the "wrong" feature in the reference character. Gaussian windows were the choice for that problem, since they guaranteed very smooth maps. You are right that a parameter schedule that works for one problem may be poorly suited to a different problem. We have recently applied the feature map model to the traveling salesman problem and reported some of our results at ICNN-88. A one-dimensional version of the elastic map ( a rubber band ) seems best suited to this problem. We found that there was a particular analytic form of the gain schedule which worked well for this problem. Window size, on the other hand, seemed to benefit best from a feedback schedule in which the degree of progress toward the solution served as input to set an appropriate window size. I have results studying some 700 different learning trials on 30-100 city problems using this method. Performance is considerable better than the Hopfield-Tank solution. Yes, it seems as though one needs distance calculation as the input for this model, rather than dot product as used in back-propagation nets. I would be happy to mail you some papers describing my implementation of feature map learning model. The first article appeared in Computer Graphics and Image Processing Journal, 1981, entitled "A Dynamic Model for Image Registration". The recent work on traveling salesman was also reported at last year's Snowbird meeting in addition to ICNN-88. Please feel free to correspond with me as I consider this a very interesting topic. Best Wishes, D. J. Burr djb at bellcore.com >From @relay.cs.net:tony at ifi.unizh.ch Mon Mar 20 03:12:51 1989 Received: from RELAY.CS.NET by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA02795; Mon, 20 Mar 89 03:12:46 EST Received: from relay2.cs.net by RELAY.CS.NET id ab08738; 20 Mar 89 4:55 EST Received: from switzerland by RELAY.CS.NET id ae29120; 20 Mar 89 4:48 EST Received: from ean by scsult.SWITZERLAND.CSNET id a011717; 20 Mar 89 9:45 WET Date: 19 Mar 89 21:45 +0100 From: tony bell To: sankar at caip.rutgers.edu Mmdf-Warning: Parse error in original version of preceding line at SWITZERLAND.CSNET Message-Id: <342:tony at ifi.unizh.ch> Subject: Top Maps Status: R You should see Ritter & Schulten's paper in IEEE ICNN proceedings 1988 (San Diego) for expressions answering question 1. Another paper from Helge Ritt er deals with the convergence properties. This was submitted to Biol Cybernetics but maybe you should write to him at the University of Illinois where he is now. Tony Bell, Univ of Zurich >From djb at flash.bellcore.com Mon Mar 20 17:51:22 1989 Received: from flash.bellcore.com by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA18086; Mon, 20 Mar 89 17:51:14 EST Received: by flash.bellcore.com (5.58/1.1) id AA25760; Mon, 20 Mar 89 17:51:18 EST Date: Mon, 20 Mar 89 17:51:18 EST From: djb at flash.bellcore.com (David J Burr) Message-Id: <8903202251.AA25760 at flash.bellcore.com> To: sankar at caip.rutgers.edu Subject: Self-Organized Mapping Status: R There has been interest on the net recently in some of the questions that you posed in your recent mail. I have personally received comments regarding the neighborhood functions and whether there is an appropriate analytic form. My comments were summarized in my recent mailing to you. If you get additional responses, I would certainly appreciate hearing about peoples' experiences. Would you consider posting a summary to the net? I did not comment on your questions 4 and 5. It seems that the neighbors- matching-to-neighbors observation comes about as a result rather than an input constraint. In my 1981 paper on elastic matching of images I used a more extended pattern matcher (area template insteat of a point-to- point nearest neighbor) for gray scale images. This tended to enforce the constraint that you observed at the input level. Unfortunately, I am not sure what its generalization would be for non-image patterns (N-D instead of2-D). I have done all my experiments on elastic mapping of fixed patterns as opposed to point distributions. There was no problem of a map being undone after it converged. Have you had such problems with your speech data? I have been told that when the distributions are stochastic or sampled, that there is even stronger need to proceed slowly. Apparently one sampled point can pull the map in one direction and this must be counterbalanced by opposing samples pulling the other way to maintain stability of the map. This unfortunately takes lots of computer cycles. Hoping to hear from you. Dave Burr >From Connectionists-Request at q.cs.cmu.edu Mon Mar 20 18:01:41 1989 Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA18228; Mon, 20 Mar 89 18:01:34 EST Received: from cs.cmu.edu by Q.CS.CMU.EDU id aa23263; 20 Mar 89 14:41:25 EST Received: from XEROX.COM by CS.CMU.EDU; 20 Mar 89 14:39:19 EST Received: from Semillon.ms by ArpaGateway.ms ; 20 MAR 89 11:26:12 PST Date: 20 Mar 89 11:25 PST From: chrisley.pa at xerox.com Subject: Re: questions on kohonen's maps In-Reply-To: ananth sankar 's message of Thu, 16 Mar 89 09:42:44 EST To: ananth sankar Cc: connectionists at cs.cmu.edu, chrisley.pa at xerox.com Message-Id: <890320-112612-6136 at Xerox> Status: R Ananth Sankar recently asked some questions about Kohonen's feature maps. As I have worked on these issues with Kohonen, I feel like I might be able to give some answers, but standard disclaimers apply: I cannot be certain that Kohonen would agree with all of the following. Also, I do not have my copy of his book with me, so I cannot be more specific about refrences. Questions: 1 Is there any analytical expression for the neighbourhood and gain functions? I have seen a simulation were careful tweaking after every so many iterations produces a correctly evolving map. This is obviously not a proper approach. Although there is probably more than one, correct, task-independent gain or neighborhood function, Kohonen does mention constraints that all of them should meet. For example, both functions should decrease to zero over time. I do not know of any tweaking; Kohonen usually determines a number of iterations and then decreases the gain linearly. If you call this tweaking, then your idea of domain-independent parameters might be a sort of holy grail, since it does not seem likely that we are going to find a completely parameter-free learning algorithm that will work in every domain. 2 Even if good results are available for particular functions for the uniform distribution input case, it is not clear to me that these same functions would result in good classification for some other problem. I have attempted to use these maps for word classification using LPC coeffs as features. As far as I know, Kohonen has used the same type of gain and neighborhood functions for all of his map demonstrations. These demonstrations, which have been shown via an animated film at several major conferences, demonstrate maps learning the distribution in cases where 1) the dimensionality of the network topology and the input space mismatch, e.g., where the network is 2d and the distribution is a 3d 'cactus'; 2) the distribution is not uniform. The algorithm was developed with these 2 cases in mind, so it is no surprise that the results are good for them as well. 3 In Kohonen's book "Self Organization and Associative Memory", Ch 5 the algorithm for weight adaptation does not produce normalized weights. Thus the output nodes cannot function as simply as taking a dot product of inputs and weights. They have to execute a distance calculation. That's right. And Kohonen usually uses the Euclidean distance metric, although other ones can be used (which he discusses in the book) Furthermore, there have been independent efforts to normalize weights in Kohonen maps so that the dot product measure can be used. If you have any doubts about the suitability of the Euclidean metric, as your question seems to imply, express them. It is an interesting issue. 4 I have not seen as yet in the literature any reports on how the fact that neighbouring nodes respond to similar patterns from a feature space can be exploited. The primary interest in maps, I believe, came from a desire to display high-dimensional information in low dimensional spaces, which are more easily apprehended. But there is evidence that there are other uses as well: 1) Kohonen has published results on using maps for phoneme recognition, where the topology-preservation plays a significant role (such maps are used in the Otaniemi Phonetic Typewriter featured in, I think, Computer magazine a year or two agao.); 2) work has been done on using the topology to store sequential information, which seems to be a good idea if you are dealing with natural signals that can only temporally shift from a state to similar states; 3) several people have followed Kohonen's suggestion of using maps for adaptive kinematic representations for robot control (the work on Murphy, mentioned on this net a month or so ago, and the work being done at Carlton (sp) University by Darryl Graf are two good examples). In short, just look at some ICNN or INNS proceedings, and you'll find many papers where researchers found Kohonen maps to be a good place from which to launch their own studies. 5 Can the net become disordered after ordering is achieved at any particular iteration? Of course, this is theoretically possible, and is almost certain if at some point the distribution of the mapped function changes. But this brings up the difficult question: what is the proper ordering in such a case? Should a net try to integrate both past and present distributions, or should it throw away the past on concentrate on the present? I think nost nn researchers would want a litlle of both, woth maybe some kind of exponential decay in the weights. But in many applications of maps, there is no chance of the distribution changing: it is fixed, and iterations are over the same test data each time. In this case, I would guess that the ordering could not becone disrupted (at least for simple distributions and a net of adequate size), but I realise that there is no proof of this, and the terms 'simple' and 'adequate' are lacking definition. But that's life in nnets for you! If anyone has any more questions, feel free. Ron Chrisley Xerox PARC System Sciences Lab 3333 Coyote Hill Road Palo Alto, CA 94304 USA chrisley.pa at xerox.com tel: (415) 494-4728 OR New College Oxford OX1 3BN UK chrisley at vax.oxford.ac.uk tel: (865) 279-492 >From chrisley.pa at xerox.com Thu Mar 23 15:00:13 1989 Received: from Xerox.COM by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA22224; Thu, 23 Mar 89 15:00:04 EST Received: from Semillon.ms by ArpaGateway.ms ; 23 MAR 89 11:35:27 PST Date: 23 Mar 89 11:35 PST From: chrisley.pa at xerox.com Subject: Re: questions on kohonen's maps In-Reply-To: ananth sankar 's message of Thu, 16 Mar 89 09:42:44 EST To: ananth sankar Cc: connectionists at cs.cmu.edu Message-Id: <890323-113527-4949 at Xerox> Status: R One further note about Ananth Sankar's questions about Kohonen maps: A friend of mine, Tony Bell, tells me (and Ananth) that Helge Ritter has a "neat set of expressions for the learning rate and neighbourhood size parameters... and he also proves something about congergence elsewhere." Unfortunately, I do not as yet have a reference for the papers, but I have liked Ritter's work in the past, so I thought people on the net might be interested. >From Connectionists-Request at q.cs.cmu.edu Fri Mar 24 11:52:18 1989 Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA20326; Fri, 24 Mar 89 11:52:13 EST Received: from ri.cmu.edu by Q.CS.CMU.EDU id aa17597; 24 Mar 89 8:48:01 EST Received: from BU-IT.BU.EDU by RI.CMU.EDU; 24 Mar 89 08:41:54 EST Received: from COCHLEA.BU.EDU by bu-it.BU.EDU (5.58/4.7) id AA06449; Tue, 21 Mar 89 13:58:32 EST Received: by cochlea.bu.edu (4.0/4.7) id AA04927; Tue, 21 Mar 89 13:59:02 EST Date: Tue, 21 Mar 89 13:59:02 EST From: lwyse at bucasb.bu.edu Message-Id: <8903211859.AA04927 at cochlea.bu.edu> To: connectionists at ri.cmu.edu In-Reply-To: connectionists at c.cs.cmu.edu's message of 20 Mar 89 23:47:09 GMT Subject: Re: questions on kohonen's maps Organization: Center for Adaptive Systems, B.U. Status: R What does "ordering" mean when your projecting inputs to a lower dimensional space? For example, the "Peano" type curves that result from a one-D neighborhood learning a 2-D input distribution, it is obviously NOT true that nearby points in the input space maximally activate nearby points on the neighborhood chain. In this case, it is not even clear that "untangling" the neighborhood is of utmost importance, since a tangled chain can still do a very good job of divvying up the space almost equally between its nodes. -lonce >From @relay.cs.net:tony at ifi.unizh.ch Fri Mar 24 13:30:26 1989 Received: from RELAY.CS.NET by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA23163; Fri, 24 Mar 89 13:30:12 EST Received: from relay2.cs.net by RELAY.CS.NET id ab09426; 24 Mar 89 12:01 EST Received: from switzerland by RELAY.CS.NET id aa01417; 24 Mar 89 11:55 EST Received: from ean by scsult.SWITZERLAND.CSNET id a011335; 24 Mar 89 17:53 WET Date: 24 Mar 89 17:51 +0100 From: tony bell To: sankar at caip.rutgers.edu Mmdf-Warning: Parse error in original version of preceding line at SWITZERLAND.CSNET Message-Id: <352:tony at ifi.unizh.ch> Status: R In case anyone else asks (or Ron sends any more vague messages to the net), here are all the refs I have on Helge Ritter's work on topological maps: [1]"Kohonen's Self-Organizing Maps: exploring their computational capabilities" in Proc. IEEE ICNN 1988, San Diego. [2]"Convergence Properties of Kohonen's Topology Conserving Maps: fluctuations, stability and dimension selection" submitted to Biol. Cybernetica. [3] "Extending Kohonen's self-organising mapping algorithm to learn Ballistic Movements" in the book "Neural Computers" Eckmiller & von der Malsburg (eds) [4] "Topology conserving mappings for learning motor tasks" in the book "Neural Networks for Computing" Denker (ed) AIP Conf. proceedings, Snowbird, 1986. The second one in particular uses some heavy statistical techniques (the inputs are seen as a Markov process and a Fokker-Planck equation describes the learn- ing) in order to prove that the map will reach equilibrium when the learning rate is time dependant (ie: it decays). Ritter's PhD thesis covers all his work, but it's in German. Now, Ritter is at the University of Illinois. I hope this helps you and I don't mind if you post this to the net if you think people are interested enough. yours, Tony Bell. >From Connectionists-Request at q.cs.cmu.edu Fri Mar 24 22:07:14 1989 Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA23834; Fri, 24 Mar 89 22:07:06 EST Received: from ri.cmu.edu by Q.CS.CMU.EDU id aa22170; 24 Mar 89 13:28:20 EST Received: from MAUI.CS.UCLA.EDU by RI.CMU.EDU; 24 Mar 89 13:26:10 EST Return-Path: Received: by maui.cs.ucla.edu (Sendmail 5.59/2.16) id AA25252; Fri, 24 Mar 89 10:25:07 PST Date: Fri, 24 Mar 89 10:25:07 PST From: Geunbae Lee Message-Id: <8903241825.AA25252 at maui.cs.ucla.edu> To: lwyse at bucasb.bu.edu Subject: Re: questions on konhonen's map Cc: connectionists at ri.cmu.edu Status: R >What does "ordering" mean when your projecting inputs to a lower dimensional >space? It means topological ordering >For example, the "Peano" type curves that result from a one-D >neighborhood learning a 2-D input distribution, it is obviously NOT >true that nearby points in the input space maximally activate nearby >points on the neighborhood chain. It depends on what you mean by "near by" If it is near by in relative sense (in topological relation), not absolute sense, then the nearby points in the input space DOES maximally activate nearby points on the neighborhood chain. --Geunbae Lee AI Lab, UCLA >From Connectionists-Request at q.cs.cmu.edu Sat Mar 25 02:26:12 1989 Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA26264; Sat, 25 Mar 89 02:26:06 EST Received: from ri.cmu.edu by Q.CS.CMU.EDU id aa25584; 24 Mar 89 17:55:35 EST Received: from XEROX.COM by RI.CMU.EDU; 24 Mar 89 17:53:44 EST Received: from Semillon.ms by ArpaGateway.ms ; 24 MAR 89 14:53:32 PST Date: 24 Mar 89 14:53 PST From: chrisley.pa at xerox.com Subject: Re: questions on kohonen's maps In-Reply-To: lwyse at bucasb.BU.EDU's message of Tue, 21 Mar 89 13:59:02 EST To: lwyse at bucasb.bu.edu Cc: connectionists at ri.cmu.edu Message-Id: <890324-145332-8519 at Xerox> Status: R Lonce (lwyse at bucasb.BU.EDU) writes: "What does "ordering" mean when your projecting inputs to a lower dimensional space? For example, the "Peano" type curves that result from a one-D neighborhood learning a 2-D input distribution, it is obviously NOT true that nearby points in the input space maximally activate nearby points on the neighborhood chain." It is not true that nearby points in input space are always mapped to nearby points in the output space when the mapping is dimensionality reducing, agreed. But 'ordering' still makes sense. The map is topology-preserving if the dependency is in the other direction, i.e., if nearby points in output space are always activated by nearby points in input space. Lonce goes on to say: "In this case, it is not even clear that "untangling" the neighborhood is of utmost importance, since a tangled chain can still do a very good job of divvying up the space almost equally between its nodes." I agree that topology preservation is not necessarily of utmost importance, but it may be useful in some applications, such as the ones I mentioned a few messages back (phoneme recognition, inverse kinematics, etc.). Also, there is 1) the interest in properties of self-organizing systems in themselves, even though an application can't be immediately found; and 2) the observation that for some reason the brain seems to use topology preserving maps (with the one-way dependency I mentioned above), which, although they *could* be computationally unnecessary or even disadvantageous, are probably in fact, nature being what she is, good solutions to tough real time problems. Ron Chrisley After April 14th, please send personal email to Chrisley at vax.ox.ac.uk >From Connectionists-Request at q.cs.cmu.edu Sun Mar 26 03:40:59 1989 Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA19433; Sun, 26 Mar 89 03:40:47 EST Received: from cs.cmu.edu by Q.CS.CMU.EDU id aa07032; 26 Mar 89 1:22:01 EST Received: from CGL.UCSF.EDU by CS.CMU.EDU; 26 Mar 89 01:18:16 EST Received: from phyb.ucsf.EDU by cgl.ucsf.EDU (5.59/GSC4.16) id AA07448; Sat, 25 Mar 89 22:18:01 PST Received: by phyb (1.2/GSC4.15) id AA08352; Sat, 25 Mar 89 22:17:59 pst Date: Sat, 25 Mar 89 22:17:59 pst From: Ken Miller Message-Id: <8903260617.AA08352 at phyb> To: Connectionists at cs.cmu.edu Subject: Normalization of weights in Kohonen algorithm Status: R re point 3 of recent posting about Kohonen algorithm: "3 In Kohonen's book "Self Organization and Associative Memory", Ch 5 the algorithm for weight adaptation does not produce normalized weights." the algorithm du_{ij}/dt = a(t)[e_j(t) - u_{ij}(t)], i in N_c where u = weights, e is input pattern, N_c is topological neighborhood of maximally responding neighborhood, should I believe be written du_{ij}/dt = a(t)[ e_j(t)/\sum_k(e_k(t)) - u_{ij}(t)/\sum_k(u_{ik}(t)) ], i in N_c. That is, the change should be such as to move the jth synaptic weight on the ith cell, as a PROPORTION of all the synaptic weights on the ith cell, in the direction of matching the PROPORTION of input which was incoming on the jth line. Note that in this case \sum_j du_{ij}(t)/dt = 0, so weights remain normalized in the sense that sum over each cell remains constant. If inputs are normalize to sum to 1 (\sum_k(e_k(t)) = 1) then the first denominator can be omitted. If weights begin normalized to sum to 1 on each cell ( \sum_k(u_{ik}(t)) = 1 for all i) then weights will remain normalized to sum to 1, hence the second denominator can be omitted. Perhaps Kohonen was assuming these normalizations and hence dispensing with the denominators? ken miller (ken at phyb.ucsf.edu) From mcvax!fib.upc.es!millan at uunet.UU.NET Fri Mar 31 04:09:00 1989 From: mcvax!fib.upc.es!millan at uunet.UU.NET (Jose del R. MILLAN) Date: 31 Mar 89 17:09 +0800 Subject: TR available Message-ID: <92*millan@fib.upc.es> The following Tech. Report is available. Requests should be sent to MILLAN at FIB.UPC.ES ________________________________________________________________________ Learning by Back-Propagation: a Systolic Algorithm and its Transputer Implementation Technical Report LSI-89-15 Jose del R. MILLAN Dept. de Llenguatges i Sistemes Informatics Universitat Politecnica de Catalunya Pau BOFILL Dept. d'Arquitectura de Computadors Universitat Politecnica de Catalunya ABSTRACT In this paper we present a systolic algorithm for back-propagation, a supervised, iterative, gradient-descent, connectionist learning rule. The algorithm works on feedforward networks where connections can skip layers and fully exploits spatial and training parallelisms, which are inherent to back-propagation. Spatial parallelism arises during the propagation of activity ---forward--- and error ---backward--- for a particular input-output pair. On the other hand, when this computation is carried out simultaneously for all input-output pairs, training parallelism is obtained. In the spatial dimension, a single systolic ring carries out sequentially the three main steps of the learnng rule ---forward, backward and weight increments update. Furthermore, the same pattern of matrix delivery is used in both the forward and the backward passes. In this manner, the algorithm preserves the similarity of the forward and backward passes in the original model. The resulting systolic algorithm is dual with respect to the pattern of matrix delivery ---either columns or rows. Finally, an implementation of the systolic algorithm for the spatial dimension is derived, that uses a linear ring of Transputer processors. From joho%sw.MCC.COM at MCC.COM Thu Mar 2 13:18:40 1989 From: joho%sw.MCC.COM at MCC.COM (Josiah Hoskins) Date: Thu, 2 Mar 89 12:18:40 CST Subject: Tech Report Announcement Message-ID: <8903021818.AA22902@jelly.sw.mcc.com> The following tech report is available. Speeding Up Artificial Neural Networks in the "Real" World Josiah C. Hoskins A new heuristic, called focused-attention backpropagation (FAB) learning, is introduced. FAB enhances the backpropagation pro- cedure by focusing attention on the exemplar patterns that are most difficult to learn. Results are reported using FAB learning to train multilayer feed-forward artificial neural networks to represent real-valued elementary functions. The rate of learning observed using FAB is 1.5 to 10 times faster than backpropagation. Request for copies should refer to MCC Technical Report Number STP-049-89 and should be sent to Kintner at mcc.com or to Josiah C. Hoskins MCC - Software Technology Program AT&T: (512) 338-3684 9390 Research Blvd, Kaleido II Bldg. UUCP/USENET: milano!joho Austin, Texas 78759 ARPA/INTERNET: joho at mcc.com From cfields at NMSU.Edu Fri Mar 3 17:16:53 1989 From: cfields at NMSU.Edu (cfields@NMSU.Edu) Date: Fri, 3 Mar 89 15:16:53 MST Subject: No subject Message-ID: <8903032216.AA17939@NMSU.Edu> _________________________________________________________________________ The following are abstracts of papers appearing in the inaugural issue of the Journal of Experimental and Theoretical Artificial Intelligence. JETAI 1, 1 was published 1 January, 1989. For submission information, please contact either of the editors: Eric Dietrich Chris Fields PACSS - Department of Philosophy Box 30001/3CRL SUNY Binghamton New Mexico State University Binghamton, NY 13901 Las Cruces, NM 88003-0001 dietrich at bingvaxu.cc.binghamton.edu cfields at nmsu.edu JETAI is published by Taylor & Francis, Ltd., London, New York, Philadelphia _________________________________________________________________________ Minds, machines and Searle Stevan Harnad Behavioral & Brain Sciences, 20 Nassau Street, Princeton NJ 08542, USA Searle's celebrated Chinese Room Argument has shaken the foundations of Artificial Intelligence. Many refutations have been attempted, but none seem convincing. This paper is an attempt to sort out explicitly the assumptions and the logical, methodological and empirical points of disagreement. Searle is shown to have underestimated some features of computer modeling, but the heart of the issue turns out to be an empirical question about the scope and limits of the purely symbolic (computational) model of the mind. Nonsymbolic modeling turns out to be immune to the Chinese Room Argument. The issues discussed include the Total Turing Test, modularity, neural modeling, robotics, causality and the symbol-grounding problem. _________________________________________________________________________ Explanation-based learning: its role in problem solving Brent J. Krawchuck and Ian H. Witten Knowledge Sciences Laboratory, Department of Computer Science, University of Calgary, 2500 University Drive, NW, Calgary, Alta, Canada, T2N 1N4. `Explanation-based' learning is a semantically-driven, knowledge-intensive paradigm for machine learning which contrasts sharply with syntactic or `similarity-based' approaches. This paper redevelops the foundations of EBL from the perspective of problem-solving. Viewed in this light, the technique is revealed as a simple modification to an inference engine which gives it the ability to generalize the conditions under which the solution to a particular problem holds. We show how to embed generalization invisibly within the problem solver, so that it is accomplished as inference proceeds rather than as a separate step. The approach is also extended to the more complex domain of planning to illustrate that it is applicable to a variety of logic-based problem-solvers and is by no means restricted to only simple ones. We argue against the current trend to isolate learning from other activity and study it separately, preferred instead to integrate it into the very heart of problem solving. ---------------------------------------------------------------------------- The recognition and classification of concepts in understanding scientific texts Fernando Gomez and Carlos Segami Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA. In understanding a novel scientific text, we may distinguish the following processes. First, concepts are built from the logical form of the sentence into the final knowledge structures. This is called concept formation. While these concepts are being formed, they are also being recognized by checking whether they are already in long-term memory (LTM). Then, those concepts which are unrecognized are integrated in LTM. In this paper, algorithms for the recognition and integration of concepts in understanding scientific texts are presented. It is shown that the integration of concepts in scientific texts is essentially a classification task, which determines how and where to integrate them in LTM. In some cases, the integration of concepts results in a reclassification of some of the concepts already stored in LTM. All the algorithms described here have been implemented and are part of SNOWY, a program which reads short scientific paragraphs and answer questions. --------------------------------------------------------------------------- Exploring the No-Function-In-Structure principle Anne Keuneke and Dean Allemang Laboratory for Artificial Intelligence Research, Department of Computer and Information Science, The Ohio State University, 2036 Neil Avenue Mall, Columbus, OH 43210-1277, USA. Although much of past work in AI has focused on compiled knowledge systems, recent research shows renewed interest and advanced efforts both in model-based reasoning and in the integration of this deep knowledge with compiled problem solving structures. Device-based reasoning can only be as good as the model used; if the needed knowledge, correct detail, or proper theoretical background is not accessible, performance deteriorates. Much of the work on model-based reasoning references the `no-function-in-structure' principle, which was introduced be de Kleer and Brown. Although they were all well motivated in establishing the guideline, this paper explores the applicability and workability of the concept as a universal principle for model representation. This paper first describes the principle, its intent and the concerns it addresses. It then questions the feasibility and the practicality of the principle as a universal guideline for model representation. ___________________________________________________________________________ From jbower at bek-mc.caltech.edu Sun Mar 5 21:09:10 1989 From: jbower at bek-mc.caltech.edu (Jim Bower) Date: Sun, 5 Mar 89 18:09:10 pst Subject: Summer course in computational neurobiology Message-ID: <8903060209.AA03962@bek-mc.caltech.edu> Course announcement: Methods in Computational Neuroscience The Marine Biological Laboratory Woods Hole, Massachusetts August 6 - September 2,1989 General Description The Marine Biological Laboratory (MBL) in Woods Hole Massachusetts is a world famous marine biological laboratory that has been in existence for over 100 years. In addition to providing research facilities for a large number of biologists during the summer, the MBL also sponsors a number of outstanding courses on different topics in Biology. This summer will be the second year in which the MBL has offered a course in "Methods in Computational Neuroscience". This course is designed as a survey of the use of computer modeling techniques in studying the information processing capabilities of the nervous system and covers models at all levels from biologically realistic single cells and networks of cells to biologically relevant abstract models. The principle aim of the course is to provide participants with the tools to simulate the functional properties of those neural systems of interest to them as well as to understand the general advantages and pitfalls of this experimental approach. The Specific Structure of the Course The course itself includes both a lecture series and a computer laboratory. The lectures are given by invited faculty whose work represents the state of the art in computational neuroscience (see list below). The course lecture notes have been incorporated into a book published by MIT press (" Methods in Neuronal Modeling: From Synapses to Networks" C. Koch and I. Segev, editors. MIT Press, Cambridge, MA.,1989). The computer laboratory is designed to give students hands-on experience with the simulation techniques considered in the lecture. It also provides students with the opportunity to actually begin simulations of neural systems of interest to them. The students are guided in this effort by the visiting lecturers and course directors, but also by several students from the Computational Neural Systems (CNS) graduate program at Caltech who serve as Laboratory TAs. The lab itself consists of state of the art graphics workstations running a GEneral NEtwork SImulation System (GENESIS) that Dr. Bower and his colleagues at Caltech have constructed over the last several years. Students return to their home institutions with the GENESIS system to continue their work. The Students The course is designed for advanced graduate students and postdoctoral fellows in biology, computer science, electrical engineering, physics, or psychology with an interest in computational neuroscience. Because of the heavy computer orientation of the Lab section, a good computer background is required (UNIX, C or PASCAL). In addition, students are expected to have a basic background in neurobiology. Course enrollment is limited to 20 so as to assure the highest quality educational experience. Course Directors James M. Bower and Christof Koch Computation and Neural Systems Program California Institute of Technology The Faculty Paul Adams (Stony Brook) Dan Alkon (NIH) Richard Anderson (MIT) John Hildebrand (Arizona) John Hopfield (Caltech) Rudolfo Llinas (NYU) David Rumelhart (Stanford) Idan Segev (Jerusalem) Terrence Sejnowski (Salk/UCSD) David Van Essen (Caltech) Christoph Von der Malsburg (USC) For further information and application materials contact: Admissions Coordinator Marine Biological Laboratory Woods Hole, MA 02543 (508) 548-3705 extension 216 Application Deadline May 15, 1989 Acceptance notification in early June. From mjolsness-eric at YALE.ARPA Tue Mar 7 21:23:16 1989 From: mjolsness-eric at YALE.ARPA (Eric Mjolsness) Date: Tue, 7 Mar 89 21:23:16 EST Subject: "Transformations" tech report Message-ID: <8903080223.AA17992@NEBULA.SUN3.CS.YALE.EDU> A new technical report is available: "Algebraic Transformations of Objective Functions" (YALEU/DCS/RR-686) by Eric Mjolsness and Charles Garrett Yale Department of Computer Science P.O. 2158 Yale Station New Haven CT 06520 Abstract: A standard neural network design trick reduces the number of connections in the winner-take-all (WTA) network from O(N^2) to O(N). We explain the trick as a general fixpoint-preserving transformation applied to the particular objective function associated with the WTA network. The key idea is to introduce new interneurons which act to maximize the objective, so that the network seeks a saddle point rather than a minimum. A number of fixpoint-preserving transformations are derived, allowing the simplification of such algebraic forms as products of expressions, functions of one or two expressions, and sparse matrix products. The transformations may be applied to reduce or simplify the implementation of a great many structured neural networks, as we demonstrate for inexact graph-matching, convolutions and coordinate transformations, and sorting. Simulations show that fixpoint-preserving transformations may be applied repeatedly and elaborately, and the example networks still robustly converge. We discuss implications for circuit design. To request a copy, please send your physical address by e-mail to mjolsness-eric at cs.yale.edu OR mjolsness-eric at yale.arpa (old style) Thank you. ------- From prlb2!vub.vub.ac.be!prog1!wplybaer at uunet.UU.NET Tue Mar 7 19:34:21 1989 From: prlb2!vub.vub.ac.be!prog1!wplybaer at uunet.UU.NET (Wim P. Lybaert) Date: Wed, 8 Mar 89 01:34:21 +0100 Subject: No subject Message-ID: <8903080034.AA10074@prog1.vub.ac.be> Hi, i would like to be placed on the connectionist neural nets mailing list that you distribute. Thanks, Wim Lybaert Brussels Free University Department PROG Oefenplein 2 1040 BRUSSELS BELGIUM email: From ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU Wed Mar 8 11:36:31 1989 From: ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU (thanasis kehagias) Date: Wed, 08 Mar 89 11:36:31 EST Subject: information function vs. squared error Message-ID: i am looking for pointers to papers discussing the use of an alternative criterion to squared error, in back propagation algorithms. the alternative function i have in mind is called (in different contexts and/or authors) cross entropy, entropy, information, inf. divergence and so on. it is defined something like: G=sum{i=1}{N} p_i*log(p_i) i am not quite sure what the index i runs through: untis, weights or something else. i know people have been talking about this a lot, i just cannot remember where i read aboout it ... it seems like Geoff Hinton's group work on this . thanks, Thanasis From mdp%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK Thu Mar 9 08:16:07 1989 From: mdp%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK (Mark Plumbley) Date: Thu, 9 Mar 89 13:16:07 GMT Subject: information function vs. squared error Message-ID: <14398.8903091316@dsl.eng.cam.ac.uk> Thanasis, The "G" function you mentioned, based on an Entropy method, is probably the one developed by Pearmutter and Hinton as a procedure for unsupervised learning of binary units [1]. More recently, Linsker [2,3] and Plumbley and Fallside [4] considered the principle of maximum information transmission (or minimum information loss) for continuous units, relating this to Principal Component methods for linear units. Unfortunately, these are mainly about unsupervised learning, rather than Backprop specifically, although in [4] we do look at the way the mean-squared error criterion places an *upper-bound* on the information loss through a supervised network. This bound will be tightest when the errors on all the output units are independent and have the same variance (or the same entropy for non-additive-Gaussian errors). *If* you can choose the target representation used by Backprop so that the errors are likely to have these properties, it should perform closer to the (information- theoretic) optimal. Hope this is some help, Mark. References: [1] B. A. Pearlmutter and G. E. Hinton: "G-Maximization: An Unsupervised Learning Procedure for Discovering Regularities". In Proceedings of the Conference on `Neural Networks for Computing'. American Institute of Physics, 1986. [2] R. Linsker: "Towards an Organisational Principle for a Layered Perceptual Network". In "Neural Information Processing Systems (Denver, CO. 1987)" (Ed. D. Z. Anderson), pp. 485-494. American Institute of Physics, 1988. [3] R. Linsker: "Self-Organization in a Perceptual Network". IEEE Computer, vol. 21 (3), March 1988, pp. 105-117. [4] M. D. Plumbley and F. Fallside: "An Information-Theoretic Approach to Unsupervised Connectionist Models". Tech. Report CUED/F-INFENG/TR.7. Cambridge University Engineering Department, 1988. Also in "Proceedings of the 1988 Connectionist Models Summer School", pp. 239-245. Morgan-Kaufmann, San Mateo, CA. +--------------------------------------------+---------------------------+ | Mark Plumbley | Cambridge University | | JANET: mdp at uk.ac.cam.eng.dsl | Engineering Department, | | ARPANET: | Trumpington Street, | | mdp%dsl.eng.cam.ac.uk at nss.cs.ucl.ac.uk | Cambridge CB2 1PZ | | Tel: +44 223 332754 Fax: +44 223 332662 | UK | +--------------------------------------------+---------------------------+ From becker at ai.toronto.edu Thu Mar 9 13:26:38 1989 From: becker at ai.toronto.edu (becker@ai.toronto.edu) Date: Thu, 9 Mar 89 13:26:38 EST Subject: information function vs. squared error Message-ID: <89Mar9.132645est.10489@ephemeral.ai.toronto.edu> The use of the cross-entropy measure G = p log(p/q) + (1-p)log(1-p)/(1-q) (Kullback, 1959), where p and q are the probabilities of a binary random variable under 2 probability distributions) has been described in at least 3 different contexts in the connectionist literature: (i) As an objective function for supervised back-propagation; this is appropriate if the output units are computing real values which are to be interpreted as probability distributions over the space of binary output vectors (Hinton, 1987). Here G-error represents the divergence between the desired and observed distributions. (ii) As an objective function for Boltzmann machine learning (Hinton and Sejnowski, 1986), where p and q are the output distributions in the + and - phases. (iii) In the Gmax unsupervised learning algorithm (Pearlmutter and Hinton, 1986) as a measure of the difference between the actual output distribution of a unit and the predicted distribution assuming independent input lines. References: Hinton, G. E. 1987. "Connectionist Learning Procedures", Revised version of Technical Report CMU-CS-87-115, to appear (appeared ?) in Artificial Intelligence. Hinton, G. E. and Sejnowski, T. J. 1986. "Learning and relearning in Boltzmann machines", in Parallel distributed processing: Explorations in the microstructure of cognition, Bradford Books. Kullback, S., 1959. "Information Theory and Statistics", New York: Wiley. Pearlmutter, B. A. and Hinton, G. E. 1986. "G-Maximization: An unsupervised learning procedure for discovering regularities.", Neural Networks for Computing: American Institute of Physics Conference Proceedings 151. Sue Becker DCS, University of Toronto From mehra at aquinas.csl.uiuc.edu Fri Mar 10 05:43:16 1989 From: mehra at aquinas.csl.uiuc.edu (Pankaj Mehra) Date: Fri, 10 Mar 89 04:43:16 CST Subject: No subject Message-ID: <8903101043.AA02586@aquinas> I have recently explored several connectionist models for learning under _realistic_ learning scenarios. The class of problems for which we are trying to acquire solutions by learning are decision problems with the following characteristics: (i) large number of continuous-valued PARAMETERS, each of which (ia) takes on values from a finite range with a nonstationary distribution (ib) costs more to measure accurately. {however, accuracy can be controlled by focussed sampling} (ic) is not known to follow any particular parametric distribution (ii) the optimization CRITERION (energy, if you will) is ill-defined {much like the _blackbox_ in David Ackley's thesis} (iii) a set of OPERATORS is available, and these are the _only_ instruments for manipulating the problem state. (iiia) the _causal_ relationships between the states before and after the application of the operator are not known (iiib) the _persistence_ model is incomplete - i.e. it is not known a priori as to when the effect of an action will be felt and how long will it persist (iv) the TRAINING ENVIRONMENT is _slow reactive_ : it can be assumed to produce reinforcement (prescriptive feedback) rather than an error (evaluative feedback); however, the delays between an action and subsequent reinforcement follow an _unknown_ distribution. ------- These have been called Dynamic Decision Problems, and shown to be a rich class, in the following publication [available upon request from the first author]: Mehra, P. and B. W. Wah, "Architectures for Strategy Learning," in Computer Architectures for Artificial Intelligence Appli- cations, ed. B. Wah and C. Ramamoorthy, Wiley, New York, NY, 1989 (in press). {send e-mail to: mehra at cs.uiuc.edu} ------- The above publication also examines the applicability of other well-known learning techniques {empirical, probabilistic, decision theoretic, EBL, hybrid techniques, learning to plan, etc} and suggests why ANSs might be prefered over others. As a part of this comparision, several contemporary connectionist models were found lacking in certain respects. I shall summarize the criticisms here, and would like to have feedback from those who have supported the use of these techniques. BACK-PROPAGATION: positive aspects: Simplicity of programming the learning algorithm An effective procedure for tuning of large parameter sets representable as _band matrices_ (layered networks) problematic assumptions: Immediate feedback Corrective {as against prescriptive} feedback [I am aware of Ron Williams' work, though] weakness as a learning approach Requires tweaking of features (normalization biases) to the extent that the degree of generalization varies drastically as the degree of coarse coding changes. A great part of the success in particular applications could therefore be attributed to the intelligence of the researcher who codes those features {rather than to the _learning_ algorithm} REINFORCEMENT LEARNING positive aspects Can handle prescriptive feedback Has been shown {Rich Sutton, Chuck Anderson} to work with delayed feedback problematic assumptions The implementations known to this author assume : persistence of effects decays _exponentially_ with time : heuristic assumptions such as "recency" (that the more recent an action is, the more is it responsible for the feedback) and frequency (that the more frequently an action occurs preceding the feedback, the more likely it is to have caused the feedback) are _hardwired_ into the learning algorithms All the knowledge needed for learning is implicit as if the learning critter was born with algorithms assuming exponential decay and as if all actions in the world caused similar delay patterns The nodes of the network compute functions much more complex than in case of classical back-propagation. weakness as a learning paradigm All actions that occur at the same time and with the same frequency are assumed equally likely to have caused the feedback. (ie. these algorithms have an implicitly coded causal model) No scope for using the same network to choose between actions having different causal and persistence assumptions. The learning algorithm amounts to a procedural encoding of environmental knowledge. Any success of these algorithms in realistic applications is in a large part due to the intelligence of the designer and the effort they put in (for example to find just the right lambda for the exponential decay factor). ------- See my paper for details of Dynamic Decision Problems and extensive study of how the basic learning model underlying _most_ of the existing learning algorithms (either in AI or Connectionism) is at odds with the requirements of training in the real world. Comments welcome from those who read the paper, as well as from those who just want to discuss the material of this basenote. - Pankaj {Mehra at cs.uiuc.edu} From mehra at aquinas.csl.uiuc.edu Fri Mar 10 05:43:16 1989 From: mehra at aquinas.csl.uiuc.edu (Pankaj Mehra) Date: Fri, 10 Mar 89 04:43:16 CST Subject: No subject Message-ID: <8903101043.AA02586@aquinas> I have recently explored several connectionist models for learning under _realistic_ learning scenarios. The class of problems for which we are trying to acquire solutions by learning are decision problems with the following characteristics: (i) large number of continuous-valued PARAMETERS, each of which (ia) takes on values from a finite range with a nonstationary distribution (ib) costs more to measure accurately. {however, accuracy can be controlled by focussed sampling} (ic) is not known to follow any particular parametric distribution (ii) the optimization CRITERION (energy, if you will) is ill-defined {much like the _blackbox_ in David Ackley's thesis} (iii) a set of OPERATORS is available, and these are the _only_ instruments for manipulating the problem state. (iiia) the _causal_ relationships between the states before and after the application of the operator are not known (iiib) the _persistence_ model is incomplete - i.e. it is not known a priori as to when the effect of an action will be felt and how long will it persist (iv) the TRAINING ENVIRONMENT is _slow reactive_ : it can be assumed to produce reinforcement (prescriptive feedback) rather than an error (evaluative feedback); however, the delays between an action and subsequent reinforcement follow an _unknown_ distribution. ------- These have been called Dynamic Decision Problems, and shown to be a rich class, in the following publication [available upon request from the first author]: Mehra, P. and B. W. Wah, "Architectures for Strategy Learning," in Computer Architectures for Artificial Intelligence Appli- cations, ed. B. Wah and C. Ramamoorthy, Wiley, New York, NY, 1989 (in press). {send e-mail to: mehra at cs.uiuc.edu} ------- The above publication also examines the applicability of other well-known learning techniques {empirical, probabilistic, decision theoretic, EBL, hybrid techniques, learning to plan, etc} and suggests why ANSs might be prefered over others. As a part of this comparision, several contemporary connectionist models were found lacking in certain respects. I shall summarize the criticisms here, and would like to have feedback from those who have supported the use of these techniques. BACK-PROPAGATION: positive aspects: Simplicity of programming the learning algorithm An effective procedure for tuning of large parameter sets representable as _band matrices_ (layered networks) problematic assumptions: Immediate feedback Corrective {as against prescriptive} feedback [I am aware of Ron Williams' work, though] weakness as a learning approach Requires tweaking of features (normalization biases) to the extent that the degree of generalization varies drastically as the degree of coarse coding changes. A great part of the success in particular applications could therefore be attributed to the intelligence of the researcher who codes those features {rather than to the _learning_ algorithm} REINFORCEMENT LEARNING positive aspects Can handle prescriptive feedback Has been shown {Rich Sutton, Chuck Anderson} to work with delayed feedback problematic assumptions The implementations known to this author assume : persistence of effects decays _exponentially_ with time : heuristic assumptions such as "recency" (that the more recent an action is, the more is it responsible for the feedback) and frequency (that the more frequently an action occurs preceding the feedback, the more likely it is to have caused the feedback) are _hardwired_ into the learning algorithms All the knowledge needed for learning is implicit as if the learning critter was born with algorithms assuming exponential decay and as if all actions in the world caused similar delay patterns The nodes of the network compute functions much more complex than in case of classical back-propagation. weakness as a learning paradigm All actions that occur at the same time and with the same frequency are assumed equally likely to have caused the feedback. (ie. these algorithms have an implicitly coded causal model) No scope for using the same network to choose between actions having different causal and persistence assumptions. The learning algorithm amounts to a procedural encoding of environmental knowledge. Any success of these algorithms in realistic applications is in a large part due to the intelligence of the designer and the effort they put in (for example to find just the right lambda for the exponential decay factor). ------- See my paper for details of Dynamic Decision Problems and extensive study of how the basic learning model underlying _most_ of the existing learning algorithms (either in AI or Connectionism) is at odds with the requirements of training in the real world. Comments welcome from those who read the paper, as well as from those who just want to discuss the material of this basenote. - Pankaj {Mehra at cs.uiuc.edu} From mike at bucasb.BU.EDU Fri Mar 10 12:22:14 1989 From: mike at bucasb.BU.EDU (Michael Cohen) Date: Fri, 10 Mar 89 12:22:14 EST Subject: network meeting announcement for distribution Message-ID: <8903101722.AA27914@bucasb.bu.edu> NEURAL NETWORK MODELS OF CONDITIONING AND ACTION 12th Symposium on Models of Behavior Friday and Saturday, June 2 and 3, 1989 105 William James Hall, Harvard University 33 Kirkland Street, Cambridge, Massachusetts PROGRAM COMMITTEE: Michael Commons, Harvard Medical School Stephen Grossberg, Boston University John E.R. Staddon, Duke University JUNE 2, 8:30AM--11:45AM ----------------------- Daniel L. Alkon, ``Pattern Recognition and Storage by an Artificial Network Derived from Biological Systems'' John H. Byrne, ``Analysis and Simulation of Cellular and Network Properties Contributing to Learning and Memory in Aplysia'' William B. Levy, ``Synaptic Modification Rules in Hippocampal Learning'' JUNE 2, 1:00PM--5:15PM ---------------------- Gail A. Carpenter, ``Recognition Learning by a Hierarchical ART Network Modulated by Reinforcement Feedback'' Stephen Grossberg, ``Neural Dynamics of Reinforcement Learning, Selective Attention, and Adaptive Timing'' Daniel S. Levine, ``Simulations of Conditioned Perseveration and Novelty Preference from Frontal Lobe Damage'' Nestor A. Schmajuk, ``Neural Dynamics of Hippocampal Modulation of Classical Conditioning'' JUNE 3, 8:30AM--11:45AM ----------------------- John W. Moore, ``Implementing Connectionist Algorithms for Classical Conditioning in the Brain'' Russell M. Church, ``A Connectionist Model of Scalar Timing Theory'' William S. Maki, ``Connectionist Approach to Conditional Discrimination: Learning, Short-Term Memory, and Attention'' JUNE 3, 1:00PM--5:15PM ---------------------- Michael L. Commons, ``Models of Acquisition and Preference'' John E.R. Staddon, ``Simple Parallel Model for Operant Learning with Application to a Class of Inference Problems'' Alliston K. Reid, ``Computational Models of Instrumental and Scheduled Performance'' Stephen Jose Hanson, ``Behavioral Diversity, Hypothesis Testing, and the Stochastic Delta Rule'' Richard S. Sutton, ``Time Derivative Models of Pavlovian Reinforcement'' FOR REGISTRATION INFORMATION SEE ATTACHED OR WRITE: Dr. Michael L. Commons Society for Quantitative Analysis of Behavior 234 Huron Avenue Cambridge, MA 02138 ---------------------------------------------------------------------- ---------------------------------------------------------------------- REGISTRATION FEE BY MAIL (Paid by check to Society for Quantitative Analysis of Behavior) (Postmarked by April 30, 1989) Name: ______________________________________________ Title: _____________________________________________ Affiliation: _______________________________________ Address: ___________________________________________ Telephone(s): ______________________________________ E-mail address: ____________________________________ ( ) Regular $35 ( ) Full-time student $25 School ____________________________________________ Graduate Date _____________________________________ Print Faculty Name ________________________________ Faculty Signature _________________________________ PREPAID 10-COURSE CHINESE BANQUET ON JUNE 2 ( ) $20 (add to pre-registration fee check) ----------------------------------------------------------------------------- (cut here and mail with your check to) Dr. Michael L. Commons Society for Quantitative Analysis of Behavior 234 Huron Avenue Cambridge, MA 02138 REGISTRATION FEE AT THE MEETING ( ) Regular $45 ( ) Full-time Student $30 (Students must show active student I.D. to receive this rate) ON SITE REGISTRATION 5:00--8:00PM, June 1, at the RECEPTION in Room 1550, William James Hall, 33 Kirkland Street, and 7:30--8:30AM, June 2, in the LOBBY of William James Hall. Registration by mail before April 30, 1989 is recommended as seating is limited HOUSING INFORMATION Rooms have been reserved in the name of the symposium for the Friday and Saturday nights at: Best Western Homestead Inn 220 Alewife Brook Parkway Cambridge, MA 02138 Single: $72 Double: $80 Reserve your room as soon as possible. The hotel will not hold them past March 31. Because of Harvard and MIT graduation ceremonies, space will fill up rapidly. Other nearby hotels: Howard Johnson's Motor Lodge 777 Memorial Drive Cambridge, MA 02139 (617) 492-7777 (800) 654-2000 Single: $115--$135 Double: $115--$135 Suisse Chalet 211 Concord Turnpike Parkway Cambridge, MA 02140 (617) 661-7800 (800) 258-1980 Single: $48.70 Double: $52.70 --------------------------------------------------------------------------- From homxb!solla at research.att.com Fri Mar 10 13:10:00 1989 From: homxb!solla at research.att.com (homxb!solla@research.att.com) Date: Fri, 10 Mar 89 13:10 EST Subject: Cross-entropy error Message-ID: A detailed discussion of cross-entropy error measure for back propagation, and a comparative study of its merits relative to the more commonly used quadratic measure is to be found in "Accelerated Learning in Layered Neural Networks" by S.A. Solla, E. Levin, and M. Fleisher. The paper has appeared in "Complex Systems", Vol. 2, 1988. Two other relevant references to the use of such error function in the context of supervised learning are: E.B. Baum and F. Wilczek, "Supervised Learning of Probability Distributions by Neural Network" in "Neural Information Processing Systems", ed. by D. Anderson (AIP, New York, 1988) J.J. Hopfield, "Leraning Algorithms and Probability Distributions in Feed- forward and Feed-back Networks", Proc. Natl. Acad. Sci. USA, Vol. 84 ,1988, p. 8429-8433. Sara A. Solla AT&T Bell Laboratories solla at homxb.att.com From John.Hampshire at SPEECH2.CS.CMU.EDU Sun Mar 12 13:21:21 1989 From: John.Hampshire at SPEECH2.CS.CMU.EDU (John.Hampshire@SPEECH2.CS.CMU.EDU) Date: Sun, 12 Mar 89 13:21:21 EST Subject: non-MSE objective function for backprop Message-ID: ********* FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD *********** **************** TO OTHER BBOARDS/ELECTRONIC MEDIA ******************* A NOVEL OBJECTIVE FUNCTION FOR IMPROVED CLASSIFICATION PERFORMANCE IN TIME-DELAY NEURAL NETS USED FOR PHONEME RECOGNITION J. B. Hampshire II A. H. Waibel Carnegie Mellon University We have been working on an alternative objective function to the mean-squared-error (MSE) objective function typically used in backpropagation. Our alternative, which we term the classification figure-of-merit (CFM), forms a mathematical assessment of the *relative* activations of all output nodes of a backprop network used as a classifier. The objective function has a number of unique characteristics; chief among these are 1. its formation of internal representations that consistently differ substantially from those of the MSE objective function 2. its immunity to "over-learning" (i.e., the process by which MSE classifiers can be trained so much that they begin to key on "idiosyncratic" features of the training set that are not representative of the ensemble from which the training set was drawn. As a result, over training actually results in degraded classification performance on a disjoint test set.) While classification performance of the CFM objective function is equivalent to that of the MSE objective function, results from the two classifiers can be combined to reduce by a median 24% the number of misclassifications made by the MSE classifier alone. This equates to single and multi-speaker /b, d, g/ recognition rates that consistently exceed 98%. A preliminary paper is available on our results of applying the CFM to phoneme recognition using Time-Delay Neural Nets now, but if you want to wait another two weeks, you can get the NEW! IMPROVED! full-fledged technical report. If you absolutely can't wait to get your hands on this stuff, send your mailing address and something to the effect of, "send me the CFM paper." If, on the other hand, you want to see a more thorough analysis, send your mailing address and say, "send me the CFM tech report (CMU-CS-89-118) in two weeks." In either case, send your request directly to hamps at speech2.cs.cmu.edu ***** DO NOT USE THE REPLY COMMAND IN YOUR MAILER ***** ********* FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD *********** **************** TO OTHER BBOARDS/ELECTRONIC MEDIA ******************* From netlist at psych.Stanford.EDU Sun Mar 12 17:13:17 1989 From: netlist at psych.Stanford.EDU (Mark Gluck) Date: Sun, 12 Mar 89 14:13:17 PST Subject: Tues. 3/14: ALAN LAPEDES, Neural Nets and Signal Processing Message-ID: Stanford University Interdisciplinary Colloquium Series: Adaptive Networks and their Applications Mar. 14th (Tuesday, 3:30pm): ******************************************************************************** "Nonlinear Signal Processing with Adaptive Networks" ALAN LAPEDES Theoretical Division Los Alamos National Laboratory, MS B213 Los Alamos, New Mexico 87545 ******************************************************************************** Abstract Previous work on using the new generation of nonlinear neural networks for signal processing tasks is reviewed. The concept of a nonlinear system changing its behavior as a parameter is changed (bifurcations) is introduced and investigated for the simple logistic map. In this situation we show that instabilities (limit cycles, chaos) of this system may be predicted as a function of a system parameter purely from observations of the system in its stable regime where it evolves to a stable fixedpoint. We consider predicting the bifurcation of a hydrodynamic experiment. Both backpropagation nets and radial basis networks are used on this problem. Agreement with experiment is good, and plenty of pretty three dimensional pictures will be shown. Unnecessary formalism will be kept to a bare minimum. Additional Information ---------------------- Location: Room 380-380X, which can be reached through the lower level between the Psychology and Mathematical Sciences buildings. Level: Technically oriented for persons working in related areas. Mailing lists: To be added to the network mailing list, netmail to netlist at psych.stanford.edu with "addme" as your subject header. For additional information, contact Mark Gluck (gluck at psych.stanford.edu). From harnad at Princeton.EDU Mon Mar 13 13:57:26 1989 From: harnad at Princeton.EDU (Stevan Harnad) Date: Mon, 13 Mar 89 13:57:26 EST Subject: Abstract for CNLS Conference Message-ID: <8903131857.AA19332@clarity.Princeton.EDU> Here is the abstract for my contribution to the session on the "Emergence of Symbolic Structures" at the 9th Annual International Conference on Emergent Computation, CNLS, Los Alamos National Laboratory, May 22 - 26 1989 Grounding Symbols in a Nonsymbolic Substrate Stevan Harnad Behavioral and Brain Sciences Princeton NJ There has been much discussion recently about the scope and limits of purely symbolic models of the mind and of the proper role of connectionism in mental modeling. In this paper the "symbol grounding problem" -- the problem of how the meanings of meaningless symbols, manipulated only on the basis of their shapes, can be grounded in anything but more meaningless symbols in a purely symbolic system -- is described, and then a potential solution is sketched: Symbolic representations must be grounded bottom-up in nonsymbolic representations of two kinds: (1) iconic representations are analogs of the sensory projections of objects and events and (2) categorical representations are learned or innate feature-detectors that pick out the invariant features of object and event categories. Elementary symbols are the names of object and even categories, picked out by their (nonsymbolic) categorical representations. Higher-order symbols are then grounded in these elementary symbols. Connectionism is a natural candidate for the mechanism that learns the invariant features. In this way connectionism can be seen as a complementary component in a hybrid nonsymbolic/symbolic model of the mind, rather than a rival to purely symbolic modeling. Such a hybrid model would not have an autonomous symbolic module, however; the symbolic functions would emerge as an intrinsically "dedicated" symbol system as a consequence of the bottom-up grounding of categories and their names. From niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK Tue Mar 14 10:16:44 1989 From: niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK (Mahesan Niranjan) Date: Tue, 14 Mar 89 15:16:44 GMT Subject: information function vs. squared error Message-ID: <28888.8903141516@dsl.eng.cam.ac.uk> I tried sending the following note last weekend but it failed for some reason - apologies if anyone is getting a repeat! Re: > Date: Wed, 08 Mar 89 11:36:31 EST > From: thanasis kehagias > Subject: information function vs. squared error > > i am looking for pointers to papers discussing the use of an alternative > criterion to squared error, in back propagation algorithms. the [..] > G=sum{i=1}{N} p_i*log(p_i) > Here is a non-causal reference: I have been looking at an error measure based on "approximate distances to class-boundary" instead of the total squared error used in typical supervised learning networks. The idea is motivated by the fact that a large network has an inherent freedom to classify a training set in many ways (and thus poor generalisation!). In my training, an example of a particular class gets a target value depending on where it lies with respect to examples from the other class (in a two class problem). This implies, that the target interpolation function that the network has to construct is a smooth transition from one class to the other (rather than a step-like cross section in the total squared error criterion). The important consequence of doing this is that networks are automatically deprived of the ability to form large weight (- sharp cross section) solutions (an auto weight decay!!). niranjan PS: A Tech report will be announced soon. From sven at iuvax.cs.indiana.edu Tue Mar 14 10:12:36 1989 From: sven at iuvax.cs.indiana.edu (Sven Anderson) Date: Tue, 14 Mar 89 10:12:36 -0500 Subject: Connection between Hidden Markov Models and Connectionist Networks In-Reply-To: thanasis kehagias's message of Mon, 13 Feb 89 00:47:00 EST Message-ID: I'm interested in receiving the paper you described: OPTIMAL CONTROL FOR TRAINING THE MISSING LINK BETWEEN HIDDEN MARKOV MODELS AND CONNECTIONIST NETWORKS by Athanasios Kehagias Division of Applied Mathematics Brown University Providence, RI 02912 If it's more convenient you might just forward the div file. thanks, Sven Anderson From honavar at cs.wisc.edu Tue Mar 14 17:59:39 1989 From: honavar at cs.wisc.edu (A Buggy AI Program) Date: Tue, 14 Mar 89 16:59:39 -0600 Subject: TR available (** DO NOT FORWARD TO BULLETIN BOARDS **) Message-ID: <8903142259.AA01452@goat.cs.wisc.edu> ** PLEASE DO NOT FORWARD TO BULLETIN BOARDS ** The following TR is now available: --------------------------------------- Perceptual Development and Learning: From Behavioral, Neurophysiological, and Morphological Evidence To Computational Models Vasant Honavar Computer Sciences Department University of Wisconsin-Madison Computer Sciences TR # 818, January 1989 Abstract An intelligent system has to be capable of adapting to a constantly changing environment. It therefore, ought to be capa- ble of learning from its perceptual interactions with its sur- roundings. This requires a certain amount of plasticity in its structure. Any attempt to model the perceptual capabilities of a living system or, for that matter, to construct a synthetic sys- tem of comparable abilities, must therefore, account for such plasticity through a variety of developmental and learning mechanisms. This paper examines some results from neuroanatomi- cal, morphological, as well as behavioral studies of the develop- ment of visual perception; integrates them into a computational framework; and suggests several interesting experiments with com- putational models that can yield insights into the development of visual perception. --------------------------------------- Requests for copies must be addressed to: honavar at cs.wisc.edu From ash%cs at ucsd.edu Tue Mar 14 19:15:54 1989 From: ash%cs at ucsd.edu (Tim Ash) Date: Tue, 14 Mar 89 16:15:54 PST Subject: No subject Message-ID: <8903150015.AA19834@beowulf.ucsd.edu.UCSD.EDU> ----------------------------------------------------------------------- The following technical report is now available. ----------------------------------------------------------------------- DYNAMIC NODE CREATION IN BACKPROPAGATION NETWORKS Timur Ash ash at ucsd.edu Abstract Large backpropagation (BP) networks are very difficult to train. This fact complicates the process of iteratively testing different sized networks (i.e., networks with dif- ferent numbers of hidden layer units) to find one that pro- vides a good mapping approximation. This paper introduces a new method called Dynamic Node Creation (DNC) that attacks both of these issues (training large networks and testing networks with different numbers of hidden layer units). DNC sequentially adds nodes one at a time to the hidden layer(s) of the network until the desired approximation accuracy is achieved. Simulation results for parity, symmetry, binary addition, and the encoder problem are presented. The pro- cedure was capable of finding known minimal topologies in many cases, and was always within three nodes of the minimum. Computational expense for finding the solutions was comparable to training normal BP networks with the same final topologies. Starting out with fewer nodes than needed to solve the problem actually seems to help find a solution. The method yielded a solution for every problem tried. BP applied to the same large networks with randomized initial weights was unable, after repeated attempts, to replicate some minimum solutions found by DNC. ----------------------------------------------------------------------- Requests for reprints (ICS Report 8901) should be directed to: Claudia Fernety Institute for Cognitive Science C-015 University of California, San Diego La Jolla, CA 92093. ----------------------------------------------------------------------- From wine at CS.UCLA.EDU Wed Mar 15 08:49:36 1989 From: wine at CS.UCLA.EDU (wine@CS.UCLA.EDU) Date: Wed, 15 Mar 89 05:49:36 PST Subject: TR available (** DO NOT FORWARD TO BULLETIN BOARDS **) In-Reply-To: Your message of Tue, 14 Mar 89 16:59:39 -0600. <8903142259.AA01452@goat.cs.wisc.edu> Message-ID: <8903151349.AA04692@retina.cs.ucla.edu> Please send me a copy of your technical report #818. Thank you in advance. --David Wine University of California at Los Angeles wine at cs.ucla.edu Computer Science Department (213) 825-6121 3531 Boelter Hall ...!(uunet,rutgers,ucbvax,randvax)!cs.ucla.edu!wine Los Angeles, CA 90024 From ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU Wed Mar 15 18:24:14 1989 From: ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU (thanasis kehagias) Date: Wed, 15 Mar 89 18:24:14 EST Subject: what is a connectionist network? Message-ID: ok, here is my question. i hope it makes sense: very often i want to refer to "these things". i do not want to call them neural networks, since it is far from clear to me they really have a similarity with the human nervous system. so i chose to call them connectionist networks. i guess this means they are networks with (many) connections. but this is very general. so i do not have a clear definition of what i am talking about. i am sure i could come up with several, but they seem to me to be either too restrictive or too general. so would anybody care to give their definition of these objects that this list is about? the issue is not trivial or vacuously philosophical. i think that even if we do not come up with a generally accepted definition of what a connectionist net is, people will have a chance to present competing opinions. possibly some lurking differences will come in the surface and the foundations of connectionism will become more secure. here is a case that i think is fraught with issues (that could be cleared up). any dynamical system that evolves in discretetime can be represented (over a finite time interval) by a feedforward connectionist network. is it fair to say that dyn.systems are connectionist networks. conversely, is it fair to say that feedforward nets are dynamical systems. what are the implications for a time-space trade off? how much do we have to learn about dyn. systems to do connectionist research? ok, after all this i guess i have to give my definition of a connectionist network. it is rather involved and it goes like this: "connectionism is not a yes-or-no property. any directed graph (collection of nodes and directed edges) has a connectionism index, defined as the ratio of nr. of edges to nr. of nodes. " PS: has anybody already dealt with the question of defining a CN? references welcome. Thanasis From ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU Wed Mar 15 18:23:24 1989 From: ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU (thanasis kehagias) Date: Wed, 15 Mar 89 18:23:24 EST Subject: cross entropy and training time in Connectionist Nets and HMM's Message-ID: these are some random thoughts on the issue of training in HMM and Connectionist networks. i focus on the cross entropy function and follow a different line of thinking than in my paper which i quote further down. this note is the outcome of an exchange between Tony Robinson and me; i thought some netters might be interested. so i want to thank Tony for posing interesting ideas and questions. also thanks to all the people who replied to my request for information on the cross entropy function. ----------------------- the starting point for this discussion is the following question: "why is HMM training so much faster than Connectionist Networks?" to put the question in perspective, let me first remark that, from a certain point of view, HMM and CN are very similar objects. specifically they use similar architectures to optimize appropriate cost functions. for further explanation of this point, see [Kehagias], also [Kung]. the similarity is even more obvious when CM are used to solve speech recognition problems. the question remains: why, in attempting to solve the same problem, CN require so much more training? 1. cost functions ----------------- it appears that a (partial) explanation is the nature of the cost function used in each case. in CN speech recognizers, the cost function of choice is quadratic error (error being the difference of appropriate vectors). however in most of what follows i will consider CN that maximize the cross entropy function. a short discussion of the relationship between cross entropy and square error is included at the end. in HMM the function MAXIMIZED is likelihood (of the observations). however HMM are a bit more subtle. using the Markov Model, one can write the likelihood of the observations used for training, call it L(q). here q is a vector that contains the transition and emission probabilities (usually called a_ij, b_kj, respectively). to keep the discussion simple, let us consider the only unknown parameters to be the a_ij's. that is, the elements of q are the a_ij's. now, q is a vector, but a mor general view of it is that it is a function (specifically a probability density function). so we will consider q as a vector or a function interchangeably. (of course any vector is a function of its index!) Now, to maximize L is not a trivial task : it is a polynomial of n*T-th order in the elements of q (where n is the order of the Markov model, T the number of observations) furthermore, the elements of q are probabilties and they must satisfy certain positivity and add-up-to-1 conditions. 2. Likelihood maximin, Backward-Forward, EM algorithm ----------------------------------------------------- so HMM people have found a way to make the optimization problem easier: consider an auxiliary function, call it Q(q,q'), to be presently defined, which can be maximized much easier. then they prove the remarkable inequality: (1) L(q)*log(L(q')/L(q)) >= (Q(q,q')-Q(q,q)). the consequence of (1) is the following: we can implement an iterative algorithm that goes as follows: Step 0: choose q(0) ..... Step k: choose q(k) such that Q(q(k-1),q(k)) is maximized. if Q(q(k-1),q(k))=0, terminate. if Q(q(k-1),q(k))>0 go to step k+1 ..... REMARKS: 1) observe that no provision is made for the case that Q(q(k-1),q(k)) is negative. this is due to the fact that max G is always nonnegative, as proved in [Baum 1968] or [Dempster]. 2) of course , in practice, the termination condition will be replaced by : if GQ(q(k-1),q(k-1)). >From (1) and (2) and Remark (1) follows that (3) L(q(k)) > L(q(k-1)). 3. Connection of EM with cross entropy and neural networks ---------------------------------------------------------- Now we will discuss the function G and point out the relationship to CN. The function Q(q,q') can be defined in quite a general setting. q , q' are probability densities. as such they are functions themselves; we write q(x), q'(x). x takes values in an appropriate range. e.g., in the HMM model x ranges over all the state transition pairs (i,j), giving the probability of a certain state transition. now, define G: (4) Q(q,q')=sum{over all x} q(x)log(q'(x)). Then, the difference Q(q,q)-Q(q,q') is: (5) Q(q,q)-Q(q,q')=G(q,q')=sum{all x}q(x)log(q(x)/q'(x)). G is the well known to connectionists (and statisticians) cross- entropy between q and q', that is, a measure of distance between these two probability densities. now we recognize two things: I. there have been cases where G minimization has been proposed as a CN training procedure . see [Hinton]. In these cases, a desired probability density was known and what was desired was to minimize the distance between desired and actual probability density of the CN output. in some of these cases, there was ncurrent simultaneous maximization of likelihood. this is noted in [Ackley]. it follows necessarily from (1) that maximizing the cross-entropy maximizes the minimum improvement in likelihood. II. it is clear that the BF algorithm does a similar thing: likelihood maximization, cross entropy minimization. as noted in [Baum 1968] and also in [Levinson], the difference q(k)-q(k-1) points in the same direction as grad L(q), evaluated at q(k-1). That is, the q(k-1) is changed in the direction of steepest descent of L. of all the possible steps (choices of q(k)) the one is chosen that minimizes the distance between q(k-1) and q(k) in the cross entropy sense. 4. Comparison in training of HMM and CN: --------------------------------------- now we can make a comparison of the performance of CN and HMM's. this comparison is between G-optimizing-CN's and HMM's. the square-error CN is not discussed here. firstly, we see that the main focus of attention is different in the two cases. in CN we want to minimize cross entropy. in HMM we want to maximi likelihood. however, likelihood maximinimization is an automatc consequence of G minimization for CN's and local G minimization is built in in the BF algorithm. in that sense, the two tasks are very similar and so the question is once again raised: why are HMM's faster to train? at this point the answers are many and easy. even though HMM's use observations in a nonlinear way, the state vector of the adjoint network (see [Kehagias]) evolves linearly. not so for CN's. the HMM adjoint network is sparsely connected. not necessarily so for the CN (pointed out by [Tony Robinson]). though both cost functions used are nonlinear, the BF is a much more efficient method to optimize the HMM cost function than Back Propagation is for CN's. the last answer is the really important one. due to the special nature of the Hidden Markov Model, we can use the BF algorithm. this algorithm allows to take large steps (large changes from q(k-1) to q(k)) in the traying Euclidean distance, without moving too far away in the cross entropy distance. of all the probability distributions, we consider only the ones that are "relevant", in that they are close to the current one; and yet, even though we take conservative steps, we are guaranteed to maximize the minimum improvement in likelihood. indeed the maxmin is a conservative attitude. the rational is the following: "you want to maximize L. you know the steepest ascent direction; you want to go in that direction, but you do not know how far to go. BF will tell you how far you can go (and it will not be an infinitesimal step) so that you maximize the minimum improvement." another way to look at this is that the Euclidean distance imposes a structure (topology) to the space of probability distributions. the cross entropy distance imposes a different structure, which apparently, is more relevant to the problem. in contrast, in BP we have not much choice in the change we bring on q. we have control over w, the weights of the connections, and we usually choose them in the steepest descent direction, and small enough that we actually have an improvement. but it is not clear that the cross entropy between distributions imposes a suitable structure on the space of weights. apparently it does not. even a relatively small step in the weight space can change the cost function by much. we have to tread more carefully. of course BF can be used due to the very special structure of the HMM problem (which is probably a good argument for the usefulness of the HM Model). BF is applicable when the cost function is a homogeneous polynomial with additive constraints on the variables. (see [Baum 1968]). the CN problem is characterized by harder nonlinearities (e.g. the sigmoid function) which induce a warped relationship between the weights and cost function. in short, the CN problem is more general and harder. 5. square error cost function ----------------------------- first a general observation: the square error cost function can be introduced under two asumptions. in the one case we assume the error to be deterministic and we want to minimize a deterministic sum of square errors (the sum is over all training patterns; the error is the difference between desired and actual response) by appropriate choice of weights. there is nothing probabilistic here. alternatively, we can assume that the training patterns are selected randomly (according to some prob. density) and also the test patterns will come from the same prob. density, and we choose the weights to minimize expected square error. even though the two points of view are distinct, they are not that different, since in both cases we can define inner products, distance functions etc. and so get a Hilbert space structure that is practically the same for both cases. of course this would involv some ergodicity assumption. at any rate, assume here the probabilistic point of view of square error. what are then the connections between the two cost functions: cross entropy and expected (or mean ) square error? i have seen some remarks on this problem in the literature, but i do not know enough about at this point. however, judging from training time, i would say that the nonlinear nature of CN with sigmoids again maps the weight space to the cost function in a very warped way. it would be interesting to examine the shape of the cost function contour in the weight space. have such studies been made? visualization seems to be a problem for high dimensional networks. 6. cross entropy maximization and some loose ends ------------------------------------------------ an interesting variation is G maximization. this usually occurs in unsupervised learning. See [Linsker], [Plumbley]. it appears under the name of transinformation maximization, or error information minimization, but these quantities can be interpreted as cross entropy between the joint input-output probability den. induced by the CN (for given weights) and the probability den. where input and output have the same marginals, but are independent (so the joint density is a product of the two marginals). i guess a way to explain this in terms of cross entropy is: even though we have no prior information on the best input-output density, there is one density we certainly want to avoid as much as possible, and this the one where input and output are independent (so the input gives no information as to what the output is). hence we want to maxmize the cross entropy distance between this product distribution and the CN induced distribution. there is also a possible interpretation along the lines of the maximum entropy principle. i must say that these interpretations do not seem (yet) to me as appealing as maximum transinformation. however they are possible and indeed statisticians have been considering them for many years now. another interseting connection is between cross entropy and rate of convergence (obviously rate of convergence is connected to training time). [Ellis] gives an excellent analysis of the connection between rate of convergence and crossentropy. application of his results to computational problems is not obvious. finally, an interesting example (of statistical work that relates to this line of connectionist research) is [Rissanen]; there the linear regression model is considered, which of course can be interpreted as a linear perceptron. in [Rissanen] selection of the optimal model is based on minmax entropy criterion. References: ----------- D.H.Ackley: "A Learning algorithm for Boltzmann machines" et.al. Cognitive Science 9 (1985). L.E. Baum &: "Growth Transformations for Functions on Manifolds" G.R. Sell Pacific Journal of Mathematics, Vol.27, No.2., 1968. L.E.Baum : "A Maximization Technique occurring in the Statistical et.al. Analysis of Probabilistic Functions of Markov Chains" The Annals of Math, Stat., Vol. 41, No. 1, 1970. A.P. Dempster:"Maximum Likelihood from Incomplete Data via EM algorithm" et. al. Pr. Roy. Stat. Soc., No. 1, 1977. R. Ellis: "Entropy, Large Deviations and Statistical Mechanics" Springer, New York 1985. G. Hinton :"Connectionist Learning Procedures", Technical Report CMU-CS-87-115 (Carnegie Mellon University), June 1987. A. Kehagias: "Optimal Control for Training: Themissing link between HMM and Connectionist Networks" submitted to 7th int. Conf. on Math. and Computer Modelling, Chicago, Illinois, August 1989. S.Y. Kung &: "A Unifying viewpoint of Multilayer Perceptrons J.N. Hwang and HMM models" (IEEE Int. Symposium and Systems Portland, Oregon, 1989. S.E.Levinson: "An Introduction to the Application of the Theory of et.al. Probabilistic Functions of a Markov Process to Automatic Speech Recognition", The Bell Sys. Tech. J., Vol.62, No. 4, April 1983. R. Linsker: "Self Organization in a Perceptual Network", IEEE Computer, Vol.21, No.3, March 1988. M. Plumbley&: "An information Theoretic Approach to Unsupervised F. Fallside Connectionist Models", Proceedings of 1988 Connectionist Models Summer School, Pittsburgh, 1988. J. Rissanen: "Minmax Entropy Estimation of Models for Vector Processes", in Lainiotis-Mehra (ed.), System Advances and case studies, Academic, New York, 1976. T. Robinson: personal communication From ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU Thu Mar 16 09:54:52 1989 From: ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU (thanasis kehagias) Date: Thu, 16 Mar 89 09:54:52 EST Subject: HMM? Message-ID: with respect to my cross entropy posting, i guess i never said it explicitly: HMM stands for Hidden Markov Model it is a model widely used in speech research. Thanasis From sankar at caip.rutgers.edu Thu Mar 16 09:42:44 1989 From: sankar at caip.rutgers.edu (ananth sankar) Date: Thu, 16 Mar 89 09:42:44 EST Subject: questions on kohonen's maps Message-ID: <8903161442.AA14983@caip.rutgers.edu> I am interested in the subject of Self Organization and have some questions with regard to Kohonen's algorithm for Self Organizing Feature Maps. I have tried to duplicate the results of Kohonen for the two dimensional uniform input case i.e. two inputs. I used a 10 X 10 output grid. The maps that resulted were not as good as reported in the papers. Questions: 1 Is there any analytical expression for the neighbourhood and gain functions? I have seen a simulation were careful tweaking after every so many iterations produces a correctly evolving map. This is obviously not a proper approach. 2 Even if good results are available for particular functions for the uniform distribution input case, it is not clear to me that these same functions would result in good classification for some other problem. I have attempted to use these maps for word classification using LPC coeffs as features. 3 In Kohonen's book "Self Organization and Associative Memory", Ch 5 the algorithm for weight adaptation does not produce normalized weights. Thus the output nodes cannot function as simply as taking a dot product of inputs and weights. They have to execute a distance calculation. 4 I have not seen as yet in the literature any reports on how the fact that neighbouring nodes respond to similar patterns from a feature space can be exploited. 5 Can the net become disordered after ordering is achieved at any particular iteration? I would appreciate any comments, suggestions etc on the above. Also so that net mail clutter may be reduced please respond to sankar at caip.rutgers.edu Thank you. Ananth Sankar Department of Electrical Engineering Rutgers University, NJ From KELLY%BROWNCOG.BITNET at mitvma.mit.edu Thu Mar 16 12:12:00 1989 From: KELLY%BROWNCOG.BITNET at mitvma.mit.edu (KELLY%BROWNCOG.BITNET@mitvma.mit.edu) Date: Thu, 16 Mar 89 12:12 EST Subject: What is a connectionist net? Here's what it's not. Message-ID: What is a connectionist model, you ask? Well, I don't think I can answer that specifically, but I can tell you what it's not. In the first place it *is* a member of a larger class of models called complex systems. But that doesn't help us either, because nobody really knows what a complex system is. The generally conceived definition has something to do with large numbers of simple, interconnecting units which can perform some type of "cooperative computation". That is, individually the units are so dumb that they can't do anything, but together they can do alot. Well, then my claim (I'm really out on a limb here), is that systems with large numbers of very complex, interconnecting units really aren't connectionist models (or even complex systems) at all, no matter how many connections there are or what type of amazing results they achieve. In particular I am referring to the result that Hecht-Nielson reports in his paper on "Kolmogorov's Mapping Neural Network Theorem" [1987 INNS proceedings?]. There he describes a way of proving that a 2-layered net (one hidden layer) is capable of solving any mapping problem. However, the units in the network are incredibly complex. No longer are we dealing with units that compute threshold functions. The hidden layer units must be able to compute any real, continuous monotonically increasing function, and the output layer units must be able to compute any *arbitrary* real continuous function. While the fact that a system like this can do some serious computation is interesting (neat, even), it really tells us nothing about connectionist networks. From ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU Thu Mar 16 22:19:54 1989 From: ST401843%BROWNVM.BITNET at VMA.CC.CMU.EDU (thanasis kehagias) Date: Thu, 16 Mar 89 22:19:54 EST Subject: credits Message-ID: recently i posted a note about traing of HMM and Connectionist Networks, where i was not careful enough in giving credit to people that deserved it. let me try to make up for it: i had a very interesting exchange of mesages with Tony Robinson, that formed the basis for my note. i received messages with ideas and references from Mark Plumbley, Steven Nowlan, Sue Becker and Sara Solla. Sara Solla referred me to a paper written by Solla, Esther Levin and Michael Fleisher, that deals with the question of cross entropy. i received a copy of this paper today. it is: "Accelerated Learning in Layered Neural Networks", by S. Solla, , E. Levin and M. Fleisher, Complex Systems, Vol. 2, 1988. the paper compares cross entropy and square error and includes a numerical study and a study of the shape of the contours of these cost functions. therefore, the similar question i posed at the end of my note is at least partly answered. i also received the revised copy of G. Hinton's report on Connectionist learning procedures, referred to in my note. in this report (Dec. 1987) Hinton has already made a remark directly related to my point of maximinimizing likelihood in the BF algorithm. specifically, he says that (in the context of CN training with cross entropy cost function) Likelihood is maximized when cross entropy is minimized. i think this is all. if i have missed someting , let me know about it .{ Thanasis From ROB%BGERUG51.BITNET at VMA.CC.CMU.EDU Fri Mar 17 09:24:00 1989 From: ROB%BGERUG51.BITNET at VMA.CC.CMU.EDU (Rob A. Vingerhoeds / Ghent State University) Date: Fri, 17 Mar 89 09:24 N Subject: Neural Networks Seminar Ghent, 25 april 1989, FINAL ANNOUNCEMENT Message-ID: BIRA SEMINAR ON NEURAL NETWORKS "APPLICATION OF NEURAL NETWORKS IN INDUSTRY, WHEN AND HOW" 25 APRIL 1989 INTERNATIONAL CONGRESS CENTRE GHENT BELGIUM FINAL ANNOUNCEMENT BIRA (Belgian Institute for Control Engineering and Automation) is organising a seminar on the state of the art in Neural Networks. The central theme will be "Application of Neural Networks in Industry, when and how" To be able to give a good and reliable verdict to this theme, some of the most important and leading scientists in this fascinating area have been invited to present a lecture at the seminar and take part in a panel discussion. The following program is foreseen: 8.30 - 9.00 Registration 9.00 - 9.15 Opening on behalf of BIRA Prof. L. Boullart, Ghent State University 9.15 - 10.00 Learning Algorithms and applications in A.I. Prof. Fogelman Soulie, Universite de Paris V 10.00 - 10.30 coffee 10.30 - 11.30 The Neural Network Framework Prof. B. Kosko, University of Southern California 11.30 - 12.00 Presentation of ANZA+ products, hardware and software Patrick Dumont, Digilog, France 12.00 - 14.00 lunch / exhibition 14.00 - 15.00 Integration of knowledge-based system and neural network techniques for robotic control Dr. David Handelman, Princeton, USA 15.00 - 16.00 Application in Image Processing and Pattern Recognition (Neocognitron) Dr. S. Miyake, ATR, Japan 16.00 - 16.30 tea 16.30 - 17.15 panel discussion over the central theme 17.15 - 17.30 closing and conclusions The seminar will be held in the same period as the famous Flanders Technology International (F.T.I.) exhibition is held. This exhibition is for both representatives from industry and for other interested people very interesting and going to both the seminar and the exhibition is double interesting. VENUE International Congress Centre Ghent - Orange Room - Citadelpark B-9000 Ghent DATE Tuesday 25 april 1989 LANGUAGE The seminar language is English. No translation will be provided. REGISTRATION FEES members BIRA/IBRA 12.500 BEF non-members 15.000 BEF Teachers/Assistants 7.500 BEF including coffee/tea, lunch and proceedings. Students can get a special price of 1.500 BEF, which does NOT include a lunch. Tickets for FLANDERS TECHNOLOGY INTERNATIONAL can be obtained at the registration desk. Payments in Belgian Franks only, to be made on receipt of an invoice from the BIRA office. Registration will close on 18 april 1989. Confirmations will NOT be send. For further information or a printed announcement with a registration form please contact either the BIRA coordinator (adress below) or one of us (using e-mail). You can also use the registration form printed below and send this via e-mail back to us. We will then make sure it reaches BIRA in time. ---------------------------------------------------------------------- REGISTRATION FORM Tuesday 25 april 1989 I.C.C.-Ghent BIRA Seminar on NEURAL NETWORKS NAME: .................................................. FIRST NAME: .................................................. ADRESS: .................................................. .................................................. POSITION: .................................................. CONCERN OR INSTITUTE: .................................................. .................................................. TEL: .................................................. FAX: .................................................. ------------------------- Member BIRA/IBRA : ........ BEF Non-members : ........ BEF Teachers/Assistants : ........ BEF ------------------------- Please only settle payment upon receipt of an invoice from the BIRA-Office. Please indicate whether the invoice should be adressed to the company or the personal adress. Date: Please send back before 17 april 1989. Do NOT use 'REPLY', because in that way everyone on the list will be informed about your plans to come to the seminar and they just might not be interested in it. ---------------------------------------------------------------------- Seminar Coordinators Rob Vingerhoeds Leo Vercauteren BIRA COORDINATOR L. Pauwels BIRA-Office Het Ingenieurshuis Desguinlei 214 2018 Antwerpen Belgium tel: +32-3-216-09-96 fax: +32-3-216-06-89 (attn. BIRA L. Pauwels) From alexis%yummy at gateway.mitre.org Fri Mar 17 09:46:27 1989 From: alexis%yummy at gateway.mitre.org (alexis%yummy@gateway.mitre.org) Date: Fri, 17 Mar 89 09:46:27 EST Subject: What is a connectionist net? Here's what it's not. In-Reply-To: KELLY%BROWNCOG.BITNET@mitvma.mit.edu's message of Thu, 16 Mar 89 12:12 EST <8903170151.AA26943@gateway.mitre.org> Message-ID: <8903171446.AA02093@marzipan.mitre.org> ************ Do Not Forward To Any Other BBoards, Etc ************ Just an aside to KELLY%BROWNCOG's note, rather than worry if Hecht-Nielson's neural net (and I use the term intentionally -- I mean "artificial intelligence" is neither so ...) is really a connectionist model, let me point out a paper/result worth being aware of. G. Cybenko wrote a very interesting paper which proves that a neural network with *one* hidden layer of nodes (i.e., one more than a perceptron) with a sigmoid transfer function can "uniformly approximate any continuous function with support in the unit hypercube". That is to say you actually can do any mapping with *ONE* hidden layer (albeit often a very very large one). Cybenko sent the paper to me because of a tirade I went on awhile ago on this bboard, so I don't actually know if it has been published anywhere yet. I'm writing this without his knowledge -- I'm pretty sure he's on this list. G. Cybenko are you out there, and are you willing to say where your paper "Approximation by Superpositions of a Sigmoidal Function" can be found by the hungary masses? alexis wieland. ************ Do Not Forward To Any Other BBoards, Etc ************ From sontag at fermat.rutgers.edu Sat Mar 18 18:27:29 1989 From: sontag at fermat.rutgers.edu (sontag@fermat.rutgers.edu) Date: Sat, 18 Mar 89 18:27:29 EST Subject: ONE HIDDEN LAYER IS ENOUGH -- re "what is a net?" discussion Message-ID: <8903182327.AA06225@control.rutgers.edu> This is in response to Alexis Wieland's request: "G. Cybenko are you out there, and are you willing to say where your paper "Approximation by Superpositions of a Sigmoidal Function" can be found by the hungary (sic) masses?" (Presumably non-Hungarian masses are interested too, so:) The paper by George Cybenko that proves this theorem (a neural network with one hidden layer of nodes with a fixed sigmoid transfer function can uniformly approximate any continuous function) is scheduled to appear in MATHEMATICS OF CONTROL, SIGNALS, AND SYSTEMS, Vol.2 (1989), Number 4. Your library should have this journal, which specializes in the formal mathematical analysis of problems related to signal processing and systems. (The journal has published many other papers that should be relevant to theoretical connectionist research, such as papers on iterated projection methods, estimation, interpolation techniques, identification, and adaptive control.) If your library doesn't yet subscribe, you might as well provide them with the following info: MATHEMATICS OF CONTROL, SIGNALS, AND SYSTEMS Springer-Verlag New York, Inc ISSN 0932-4194, Title # 498 In North America, order from: Springer-Verlag New York, Inc Journal Fulfillment Services 44 Hartz Way, Secaucus, NJ 07094 (Volume 2, 1989 ... $179.00 incl. p&h) Outside NA, order from: Springer-Verlag Heidelberger Platz 3 D-1000 Berlin 33, FRG (Volume 2, 1989 ... DM 348.- incl. p&h) -bradley dickinson and eduardo d. sontag, co-Managing eds. From terry%sdbio2 at ucsd.edu Sat Mar 18 21:11:09 1989 From: terry%sdbio2 at ucsd.edu (Terry Sejnowski) Date: Sat, 18 Mar 89 18:11:09 PST Subject: ONE HIDDEN LAYER IS ENOUGH -- re "what is a net?" discussion Message-ID: <8903190211.AA17912@sdbio2.UCSD.EDU> Hal White in the Economics Department at UCSD has also proved that one hidden layer can uniformly approximate smooth mappings. He has gone on to prove the even more interesting theorem that it is possible to learn the mapping. Write to him for a preprint: Hal White Department of Economics UCSD San Diego, CA 92093 Two related papers that are in press in Neural Computation: What size net gives valid generalization? by Eric Baum and David Haussler A proposal for more powerful learning algorithms. Eric Baum. For preprints write to: Eric Baum Department of Physics Princeton University Princeton, NJ 08540 Terry Sejnowski ----- From chrisley.pa at Xerox.COM Mon Mar 20 14:25:00 1989 From: chrisley.pa at Xerox.COM (chrisley.pa@Xerox.COM) Date: 20 Mar 89 11:25 PST Subject: questions on kohonen's maps In-Reply-To: ananth sankar 's message of Thu, 16 Mar 89 09:42:44 EST Message-ID: <890320-112612-6136@Xerox> Ananth Sankar recently asked some questions about Kohonen's feature maps. As I have worked on these issues with Kohonen, I feel like I might be able to give some answers, but standard disclaimers apply: I cannot be certain that Kohonen would agree with all of the following. Also, I do not have my copy of his book with me, so I cannot be more specific about refrences. Questions: 1 Is there any analytical expression for the neighbourhood and gain functions? I have seen a simulation were careful tweaking after every so many iterations produces a correctly evolving map. This is obviously not a proper approach. Although there is probably more than one, correct, task-independent gain or neighborhood function, Kohonen does mention constraints that all of them should meet. For example, both functions should decrease to zero over time. I do not know of any tweaking; Kohonen usually determines a number of iterations and then decreases the gain linearly. If you call this tweaking, then your idea of domain-independent parameters might be a sort of holy grail, since it does not seem likely that we are going to find a completely parameter-free learning algorithm that will work in every domain. 2 Even if good results are available for particular functions for the uniform distribution input case, it is not clear to me that these same functions would result in good classification for some other problem. I have attempted to use these maps for word classification using LPC coeffs as features. As far as I know, Kohonen has used the same type of gain and neighborhood functions for all of his map demonstrations. These demonstrations, which have been shown via an animated film at several major conferences, demonstrate maps learning the distribution in cases where 1) the dimensionality of the network topology and the input space mismatch, e.g., where the network is 2d and the distribution is a 3d 'cactus'; 2) the distribution is not uniform. The algorithm was developed with these 2 cases in mind, so it is no surprise that the results are good for them as well. 3 In Kohonen's book "Self Organization and Associative Memory", Ch 5 the algorithm for weight adaptation does not produce normalized weights. Thus the output nodes cannot function as simply as taking a dot product of inputs and weights. They have to execute a distance calculation. That's right. And Kohonen usually uses the Euclidean distance metric, although other ones can be used (which he discusses in the book) Furthermore, there have been independent efforts to normalize weights in Kohonen maps so that the dot product measure can be used. If you have any doubts about the suitability of the Euclidean metric, as your question seems to imply, express them. It is an interesting issue. 4 I have not seen as yet in the literature any reports on how the fact that neighbouring nodes respond to similar patterns from a feature space can be exploited. The primary interest in maps, I believe, came from a desire to display high-dimensional information in low dimensional spaces, which are more easily apprehended. But there is evidence that there are other uses as well: 1) Kohonen has published results on using maps for phoneme recognition, where the topology-preservation plays a significant role (such maps are used in the Otaniemi Phonetic Typewriter featured in, I think, Computer magazine a year or two agao.); 2) work has been done on using the topology to store sequential information, which seems to be a good idea if you are dealing with natural signals that can only temporally shift from a state to similar states; 3) several people have followed Kohonen's suggestion of using maps for adaptive kinematic representations for robot control (the work on Murphy, mentioned on this net a month or so ago, and the work being done at Carlton (sp) University by Darryl Graf are two good examples). In short, just look at some ICNN or INNS proceedings, and you'll find many papers where researchers found Kohonen maps to be a good place from which to launch their own studies. 5 Can the net become disordered after ordering is achieved at any particular iteration? Of course, this is theoretically possible, and is almost certain if at some point the distribution of the mapped function changes. But this brings up the difficult question: what is the proper ordering in such a case? Should a net try to integrate both past and present distributions, or should it throw away the past on concentrate on the present? I think nost nn researchers would want a litlle of both, woth maybe some kind of exponential decay in the weights. But in many applications of maps, there is no chance of the distribution changing: it is fixed, and iterations are over the same test data each time. In this case, I would guess that the ordering could not becone disrupted (at least for simple distributions and a net of adequate size), but I realise that there is no proof of this, and the terms 'simple' and 'adequate' are lacking definition. But that's life in nnets for you! If anyone has any more questions, feel free. Ron Chrisley Xerox PARC System Sciences Lab 3333 Coyote Hill Road Palo Alto, CA 94304 USA chrisley.pa at xerox.com tel: (415) 494-4728 OR New College Oxford OX1 3BN UK chrisley at vax.oxford.ac.uk tel: (865) 279-492 From moody-john at YALE.ARPA Tue Mar 21 16:11:08 1989 From: moody-john at YALE.ARPA (john moody) Date: Tue, 21 Mar 89 16:11:08 EST Subject: two research reports available Message-ID: <8903212107.AA03190@NEBULA.SUN3.CS.YALE.EDU> ********* FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD *********** **************** TO OTHER BBOARDS/ELECTRONIC MEDIA ******************* FAST LEARNING IN MULTI-RESOLUTION HIERARCHIES John Moody Research Report YALEU/DCS/RR-681, February 1989 ABSTRACT A class of fast, supervised learning algorithms is presented. They use local representations, hashing, and multiple scales of resolution to approximate functions which are piece-wise continu- ous. Inspired by Albus's CMAC model, the algorithms learn orders of magnitude more rapidly than typical implementations of back propagation, while often achieving comparable qualities of gen- eralization. Furthermore, unlike most traditional function ap- proximation methods, the algorithms are well suited for use in real time adaptive signal processing. Unlike simpler adaptive systems, such as linear predictive coding, the adaptive linear combiner, and the Kalman filter, the new algorithms are capable of efficiently capturing the structure of complicated non-linear systems. As an illustration, the algorithm is applied to the prediction of a chaotic timeseries. NOTE: This research report will appear in Advances in Neural In- formation Processing Systems, edited by David Touretzky, to be published in April 1989 by Morgan Kaufmann Publishers, Inc. The author gratefully acknowledges financial support under ONR grant N00014-89-J-1228, ONR grant N00014-86-K-0310, AFOSR grant F49620-88-C0025, and a Purdue Army subcontract. *********************************************************** FAST LEARNING IN NETWORKS OF LOCALLY-TUNED PROCESSING UNITS John Moody and Christian J. Darken Research Report YALEU/DCS/RR-654, October 1988, Revised March 1989 ABSTRACT We propose a network architecture which uses a single internal layer of locally-tuned processing units to learn both classifica- tion tasks and real-valued function approximations We consider training such networks in a completely supervised manner, but abandon this approach in favor of a more computationally effi- cient hybrid learning method which combines self-organized and supervised learning. Our networks learn faster than back propa- gation for two reasons: the local representations ensure that only a few units respond to any given input, thus reducing compu- tational overhead, and the hybrid learning rules are linear rath- er than nonlinear, thus leading to faster convergence. Unlike many existing methods for data analysis, our network architecture and learning rules are truly adaptive and are thus appropriate for real-time use. NOTE: This research report will appear in Neural Computation, a new Journal edited by Terry Sejnowski and published by MIT Press. The work was supported by ONR grant N00014-86-K-0310, AFOSR grant F49620-88-C0025, and a Purdue Army subcontract. *********************************************************** Copies of both reports can be obtained by sending a request to: Judy Terrell Yale Computer Science PO Box 2158 Yale Station New Haven, CT 06520 (203)432-1200 e-mail: terrell at cs.yale.edu terrell at yale.arpa terrell at yalecs.bitnet ********* FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD *********** **************** TO OTHER BBOARDS/ELECTRONIC MEDIA ******************* ------- From chrisley.pa at Xerox.COM Thu Mar 23 14:35:00 1989 From: chrisley.pa at Xerox.COM (chrisley.pa@Xerox.COM) Date: 23 Mar 89 11:35 PST Subject: questions on kohonen's maps In-Reply-To: ananth sankar 's message of Thu, 16 Mar 89 09:42:44 EST Message-ID: <890323-113527-4949@Xerox> One further note about Ananth Sankar's questions about Kohonen maps: A friend of mine, Tony Bell, tells me (and Ananth) that Helge Ritter has a "neat set of expressions for the learning rate and neighbourhood size parameters... and he also proves something about congergence elsewhere." Unfortunately, I do not as yet have a reference for the papers, but I have liked Ritter's work in the past, so I thought people on the net might be interested. From jose at tractatus.bellcore.com Wed Mar 22 10:44:09 1989 From: jose at tractatus.bellcore.com (Stephen J Hanson) Date: Wed, 22 Mar 89 10:44:09 EST Subject: technical report available Message-ID: <8903221544.AA14583@tractatus.bellcore.com> Princeton Cognitive Science Lab Technical Report: CSL36, February, 1989. COMPARING BIASES FOR MINIMAL NETWORK CONSTRUCTION WITH BACK-PROPAGATION Stephen Jos'e Hanson Bellcore and Princeton Cognitive Science Laboratory and Lorien Y. Pratt Rutgers University ABSTRACT Rumelhart (1987), has proposed a method for choosing minimal or "simple" representations during learning in Back-propagation networks. This approach can be used to (a) dynamically select the number of hidden units, (b) construct a representation that is appropriate for the problem and (c) thus improve the generalization ability of Back-propagation networks. The method Rumelhart suggests involves adding penalty terms to the usual error function. In this paper we introduce Rumelhart's minimal networks idea and compare two possible biases on the weight search space. These biases are compared in both simple counting problems and a speech recognition problem. In general, the constrained search does seem to minimize the number of hidden units required with an expected increase in local minima. to appear in Advances in Neural Information Processing, D. Touretzky Ed., 1989 Research was jointly sponsered by Princeton CSL and Bellcore. REQUESTS FOR THIS TECHNICAL REPORT SHOULD BE SENT TO laura at clarity.princeton.edu Please do not reply to this message or forward, Thankyou. From lwyse at bucasb.BU.EDU Tue Mar 21 13:59:02 1989 From: lwyse at bucasb.BU.EDU (lwyse@bucasb.BU.EDU) Date: Tue, 21 Mar 89 13:59:02 EST Subject: questions on kohonen's maps In-Reply-To: connectionists@c.cs.cmu.edu's message of 20 Mar 89 23:47:09 GMT Message-ID: <8903211859.AA04927@cochlea.bu.edu> What does "ordering" mean when your projecting inputs to a lower dimensional space? For example, the "Peano" type curves that result from a one-D neighborhood learning a 2-D input distribution, it is obviously NOT true that nearby points in the input space maximally activate nearby points on the neighborhood chain. In this case, it is not even clear that "untangling" the neighborhood is of utmost importance, since a tangled chain can still do a very good job of divvying up the space almost equally between its nodes. -lonce From jose at tractatus.bellcore.com Thu Mar 23 17:19:35 1989 From: jose at tractatus.bellcore.com (Stephen J Hanson) Date: Thu, 23 Mar 89 17:19:35 EST Subject: No subject Message-ID: <8903232219.AA16776@tractatus.bellcore.com> Princeton Cognitive Science Lab Technical Report: CSL36, February, 1989. COMPARING BIASES FOR MINIMAL NETWORK CONSTRUCTION WITH BACK-PROPAGATION Stephen Jos'e Hanson Bellcore and Princeton Cognitive Science Laboratory and Lorien Y. Pratt Rutgers University ABSTRACT Rumelhart (1987), has proposed a method for choosing minimal or "simple" representations during learning in Back-propagation networks. This approach can be used to (a) dynamically select the number of hidden units, (b) construct a representation that is appropriate for the problem and (c) thus improve the generalization ability of Back-propagation networks. The method Rumelhart suggests involves adding penalty terms to the usual error function. In this paper we introduce Rumelhart's minimal networks idea and compare two possible biases on the weight search space. These biases are compared in both simple counting problems and a speech recognition problem. In general, the constrained search does seem to minimize the number of hidden units required with an expected increase in local minima. to appear in Advances in Neural Information Processing, D. Touretzky Ed., 1989 Research was jointly sponsered by Princeton CSL and Bellcore. REQUESTS FOR THIS TECHNICAL REPORT SHOULD BE SENT TO laura at clarity.princeton.edu Please do not reply to this message or forward, Thankyou. From gblee at CS.UCLA.EDU Fri Mar 24 13:25:07 1989 From: gblee at CS.UCLA.EDU (Geunbae Lee) Date: Fri, 24 Mar 89 10:25:07 PST Subject: questions on konhonen's map Message-ID: <8903241825.AA25252@maui.cs.ucla.edu> >What does "ordering" mean when your projecting inputs to a lower dimensional >space? It means topological ordering >For example, the "Peano" type curves that result from a one-D >neighborhood learning a 2-D input distribution, it is obviously NOT >true that nearby points in the input space maximally activate nearby >points on the neighborhood chain. It depends on what you mean by "near by" If it is near by in relative sense (in topological relation), not absolute sense, then the nearby points in the input space DOES maximally activate nearby points on the neighborhood chain. --Geunbae Lee AI Lab, UCLA From LIN2 at ibm.com Fri Mar 24 15:02:32 1989 From: LIN2 at ibm.com (Ralph Linsker) Date: 24 Mar 89 15:02:32 EST Subject: Technical report available Message-ID: <032489.150233.lin2@ibm.com> ********* FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD *********** **************** TO OTHER BBOARDS/ELECTRONIC MEDIA ******************* The following report (IBM Research Report RC 14195, Nov. 1988) is available upon request to: lin2 @ ibm.com It will appear in: Advances in Neural Information Processing Systems 1, ed. D. S. Touretzky (San Mateo, CA: Morgan Kaufmann), April 1989. "An Application of the Principle of Maximum Information Preservation to Linear Systems," Ralph Linsker This paper addresses the problem of determining the weights for a set of linear filters (model "cells") so as to maxi- mize the ensemble-averaged information that the cells' out- put values jointly convey about their input values, given the statistical properties of the ensemble of input vectors. The quantity that is maximized is the Shannon information rate, or equivalently the average mutual information between input and output.* Several models for the role of processing noise are analyzed, and the biological motivation for con- sidering them is described. For simple models in which nearby input signal values (in space or time) are corre- lated, the cells resulting from this optimization process include center-surround cells and cells sensitive to temporal variations in input signal. *The possible relation between this optimization principle and the organization of a sensory processing system is discussed in: R. Linsker, Computer 21(3)105-117 (March 1988). If you would like a reprint of the Computer article, please so note. From chrisley.pa at Xerox.COM Fri Mar 24 17:53:00 1989 From: chrisley.pa at Xerox.COM (chrisley.pa@Xerox.COM) Date: 24 Mar 89 14:53 PST Subject: questions on kohonen's maps In-Reply-To: lwyse@bucasb.BU.EDU's message of Tue, 21 Mar 89 13:59:02 EST Message-ID: <890324-145332-8519@Xerox> Lonce (lwyse at bucasb.BU.EDU) writes: "What does "ordering" mean when your projecting inputs to a lower dimensional space? For example, the "Peano" type curves that result from a one-D neighborhood learning a 2-D input distribution, it is obviously NOT true that nearby points in the input space maximally activate nearby points on the neighborhood chain." It is not true that nearby points in input space are always mapped to nearby points in the output space when the mapping is dimensionality reducing, agreed. But 'ordering' still makes sense. The map is topology-preserving if the dependency is in the other direction, i.e., if nearby points in output space are always activated by nearby points in input space. Lonce goes on to say: "In this case, it is not even clear that "untangling" the neighborhood is of utmost importance, since a tangled chain can still do a very good job of divvying up the space almost equally between its nodes." I agree that topology preservation is not necessarily of utmost importance, but it may be useful in some applications, such as the ones I mentioned a few messages back (phoneme recognition, inverse kinematics, etc.). Also, there is 1) the interest in properties of self-organizing systems in themselves, even though an application can't be immediately found; and 2) the observation that for some reason the brain seems to use topology preserving maps (with the one-way dependency I mentioned above), which, although they *could* be computationally unnecessary or even disadvantageous, are probably in fact, nature being what she is, good solutions to tough real time problems. Ron Chrisley After April 14th, please send personal email to Chrisley at vax.ox.ac.uk From ken at phyb.ucsf.EDU Sun Mar 26 01:17:59 1989 From: ken at phyb.ucsf.EDU (Ken Miller) Date: Sat, 25 Mar 89 22:17:59 pst Subject: Normalization of weights in Kohonen algorithm Message-ID: <8903260617.AA08352@phyb> re point 3 of recent posting about Kohonen algorithm: "3 In Kohonen's book "Self Organization and Associative Memory", Ch 5 the algorithm for weight adaptation does not produce normalized weights." the algorithm du_{ij}/dt = a(t)[e_j(t) - u_{ij}(t)], i in N_c where u = weights, e is input pattern, N_c is topological neighborhood of maximally responding neighborhood, should I believe be written du_{ij}/dt = a(t)[ e_j(t)/\sum_k(e_k(t)) - u_{ij}(t)/\sum_k(u_{ik}(t)) ], i in N_c. That is, the change should be such as to move the jth synaptic weight on the ith cell, as a PROPORTION of all the synaptic weights on the ith cell, in the direction of matching the PROPORTION of input which was incoming on the jth line. Note that in this case \sum_j du_{ij}(t)/dt = 0, so weights remain normalized in the sense that sum over each cell remains constant. If inputs are normalize to sum to 1 (\sum_k(e_k(t)) = 1) then the first denominator can be omitted. If weights begin normalized to sum to 1 on each cell ( \sum_k(u_{ik}(t)) = 1 for all i) then weights will remain normalized to sum to 1, hence the second denominator can be omitted. Perhaps Kohonen was assuming these normalizations and hence dispensing with the denominators? ken miller (ken at phyb.ucsf.edu) From nowlan at ai.toronto.edu Tue Mar 28 09:41:36 1989 From: nowlan at ai.toronto.edu (Steven J. Nowlan) Date: Tue, 28 Mar 89 09:41:36 EST Subject: training time in HMM and CN Message-ID: <89Mar28.094139est.10529@ephemeral.ai.toronto.edu> Two comments on Thansis' post on the relative training speed of HMM vs CN for sequential problems such as speech recognition: 1. The BF algorithm is quite highly optimized, while vanilla BP doesn't implement anything that a numerical analyst would consider a real descent procedure (not even steepest descent). If you were to use a reasonably powerful numerical optimization technique, such as one of the Broyden methods you may find CN convergence extremely fast. Ray Watrous has in fact shown this sort of speedup for speech problems [1]. 2. A more subtle, but probably more important difference, is the issue of how targets are specified over an input sequence. The BF algorithm specifies targets for intermediate steps in an input sequence based on expectations of final outcome of that sequence collected from many similar sequences. It is not clear how to specify output targets for intermediate points of an input sequence in a CN, although Watrous has shown that intelligent choice of such targets can markedly improve CN convergence and performance. Of interest in this regard is the work by Sutton on Temporal Difference methods [2]. One can view this work as specifying a target function over a sequence in a dynamical way, so that the target function reflects the experience of the system to date in a clever way. Sutton [2] has shown an equivalence between one form of linear TD method and the maximum likelihood estimates of the parameters for an absorbing Markov chain model of the same process. This seems much closer in flavour to what the BF algorithm is doing, and when applied to a non-linear system may in fact be an interesting generalization of BF. Comments and requests for clarifications should be directed to me, not to Connectionists please. - Steve Nowlan nowlan at ai.toronto.edu References: [1] Watrous, Raymond L. "Speech Recognition Using Connectionist Networks", TR MS-CIS-88-96, Department of Computer and Information Science, University of Pennsylvania, Philadelphia, 1988. [2] Sutton, Richard S. "Learning to Predict by the Methods of Temporal Difference", GTE Technical Report TR87-509.1, GTE Laboratories Inc. Waltham, Mass. 1987. From cfields at NMSU.Edu Tue Mar 28 19:56:24 1989 From: cfields at NMSU.Edu (cfields@NMSU.Edu) Date: Tue, 28 Mar 89 17:56:24 MST Subject: No subject Message-ID: <8903290056.AA14581@NMSU.Edu> Call for Participants / Call for Abstracts Symbolic Problem Solving in Noisy, Novel, and Uncertain Task Environments 20-21 August, 1989 (tentative), Detroit, MI, USA An IJCAI-89 Workshop, Sponsored by AAAI Goals. Brittleness in the face of noise, novelty, and uncertainty is a well-known failing of symbolic problem solvers. The goals of this Workshop are to characterize the features of task environments that cause brittleness, to investigate mechanisms for decreasing the brittleness of symbolic problem solvers, and to review case histories of implemented systems that function in task environments high in noise, novelty, and data of uncertain relevance. Topics of interest for the Workshop include the following. Analysis of task environments: Definitions of noisy, novelty, and uncertain relevance; exploration of related concepts in general systems theory or logic; parameters for characterizing task environments; knowledge engineering strategies. Mechanisms for addressing noise and novelty: Plasticity and learning; constructive problem solving; fragmentation of knowledge structures; dynamic modification of rules, schemata, or cases; coherence maintenance; adaptive control mechanisms. Representations: Data structures allowing dynamic abstraction and modification; representation of ``unstructured'' knowledge; knowledge implicit in control or learning procedures; ordering of knowledge structures; tradeoffs between explicit and implicit knowledge representation. Implementation issues: Implementing symbolic problem solvers on parallel machines; concurrency control strategies; integrating symbolic systems with artificial neural networks; general systems integration. Researchers interested in participating in the Workshop are invited to submit abstracts describing work in any of these topic areas. Format. All participants will present their current work, either as a brief oral report or as a poster. Most presentations will be posters, as these provide the greatest opportunity for presentation and discussion of technical details. Presentations will be on the first day of the Workshop, followed by discussions in working groups organized by application domain and a panel discussion on the second day. Attendance at IJCAI Workshops is limited to fifty participants. Participants not registered for IJCAI must pay a $50/day fee. Abstract Submission. Please submit a 1 page abstract of the work to be presented, together with a cover letter summarizing previous work in relevant areas and expected contribution to the Workshop, to Mike Coombs, Box 30001/3CRL, New Mexico State University, Las Cruces, NM 88003-0001 USA, by 15 May 1989. Authors will be notified as to acceptance by 1 June 1989. Accepted abstracts will be distributed at the Workshop. A volume collecting selected papers from the Workshop is planned; papers for this volume will be solicited at the Workshop. Organizers. Mike Coombs and Chris Fields (NMSU), Russ Frew (GE), David Goldberg (Alabama), Jim Reggia (Maryland). Points of contact: Mike Coombs, 505-646-5757, mcoombs at nmsu.edu; Chris Fields, 505-646-2848, cfields at nmsu.edu. From elman%amos at ucsd.edu Wed Mar 29 00:30:44 1989 From: elman%amos at ucsd.edu (Jeff Elman) Date: Tue, 28 Mar 89 21:30:44 PST Subject: 1990 Connectionist Summer School announcement Message-ID: <8903290530.AA23241@amos.UCSD.EDU> March 28, 1989 PRELMINARY ANNOUNCEMENT CONNECTIONIST SUMMER SCHOOL / SUMMER 1990 UCSD La Jolla, California The next Connectionist Summer School will be held at the University of California, San Diego in June 1990. This will be the third session in the series which was held at Carnegie-Mellon in the summers of 1986 and 1988. The summer school will offer courses in a variety of areas of connectionist modelling, with emphasis on computa- tional neuroscience, cognitive models, and hardware imple- mentation. In addition to full courses, there will be a series of shorter tutorials, colloquia, and public lectures. Proceedings of the summer school will be published the fol- lowing fall. As in the past, participation will be limited to gradu- ate students enrolled in PhD. programs (full- or part-time). Admission will be on a competitive basis. We hope to have sufficient funding to subsidize tuition and housing. THIS IS A PRELMINARY ANNOUNCEMENT. Further details will be announced over the next several months. Terry Sejnowski Jeff Elman UCSD/Salk UCSD Geoff Hinton Dave Touretzky Toronto CMU hinton at ai.toronto.edu touretzky at cs.cmu.edu From niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK Wed Mar 29 09:17:49 1989 From: niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK (Mahesan Niranjan) Date: Wed, 29 Mar 89 09:17:49 BST Subject: Missing link etc... Message-ID: <23751.8903290817@dsl.eng.cam.ac.uk> Some recent papers and postings on this network compare HMMs and Multi-layer neural networks. Here is something I find missing in these discussions. In speech pattern processing, HMMs make an inherent assumption about the time series; - that it can be chopped up into a sequence of piecewise stationary regions. Thus, an HMM places break-points in the transition regions of the signal and models the steady regions by the statistical parameters of individual states. For speech signals, this is a bad assumption (human speech production is not at all like this) - but the recognisers somehow seem to work!! In neural networks (with or without feedback) what is the equivalent assumption about the time evolution of the signal? niranjan From ersoy at ee.ecn.purdue.edu Wed Mar 29 12:22:20 1989 From: ersoy at ee.ecn.purdue.edu (Okan K Ersoy) Date: Wed, 29 Mar 89 12:22:20 EST Subject: No subject Message-ID: <8903291722.AA07623@ee.ecn.purdue.edu> CALL FOR PAPERS AND REFEREES HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES - 23 NEURAL NETWORKS AND RELATED EMERGING TECHNOLOGIES KAILUA-KONA, HAWAII - JANUARY 3-6, 1990 The Neural Networks Track of HICSS-23 will contain a special set of papers focusing on a broad selection of topics in the area of Neural Networks and Related Emerging Technologies. The presentations will provide a forum to discuss new advances in learning theory, associative memory, self-organization, architectures, implementations and applications. Papers are invited that may be theoretical, conceptual, tutorial or descriptive in nature. Those papers selected for presentation will appear in the Conference Proceedings which is published by the Computer Society of the IEEE. HICSS-23 is sponsored by the University of Hawaii in cooperation with the ACM, the Computer Society,and the Pacific Research Institute for Informaiton Systems and Management (PRIISM). Submissions are solicited in: Supervised and Unsupervised Learning Associative Memory Self-Organization Architectures Optical, Electronic and Other Novel Implementations Optimization Signal/Image Processing and Understanding Novel Applications INSTRUCTIONS FOR SUBMITTING PAPERS Manuscripts should be 22-26 typewritten, double-spaced pages in length. Do not send submissions that are significantly shorter or longer than this. Papers must not have been previously presented or published, nor currently submitted for journal publication. Each manuscript will be put through a rigorous refereeing process. Manuscripts should have a title page that includes the title of the paper, full name of its author(s), affiliations(s), complete physical and electronic address(es), telephone number(s) and a 300-word abstract of the paper. DEADLINES Six copies of the manuscript are due by June 10, 1989. Notification of accepted papers by September 1, 1989. Accpeted manuscripts, camera-ready, are due by October 3, 1989. SEND SUBMISSIONS AND QUESTIONS TO O. K. Ersoy H. H. Szu Purdue University Naval Research Laboratories School of Electrical Engineering Code 5709 W. Lafayette, IN 47907 4555 Overlook Ave., SE (317) 494-6162 Washington, DC 20375 E-Mail: ersoy at ee.ecn.purdue (202) 767-2407 From lina at wheaties.ai.mit.edu Wed Mar 29 13:23:33 1989 From: lina at wheaties.ai.mit.edu (Lina Massone) Date: Wed, 29 Mar 89 13:23:33 EST Subject: No subject Message-ID: <8903291823.AA09549@gelatinosa.ai.mit.edu> ********* FOR CONNECTIONISTS ONLY - PLEASE DO NOT FORWARD *********** **************** TO OTHER BBOARDS/ELECTRONIC MEDIA ******************* TECHNICAL REPORT AVAILABLE A NEURAL NETWORK MODEL FOR LIMB TRAJECTORY FORMATION Lina Massone and Emilio Bizzi Dept. of Brain and Cognitive Sciences Massachusetts Institute of Technology This paper deals with the problem of representing and generating unconstrained aiming movements of a limb by means of a neural network architecture. The network produced a time trajectory of a limb from a starting posture toward a target specified by a sensory stimulus. Thus the network performed a sensory-motor transformation. The experimenters imposed a bell-shaped velocity profile on the trajectory. This type of profile is characteristic of most movements performed by biological systems. We investigated the generalization capabilities of the network as well as its internal organization. Experiments performed during learning and on the trained network showed that: (i) the task could be learned by a three-layer sequential network; (ii) the network successfully generalized in trajectory space and adjusted the velocity profiles properly; (iii) the same task could not be learned by a linear network; (iv) after learning, the internal connections became organized into inhibitory and excitatory zones and encoded the main features of the training set; (v) the model was robust to noise on the input signals; (vi) the network exhibited attractor-dynamics properties; (vii) the network was able to solve the motor-equivalence problem. A key feature of this work is the fact that the neural network was coupled to a mechanical model of a limb in which muscles are represented as springs. With this representation the model solved the problem of motor redundancy. A short version of this paper covering only part of the described research was mailed in February to IJCNN. The full report has been submitted to Biological Cybernetics. All requests should be addressed to: lina at wheaties.ai.mit.edu From marchman%amos at ucsd.edu Wed Mar 29 19:20:36 1989 From: marchman%amos at ucsd.edu (Virginia Marchman) Date: Wed, 29 Mar 89 16:20:36 PST Subject: Technical Report Available Message-ID: <8903300020.AA01129@amos.UCSD.EDU> The following Technical Report (#8902) is available from the Center for Research in Language. (Please do not forward.) ******************************************************************* Pattern Association in a Back Propagation Network: Implications for Child Language Acquisition Kim Plunkett Virginia Marchman University of Aarhus, Denmark University of California, San Diego Abstract A 3-layer back propagation network is used to implement a pattern association task which learns mappings that are analogous to the present and past tense forms of English verbs, i.e., arbitrary, identity, vowel change, and suffixation mappings. The degree of correspondence between connectionist models of tasks of this type (Rumelhart & McClelland, 1986; 1987) and children's acquisition of inflectional morphology has recently been highlighted in discussions of the general applicability of PDP to the study of human cognition and language (Pinker & Mehler, 1988). In this paper, we attempt to eliminate many of the shortcomings of the R&M work and adopt an empirical, comparative approach to the analysis of learning (i.e., hit rate and error type) in these networks. In all of our simulations, the network is given a constant 'diet' of input stems -- that is, discontinuities are not introduced into the learning set at any point. Four sets of simulations are described in which input conditions (class size and token frequency) and the presence/absence of phonological subregularities are manipulated. First, baseline simulations chart the initial computational constraints of the system and reveal complex "competition effects" when the four verb classes must be learned simultaneously. Next, we explore the nature of these competitions given different type (class sizes) and token frequencies (# of repetitions). Several hypotheses about input to children are tested, from dictionary counts and production corpora. Results suggest that relative class size determines which "default" transformation is employed by the network, as well as the frequency of overgeneralization errors (both "pure" and "blended" overgeneralizations). A third series of simulations manipulates token frequency within a constant class size, searching for the set of token frequencies which results in "adult-like competence" and "child-like" errors across learning. A final series investigates the addition of phonological sub-regularities into the identity and vowel change classes. Phonological cues are clearly exploited by the system, leading to overall improved performance. However, overgeneralizations, U-shaped learning and competition effects continue to be observed in similar conditions. These models establish that input configuration plays a role in detemining the types of errors produced by the network - including the conditions under which "rule-like" behavior and "U-shaped" development will and will not emerge. The results are discussed with reference to behavioral data on children's acquisition of the past tense and the validity of drawing conclusions about the acquisition of language from models of this sort. ***************************************************************** Please send requests for hard copy to: yvonne at amos.ucsd.edu or Center for Research in Language C-008 University of California, San Diego La Jolla, CA 92093 Attn: Yvonne -- Virginia Marchman (marchman at amos.ucsd.edu) Kim Plunkett (psykimp at dkarh02.bitnet) From sankar at caip.rutgers.edu Fri Mar 31 15:14:12 1989 From: sankar at caip.rutgers.edu (ananth sankar) Date: Fri, 31 Mar 89 15:14:12 EST Subject: KOHONEN MAPS Message-ID: <8903312014.AA03080@caip.rutgers.edu> I had initiated a discussion on Kohonen's maps two weeks ago and apart from the many replies I (and many others??) received there were requests that I post the responses. It would be a good idea to go through this material and then discuss again. >From pastor at prc.unisys.com Thu Mar 16 16:58:47 1989 Received: from PRC-GW.PRC.UNISYS.COM by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA03401; Thu, 16 Mar 89 16:58:40 EST Received: from bigburd.PRC.Unisys.COM by burdvax.PRC.Unisys.COM (5.61/Domain/jpb/2.9) id AA11739; Thu, 16 Mar 89 16:58:28 -0500 Received: by bigburd.PRC.Unisys.COM (5.61/Domain/jpb/2.9) id AA24449; Thu, 16 Mar 89 16:58:23 -0500 From: pastor at prc.unisys.com (Jon Pastor) Message-Id: <8903162158.AA24449 at bigburd.PRC.Unisys.COM> Received: from Xerox143 by bigburd.PRC.Unisys.COM with PUP; Thu, 16 Mar 89 16:58 EST To: ananth sankar Date: 16 Mar 89 16:56 EST (Thursday) Subject: Re: questions on kohonen's maps In-Reply-To: ananth sankar 's message of Thu, 16 Mar 89 09:42:44 EST To: ananth sankar Cc: pastor at bigburd.prc.unisys.com Status: R I am in the process of implementing a Kohonen-style system, and if I actually get it running and obtain any results I'll let you know. If you get any responses, please let me know. Thanks. >From Connectionists-Request at q.cs.cmu.edu Thu Mar 16 16:59:58 1989 Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA03426; Thu, 16 Mar 89 16:59:52 EST Received: from cs.cmu.edu by Q.CS.CMU.EDU id aa11454; 16 Mar 89 9:44:34 EST Received: from CAIP.RUTGERS.EDU by CS.CMU.EDU; 16 Mar 89 09:42:55 EST Received: by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA14983; Thu, 16 Mar 89 09:42:44 EST Date: Thu, 16 Mar 89 09:42:44 EST From: ananth sankar Message-Id: <8903161442.AA14983 at caip.rutgers.edu> To: connectionists at cs.cmu.edu Subject: questions on kohonen's maps Status: R I am interested in the subject of Self Organization and have some questions with regard to Kohonen's algorithm for Self Organizing Feature Maps. I have tried to duplicate the results of Kohonen for the two dimensional uniform input case i.e. two inputs. I used a 10 X 10 output grid. The maps that resulted were not as good as reported in the papers. Questions: 1 Is there any analytical expression for the neighbourhood and gain functions? I have seen a simulation were careful tweaking after every so many iterations produces a correctly evolving map. This is obviously not a proper approach. 2 Even if good results are available for particular functions for the uniform distribution input case, it is not clear to me that these same functions would result in good classification for some other problem. I have attempted to use these maps for word classification using LPC coeffs as features. 3 In Kohonen's book "Self Organization and Associative Memory", Ch 5 the algorithm for weight adaptation does not produce normalized weights. Thus the output nodes cannot function as simply as taking a dot product of inputs and weights. They have to execute a distance calculation. 4 I have not seen as yet in the literature any reports on how the fact that neighbouring nodes respond to similar patterns from a feature space can be exploited. 5 Can the net become disordered after ordering is achieved at any particular iteration? I would appreciate any comments, suggestions etc on the above. Also so that net mail clutter may be reduced please respond to sankar at caip.rutgers.edu Thank you. Ananth Sankar Department of Electrical Engineering Rutgers University, NJ >From regier at cogsci.berkeley.edu Thu Mar 16 17:07:20 1989 Received: from cogsci.Berkeley.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA03562; Thu, 16 Mar 89 17:07:16 EST Received: by cogsci.berkeley.edu (5.61/1.29) id AA13666; Thu, 16 Mar 89 14:07:18 -0800 Date: Thu, 16 Mar 89 14:07:18 -0800 From: regier at cogsci.berkeley.edu (Terry Regier) Message-Id: <8903162207.AA13666 at cogsci.berkeley.edu> To: sankar at caip.rutgers.edu Subject: Kohonen request Status: R Hi, I'm interested in the responses to your recent Kohonen posting on Connectionists. Do you suppose you could post the results once all the replies are in? Thanks, -- Terry >From ken at phyb.ucsf.edu Thu Mar 16 20:11:35 1989 Received: from cgl.ucsf.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA09101; Thu, 16 Mar 89 20:11:32 EST Received: from phyb.ucsf.EDU by cgl.ucsf.EDU (5.59/GSC4.15) id AA01036; Thu, 16 Mar 89 17:11:23 PST Received: by phyb (1.2/GSC4.15) id AA11601; Thu, 16 Mar 89 17:11:17 pst Date: Thu, 16 Mar 89 17:11:17 pst From: ken at phyb.ucsf.edu (Ken Miller) Message-Id: <8903170111.AA11601 at phyb> To: sankar at caip.rutgers.edu Subject: kohonen Status: R re your point 3: the algorithm du_{ij}/dt = a(t)[e_j(t) - u_{ij}(t)], i in N_c where u = weights, e is input pattern, N_c is topological neighborhood of maximally responding neighborhood, should actually be written du_{ij}/dt = a(t)[ (e_j(t)/\sum_k(e_k(t))) - u_{ij}(t)/\sum_j(u_{ij}(t) ], i in N_c. That is the change should be such as to move the jth synaptic weight on the ith cell, as a PROPORTION of all the synaptic weights on the ith cell, in the direction of matching the PROPORTION of input which was incoming on the jth line. Note that in this case \sum_j du_{ij}(t)/dt = 0, so weights remain normalized in the sense that sum over each cell remains constant. If you normalize your inputs to sum to 1 (\sum_k(e_k(t)) = 1) and start with weights normalized to sum to 1 on each cell ( \sum_j(u_{ij}(t) = 1 for all i) then weights will remain normalized to sum to 1, hence the two sums in the denominators are both just = 1 and can be left out. Kohonen was I believe assuming these normalizations and hence dispensing with the sums. ken miller (ken at phyb.ucsf.edu) ucsf dept. of physiology >From tds at wheaties.ai.mit.edu Thu Mar 16 23:26:42 1989 Received: from life.ai.mit.edu by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA12489; Thu, 16 Mar 89 23:26:39 EST Received: from mauriac.ai.mit.edu by life.ai.mit.edu; Thu, 16 Mar 89 22:48:15 EST Received: from localhost by mauriac.ai.mit.edu; Thu, 16 Mar 89 22:48:06 est Date: Thu, 16 Mar 89 22:48:06 est From: tds at wheaties.ai.mit.edu Message-Id: <8903170348.AA19015 at mauriac.ai.mit.edu> To: sankar at caip.rutgers.edu Subject: Kohonen maps Status: R I share some of your confusion about Kohonen maps. My main question is #4: are they really doing anything useful? The mapping demonstrated in Kohonen's 1982 paper (Biol. Cyb.) only shows mappings from a 2D manifold in 3-space onto a two-dimensionally arranged set of units. The book talks about dimensionality issues in more detail, but so far as I can tell what the network does (after training) is to map three numbers into about 100 numbers. Since the mapping is linear, I don't see how anything at all is gained. If the network is unable to generate an ordering, it may be one way to tell if the data does not lie on a 2D manifold. But there are many other ways to do this that are more efficient! Also, this is not robust if the manifold folds back on itself (so that two distinct points on the surface are in the same direction from the origin). Let me know if you find out the true significance of this widely-known work, Terry >From lwyse at bucasb.bu.edu Fri Mar 17 17:42:18 1989 Received: from BU-IT.BU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA05821; Fri, 17 Mar 89 17:42:12 EST Received: from COCHLEA.BU.EDU by bu-it.BU.EDU (5.58/4.7) id AA17739; Fri, 17 Mar 89 17:38:02 EST Received: by cochlea.bu.edu (4.0/4.7) id AA02692; Fri, 17 Mar 89 17:38:21 EST Date: Fri, 17 Mar 89 17:38:21 EST From: lwyse at bucasb.bu.edu Message-Id: <8903172238.AA02692 at cochlea.bu.edu> To: sankar at caip.rutgers.edu Subject: re:questions on Kohonen maps Status: R I would be surprised if there was some analytical expression for the neighborhood and gain functions that was useful in practical applications. I have found different "best functions" for different input vector distributions, initial weight distributions, etc. A related question to yours: What does "ordering" mean when mapping accross different dimensional spaces? An excerpt from a report on my experiences with Kohonen maps: When the input space and the neighborhood space of the weight vectors are of different dimension, however, what "ordered" means becomes a sticky wicket. For example, int Fig. 5.17, Kohonen shows a one-dimensional neigborhood of weight vecotrs approximating a triangular distribution of inputs with what he terms a "Peano-like" curve. But this type of curve folds in on itself in an attempt to fill the space and thus moves points that may be far from each other in their one-D neighborooh, but be maximally responsive to very close input points. Is this "ordered"? He doesn't seem to address this point directly. A point I would like to bring out is that in these situations where the dimension of the input space and the dimension of the neighborhood differ, whether or not the wheight-vector chain crosses itself is {\em not} necessarily the important metric for measuring the ability of the weights to approximate the input space. That is, there is not necessarily a correlation between neighborhood-chain crossings, and the mean squared error of the weight vector approximations of the input points. It is true, however, that if the neighborhood chain crosses itself, then {\em there exists} a better approximation to the input space. -lonce >From risto at cs.ucla.edu Sat Mar 18 02:59:46 1989 Received: from Oahu.CS.UCLA.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA14191; Sat, 18 Mar 89 02:59:35 EST Return-Path: Received: by oahu.cs.ucla.edu (Sendmail 5.59/2.16) id AA02486; Fri, 17 Mar 89 23:14:45 PST Date: Fri, 17 Mar 89 23:14:45 PST From: risto at cs.ucla.edu (Risto Miikkulainen) Message-Id: <8903180714.AA02486 at oahu.cs.ucla.edu> To: sankar at caip.rutgers.edu In-Reply-To: ananth sankar's message of Thu, 16 Mar 89 09:42:44 EST <8903161442.AA14983 at caip.rutgers.edu> Subject: questions on kohonen's maps Reply-To: risto at cs.ucla.edu Organization: UCLA Computer Science Department Physical-Address: 3677 Boelter Hall Status: R Date: Thu, 16 Mar 89 09:42:44 EST From: ananth sankar 1 Is there any analytical expression for the neighbourhood and gain functions? I have seen a simulation were careful tweaking after every so many iterations produces a correctly evolving map. This is obviously not a proper approach. The trick is to start with a neighborhood large enough. For 10x10, a radius of 8 units might be appropriate. Then reduce the radius gradually (e.g. over a few thousand inputs) to 1 or even to 0. 3 In Kohonen's book "Self Organization and Associative Memory", Ch 5 the algorithm for weight adaptation does not produce normalized weights. Thus the output nodes cannot function as simply as taking a dot product of inputs and weights. They have to execute a distance calculation. True. The original idea was to form the "activity bubble" with lateral inhibition and change the weights by "redistribution of synaptic resources". This neurologically plausible algorithm gave way to an abstraction which uses distance, global selection and difference. (I did some work comparing these two algorithms; I can send you the tech report if you want to look at it. At least it has the parameters that work) 5 Can the net become disordered after ordering is achieved at any particular iteration? Kohonen proved (in ch 5) that this cannot happen (in the 1-d case) for the abstract algorithm. This is a big problem for the biologically plausible algorithm though. >From djb at flash.bellcore.com Sat Mar 18 23:38:41 1989 Received: from flash.bellcore.com by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA27190; Sat, 18 Mar 89 23:38:32 EST Received: by flash.bellcore.com (5.58/1.1) id AA06742; Sat, 18 Mar 89 23:38:10 EST Date: Sat, 18 Mar 89 23:38:10 EST From: djb at flash.bellcore.com (David J Burr) Message-Id: <8903190438.AA06742 at flash.bellcore.com> To: sankar at caip.rutgers.edu Subject: Feature Map Learning Status: R Your questions regarding the feature map algorithm are ones that have also concerned me. I have been experimenting with a form of this elastic mapping algorithm since about 1979. My early experiments were focussed on using such an adaptive process to map handwritten characters onto reference characters in an attempt to automate a form of elastic template matching. The algorithm I came up with was one which used nearest neighbor "attractors" to "pull" an elastic map into shape by an interative process. I defined a window or smoothing kernel which had a Gaussian shape as opposed to the bos (box) shape commonly used in self organized mapping. My algorithm resembled the Kohonen feature map classifier that you referred to in your email. The gaussian kernel has advantages over the box kernel in that aliasing distortion can be reduced. This is similar to the use of Hamming windows in the design of fast fourier transforms. With regard to your first and second questions, we have found that the actual window size and gain parameters can take on a number of different schedule shapes and give similar results. It is important that window size decrease very gradually to avoid to early committment to a particular vector. This is particularly important in the mapping of highly distorted characters where a rapid schedule could cause a feature in one character to map to the "wrong" feature in the reference character. Gaussian windows were the choice for that problem, since they guaranteed very smooth maps. You are right that a parameter schedule that works for one problem may be poorly suited to a different problem. We have recently applied the feature map model to the traveling salesman problem and reported some of our results at ICNN-88. A one-dimensional version of the elastic map ( a rubber band ) seems best suited to this problem. We found that there was a particular analytic form of the gain schedule which worked well for this problem. Window size, on the other hand, seemed to benefit best from a feedback schedule in which the degree of progress toward the solution served as input to set an appropriate window size. I have results studying some 700 different learning trials on 30-100 city problems using this method. Performance is considerable better than the Hopfield-Tank solution. Yes, it seems as though one needs distance calculation as the input for this model, rather than dot product as used in back-propagation nets. I would be happy to mail you some papers describing my implementation of feature map learning model. The first article appeared in Computer Graphics and Image Processing Journal, 1981, entitled "A Dynamic Model for Image Registration". The recent work on traveling salesman was also reported at last year's Snowbird meeting in addition to ICNN-88. Please feel free to correspond with me as I consider this a very interesting topic. Best Wishes, D. J. Burr djb at bellcore.com >From @relay.cs.net:tony at ifi.unizh.ch Mon Mar 20 03:12:51 1989 Received: from RELAY.CS.NET by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA02795; Mon, 20 Mar 89 03:12:46 EST Received: from relay2.cs.net by RELAY.CS.NET id ab08738; 20 Mar 89 4:55 EST Received: from switzerland by RELAY.CS.NET id ae29120; 20 Mar 89 4:48 EST Received: from ean by scsult.SWITZERLAND.CSNET id a011717; 20 Mar 89 9:45 WET Date: 19 Mar 89 21:45 +0100 From: tony bell To: sankar at caip.rutgers.edu Mmdf-Warning: Parse error in original version of preceding line at SWITZERLAND.CSNET Message-Id: <342:tony at ifi.unizh.ch> Subject: Top Maps Status: R You should see Ritter & Schulten's paper in IEEE ICNN proceedings 1988 (San Diego) for expressions answering question 1. Another paper from Helge Ritt er deals with the convergence properties. This was submitted to Biol Cybernetics but maybe you should write to him at the University of Illinois where he is now. Tony Bell, Univ of Zurich >From djb at flash.bellcore.com Mon Mar 20 17:51:22 1989 Received: from flash.bellcore.com by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA18086; Mon, 20 Mar 89 17:51:14 EST Received: by flash.bellcore.com (5.58/1.1) id AA25760; Mon, 20 Mar 89 17:51:18 EST Date: Mon, 20 Mar 89 17:51:18 EST From: djb at flash.bellcore.com (David J Burr) Message-Id: <8903202251.AA25760 at flash.bellcore.com> To: sankar at caip.rutgers.edu Subject: Self-Organized Mapping Status: R There has been interest on the net recently in some of the questions that you posed in your recent mail. I have personally received comments regarding the neighborhood functions and whether there is an appropriate analytic form. My comments were summarized in my recent mailing to you. If you get additional responses, I would certainly appreciate hearing about peoples' experiences. Would you consider posting a summary to the net? I did not comment on your questions 4 and 5. It seems that the neighbors- matching-to-neighbors observation comes about as a result rather than an input constraint. In my 1981 paper on elastic matching of images I used a more extended pattern matcher (area template insteat of a point-to- point nearest neighbor) for gray scale images. This tended to enforce the constraint that you observed at the input level. Unfortunately, I am not sure what its generalization would be for non-image patterns (N-D instead of2-D). I have done all my experiments on elastic mapping of fixed patterns as opposed to point distributions. There was no problem of a map being undone after it converged. Have you had such problems with your speech data? I have been told that when the distributions are stochastic or sampled, that there is even stronger need to proceed slowly. Apparently one sampled point can pull the map in one direction and this must be counterbalanced by opposing samples pulling the other way to maintain stability of the map. This unfortunately takes lots of computer cycles. Hoping to hear from you. Dave Burr >From Connectionists-Request at q.cs.cmu.edu Mon Mar 20 18:01:41 1989 Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA18228; Mon, 20 Mar 89 18:01:34 EST Received: from cs.cmu.edu by Q.CS.CMU.EDU id aa23263; 20 Mar 89 14:41:25 EST Received: from XEROX.COM by CS.CMU.EDU; 20 Mar 89 14:39:19 EST Received: from Semillon.ms by ArpaGateway.ms ; 20 MAR 89 11:26:12 PST Date: 20 Mar 89 11:25 PST From: chrisley.pa at xerox.com Subject: Re: questions on kohonen's maps In-Reply-To: ananth sankar 's message of Thu, 16 Mar 89 09:42:44 EST To: ananth sankar Cc: connectionists at cs.cmu.edu, chrisley.pa at xerox.com Message-Id: <890320-112612-6136 at Xerox> Status: R Ananth Sankar recently asked some questions about Kohonen's feature maps. As I have worked on these issues with Kohonen, I feel like I might be able to give some answers, but standard disclaimers apply: I cannot be certain that Kohonen would agree with all of the following. Also, I do not have my copy of his book with me, so I cannot be more specific about refrences. Questions: 1 Is there any analytical expression for the neighbourhood and gain functions? I have seen a simulation were careful tweaking after every so many iterations produces a correctly evolving map. This is obviously not a proper approach. Although there is probably more than one, correct, task-independent gain or neighborhood function, Kohonen does mention constraints that all of them should meet. For example, both functions should decrease to zero over time. I do not know of any tweaking; Kohonen usually determines a number of iterations and then decreases the gain linearly. If you call this tweaking, then your idea of domain-independent parameters might be a sort of holy grail, since it does not seem likely that we are going to find a completely parameter-free learning algorithm that will work in every domain. 2 Even if good results are available for particular functions for the uniform distribution input case, it is not clear to me that these same functions would result in good classification for some other problem. I have attempted to use these maps for word classification using LPC coeffs as features. As far as I know, Kohonen has used the same type of gain and neighborhood functions for all of his map demonstrations. These demonstrations, which have been shown via an animated film at several major conferences, demonstrate maps learning the distribution in cases where 1) the dimensionality of the network topology and the input space mismatch, e.g., where the network is 2d and the distribution is a 3d 'cactus'; 2) the distribution is not uniform. The algorithm was developed with these 2 cases in mind, so it is no surprise that the results are good for them as well. 3 In Kohonen's book "Self Organization and Associative Memory", Ch 5 the algorithm for weight adaptation does not produce normalized weights. Thus the output nodes cannot function as simply as taking a dot product of inputs and weights. They have to execute a distance calculation. That's right. And Kohonen usually uses the Euclidean distance metric, although other ones can be used (which he discusses in the book) Furthermore, there have been independent efforts to normalize weights in Kohonen maps so that the dot product measure can be used. If you have any doubts about the suitability of the Euclidean metric, as your question seems to imply, express them. It is an interesting issue. 4 I have not seen as yet in the literature any reports on how the fact that neighbouring nodes respond to similar patterns from a feature space can be exploited. The primary interest in maps, I believe, came from a desire to display high-dimensional information in low dimensional spaces, which are more easily apprehended. But there is evidence that there are other uses as well: 1) Kohonen has published results on using maps for phoneme recognition, where the topology-preservation plays a significant role (such maps are used in the Otaniemi Phonetic Typewriter featured in, I think, Computer magazine a year or two agao.); 2) work has been done on using the topology to store sequential information, which seems to be a good idea if you are dealing with natural signals that can only temporally shift from a state to similar states; 3) several people have followed Kohonen's suggestion of using maps for adaptive kinematic representations for robot control (the work on Murphy, mentioned on this net a month or so ago, and the work being done at Carlton (sp) University by Darryl Graf are two good examples). In short, just look at some ICNN or INNS proceedings, and you'll find many papers where researchers found Kohonen maps to be a good place from which to launch their own studies. 5 Can the net become disordered after ordering is achieved at any particular iteration? Of course, this is theoretically possible, and is almost certain if at some point the distribution of the mapped function changes. But this brings up the difficult question: what is the proper ordering in such a case? Should a net try to integrate both past and present distributions, or should it throw away the past on concentrate on the present? I think nost nn researchers would want a litlle of both, woth maybe some kind of exponential decay in the weights. But in many applications of maps, there is no chance of the distribution changing: it is fixed, and iterations are over the same test data each time. In this case, I would guess that the ordering could not becone disrupted (at least for simple distributions and a net of adequate size), but I realise that there is no proof of this, and the terms 'simple' and 'adequate' are lacking definition. But that's life in nnets for you! If anyone has any more questions, feel free. Ron Chrisley Xerox PARC System Sciences Lab 3333 Coyote Hill Road Palo Alto, CA 94304 USA chrisley.pa at xerox.com tel: (415) 494-4728 OR New College Oxford OX1 3BN UK chrisley at vax.oxford.ac.uk tel: (865) 279-492 >From chrisley.pa at xerox.com Thu Mar 23 15:00:13 1989 Received: from Xerox.COM by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA22224; Thu, 23 Mar 89 15:00:04 EST Received: from Semillon.ms by ArpaGateway.ms ; 23 MAR 89 11:35:27 PST Date: 23 Mar 89 11:35 PST From: chrisley.pa at xerox.com Subject: Re: questions on kohonen's maps In-Reply-To: ananth sankar 's message of Thu, 16 Mar 89 09:42:44 EST To: ananth sankar Cc: connectionists at cs.cmu.edu Message-Id: <890323-113527-4949 at Xerox> Status: R One further note about Ananth Sankar's questions about Kohonen maps: A friend of mine, Tony Bell, tells me (and Ananth) that Helge Ritter has a "neat set of expressions for the learning rate and neighbourhood size parameters... and he also proves something about congergence elsewhere." Unfortunately, I do not as yet have a reference for the papers, but I have liked Ritter's work in the past, so I thought people on the net might be interested. >From Connectionists-Request at q.cs.cmu.edu Fri Mar 24 11:52:18 1989 Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA20326; Fri, 24 Mar 89 11:52:13 EST Received: from ri.cmu.edu by Q.CS.CMU.EDU id aa17597; 24 Mar 89 8:48:01 EST Received: from BU-IT.BU.EDU by RI.CMU.EDU; 24 Mar 89 08:41:54 EST Received: from COCHLEA.BU.EDU by bu-it.BU.EDU (5.58/4.7) id AA06449; Tue, 21 Mar 89 13:58:32 EST Received: by cochlea.bu.edu (4.0/4.7) id AA04927; Tue, 21 Mar 89 13:59:02 EST Date: Tue, 21 Mar 89 13:59:02 EST From: lwyse at bucasb.bu.edu Message-Id: <8903211859.AA04927 at cochlea.bu.edu> To: connectionists at ri.cmu.edu In-Reply-To: connectionists at c.cs.cmu.edu's message of 20 Mar 89 23:47:09 GMT Subject: Re: questions on kohonen's maps Organization: Center for Adaptive Systems, B.U. Status: R What does "ordering" mean when your projecting inputs to a lower dimensional space? For example, the "Peano" type curves that result from a one-D neighborhood learning a 2-D input distribution, it is obviously NOT true that nearby points in the input space maximally activate nearby points on the neighborhood chain. In this case, it is not even clear that "untangling" the neighborhood is of utmost importance, since a tangled chain can still do a very good job of divvying up the space almost equally between its nodes. -lonce >From @relay.cs.net:tony at ifi.unizh.ch Fri Mar 24 13:30:26 1989 Received: from RELAY.CS.NET by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA23163; Fri, 24 Mar 89 13:30:12 EST Received: from relay2.cs.net by RELAY.CS.NET id ab09426; 24 Mar 89 12:01 EST Received: from switzerland by RELAY.CS.NET id aa01417; 24 Mar 89 11:55 EST Received: from ean by scsult.SWITZERLAND.CSNET id a011335; 24 Mar 89 17:53 WET Date: 24 Mar 89 17:51 +0100 From: tony bell To: sankar at caip.rutgers.edu Mmdf-Warning: Parse error in original version of preceding line at SWITZERLAND.CSNET Message-Id: <352:tony at ifi.unizh.ch> Status: R In case anyone else asks (or Ron sends any more vague messages to the net), here are all the refs I have on Helge Ritter's work on topological maps: [1]"Kohonen's Self-Organizing Maps: exploring their computational capabilities" in Proc. IEEE ICNN 1988, San Diego. [2]"Convergence Properties of Kohonen's Topology Conserving Maps: fluctuations, stability and dimension selection" submitted to Biol. Cybernetica. [3] "Extending Kohonen's self-organising mapping algorithm to learn Ballistic Movements" in the book "Neural Computers" Eckmiller & von der Malsburg (eds) [4] "Topology conserving mappings for learning motor tasks" in the book "Neural Networks for Computing" Denker (ed) AIP Conf. proceedings, Snowbird, 1986. The second one in particular uses some heavy statistical techniques (the inputs are seen as a Markov process and a Fokker-Planck equation describes the learn- ing) in order to prove that the map will reach equilibrium when the learning rate is time dependant (ie: it decays). Ritter's PhD thesis covers all his work, but it's in German. Now, Ritter is at the University of Illinois. I hope this helps you and I don't mind if you post this to the net if you think people are interested enough. yours, Tony Bell. >From Connectionists-Request at q.cs.cmu.edu Fri Mar 24 22:07:14 1989 Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA23834; Fri, 24 Mar 89 22:07:06 EST Received: from ri.cmu.edu by Q.CS.CMU.EDU id aa22170; 24 Mar 89 13:28:20 EST Received: from MAUI.CS.UCLA.EDU by RI.CMU.EDU; 24 Mar 89 13:26:10 EST Return-Path: Received: by maui.cs.ucla.edu (Sendmail 5.59/2.16) id AA25252; Fri, 24 Mar 89 10:25:07 PST Date: Fri, 24 Mar 89 10:25:07 PST From: Geunbae Lee Message-Id: <8903241825.AA25252 at maui.cs.ucla.edu> To: lwyse at bucasb.bu.edu Subject: Re: questions on konhonen's map Cc: connectionists at ri.cmu.edu Status: R >What does "ordering" mean when your projecting inputs to a lower dimensional >space? It means topological ordering >For example, the "Peano" type curves that result from a one-D >neighborhood learning a 2-D input distribution, it is obviously NOT >true that nearby points in the input space maximally activate nearby >points on the neighborhood chain. It depends on what you mean by "near by" If it is near by in relative sense (in topological relation), not absolute sense, then the nearby points in the input space DOES maximally activate nearby points on the neighborhood chain. --Geunbae Lee AI Lab, UCLA >From Connectionists-Request at q.cs.cmu.edu Sat Mar 25 02:26:12 1989 Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA26264; Sat, 25 Mar 89 02:26:06 EST Received: from ri.cmu.edu by Q.CS.CMU.EDU id aa25584; 24 Mar 89 17:55:35 EST Received: from XEROX.COM by RI.CMU.EDU; 24 Mar 89 17:53:44 EST Received: from Semillon.ms by ArpaGateway.ms ; 24 MAR 89 14:53:32 PST Date: 24 Mar 89 14:53 PST From: chrisley.pa at xerox.com Subject: Re: questions on kohonen's maps In-Reply-To: lwyse at bucasb.BU.EDU's message of Tue, 21 Mar 89 13:59:02 EST To: lwyse at bucasb.bu.edu Cc: connectionists at ri.cmu.edu Message-Id: <890324-145332-8519 at Xerox> Status: R Lonce (lwyse at bucasb.BU.EDU) writes: "What does "ordering" mean when your projecting inputs to a lower dimensional space? For example, the "Peano" type curves that result from a one-D neighborhood learning a 2-D input distribution, it is obviously NOT true that nearby points in the input space maximally activate nearby points on the neighborhood chain." It is not true that nearby points in input space are always mapped to nearby points in the output space when the mapping is dimensionality reducing, agreed. But 'ordering' still makes sense. The map is topology-preserving if the dependency is in the other direction, i.e., if nearby points in output space are always activated by nearby points in input space. Lonce goes on to say: "In this case, it is not even clear that "untangling" the neighborhood is of utmost importance, since a tangled chain can still do a very good job of divvying up the space almost equally between its nodes." I agree that topology preservation is not necessarily of utmost importance, but it may be useful in some applications, such as the ones I mentioned a few messages back (phoneme recognition, inverse kinematics, etc.). Also, there is 1) the interest in properties of self-organizing systems in themselves, even though an application can't be immediately found; and 2) the observation that for some reason the brain seems to use topology preserving maps (with the one-way dependency I mentioned above), which, although they *could* be computationally unnecessary or even disadvantageous, are probably in fact, nature being what she is, good solutions to tough real time problems. Ron Chrisley After April 14th, please send personal email to Chrisley at vax.ox.ac.uk >From Connectionists-Request at q.cs.cmu.edu Sun Mar 26 03:40:59 1989 Received: from Q.CS.CMU.EDU by caip.rutgers.edu (5.59/SMI4.0/RU1.0/3.03) id AA19433; Sun, 26 Mar 89 03:40:47 EST Received: from cs.cmu.edu by Q.CS.CMU.EDU id aa07032; 26 Mar 89 1:22:01 EST Received: from CGL.UCSF.EDU by CS.CMU.EDU; 26 Mar 89 01:18:16 EST Received: from phyb.ucsf.EDU by cgl.ucsf.EDU (5.59/GSC4.16) id AA07448; Sat, 25 Mar 89 22:18:01 PST Received: by phyb (1.2/GSC4.15) id AA08352; Sat, 25 Mar 89 22:17:59 pst Date: Sat, 25 Mar 89 22:17:59 pst From: Ken Miller Message-Id: <8903260617.AA08352 at phyb> To: Connectionists at cs.cmu.edu Subject: Normalization of weights in Kohonen algorithm Status: R re point 3 of recent posting about Kohonen algorithm: "3 In Kohonen's book "Self Organization and Associative Memory", Ch 5 the algorithm for weight adaptation does not produce normalized weights." the algorithm du_{ij}/dt = a(t)[e_j(t) - u_{ij}(t)], i in N_c where u = weights, e is input pattern, N_c is topological neighborhood of maximally responding neighborhood, should I believe be written du_{ij}/dt = a(t)[ e_j(t)/\sum_k(e_k(t)) - u_{ij}(t)/\sum_k(u_{ik}(t)) ], i in N_c. That is, the change should be such as to move the jth synaptic weight on the ith cell, as a PROPORTION of all the synaptic weights on the ith cell, in the direction of matching the PROPORTION of input which was incoming on the jth line. Note that in this case \sum_j du_{ij}(t)/dt = 0, so weights remain normalized in the sense that sum over each cell remains constant. If inputs are normalize to sum to 1 (\sum_k(e_k(t)) = 1) then the first denominator can be omitted. If weights begin normalized to sum to 1 on each cell ( \sum_k(u_{ik}(t)) = 1 for all i) then weights will remain normalized to sum to 1, hence the second denominator can be omitted. Perhaps Kohonen was assuming these normalizations and hence dispensing with the denominators? ken miller (ken at phyb.ucsf.edu) From mcvax!fib.upc.es!millan at uunet.UU.NET Fri Mar 31 04:09:00 1989 From: mcvax!fib.upc.es!millan at uunet.UU.NET (Jose del R. MILLAN) Date: 31 Mar 89 17:09 +0800 Subject: TR available Message-ID: <92*millan@fib.upc.es> The following Tech. Report is available. Requests should be sent to MILLAN at FIB.UPC.ES ________________________________________________________________________ Learning by Back-Propagation: a Systolic Algorithm and its Transputer Implementation Technical Report LSI-89-15 Jose del R. MILLAN Dept. de Llenguatges i Sistemes Informatics Universitat Politecnica de Catalunya Pau BOFILL Dept. d'Arquitectura de Computadors Universitat Politecnica de Catalunya ABSTRACT In this paper we present a systolic algorithm for back-propagation, a supervised, iterative, gradient-descent, connectionist learning rule. The algorithm works on feedforward networks where connections can skip layers and fully exploits spatial and training parallelisms, which are inherent to back-propagation. Spatial parallelism arises during the propagation of activity ---forward--- and error ---backward--- for a particular input-output pair. On the other hand, when this computation is carried out simultaneously for all input-output pairs, training parallelism is obtained. In the spatial dimension, a single systolic ring carries out sequentially the three main steps of the learnng rule ---forward, backward and weight increments update. Furthermore, the same pattern of matrix delivery is used in both the forward and the backward passes. In this manner, the algorithm preserves the similarity of the forward and backward passes in the original model. The resulting systolic algorithm is dual with respect to the pattern of matrix delivery ---either columns or rows. Finally, an implementation of the systolic algorithm for the spatial dimension is derived, that uses a linear ring of Transputer processors.