From jlm+ at andrew.cmu.edu Thu Sep 1 16:10:27 1988 From: jlm+ at andrew.cmu.edu (James L. McClelland) Date: Thu, 1 Sep 88 16:10:27 -0400 (EDT) Subject: Comments on Pinker's Replies to Harnad Message-ID: <0X7Ouny00jWDM2TdAk@andrew.cmu.edu> I tried to send this yesterday but for some reason it appears to have slipped through the cracks. The copy I sent to steve got through, but apparently the one to connectionists didn't. ============================================================= Steve -- In the first of your two messages, there seemed to be a failure to entertain the possibility that there might be a network that is not a strict implementation of a rule system nor a pattern associator of the type described by Rumelhart and me that could capture the past tense phenomena. The principle shortcoming of our network, in my view, was that it treated the problem of past-tense formation as a problem in which one generates the past tense of a word from its present tense. This of course cannot be the right way to do things, for reasons which you describe at some length in your paper. However, THIS problem has nothing to do with whether a network or some other method is used for going from present to past tense. Several researchers are now exploring models that take as input a distributed representation of the intended meaning, and generate as output a description of the phonological properties of the utterance that expresses that meaning. Such a network must have at least one hidden layer to do this task. Note that such a network would naturally be able to exploit the common structure of the various different versions of English inflectional morphology. It is already clear that it would have a much easier time learning inflection rather than word-reversal as a way of mastering past tense etc. What remain to be addressed are issues about the nature and onset of use of the regular inflection in English. Suffice it to say here that the claims you and Prince make about the sharp distinction between the regular and irregular systems deserve very close scrutiny. I for one find the arguments you give in favor of this view unconvincing. We will be writing at more length on these matters, but for now I just wanted two points to be clear: 1) The argument about what class of models a particular model's shortcomings exemplify is not an easy one to resolve, and there is considerable science and (yes) mathematics to be done to understand just what the classes are and what can be taken as examples of them. Just what generalization you believe you have reason to claim your arguments allow you to make has not always been clear. In the first of your two recent messages you state: Our concern is not with (the mathematical question of) what nets can or cannot do in principle, but with which theories are true, and our conclusions were about pattern associators using distributed phonological representations. We showed that it is unlikely that human children learn the regular rule the way such a pattern associator learns the regular rule, because it is simply the wrong tool for the job. After receiving the message containing the above I wrote the following: < Now, the model Rumelhart and I proposed was a pattern associator using distributed phonological representations, but so are the other kinds of models that people are currently exploring; they happen though to use such representations at the output and not the input and to have hidden layers. I strongly suspect that you would like your argument to apply to the broad class of models which might be encompassed by the phrase "pattern associators using distributed phonological representations", and I know for a fact that many readers think that this is what you intend. However, I think it is much more likely that your arguments apply to the much narrower class of models which map distributed phonological representations of present tense to distributed phonological represenations of past tense. > In your longer, second note, you are very clear in stating that you indend your arguments to be taken against the narrow class of models that map phonology to phonology. I do hope that this sensible view gets propagated, as I think many may feel that you think you have a more general case. Indeed, your second message takes a general attitude that I find I can agree with: Let's do some more research and find out what can and can't be done and what the important taxonomic classes of architecture types might be. 2) There's quite a bit more empirical research to be done even characterizing accurately the facts about the past tense. I believe this research will show that you have substantially overstated the empirical situation in several respects. Just as one example, you and Prince state the following: The baseball term _to fly out_, meaning 'make an out by hitting a fly ball that gets caught', is derived from the baseball noun _fly (ball)_, meaning 'ball hit on a conspicuously parabolic trajectory', which is in turn related to the simple strong verb _fly_, 'proceed through the air. Everyone says 'he flied out'; no mere mortal has yet been observed to have "flown out" to left field. You repeated this at Cog Sci two weeks ago. Yet in October of 87 I received the message appended below, which directly contradicts your claim. As you state in your second, more constructive message, we ALL need to be very clear about what the facts are and not to rush around making glib statements! Jay McClelland ======================================================= [The following is appended with the consent of the author.] Date: Sun, 11 Oct 87 21:20:55 PDT From: elman at amos.ling.ucsd.edu (Jeff Elman) To: der at psych.stanford.edu, jlm at andrew.cmu.edu Subject: flying baseball players Heard in Thursday's play-off game between the Tigers and Twins: "...and he flew out to left field...he's...OUT!" What was that P&P were saying?! Jeff ======================================================= From jlm+ at andrew.cmu.edu Thu Sep 1 16:17:59 1988 From: jlm+ at andrew.cmu.edu (James L. McClelland) Date: Thu, 1 Sep 88 16:17:59 -0400 (EDT) Subject: Cognitive Science and Connectionist Models In-Reply-To: <4483.588995267@DST.BOLTZ.CS.CMU.EDU> References: <4483.588995267@DST.BOLTZ.CS.CMU.EDU> Message-ID: I've been meaning to send this message for a long time: The recent discussion about the proceedings of cognitive science have pushed me over the edge: [Actually I sent this yesterday but as with my previous mail this one seems to have gotten lost as well.] The Journal Cognitive Science (the publication of the Cognitive Science Society) has a commitment to the exploration of connectionist models. I am one of the senior editors and the editorial board includes several prominant connectionists. I speak for the journal in saying that we welcome connectionist research with an interdisiplinary flavor. There will be a group of connectionist papers coming out shortly. If you want to submit, read the instructions for authors inside the back cover of a recent issue. If you want to subscribe, write to Ablex Publishing, 355 Chestnut St. Norwood, NJ 07648 or join the society by writing to Alan Lesgold, Secretary-Treasurer Learning Research and Development Center University of Pittsburgh Pittsburgh, PA 15260 membership is just a bit more than a plain subscription and gets you announcements about meetings etc as well as the journal. -- Jay McClelland From steve at cogito.mit.edu Fri Sep 2 12:13:20 1988 From: steve at cogito.mit.edu (Steve Pinker) Date: Fri, 2 Sep 88 12:13:20 edt Subject: VERY brief note on Steven Harnad's reply to answers Message-ID: <8809021614.AA09711@ATHENA.MIT.EDU> In his reply to our answers to his questions, Harnad Harnad writes that: -Looking at the actual behavior and empirical fidelity of connectionist models is not the right way to test connectionist hypotheses; -Developmental, neural, reaction time, and brain-damage data should be put aside in evaluating psychological theories. -The meaning of the word "learning" should be stipulated to apply only to extracting statistical regularities from input data. -Induction has philosophical priority over innatism. We don't have much to say here (thank God, you are probably all thinking). We disagree sharply with the first two claims, and have no interest whatsoever in discussing the last two. Alan Prince Steven Pinker From FROSTROMS%CPVB.SAINET.MFENET at NMFECC.ARPA Thu Sep 1 17:05:02 1988 From: FROSTROMS%CPVB.SAINET.MFENET at NMFECC.ARPA (FROSTROMS%CPVB.SAINET.MFENET@NMFECC.ARPA) Date: Thu, 1 Sep 88 14:05:02 PDT Subject: A Harder Learning Problem Message-ID: <880901140502.20200215@NMFECC.ARPA> This is a (delayed) response to Alexis P. Wieland's posting of Fri Aug 5 on the spiral problem: _A Harder Learning Problem_ : > One of the tasks that we've been using at MITRE to test and compare our > learning algorithms is to distinguish between two intertwined spirals. > This task uses a net with 2 inputs and 1 output. The inputs correspond > to points, and the net should output a 1 on one spiral and > a 0 on the other. Each of the spirals contains 3 full revolutions. > This task has some nice features: it's very non-linear, it's relatively > difficult (our spiffed up learning algorithm requires ~15-20 million > presentations = ~150-200 thousand epochs = ~1-2 days of cpu on a (loaded) > Sun4/280 to learn, ... we've never succeeded at getting vanilla bp to > correctly converge), and because you have 2 in and 1 out you can *PLOT* > the current transfer function of the entire network as it learns. > > I'd be interested in seeing other people try this or a related problem. Here at SAIC, Dennis Walker obtained the following results: "I tried the spiral problem using the standard Back Propagation model in ANSim (Artificial Neural System Simulation Environment) and found that neither spiffed-up learning algorithms nor tricky learning rate adjustments are necessary to find a solution to this difficult problem. Our network had two hidden layers -- a 2-20-10-1 structure for a total of 281 weights. No intra-layer connections were necessary. The learning rates for all 3 layers were set to 0.1 with the momentums set to 0.7. Batching was used for weight updating. Also, an error tolerance of 0.15 was used: as long as the output was within 0.15 of the target no error was assigned. It took ANSim 13,940 cycles (passes through the data) to get the outputs within 0.3 of the targets. (In ANSim, the activations range from -0.5 to 0.5 instead of the usual 0 to 1 range.) Using the SAIC Delta Floating Point Processor with ANSim, this took less than 27 minutes to train (~0.114 seconds/pass). I also tried reducing the network size to 2-16-8-1 and again was able to train the network successfully, but it took an unbelievable 300K cycles! This is definitly a tough problem." Stephen A. Frostrom Science Applications International Corporation 10260 Campus Point Drive San Diego, CA 92121 (619) 546-6404 frostroms at SAIC-CPVB.arpa From steve at cogito.mit.edu Thu Sep 1 13:06:38 1988 From: steve at cogito.mit.edu (Steve Pinker) Date: Thu, 1 Sep 88 13:06:38 edt Subject: Input to Past Tense Net Message-ID: <8809011707.AA25787@ATHENA.MIT.EDU> Dear Jay, We of course agree with you completely that there's a lot of work to be done in exploring both the properties of nets and the relevant empirical data. On the input/output of nets for the past tense: We agree that some of the problems with RM'86 can be attributed to its using distributed phonological representations of the stem as input. We also agree that by using different a different kind of input some of those problems would be diminished. But models that "take as input a distributed representation of the intended meaning, and generate as output a description of the phonological properties of the utterance that expresses the meaning" is on the wrong track. As we showed in OLC (pp. 110-114), the crucial aspects of the input are not its semantic properties, but whether the root of its lexical entry is marked as 'irregular', which in turn often depends on the grammatical category of the root. Two words with different roots will usually have different meanings, but the difference is epiphenomenal -- there's no *systematic*, generalization-supporting pattern between verb semantics and regularity. As we noted, there are words with high semantic similarity and different past tense forms ('hit/hit', 'strike/struck', 'slap/slapped') and words with low semantic similarity and the same past tense forms ('come=arrive/came'; 'come=have an organism/came', 'become/became', 'overcome/overcame', 'come to one's senses/came to one's senses', etc.). On flying-out: We're not sure what the Elman anecdote is supposed to imply. The phenomena are quite clear: a word of Category X that is transparently derived from a word of Category Y is regular with respect to inflectional rules applying to X. That is why the vast majority of time one hears 'flied out', not 'flew out' ('flew out' is a vanishingly rare anecdote worthy of an e-mail message; 'flied-out' usages would over-run mboxes if we bothered to publicly document every instance). That's also why all the other examples of unambiguous cross-category conversion one can think of are regular (see OLC p. 111). That's also why you can add to this list of regulars that are homophonous with an irregular indefinitely (e.g. 'Mary out-Sally-Rided/*out-Sally-Rode Sally Ride'; 'Alcatraz out-Sing-Singed/*out-sang-sang Sing Sing', etc.). And that's why you find the phenomenon in different categories ('Toronto Maple Leafs') and in other languages. In other words we have an absolutely overwhelming empirical tendency toward overregularizing cross-categorially derived verbs and an extremely simple and elegant explanation for it (OLC 111-112). If one is also interested in accounting for one-shot violations like the Elman anecdote there are numerous hypotheses to test (an RM86-like model that doesn't apply the majority of the time (?); a speech error (OLC n.32); hypercorrection (OLC. p. 127); derivational ambiguity (OLC n. 16), and no doubt others.) In general: What the facts are telling us is that the right way to set up a net for the past tense is to have an input vector that encodes grammatical cateogry, root/derived status, etc. Perhaps such a net would "merely implement" a traditional grammar, but perhaps it would shed new light on the problem, solving some previous difficulties. What baffles us is why this obvious step would be anathema to so many connectionists. There seems to be a puzzling trend in connectionist approaches to language -- the goal of exploring the properties of nets as psycholinguistic models is married to the goal of promoting a particular view of language that eschews grammatical representations of any sort at any cost and tries to use knowledge-driven processing, associationist-style learning, or both as a substitute. In practice the empirical side of this effort often relies on isolated anecdotes and examples and ignores the vast amount of systematic research on the phenomena at hand. There's no reason why connectionist work on language has to proceed this way, as Paul Smolensky for one has pointed out. Why not exploit the discoveries of linguistics and psycholinguistics, instead of trying to ignore or rewrite them? Our understanding of both connectionism and of language would be the better for it. Steve Pinker Alan Prince From jlm+ at andrew.cmu.edu Thu Sep 1 16:10:27 1988 From: jlm+ at andrew.cmu.edu (James L. McClelland) Date: Thu, 1 Sep 88 16:10:27 -0400 (EDT) Subject: Comments on Pinker's Replies to Harnad Message-ID: <0X7Ouny00jWDM2TdAk@andrew.cmu.edu> I tried to send this yesterday but for some reason it appears to have slipped through the cracks. The copy I sent to steve got through, but apparently the one to connectionists didn't. ============================================================= Steve -- In the first of your two messages, there seemed to be a failure to entertain the possibility that there might be a network that is not a strict implementation of a rule system nor a pattern associator of the type described by Rumelhart and me that could capture the past tense phenomena. The principle shortcoming of our network, in my view, was that it treated the problem of past-tense formation as a problem in which one generates the past tense of a word from its present tense. This of course cannot be the right way to do things, for reasons which you describe at some length in your paper. However, THIS problem has nothing to do with whether a network or some other method is used for going from present to past tense. Several researchers are now exploring models that take as input a distributed representation of the intended meaning, and generate as output a description of the phonological properties of the utterance that expresses that meaning. Such a network must have at least one hidden layer to do this task. Note that such a network would naturally be able to exploit the common structure of the various different versions of English inflectional morphology. It is already clear that it would have a much easier time learning inflection rather than word-reversal as a way of mastering past tense etc. What remain to be addressed are issues about the nature and onset of use of the regular inflection in English. Suffice it to say here that the claims you and Prince make about the sharp distinction between the regular and irregular systems deserve very close scrutiny. I for one find the arguments you give in favor of this view unconvincing. We will be writing at more length on these matters, but for now I just wanted two points to be clear: 1) The argument about what class of models a particular model's shortcomings exemplify is not an easy one to resolve, and there is considerable science and (yes) mathematics to be done to understand just what the classes are and what can be taken as examples of them. Just what generalization you believe you have reason to claim your arguments allow you to make has not always been clear. In the first of your two recent messages you state: Our concern is not with (the mathematical question of) what nets can or cannot do in principle, but with which theories are true, and our conclusions were about pattern associators using distributed phonological representations. We showed that it is unlikely that human children learn the regular rule the way such a pattern associator learns the regular rule, because it is simply the wrong tool for the job. After receiving the message containing the above I wrote the following: < Now, the model Rumelhart and I proposed was a pattern associator using distributed phonological representations, but so are the other kinds of models that people are currently exploring; they happen though to use such representations at the output and not the input and to have hidden layers. I strongly suspect that you would like your argument to apply to the broad class of models which might be encompassed by the phrase "pattern associators using distributed phonological representations", and I know for a fact that many readers think that this is what you intend. However, I think it is much more likely that your arguments apply to the much narrower class of models which map distributed phonological representations of present tense to distributed phonological represenations of past tense. > In your longer, second note, you are very clear in stating that you indend your arguments to be taken against the narrow class of models that map phonology to phonology. I do hope that this sensible view gets propagated, as I think many may feel that you think you have a more general case. Indeed, your second message takes a general attitude that I find I can agree with: Let's do some more research and find out what can and can't be done and what the important taxonomic classes of architecture types might be. 2) There's quite a bit more empirical research to be done even characterizing accurately the facts about the past tense. I believe this research will show that you have substantially overstated the empirical situation in several respects. Just as one example, you and Prince state the following: The baseball term _to fly out_, meaning 'make an out by hitting a fly ball that gets caught', is derived from the baseball noun _fly (ball)_, meaning 'ball hit on a conspicuously parabolic trajectory', which is in turn related to the simple strong verb _fly_, 'proceed through the air. Everyone says 'he flied out'; no mere mortal has yet been observed to have "flown out" to left field. You repeated this at Cog Sci two weeks ago. Yet in October of 87 I received the message appended below, which directly contradicts your claim. As you state in your second, more constructive message, we ALL need to be very clear about what the facts are and not to rush around making glib statements! Jay McClelland ======================================================= [The following is appended with the consent of the author.] Date: Sun, 11 Oct 87 21:20:55 PDT From: elman at amos.ling.ucsd.edu (Jeff Elman) To: der at psych.stanford.edu, jlm at andrew.cmu.edu Subject: flying baseball players Heard in Thursday's play-off game between the Tigers and Twins: "...and he flew out to left field...he's...OUT!" What was that P&P were saying?! Jeff ======================================================= From jlm+ at andrew.cmu.edu Thu Sep 1 16:17:59 1988 From: jlm+ at andrew.cmu.edu (James L. McClelland) Date: Thu, 1 Sep 88 16:17:59 -0400 (EDT) Subject: Cognitive Science and Connectionist Models In-Reply-To: <4483.588995267@DST.BOLTZ.CS.CMU.EDU> References: <4483.588995267@DST.BOLTZ.CS.CMU.EDU> Message-ID: I've been meaning to send this message for a long time: The recent discussion about the proceedings of cognitive science have pushed me over the edge: [Actually I sent this yesterday but as with my previous mail this one seems to have gotten lost as well.] The Journal Cognitive Science (the publication of the Cognitive Science Society) has a commitment to the exploration of connectionist models. I am one of the senior editors and the editorial board includes several prominant connectionists. I speak for the journal in saying that we welcome connectionist research with an interdisiplinary flavor. There will be a group of connectionist papers coming out shortly. If you want to submit, read the instructions for authors inside the back cover of a recent issue. If you want to subscribe, write to Ablex Publishing, 355 Chestnut St. Norwood, NJ 07648 or join the society by writing to Alan Lesgold, Secretary-Treasurer Learning Research and Development Center University of Pittsburgh Pittsburgh, PA 15260 membership is just a bit more than a plain subscription and gets you announcements about meetings etc as well as the journal. -- Jay McClelland From steve at cogito.mit.edu Fri Sep 2 10:21:08 1988 From: steve at cogito.mit.edu (Steve Pinker) Date: Fri, 2 Sep 88 10:21:08 edt Subject: Input to Past Tense Message-ID: <8809021421.AA08302@ATHENA.MIT.EDU> Dear Jay, We of course agree with you completely that there's a lot of work to be done in exploring both the properties of nets and the relevant empirical data. On the input/output of nets for the past tense: We agree that some of the problems with RM'86 can be attributed to its using distributed phonological representations of the stem as input. We also agree that by using different a different kind of input some of those problems would be diminished. But models that "take as input a distributed representation of the intended meaning, and generate as output a description of the phonological properties of the utterance that expresses the meaning" is on the wrong track. As we showed in OLC (pp. 110-114), the crucial aspects of the input are not its semantic properties, but whether the root of its lexical entry is marked as 'irregular', which in turn often depends on the grammatical category of the root. Two words with different roots will usually have different meanings, but the difference is epiphenomenal -- there's no *systematic*, generalization-supporting pattern between verb semantics and regularity. As we noted, there are words with high semantic similarity and different past tense forms ('hit/hit', 'strike/struck', 'slap/slapped') and words with low semantic similarity and the same past tense forms ('come=arrive/came'; 'come=have an organism/came', 'become/became', 'overcome/overcame', 'come to one's senses/came to one's senses', etc.). On flying-out: We're not sure what the Elman anecdote is supposed to imply. The phenomena are quite clear: a word of Category X that is transparently derived from a word of Category Y is regular with respect to inflectional rules applying to X. That is why the vast majority of time one hears 'flied out', not 'flew out' ('flew out' is a vanishingly rare anecdote worthy of an e-mail message; 'flied-out' usages would over-run mboxes if we bothered to publicly document every instance). That's also why all the other examples of unambiguous cross-category conversion one can think of are regular (see OLC p. 111). That's also why you can add to this list of regulars that are homophonous with an irregular indefinitely (e.g. 'Mary out-Sally-Rided/*out-Sally-Rode Sally Ride'; 'Alcatraz out-Sing-Singed/*out-sang-sang Sing Sing', etc.). And that's why you find the phenomenon in different categories ('Toronto Maple Leafs') and in other languages. In other words we have an absolutely overwhelming empirical tendency toward overregularizing cross-categorially derived verbs and an extremely simple and elegant explanation for it (OLC 111-112). If one is also interested in accounting for one-shot violations like the Elman anecdote there are numerous hypotheses to test (an RM86-like model that doesn't apply the majority of the time (?); a speech error (OLC n.32); hypercorrection (OLC. p. 127); derivational ambiguity (OLC n. 16), and no doubt others.) In general: What the facts are telling us is that the right way to set up a net for the past tense is to have an input vector that encodes grammatical cateogry, root/derived status, etc. Perhaps such a net would "merely implement" a traditional grammar, but perhaps it would shed new light on the problem, solving some previous difficulties. What baffles us is why this obvious step would be anathema to so many connectionists. There seems to be a puzzling trend in connectionist approaches to language -- the goal of exploring the properties of nets as psycholinguistic models is married to the goal of promoting a particular view of language that eschews grammatical representations of any sort at any cost and tries to use knowledge-driven processing, associationist-style learning, or both as a substitute. In practice the empirical side of this effort often relies on isolated anecdotes and examples and ignores the vast amount of systematic research on the phenomena at hand. There's no reason why connectionist work on language has to proceed this way, as Paul Smolensky for one has pointed out. Why not exploit the discoveries of linguistics and psycholinguistics, instead of trying to ignore or rewrite them? Our understanding of both connectionism and of language would be the better for it. Steve Pinker Alan Prince From harnad at Princeton.EDU Sat Sep 3 16:03:16 1988 From: harnad at Princeton.EDU (Stevan Harnad) Date: Sat, 3 Sep 88 16:03:16 edt Subject: On Modeling and Its Constraints (P&P PS) Message-ID: <8809032003.AA15194@mind> Pinker & Prince attribute the following 4 points (not quotes) to me, indicating that they sharply disgree with (1) and (2) and have no interest whatsoever in discussing (3) and (4).: (1) Looking at the actual behavior and empirical fidelity of connectionist models is not the right way to test connectionist hypotheses. This was not the issue, as any attentive follower of the discussion can confirm. The question was whether Pinker & Prince's article was to be taken as a critique of the connectionist approach in principle, or just of the Rumelhart & McClelland 1986 model in particular. (2) Developmental, neural, reaction time, and brain-damage data should be put aside in evaluating psychological theories. This was a conditional methodological point; it is not correctly stated in (2): IF one has a model for a small fragment of human cognitive performance capacity (a "toy" model), a fragment that one has no reason to suppose to be functionally self-contained and independent of the rest of cognition, THEN it is premature to try to bolster confidence in the model by fitting it to developmental (neural, reaction time, etc.) data. It is a better strategy to try to reduce the model's vast degrees of freedom by scaling up to a larger and larger fragment of cognitive performance capacity. This certainly applies to past-tense learning (although my example was chess-playing and doing factorials). It also seems to apply to all cognitive models proposed to date. "Psychological theories" will begin when these toy models begin to approach lifesize; then fine-tuning and implementational details may help decide between asymptotic rivals. [Here's something for connectionists to disagree with me about: I don't think there is a solid enough fact known about the nervous system to warrant "constraining" cognitive models with it. Constraints are handicaps; what's needed in the toy world that contemporary modeling lives in is more power and generality in generating our performance capacities. If "constraints" help us to get that, then they're useful (just as any source of insight, including analogy and pure fantasy can be useful). Otherwise they are just arbitrary burdens. The only face-valid "constraint" is our cognitive capacity itself, and we all know enough about that already to provide us with competence data till doomsday. Fine-tuning details are premature; we haven't even come near the station yet.] (3) The meaning of the word "learning" should be stipulated to apply only to extracting statistical regularities from input data. (4) Induction has philosophical priority over innatism. These are substantive issues, very relevant to the issues under discussion (and not decidable by stipulation). However, obviously, they can only be discussed seriously with interested parties. Stevan Harnad harnad at mind.princeton.edu From marchman at amos.ling.ucsd.edu Mon Sep 5 18:04:26 1988 From: marchman at amos.ling.ucsd.edu (Virginia Marchman) Date: Mon, 5 Sep 88 15:04:26 PDT Subject: past tense debate Message-ID: <8809052204.AA04578@amos.ling.ucsd.edu> Jumping in on the recent discussion about connectionism and the learning of the English past tense, I would like to make the following 2 points: (1) The data on acquisition of the past tense in real children may be very different from the patterns assumed by either side in this debate. (2) Networks can simulate "default" strategies that mimic the categorial rules defended by P&P, but the emergence of such rule-like behavior can depend on statistical properties of the input language (a constant input, not the discontinuous input used by R&M). This finding may be relevant to discussions for both "sides" in light of the behavioral (human) data I allude to in (1). (1) As a psychologist interested in the empirical facts which characterize the acquisition of the past tense (and other domains of linguistic knowledge), I agree with McClelland's comment directed to Pinker and Prince that > There's quite a bit more empirical research to be > done [to] even characterize accurately the facts about > the past tense. I believe this research will > show you that you have substantially > overstated the empirical situation in several respects. (Re: reply to S. Harnad, Connectionist Net, 8/31/88) After OLC was released in tech report form (Occasional Paper #33, 1987), I wrote a paper arguing that P&P may have underestimated the complexity and degree of individual variation inherent in the process of acquiring the English past tense ("Rules and Regularities in the acquisition of the English past tense." Center for Research in Language Newsletter, UCSD, vol. 2, #4, April, 1988). However, it is difficult for me to believe that developmental data are (in fact, or in principle) "too impoverished" to substantively contribute to the debate between the symbolic and connectionist accounts (S. Harnad, "On Theft vs. Honest Toil", Connectionist Net, 8/31/88). In the paper, I presented data on the production of past tense forms by English-speaking children between the ages of 3 and 8, using an elicitation technique essentially identical to the one used by Bybee & Slobin (i.e., the data cited in the original R&M paper). While I was fully expecting to see the standard "stages" of overgeneralization and "U-shaped" development, the data suggested that I should stop and re-think the standard characterization of the acquisition of inflectional morphology. First, my data indicated that a child can be in the "stage" of overgeneralizing the "add -ed" rule anywhere between 3 and 7 years of age. Second, errors took several forms beyond the one emphasized by P&P, i.e. overgeneralization of the "-ed" rule to irregular forms. Instead, errors seem to result from the misapplication of *several* (at least two) past tense formation processes. For example, identity mapping (e.g. "hit --> hit") was incorrectly applied to forms from several different classes (both regulars and irregulars that require a vowel change). Vowel changes were inappropriately applied to regulars and irregulars alike (including examples like "pick --> puck"). Furthermore, children committed these "irregularizations" of regular forms at the same time (i.e., within the same child) that they also committed the better-known error of regularizing irregular forms. Although individual children had "favorite" error types, the different errors patterns were not concentrated in any particular age range. These data provide two challenges to the stage model so often assumed by investigators on either side of the symbolic/connectionist debate: (a) Why is it that children with very *different* amounts of linguistic experience (e.g., 4 year olds and 7 year olds) over- and undergeneralize verbs in qualitatively similar ways? This degree of individual variation within and across age levels in "rate" of acquisition among normal children may be outside acceptable levels of tolerance for a stage model. At the very least, additional evidence is needed to conclusively assume that acquisition proceeds in a "U-shaped" fashion from rote to rule to rule+rote mechanisms. (b) In several interesting ways, children can be shown to treat irregular and regular verbs similarly during acquisition. Exactly what evidence does one need to show that the regular transformation (add -ed) has a privileged status *during acquisition*? Although overextension of the -ed rule is the most frequent error type overall, there was little in my data upon which to claim that regulars and irregulars are *qualitatively* different at any point in the learning process. As I state in the conclusion: ".... addressing at least some of the interesting questions for language acquistion requires looking beyond what children are supposed to be doing within any one "stage" of development. I emphasized the idiosyncratic and multi-faceted nature of children's rule-governed systems and asked whether the three-phased model is the most useful metaphor for understanding how children deal with the complexities inherent in the *systems* of language at various points in development. Rather than looking for ways to explain qualitative changes in rule types and their domain of operation, it may be more used to shift theoretical emphasis onto acquisition as a protracted resolution of several competing and interdependent sub-systems." (2) In a Technical Report that will be available in the next 4-6 weeks ("Pattern Association in a Back Propagation Network: Implications for Child Language Acquisition", Center for Research in Language, UCSD), Kim Plunkett (psykimp%dkarh02.bitnet) and I will report on a series of approx. 20 simulations conducted during the last 8 months at UCSD. Our goal was to extend the original R&M work with particular focus on the developmental aspects of the model by exploring the interaction of input assumptions with the specific learning properties of the patterns that the simulation is required to associate from input to output. Our first explorations in this problem confirmed the claim by P&P (OLC), that the U-shaped developmental performance of the R&M simulation was indeed highly sensitive to the discontinuity in vocabulary size and structure imposed upon the model. In our simulations, we did NOT introduce any "artificial" discontinuities in the input to the network across the learning period. We restricted ourselves to mappings between phonological strings -- although we agree with both P&P and McClelland that children use more sources of information (e.g. semantics) in the acquisition of an inflectional system like the past tense. It is certainly not our goal to suggest that linguistic categories (i.e. phonology, semantics) play no role in the acquisition of language, nor that a connectionist network that is required to perform phonological-to-phonological mappings is faced with the same task as a child learning language. But the results from these simulations may present useful information about the effects of different input characteristics on the kinds of errors a net will produce -- including some understanding of the conditions under which "rule-like" behaviors will and will not emerge. And, these error patterns (and the individual variability obtained -- where different simulations stand for different individuals) can shed some light on the "real" phenomena that is of the most concern. In our mixture of approaches, we are trying to systematically explore the assumptions of both the symbolic and connectionist approaches to acquisition, keeping what kids "really" do firmly in mind. For our simulations, we constructed a language that consists of legal English CVC, VCC, and CCV strings. Each present and past tense form was represented using a fixed-length distributed phonological feature system. The task for each network was to learn (using back-propagation) approximately 500 phonological-to-phonological mappings where the present tense forms are transformed to the past tense via one of four types of "rules": Arbitrary (any phoneme can go to any other phoneme, like GO --> WENT), Vowel Change (12 possible English vowel changes, analogous to COME --> CAME), Identity map (no change, analogous to HIT --> HIT), and the turning on of a suffix (one of three depending on the voicing of the final phoneme in the stem, analogous to WALK --> WALKED). Input strings were randomly assigned to verb classes and therefore, *no information was provided which tells the network to which class a particular verb belongs*. One primary goal of this work was to outline the particular configuration of vocabulary input (i.e. "diet") that allowed the system to achieve "adult-like competence" in the past tense, with "child-like" stages in between. Across simulations, we systematically varied the overall number of unique forms that undergo each transformation (i.e., class size), as well as the number of times each class member is presented to the system per epoch (token frequency). We experimented with several different class size and token ratios that, according to estimates out there in the literature, represent the vocabulary configuration of the past tense system in English (e.g., arbitraries are relatively few in number but are highly frequent). We used two measures of performance/acquisition after every sweep through the vocabulary: 1) rate of learning (overall error rate), and 2) success at achieving the target output forms (overall "hit" rate, consonant "hits", vowel "hits" and suffix "hits"). With these, we determined the degree to which the network was achieving the target, as well as the tendency for the network to, for example, turn on a suffix when it shouldn't, change a vowel when it should identity map, etc. *at every point along the learning curve*. I will not describe all of the results here, however, one finding is particularily relevant to the current discussion. In several of our simulations, the network tended to adopt a "default" suffixation strategy when it formed the past tense of verbs. That is, even though the system was getting a high proportion of the both the "regular" and the "irregular" (arbitrary, vowel change and identity) verbs correct, the most common errors made by the system at various points in development are best described as overgeneralizations of the "add -ed" rule. However, other error types (analogous to the "irregularizations" described above) also occurred. Certain configurations of class size (# of forms) and token frequency (# of exemplars repeated) resulted in a network that adopted suffixation as its "default" strategy; yet, in other simulations (i.e., vocabulary configurations), the network adopted "identity mapping" as its guide through the acquisition of the vocabulary. Overgeneralizations of the identity mapping procedure were prevalent in several simulations, as was the tendency to incorrectly change a vowel. It is important to stress that these different outcomes occurred in the *same* network (e.g., 3 layer, 20 input units, etc.), each one exposed to a different combination of regular and irregular input. Emergence of a default strategy (a rule?) at certain points in learning depended not on tagging of the input (as P&P suggest), but on the ratio of regulars and irregulars in the input to which the system was exposed. This pattern of performance could *not* have been determined by the phonological characteristics of members of either the regular or the irregular classes. That is, phonological information was available to the system (within the distributed feature representation) but the phonological structure of the stem did not determine class membership (i.e., performance was not determined by the identifiability of which "class" of relationships would obtain between the input and the output). The input-sensitivity of error patterns in our simulations may come as bad news to those who (1) care about what children do, and (2) believe that children go through a universal U-shaped pattern of development. However, as I suggest in my CRL paper, this familiar characterization of "real" children may not be the most useful for understanding the acquisition process. Default mappings, rule-like in nature, can emerge in a system that is given no explicit information about class membership (bad news for P&P?), but such an outcome is by no means guaranteed. Our current and future work includes a comparison of this set of simulations with additional sets in which information about class membership is explicitly "tagged" in the system (as P&P assume), models in which phonological similarity in the stem is varied systematically (to determine whether default mappings still emerge), and models in which semantic information is also available (as everyone on earth assumes must be the case for a realistic model of language learning). Virginia Marchman Department of Psychology C-009 UCSD La Jolla, CA 92093 marchman at amos.ucsd.ling.edu From marchman at amos.ling.ucsd.edu Tue Sep 6 20:09:40 1988 From: marchman at amos.ling.ucsd.edu (Virginia Marchman) Date: Tue, 6 Sep 88 17:09:40 PDT Subject: Past tense debate -- address correction Message-ID: <8809070009.AA09152@amos.ling.ucsd.edu> It appears that I provided the wrong email address on my posting of 9/5/88. Sorry for the inconvenience. -virginia the correct address is: marchman at amos.ling.ucsd.edu From prince at cogito.mit.edu Tue Sep 6 21:03:23 1988 From: prince at cogito.mit.edu (Alan Prince) Date: Tue, 6 Sep 88 21:03:23 edt Subject: Final Word on Harnad's Final Word Message-ID: <8809070104.AA11135@ATHENA.MIT.EDU> ``The Eye's Plain Version is a Thing Apart'' Whatever the intricacies of the other substantive issues that Harnad deals with in such detail, for him the central question must always be: "whether Pinker & Prince's article was to be taken as a critique of the connectionist approach in principle, or just of the Rumelhart & McClelland 1986 model in particular" (Harnad 1988c, cf. 1988a,b). At this we are mildly abashed: we don't understand the continuing insistence on exclusive "or". It is no mystery that our paper is a detailed analysis of one empirical model of a corner (of a corner) of linguistic capacity; nor is it obscure that from time to time, when warranted, we draw broader conclusions (as in section 8). Aside from the 'ambiguities' arising from Harnad's humpty-dumpty-ish appropriation of words like 'learning', we find that the two modes of reasoning coexist in comfort and symbiosis. Harnad apparently wants us to pledge allegiance to one side (or the other) of a phony disjunction. May we politely refuse? S. Pinker A. Prince From bondc at iuvax.cs.indiana.edu Wed Sep 7 07:13:02 1988 From: bondc at iuvax.cs.indiana.edu (Clay M Bond) Date: Wed, 7 Sep 88 06:13:02 EST Subject: No subject Message-ID: >From Connectionists-Request at q.cs.cmu.edu Wed Sep 7 02:21:24 1988 >Received: from B.GP.CS.CMU.EDU by Q.CS.CMU.EDU; 6 Sep 88 21:06:22 EDT >Received: from C.CS.CMU.EDU by B.GP.CS.CMU.EDU; 6 Sep 88 21:04:48 EDT >Received: from ATHENA (ATHENA.MIT.EDU.#Internet) by C.CS.CMU.EDU with TCP; Tue 6 Sep 88 21:04:28-EDT >Received: by ATHENA.MIT.EDU (5.45/4.7) id AA11135; Tue, 6 Sep 88 21:04:19 EDT >Message-Id: <8809070104.AA11135 at ATHENA.MIT.EDU> >Date: Tue, 6 Sep 88 21:03:23 edt >From: Alan Prince >Site: MIT Center for Cognitive Science >To: connectionists at c.cs.cmu.edu >Subject: Final Word on Harnad's Final Word >Status: R > > >``The Eye's Plain Version is a Thing Apart'' > >Whatever the intricacies of the other substantive issues that >Harnad deals with in such detail, for him the central question >must always be: "whether Pinker & Prince's article was to be taken >as a critique of the connectionist approach in principle, or just of >the Rumelhart & McClelland 1986 model in particular" (Harnad 1988c, cf. >1988a,b). > >At this we are mildly abashed: we don't understand the continuing >insistence on exclusive "or". It is no mystery that our paper >is a detailed analysis of one empirical model of a corner (of a >corner) of linguistic capacity; nor is it obscure that from time >to time, when warranted, we draw broader conclusions (as in section 8). >Aside from the 'ambiguities' arising from Harnad's humpty-dumpty-ish ^^^^^^^^^^^^^^^^^^^^^^^^^^ >appropriation of words like 'learning', we find that the two modes >of reasoning coexist in comfort and symbiosis. Harnad apparently ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >wants us to pledge allegiance to one side (or the other) of a phony >disjunction. May we politely refuse? > >S. Pinker >A. Prince It certainly says a great deal about the MITniks that when confronted with a valid criticism of their assumptions which they cannot defend they resort to smugness and condescension. No one has to comment on their maturity or status as scientists; they say more by their nastiness than anyone else could. I requested to be included on this mailing list because I am a cognitive scientist and am currently involved in connectionist research. Intel- ligent, scientific discussion is productive for all. Childish trash such as Pinker and Prince's response above is not welcome in my mail queue. If you have nothing of substance to say, then please don't presume that my time can be wasted. Send such pre-adult filth to alt.flame, P and P. And if you don't have the basic intelligence to perceive a very important and obvious disjunction of issues, then you certainly have no business with BAs, much less PhDs. Sincerely, C. Bond Flames to: /dev/null From jose at tractatus.bellcore.com Wed Sep 7 17:03:24 1988 From: jose at tractatus.bellcore.com (Stephen J Hanson) Date: Wed, 7 Sep 88 17:03:24 EDT Subject: observations Message-ID: <8809072103.AA28301@tractatus.bellcore.com> I thought it interesting in the various exchanges that Pinker and Prince never bothered to provide an alternative model for what seems to clear set of phenomenon in language acquisition. Rumelhart and McClelland did have a model--and it kind of worked.. even if they maybe should have considered in other experiments using other kinds of features (perhaps sentential syntactic or semantic). Nonetheless, the model has/had interesting properties, could be extended, tested and analyzed, was well defined in terms of failures and successes, and apparently provides some heuristics for more experiments and refinements and improvements on the basic model --I'm not sure what more one could ask for. The complaints concerning the nature of pattern associators seems odd and off the mark--probably a simple misunderstanding concerning technical issues. And the data concerning verb past tense acquisition are obviously important-- I doubt R & M would disagree. So what's the problem? I and perhaps others watching all the words fly (no, I have nothing to say about flying words) by wonder what exactly is going on here--Is there another model waiting in the wings that can compete with the R & M model? What specific alternative approaches really exist for modeling verb past tense acquisition (notice this does mean learning)? If there are no others, perhaps P &P and R &M should work on improved model together. Stephen J. Hanson (jose at bellcore.com) From bates at amos.ling.ucsd.edu Wed Sep 7 18:14:21 1988 From: bates at amos.ling.ucsd.edu (Elizabeth Bates) Date: Wed, 7 Sep 88 15:14:21 PDT Subject: observations Message-ID: <8809072214.AA12917@amos.ling.ucsd.edu> As a child language researcher and a by-stander in the current debate, I would like to reassure some of the AI folks about the good intentions on both sides. Unfortunately, the current argument has deteriorated to the academic equivalent of "Your mother wears army boots!". But there is valid stuff behind it all. My sympathies tend to lie more on the connectionist side, but P&P deserve our careful attention for several reasons. (1) They are (in my humble view) the first of the vocal critics of PDP who have bothered to look carefully at the details of even ONE model, as opposed to those (like Fodor and Pylyshyn) who have pulled their 1960's arguments out of the closet and dusted them off in the smug conviction that nothing has changed. (2) Although I think P&P overstate the strength of their empirical case (i.e. they are wrong on many counts about the intuitions of adults and the behavior of children) they do take the empirical evidence seriously, something I wish practicioners on BOTH sides of the aisle would do more often. (3) Steve Pinker is one of the few child language researchers who has indeed put forward a (reasonably) coherent model of the learning process. It is far too nativist for me, in the sense that it solves too many problems by stipulation (..."Let us assume that the child knows some version of X-bar theory...."). As any mathematician knows, the more you do by assumption, the less you have to prove. In that (limited) sense, I agree with Steven Harnad. But I strongly recommend that interested network subscribers take a good look at Steve Pinker's book and decide for themselves. There is indeed a nasty habit of speech at MIT, an irritating smugness that does not contribute to the progress of science. I probably like that less than anyone. But there is also real substance and a lot of sweat that has gone into the P&P work on connectionism. They deserve to be answered on those terms (try ignoring the tone of voice -- you'll need the practice if you have or plan to have adolescent children). Having said that, let me underscore the value of looking carefully at real human data in developing models, arguments and counterarguments about the acquisition and use of language. One of the worst flaws in the R&M model was the abrupt change in the input that they used to create a U-shaped function -- in the cherished belief, based on many text-book accounts, that such a U-shaped development exists in children. To borrow a phrase from our sainted vice-president: READ MY LIPS! There is no U-shaped function, no sudden drop from one way of forming the past tense to another. There is, instead, a protracted competition between forms that may drag on for years, and there is considerable individual variability in the process. I recommend that you (re)read Virginia Marchman's comments to get a better hold of the facts. Similar arguments can be made about the supposedly crisp intuitions P&P claim for adults (fly --> (flew,flied)). They have raised an interesting behavioral domain for our consideration, but I can assure you that adult behavior and adult intuitions are not crisp at all. The Elman anecdote that Jay McClelland brought to our attention is not irrelevant, nor is it isolated. I have a reasonably good control over the English language myself, and yet I still vacillate in passivizing many forms (is it "sneaked" or "snuck"?). Crisp intuitions and U-shaped functions are idealizations invented by linguists, accepted by psycholinguists who should have known better, passed on to computer scientists and perpetuated in simulations even by people like R&M who are ideologically predisposed to think otherwise. --Elizabeth Bates (bates at amos.ling.ucsd.edu). From Dave.Touretzky at B.GP.CS.CMU.EDU Wed Sep 7 22:49:31 1988 From: Dave.Touretzky at B.GP.CS.CMU.EDU (Dave.Touretzky@B.GP.CS.CMU.EDU) Date: Wed, 07 Sep 88 22:49:31 EDT Subject: schedule for the upcoming NIPS conference Message-ID: <2710.589690171@DST.BOLTZ.CS.CMU.EDU> A copy of the preliminary schedule for the upcoming NIPS conference (November 28-December 1, with workshops December 1-3) appears below. NIPS is a single-track, purely scientific conference. The program committee, chaired by Scott Kirkpatrick, was very selective: only 25% of submissions were accepted this year. There will be 25 oral presentations and 60 posters. The proceedings will be available around the end of April '89, but they can be ordered now from Morgan Kaufmann Publishers, P.O. Box 50490, Palo Alto, CA 94303-9953; tel. 415-578-9911. Prepublication price is $33.95, plus $2.25 postage ($4.00 for overseas orders). California residents must add sales tax. Specify that you want "Advances in Neural Information Processing Systems". PRELIMINARY PROGRAM, NIPS '88 Denver, November 29-December 1, 1988 Tuesday AM __________ SESSION O1: Learning and Generalization ________________________________________ Invited Talk 8:30 O1.1: "Birdsong Learning", Mark Konishi, Division of Biology, California Institute of Technology Contributed Talks 9:10 O1.2: "Comparing Generalization by Humans and Adaptive Networks", M. Pavel, M.A. Gluck, V. Henkle, Department of Psychology, Stanford University 9:40 O1.3: "An Optimality Principle for Unsupervised Learn- ing", T. Sanger, AI Lab, MIT 10:10 Break 10:30 O1.4: "Learning by Example with Hints", Y.S. Abu- Mostafa, California Institute of Technology, Department of Electrical Engineering 11:00 O1.5: "Associative Learning Via Inhibitory Search", D.H. Ackley, Cognitive Science Research Group, Bell Communi- cation Research, Morristown NJ 11:30 O1.6: "Speedy Alternatives to Back Propagation", J. Moody, C. Darken, Computer Science Department, Yale Univer- sity Tuesday PM __________ 12:00 Poster Preview I SESSION P1A: Learning and Generalization _________________________________________ P1A.1: "Efficient Parallel Learning Algorithms for Neural Networks", A. Kramer, Prof. A. Sangiovanni-Vincentelli, De- partment of EECS, U.C. Berkeley P1A.2: "Properties of a Hybrid Neural Network-Classifier System", Lawrence Davis, Bolt Beranek and Newman Laborato- ries, Cambridge, MA P1A.3: "Self Organizing Neural Networks For The Identifica- tion Problem", M.R. Tenorio, Wei-Tsih Lee, School of Elec- trical Engineering, Purdue University P1A.4: "Comparison of Multilayer Networks and Data Analy- sis", P. Gallinari, S. Thiria, F. Fogelman-Soulie, Laboratoire d'Intelligence Artificielle, Ecole des Hautes Etudes en Informatique, Universite' de Paris 5, 75 006 Paris, France P1A.5: "Neural Networks and Principal Component Analysis: Learning from Examples, without Local Minima", P. Baldi, K. Hornik, Department of Mathematics, University of California, San Diego P1A.6: "Learning by Choice of Internal Representations", Tal Grossman, Ronny Meir, Eytan Domany, Department of Elec- tronics, Weizmann Institute of Science P1A.7: "What size Net Gives Valid Generalization?", D. Haussler, E.B. Baum, Department of Computer and Information Sciences, University of California, Santa Cruz P1A.8: "Mean Field Annealing and Neural Networks", G. Bilbro, T.K. Miller, W. Snyder, D. Van den Bout, M White, R. Mann, Department of Electrical and Computer Engineering, North Carolina State University P1A.9: "Connectionist Learning of Expert Preferences by Comparison Training", G. Tesauro, University of Illinois at Urbana-Champign, Champaign, IL P1A.10: "Dynamic Hypothesis Formation in Connectionist Net- works", M.C. Mozer, Department of Psychology and Computer Science, University of Toronto P1A.11: "Digit Recognition Using a Multi-Architecture Feed Forward Neural Network", W.R. Gardner, L. Pearlstein, De- partment of Electrical Engineering, University of Delaware P1A.12: "The Boltzmann Perceptron: A Multi-Layered Feed- Forward Network, Equivalent to the Boltzmann Machine", Eyal Yair, Allen Gersho, Center For Information Processing Re- search, University of California P1A.13: "Adaptive Neural-Net Preprocessing for Signal De- tection in Non-Gaussian Noise", R.P. Lippmann, P.E. Beckmann, MIT Lincoln Laboratory, Lexington, MA P1A.14: "Training Multilayer Perceptrons with the Extended Kalman Algorithm", S. Singhal, L. Wu, Bell Communications Research, Morristown, NJ P1A.15: "GEMINI: Gradient Estimation through Matrix Inver- sion after Noise Injection", Y. LeCun, C.C. Galland, G.E. Hinton, Computer Science Department, University of Toronto P1A.16: "Analysis of Recurrent Backpropagation", P.Y. Simard, M.B. Ottaway, D.H. Ballard, Department of Computer Science, University of Rochester P1A.17: "Scaling and Generalization in Neural Networks: a Case Study", Subutai Ahmad, Gerald Tesauro, Center for Com- plex Systems Research, University of Illinois at Urbana- Champaign P1A.18: "Does the Neuron "Learn" Like the Synapse?", R. Tawel, Jet Propulsion Laboratory, California Institute of Technology P1A.19: "Experiments on Network Learning by Exhaustive Search", D. B. Schwartz, J. S. Denker, S. A. Solla, AT&T Bell Laboratories, Holmdel, NJ P1A.20: "Some Comparisons of Constraints for Minimal Net- work Construction, with Backpropagation", Stephen Jose Hanson, Lorien Y. Pratt, Bell Communications Research, Morristown, NJ P1A.21: "Implementing the Principle of Maximum Information Preservation: Local Algorithms for Biological and Synthetic Networks", Ralph Linsker, IBM Thomas J. Watson Research Cen- ter, Yorktown Heights, NY P1A.22: "Biological Implications of a Pulse-Coded Reformu- lation of Klopf's Differential-, Hebbian Learning Algo- rithm", M.A. Gluck, D. Parker, E. Reifsnider, Department of Psychology, Stanford University SESSION P1B: Applications __________________________ P1B.1: "Comparison of Two LP Parametic Representations in a Neural Network-based, Speech Recognizer", K.K. Paliwal, Tata Institute of Fundamental Research, Homi Bhabha Road, Bombay-400005, India P1B.2: "Nonlinear Dynamical Modeling of Speech Using Neural Networks", N. Tishby, AT&T Bell Laboratories, Murray Hill, NJ P1B.3: "Use of Multi-Layered Networks for Coding Speech with Phonetic Features", Y. Bengio, R. De Mori, School of Computer Science, McGill University P1B.4: "Speech Production Using Neural Network with Cooper- ative Learning Mechanism", M. Komura, A. Tanaka, Interna- tional Institute for Advanced Study of Social Information Science, Fujitsu Limited, Japan P1B.5: "Temporal Representations in a Connectionist Speech System", E.J. Smythe, Computer Science Department, Indiana University P1B.6: "TheoNet: A Connectionist Network Implementation of a Solar Flare Forecasting Expert System (Theo)", R. Fozzard, L. Ceci, G. Bradshaw, Department of Computer Science & Psy- chology, University of Colorado at Boulder P1B.7: "An Information Theoretic Approach to Rule-Based Connectionist Expert Systems", R.M. Goodman, J.W. Miller, P. Smyth, Department of Electrical Engineering California In- stitute of Technology, Pasadena, CA P1B.8: "Neural TV Image Compression Using Hopfield Type Networks", M. Naillon, J.B. Theeten, G. Nocture, Laboratoires d'Electronique et de Physique Appliquee (LEP1), France P1B.9: "Neural Net Receivers in Spread-Spectrum Multiple- Access Communication Systems", B.P. Paris, G. Orsak, M.K. Varanasi, B. Aazhang, Department of Electrical & Computer Engineering, Rice University P1B.10: "Performance of Synthetic Neural Network Classi- fication of Noisy Radar Signals", I. Jouny, F.D. Garber, De- partment of Electrical Engineering, The Ohio State University P1B.11: "The Neural Analog Diffusion-Enhancement Layer (NADEL) and Early Visual, Processing", A.M. Waxman, M. Seibert, Laboratory for Sensory Robotics, Boston University P1B.12: "A Cooperative Network for color Segmentation", A. Hurlbert, T. Poggio, Center for Biological Information Proc- essing, Whitaker College P1B.13: "Neural Network Star Pattern Recognition for Spacecraft Attitude Determination, and Control", P. Alvelda, M.A. San Martin, C.E. Bell, J.Barhen, The Jet Propulsion Laboratory, California Institute of Technology, P1B.14: "Neural Networks that Learn to Discriminate Similar Kanji Characters", Yoshihiro Mori, Kazuhiko Yokosawa, ATR Auditory and Visual Perception Research Laboratories, Osaka, Japan P1B.15: "Further Explorations in the Learning of Visually- Guided Reaching: Making, MURPHY Smarter", B.W. Mel, Center for Complex Systems Research, University of Illinois P1B.16: "Using Backpropagation to Learn the Dynamics of a Real Robot Arm", K. Goldberg, B. Pearlmutter, Department of Computer Science, Carnegie-Mellon University SESSION O2: Applications _________________________ Invited Talk 2:20 O2.1: "Speech Recognition," John Bridle, Royal Radar Establishment, Malvern, U.K. Contributed Talks 3:00 O2.2: "Modularity in Neural Networks for Speech Recog- nition," A. Waibel, Carnegie Mellon University 3:30 O2.3: "Applications of Error Back-propagation to Pho- netic Classification," H.C. Leung, V.W. Zue, Department of Electrical Eng. & Computer Science, MIT 4:00 O2.4: "Neural Network Recognizer for Hand-Written Zip Code Digits: Representations,, Algorithms, and Hardware," J.S. Denker, H.P. Graf, L.D. Jackel, R.E. Howard, W. Hubbard, D. Henderson, W.R. Gardner, H.S. Baird, I. Guyon, AT&T Bell Laboratories, Holmdel, NJ 4:30 O2.5: "ALVINN: An Autonomous Land Vehicle in a Neural Network," D.A. Pomerleau, Computer Science Department, Carnegie Mellon University 5:00 O2.6: "A Combined Multiple Neural Network Learning System for the Classification of, Mortgage Insurance Appli- cations and Prediction of Loan Performance," S. Ghosh, E.A. Collins, C. L. Scofield, Nestor Inc., Providence, RI 8:00 Poster Session I Wednesday AM ____________ SESSION O3: Neurobiology _________________________ Invited Talk 8:30 O3.1: "Cricket Wind Detection," John Miller, Depart- ment of Zoology, UC Berkeley Contributed Talks 9:10 O3.2: "A Passive, Shared Element Analog Electronic Cochlea," D. Feld, J. Eisenberg, E.R. Lewis, Department of Electrical Eng. & Computer Science, University of California, Berkeley 9:40 O3.3: "Neuronal Maps for Sensory-motor Control in the Barn Owl," C.D. Spence, J.C. Pearson, J.J. Gelfand, R.M. Peterson, W.E. Sullivan, David Sarnoff Research Ctr, Subsid- iary of SRI International, Princeton, NJ 10:10 Break 10:30 O3.4: "Simulating Cat Visual Cortex: Circuitry Under- lying Orientation Selectivity," U.J. Wehmeier, D.C. Van Essen, C. Koch, Division of Biology, California Institute of Technology 11:00 O3.5: Model of Ocular Dominance Column Formation: Ana- lytical and Computational, Results," K.D. Miller, J.B. Keller, M.P. Stryker, Department of Physiology, University of California, San Francisco 11:30 O3.6: "Modeling a Central Pattern Generator in Soft- ware and Hardware:, Tritonia in Sea Moss," S. Ryckebusch, C. Mead, J. M. Bower, Computational Neural Systems Program, Caltech Wednesday PM ____________ 12:00 Poster Preview II SESSION P2A: Structured Networks _________________________________ P2A.1: "Training a 3-Node Neural Network is NP-Complete," A. Blum, R.L. Rivest, MIT Lab for Computer Science P2A.2: "A Massively Parallel Self-Tuning Context-Free Parser," E. Santos Jr., Department of Computer Science, Brown University, P2A.3: "A Back-Propagation Algorithm With Optimal Use of Hidden Units," Y. Chauvin, Thomson CSF, Inc./ Stanford Uni- versity P2A.4: "Analyzing the Energy Landscapes of Distributed Winner-Take-All Networks," D.S. Touretzky, Computer Science Department, Carnegie Mellon University P2A.5: "Dynamic, Non-Local Role Bindings and Inferencing in a Localist Network For Natural Language Understanding," T.E. Lange, M.G. Dyer, Computer Science Department, University of California, Los Angeles P2A.6: "Spreading Activation Over Distributed Microfea- tures," J. Hendler, Department of Computer Science, Univer- sity of Maryland P2A.7: "Short-term Memory as a Metastable State: A Model of Neural Oscillator For A Unified Submodule," A.B. Kirillov, G.N. Borisyuk, R.M. Borisyuk, Ye.I. Kovalenko, V.I. Kryukov, V.I. Makarenko, V.A. Chulaevsky, Research Computer Center, USSR Academy of Sciences P2A.8: "Statistical Prediction with Kanerva's Sparse Dis- tributed Memory," D. Rogers, Research Institute for Advanced Computer Science, NASA Ames Research Ctr, Moffett Field, CA P2A.9: "Image Restoration By Mean Field Annealing," G.L. Bilbro, W.E. Snyder, Dept. of Electrical and Computer Engi- neering, North Carolina State University P2A.10: "Automatic Local Annealing," J. Leinbach, Depart- ment of Psychology, Carnegie-Mellon University P2A.11: "Neural Networks for Model Matching and Perceptual Organization," E. Mjolsness, G. Gindi, P. Anandan, Depart- ment of Computer Science, Yale University P2A.12: "On the k-Winners-Take-All Feedback Network and Ap- plications," E. Majani, R. Erlanson, Y. Abu-Mostafa, Jet Propulsion Laboratory, California Institute of Technology, P2A.13: "An Adaptive Network that Learns Sequences of Tran- sitions," C.L. Winter, Science Applications International Corporation, Tucson, Arizona P2A.14: "Convergence and Pattern-Stabilization in the Boltzmann Machine," M. Kam, R. Cheng, Department of Elec- trical and Computer Eng., Drexel University SESSION P2B: Neurobiology __________________________ P2B.1: "Storage of Covariance By The Selective Long-Term Potentiation and Depression of, Synaptic Strengths In The Hippocampus", P.K. Stanton, J. Jester, S. Chattarji, T.J. Sejnowski, Department of Biophysics, The Johns Hopkins Uni- versity P2B.2: "A Mathematical Model of the Olfactory Bulb", Z. Li, J.J. Hopfield, Division of Biology, California Institute of Technology P2B.3: "A Model of Neural Control of the Vestibulo-Ocular Reflex", M.G. Paulin, S.Ludtke, M. Nelson, J.M. Bower, Divi- sion of Biology, California Institute of Technology P2B.4: "Associative Learning in Hermissenda: A Lumped Pa- rameter Computer Model, of Neurophysiological Processes", Daniel L. Alkon, Francis Quek, Thomas P. Vogl, Environmental Research Institute of Michigan, Arlington, VA P2B.5: "Reconstruction of the Electric Fields of the Weakly Electric Fish Gnathonemus, Petersii Generated During Explor- atory Activity", B. Rasnow, M.E. Nelson, C. Assad, J.M. Bower, Department of Physics, California Institute of Tech- nology P2B.6: "A Model for Resolution Enhancement (Hyperacuity) in Sensory Representation" J. Miller, J. Zhang, Department of Zoology, University of California, Berkeley P2B.7: "Coding Schemes for Motion Computation in Mammalian Cortex", H.T. Wang, B.P. Mathur, C. Koch, Rockwell Interna- tional Science Ctr., Thousand Oaks, CA P2B.8: "Theory of Self- Organization of Cortical Maps", S. Tanaka, NEC Corporation- Fundamental Res. Lab., Kawasaki Kanagawa, 213 JAPAN P2B.9: "A Bifurcation Theory Approach to the Programming of Periodic Attractors, in Network Models of Olfactory Cortex", Bill Baird, Department of Biophysics, University of California at Berkeley P2B.10: "Neuronal Cartography: population coding and resol- ution enhancement, through arrays of broadly tuned cells", Pierre Baldi, Walter Heiligenberg, Department of Mathemat- ics, University of California, San Diego P2B.11: "Learning the Solution to the Aperture Problem for Pattern Motion with a Hebb Rule", M.I. Sereno, Division of Biology, California Institute of Technology P2B.12: "A Model for Neural Directional Selectivity that Exhibits Robust Direction of, Motion Computation", N.M. Grzywacz, F.R. Amthor, Center for Biological Information Processing, Whitaker College, Cambridge, MA P2B.13: "A Low-Power CMOS Circuit which Emulates Temporal Electrical Properties of, Neurons", J. Meador, C. Cole, De- partment of Electrical and Computer Engineering, Washington State University P2B.14: "A General Purpose Neural Network Simulator for Im- plementing Realistic Models of Neural Circuits", M.A. Wilson, U.S. Bhalla, J.D. Uhley, J.M. Bower, Division of Bi- ology, California Institute of Technology, SESSION P2C: Implementation ____________________________ P2C.1: "MOS Charge Storage of Adaptive Networks," R.E. Howard, D.B. Schwartz, AT&T Bell Laboratories, Holmdel, NJ P2C.2: "A Self-Learning Neural Network," A. Hartstein, R.H. Koch, IBM-Thomas J. Watson Research Center, Yorktown Heights, NY P2C.3: "An Analog VLSI Chip for Cubic Spline Surface In- terpolation," J.G. Harris, Division of Computation and Neural Systems, California Institute of Technology P2C.4: "Analog Implementation of Shunting Neural Networks," B. Nabet, R.B. Darling, R.B. Pinter, Department of Elec- trical Engineering University of Washington P2C.5: "Stability of Analog Neural Networks with Time De- lay," C.M. Marcus, R.M. Westervelt, Division of Applied Sci- ences, Harvard University P2C.6: "Analog subthreshold VLSI circuit for interpolating sparsely sampled 2-D, surfaces using resistive networks," J. Luo, C. Koch, C. Mead, Division of Biology California Insti- tute of Technology P2C.7: "A Physical Realization of the Winner-Take-All Func- tion," J. Lazzaro, C.A. Mead, Computer Science California Institute of Technology P2C.8: "General Purpose Neural Analog Computer," P. Mueller, J. Van der Spiegel, D. Blackman, J. Dao, C. Donham, R. Furman, D.P. Hsieh, M. Loinaz, Department of Biochemistry and Biophysics, University of Pennsylvania P2C.9: "A Silicon Based Photoreceptor Sensitive to Small Changes in Light Intensity," C.A. Mead, T. Delbruck, California Institute of Technology Pasadena, CA P2C.10: "A Digital Realisation of Self-Organising Maps," M.J. Johnson, N.M. Allinson, K. Moon, Department of Elec- tronics, University of York, England P2C.11: "Training of a Limited-Interconnect, Synthetic Neural IC," M.R. Walker, L.A. Akers, Center for solid-State Electronics Research, Arizona State University P2C.12: "Electronic Receptors for Tactile Sensing," A.G. Andreou, Department of Electrical and Computer Engineering, The Johns Hopkins University P2C.13: "Cooperation in an Optical Associative Memory Based on Competition," D.M. Liniger, P.J. Martin, D.Z. Anderson, Department of Physics & Joint Inst. for Laboratory Astrophysics, University of Colorado, Boulder SESSION O4: Computational Structures _____________________________________ Invited Talk 2:20 O4.1: "Symbol Processing in the Brain," Geoffrey Hinton, Computer Science Department, University of Toronto Contributed Talks 3:00 O4.2: "Towards a Fractal Basis for Artificial Intelli- gence," Jordan Pollack, New Mexico State University, Las Cruces, NM 3:30 O4.3: "Learning Sequential Structure In Simple Recur- rent Networks," D. Servan-Schreiber, A. Cleeremans, J.L. McClelland, Department of Psychology, Carnegie-Mellon Uni- versity 4:00 O4.4: "Short-Term Memory as a Metastable State "Neurolocator," A Model of Attention", V.I. Kryukov, Re- search Computer Center, USSR Academy of Sciences 4:30 O4.5: "Heterogeneous Neural Networks for Adaptive Be- havior in Dynamic Environments," R.D. Beer, H.J. Chiel, L.S. Sterling, Center for Automation and Intelligent Sys. Res., Case Western Reserve University, Cleveland, OH 5:00 O4.6: "A Link Between Markov Models and Multilayer Perceptions," H. Bourlard, C.J. Wellekens, Philips Research Laboratory, Brussels, Belgium 7:00 Conference Banquet 9:00 Plenary Speaker "Neural Architecture and Function," Valentino Braitenberg, Max Planck Institut fur Biologische Kybernetik, West Germany Thursday AM ___________ SESSION O5: Applications _________________________ Invited Talk 8:30 O5.1: "Robotics, Modularity, and Learning," Rodney Brooks, AI Lab, MIT Contributed Talks 9:10 O5.2: "The Local Nonlinear Inhibition Circuit," S. Ryckebusch, J. Lazzaro, M. Mahowald, California Institute of Technology, Pasadena, CA 9:40 O5.3: "An Analog Self-Organizing Neural Network Chip," J. Mann, S. Gilbert, Lincoln Laboratory, MIT, Lexington, MA 10:10 Break 10:30 O5.4: "Performance of a Stochastic Learning Micro- chip," J. Alspector, B. Gupta, R.B. Allen, Bellcore, Morristown, NJ 11:00 O5.5: "A Fast, New Synaptic Matrix For Optically Pro- grammed Neural Networks," C.D. Kornfeld, R.C. Frye, C.C. Wong, E.A. Rietman, AT&T Bell Laboratories, Murray Hill, NJ 11:30 O5.6: "Programmable Analog Pulse-Firing Neural Net- works," Alan F. Murray, Lionel Tarassenko, Alister Hamilton, Department of Electrical Engineering, University of Edinburgh Scotland, UK 12:00 Poster Session II From steve at psyche.mit.edu Thu Sep 8 12:59:30 1988 From: steve at psyche.mit.edu (Steve Pinker) Date: Thu, 8 Sep 88 12:59:30 edt Subject: Comments on Marchman's note Message-ID: <8809081700.AA07322@ATHENA.MIT.EDU> One thing that is not in dispute in the past tense debate: we could use more data on children's development and on the behavior of network models designed to acquire morphological regularities. It is good to see Virginia Marchman contribute useful results on these problems. In a complex area, however, it especially important to be clear about the factual and theoretical claims under contention. In OLC, we praised R-M for their breadth of coverage of developmental data (primarily a diverse set of findings from Bybee & Slobin's experiments), and reviewed all of these data plus additional experimental, diary, and transcript studies. The thrust of Marchman's note is that "the data on the acquisition of the past tense in real children may be very different from the patterns assumed by either side in this debate". More specifically, she cites Jay McClelland's recent prediction that future research will show that we have "substantially overstated the empirical situation in several respects". We are certainly prepared to learn that future research will modify our current summary of the data or fail to conform to predictions. But Marchman's experiment, as valuable as it is, largely replicates results that have been in the literature for some time and that have been discussed at length, most recently, by R-M and ourselves. Furthermore, the data she presents are completely consistent with the picture presented in OLC, and she does not actually document a single case where we "underestimated the complexity and degree of individual variation inherent in the process of acquiring the English past tense". 1. Marchman reports that 'a child can be in the "stage" of overgeneralizing the "add -ed" rule anywhere between 3 and 7 years old.' The fact that overregularizations occur over a span of several years is well-known in the literature, documented most thoroughly in the important work of Kuczaj in the late 1970's. It figures prominently in the summary of children's development in OLC (e.g. p. 137). 2. She calls into question the characterization of children's development as following a 'U'-shaped curve. The 'U'- sequence that R-M and we were referring to is simply that (i) very young children do not overregularize from the day they begin to talk, but can use some correct past tense forms (e.g. 'came') for a while before (ii) overregularizations (e.g. 'comed') appear in their speech, which (iii) diminish by adulthood. Thus if you plot percentage of overregularizations against time, the curve is nonmonotonic in a way that can be described as an inverted-U. This is all that we (or everyone else) means by 'stages' or 'U-shaped development', no more, no less. No one claims that the transitions are discrete, or that the behavior within the stages is simple or homogeneous (this should be clear to anyone reading R-M or OLC). Marchman does not present any data that contradict this familiar characterization. Nor could she; her study is completely confined to children within stage (ii). 3. She reports that "errors took several forms beyond the one emphasized by P&P, i.e. overgeneralization of the "-ed" rule to irregular forms. Instead, errors seem to result from the misapplication of *several* (at least two) past tense formation processes" (identity mapping, vowel changes, and addition of 'ed'). But the fact that children can say 'bringed', 'brang', and 'brung' is hardly news. (We noted that these errors exist (e.g. 'bite/bote', p. 161, p. 180) and that they are rarer than '-ed' overregularizations (p. 160).) As for its role in the past tense debate, in OLC much attention is devoted to the acquisition of multiple regularization mechanisms in general (pp. 130-136) and identity-mapping (pp. 145-151) and vowel-shift subregularities (pp. 152-157) in particular. (Marchman does call attention to the fact that vowel-change subregularization errors can occur for *regular* verbs, as in 'pick/puck'. We find cases like 'trick/truck' in our naturalistic data as well. Interestingly, the R-M model never did this. All of its suprathreshold vowel-shift errors with regular verbs blended the vowel-change with a past tense ending (e.g.'sip/sepped'). Indeed even among the irregulars it came up with a bare vowel-change response in only 1 out of its 16 outputs. This is symptomatic of one of the major design problems of the model: its distributed representations makes it prone to blending regularities rather than entertaining them as competitors.) 4. Contrary to the claim that we neglect individual variation in children, we explicitly discuss it in a number of places (see, e.g. p. 144). 5. Marchman writes, "In several interesting ways, children can be shown to treat irregular and regular verbs similarly during acquisition." This is identical to the claim in OLC (pp. 130-131, 135-136), though of course the interpretation of this fact is open to debate. We emphasized that the regularity of the English '-ed' rule and the irregularity of the (e.g.) 'ow/ew' alternation are not innate, but are things the child has to figure out from an input sample. This learning cannot be instantaneous and thus "the child who *has not yet figured out* the distinction between regular, subregular, and idiosyncratic cases will display behavior that is similar to a system that is *incapable of making* the distinction" (p. 136). 6. According to Marchman, we suggest that regulars and irregulars are tagged as such in the input. To our knowledge, no one has made this very implausible claim, certainly not us. On the contrary, we are explicitly concerned with the learning problems resulting from the fact that the distinction is *not* marked in the input (pp.128-136). 7. Finally, Marchman previews a report of a set of runs from a new network simulation of past tense acquisition. We look forward to a full report, at which point detailed comparisons will become possible. At this point, in comparing her work to the OLC description of the past tense, she appears to have misinterpreted what we mean by the 'default' status of the regular rule. She writes as if it means that the regular rule is productively overgeneralized. However, the point of our discussion of the difference between the irregular and regular subsystems (pp. 114-123) is that there are about 6 criteria distinguishing regular from irregular alternations that go beyond the mere fact of generalizability itself. These criteria are the basis of the claim (reiterated in the Cog Sci Soc talk) that the regular rule acts as a 'default', in contrast to what happens in the R-M model (pp. 123-125). Marchman does not deal with these issues. In sum, Marchman's data are completely consistent with the empirical picture presented in OLC. Steven Pinker Alan Prince From steve at psyche.mit.edu Fri Sep 9 00:47:10 1988 From: steve at psyche.mit.edu (Steve Pinker) Date: Fri, 9 Sep 88 00:47:10 edt Subject: Two Observations of E. Bates Message-ID: <8809090448.AA17720@ATHENA.MIT.EDU> (1) Concerning the development of the past tense, Elizabeth Bates writes "there is no U-shaped function", based on Marchman's data. This implies that researchers in the area have made some fundamental error that vitiates their attempts at theory. But, as noted in our comments on Marchman, the 'U'-sequence that everyone refers to is simply that (i) very young children do not overregularize from the day they begin to talk, but can use some correct past tense forms (e.g. 'came') for a while before (ii) overregularizations (e.g. 'comed') appear in their speech, which (iii) diminish by adulthood. Thus if you plot percentage of overregularizations against time, the curve is nonmonotonic in a way that can be described as an inverted-U. This is all that we (or everyone else) mean by 'stages' or 'U-shaped development', no more, no less. No one claims that the transitions are discrete, or that the behavior within the stages is simple or homogeneous (this should be clear to anyone reading R&M or OLC). Marchman herself does not present any data that contradict this familiar characterization. Nor could she; her study is completely confined to children within stage (ii). Further discussion of the relation between Marchman's data and the empirical picture drawn in R-M, OLC and other studies can be found in our remarks on Marchman. (2) Bates runs two issues together: -whether judgments are always "crisp" ('sneaked' versus 'snuck'), -whether verbs derived from nouns and adjectives are regular ('out-Sally-Rided' versus 'overrode'). The implication is that endemic sogginess of judgment, overlooked or suppressed by linguists, makes it impossible to say anything about regularization-through-derivation. Noncrispness of judgments of irregular forms was confronted explicitly in OLC, which has a pretty thorough documentation of the phenomenon (p. 116-117, p. 118-119, and the entire Appendix). The important thing about the the cross-category effect is that it implies that the linguistic notions 'irregular', 'root', and 'syntactic category' have mentally-represented counterparts; it also emerges from a conspicuously narrow exposure to the data (since learners are not flooded with examples of denominal verbs that happen to be homophonous with irregulars); it is found with consistency across languages. The effect could be true whether the judgments respect part-of-speech distinctions absolutely or probabilistically. As long as a significant proportion of the variance is uniquely accounted for by syntactic category, there is something to explain. In fact, of course, most of the relevant judgments are quite clear (*high-stuck the goalie, *kung the checkers piece; OLC p. 111), and there can be little question that syntactic category is a compelling force for regularization, far more potent that unaided semantics (e.g. 'he cut/*cutted a deal'; OLC pp. 112-113). We regard this effect (due largely to work by Kiparsky) as a major, surprising discovery about the way linguistic systems are organized. For specific hypotheses about *when* and *why* some such judgments should be fuzzy, see Note 17 (p. 112) and pp. 126-127. Alan Prince Steven Pinker From bates at amos.ling.ucsd.edu Fri Sep 9 16:49:25 1988 From: bates at amos.ling.ucsd.edu (Elizabeth Bates) Date: Fri, 9 Sep 88 13:49:25 PDT Subject: Two Observations of E. Bates Message-ID: <8809092049.AA03836@amos.ling.ucsd.edu> Does the U-shaped function, then, mean nothing more to P&P than the claim that errors come and go? If so, I see little here that cries out for unique qualitative mechanisms and/or representations, above and beyond garden-variety learning. However, even allowing the weak version of the U that P&P describe, it is still not inevitable that children begin with irregulars, then over-regularize. My own daughter, for example, passed directly into over-regularizations of "go" and "come" as her first-ever past tense forms. It seems to me that the question re whether unitary mental categories are required (to account for the irregular/regular contrast) ought to revolve around the presence of evidence for a *qualitative@ difference between the two. Otherwise, we are merely haggling over the price.....-liz From bever at prodigal.psych.rochester.EDU Sat Sep 10 01:09:35 1988 From: bever at prodigal.psych.rochester.EDU (bever@prodigal.psych.rochester.EDU) Date: Sat, 10 Sep 88 01:09:35 EDT Subject: Light Message-ID: <8809100509.4381@prodigal.psych.rochester.edu> Recent correspondence has focussed on the performance level of the Rumelhart and McClelland past tense learning model and subsequent models, under varying conditions of feeding. Pinker and Prince point out that the model is unsuccessful by normal statistical standards. The responses so far seem to be: (1) that's always the way it is with new models (Harnad), (2) adults may perform more like the model than P&P assume (Bates, Elman, McClelland) and (3) children may not conform to the rules very well either (Bates, Marchman). We think that the exact performance level and pattern of the model is not the only test of its validity, for the following reasons: 1) Such models work only insofar as they presuppose rule-based structures. 2) The past-tense overgeneralization errors are errors of behavior not knowledge. Many statistically valid models of phenomena are fundamentally incorrect: for example, Ptolomeic astronomy was reputed to be quite accurate for its day, especially compared with the original Copernican hypothesis. The question is, WHY does a model perform the way it does? We have demonstrated (Lachter and Bever, 1988; Bever, 1988) that existing connectionist models learn to simulate rule-governed behavior only insofar as the relevant structures are built into the model or the way it is fed. What would be important to show is that such models could achieve the same performance level and characteristics without structures and feeding schemes which already embody what is to be learned. At the moment, insofar as the models succeed statistically, they confirm the view that language learning presupposes structural hypotheses on the part of the child, and helpful input from the world. The exact performance level and pattern of children or models is of limited importance for another reason: what is at issue is linguistic KNOWLEDGE, not language behavior. There is considerable evidence that the overgeneralization behavior is a speech production error, not an error of linguistic knowledge. Children explicitly know the difference between the way they say the past tense of a verb and the way they ought to say it - the child says 'readed' for the same kind of reason that it says 'puscetti' - overgeneralization in speech production of a statistically valid property of the language. Most significant is the fact that a child knows when an adult teases it by imitating the way it says such words. Whatever the success or failure of an inductive model, it must fail to discover the distinction between structural knowledge, and language behavior, a distinction which every child knows, and a distinction which is vital to understanding both the knowledge and the behavior the child exhibits. In failing to make the distinction, the more a model succeeds at mimicking the behavior, the clearer it becomes that it does NOT acquire the knowledge. The view that a bit of 'knowledge' is simply a 'behavioral generalization' taken to an extreme, begs the question about the representation of the distinction: insofar as it answers the question at all, it gets it wrong. Connectionist models offer a new way to study the role of statistically valid generalizations in the acquisition of complex structures. For example, such models may facilitate the study of how structural hypotheses might be confirmed and extended behaviorally by the data the child receives (Bever, 1988): the models are potentially exquisite analytic engines which can detect subtle regularities in the environment, given a particular representational scheme. We think this may be their ultimate contribution to behavioral science. But they solve the puzzle about the relationship between structure and behavior no more than an adding machine tells us about the relationship between the nature of numbers and how children add and subtract. Tom Bever Joel Lachter Bever or Lachter @psych.prodigal.rochester.edu References: Recent net correspondence between Bates, Elman, Harnad, Marchman, McClelland, Pinker and Prince. Bever. T.G., 1988. The Demons and the Beast - Modular and Nodular kinds of Knowledge. University of Rochester Technical Report, #48; to appear in Georgeopolous, C. and Ishihara, R. (Eds). Interdisciplinary approaches to language. Kluwer, Dordrecht, in press. Lachter J. and Bever, T.G. (1988) The relation between linguistic structure and associative theories of language learning -- A constructive critique of some connectionist learning models. Cognition, 28, pp 195-247. b From bondc at iuvax.cs.indiana.edu Sat Sep 10 11:11:14 1988 From: bondc at iuvax.cs.indiana.edu (Clay M Bond) Date: Sat, 10 Sep 88 10:11:14 EST Subject: No subject Message-ID: >We think that the exact performance level and pattern of the model is not >the only test of its validity, for the following reasons: > >1) Such models work only insofar as they presuppose rule-based >structures. > >2) The past-tense overgeneralization errors are errors of behavior not >knowledge. These are not reasons. They are only so if both sides accept that structures are rule-based and that there is some difference between behavior and know- ledge. For those who do not accept these assumptions, you have no test of validity; you cannot evaluate a model if you are making different assumptions. >The exact performance level and pattern of children or models is >of limited importance for another reason: what is at issue is >linguistic KNOWLEDGE, not language behavior. There is >considerable evidence that the overgeneralization behavior is a >speech production error, not an error of linguistic knowledge. Again, there is no such evidence without first assuming that there exists some difference between knowledge and behavior. >representational scheme. We think this may be their ultimate >contribution to behavioral science. But they solve the puzzle >about the relationship between structure and behavior no more >than an adding machine tells us about the relationship between >the nature of numbers and how children add and subtract. Once again, you have made no point above. Your arguments are remarkably similar to SLA projects which start out assuming the existence of UG, present data, and then conclude that UG exists. Whether one takes an agnostic position on these related differentiations, knowledge/behavior, competence/performance, brain/mind, micro/macrocognition is not relevant. What is relevant is that those who insist that these dif- ferentiations exists are obligated to show empirically exactly how they operate, where they reside, and how they map onto actual neurological pro- cesses, something they have conveniently ignored so far. That they must exist is highly debateable, to say the least; this, I think, is possibly the greatest contribution connectionism has offered. Until such time as these things are proven, they will remain religious issues/tenets. Clay Bond Indiana University Department of Linguistics, bondc at iuvax.cs.indiana.edu From hinton at ai.toronto.edu Sat Sep 10 16:58:29 1988 From: hinton at ai.toronto.edu (Geoffrey Hinton) Date: Sat, 10 Sep 88 16:58:29 EDT Subject: Bever's claims Message-ID: <88Sep10.141907edt.681@neat.ai.toronto.edu> In a recent message, Bever claims the following: "We have demonstrated (Lachter and Bever, 1988; Bever, 1988) that existing connectionist models learn to simulate rule-governed behavior only insofar as the relevant structures are built into the model or the way it is fed. What would be important to show is that such models could achieve the same performance level and characteristics without structures and feeding schemes which already embody what is to be learned." This claim irks me since I have already explained to him that there are connectionist networks that really do discover representations that are not built into the initial network. One example is the family trees network described in Rumelhart, D.~E., Hinton, G.~E., and Williams, R.~J. (1986) Learning representations by back-propagating errors. Nature, 323, pages 533--536. I would like Bever to clearly state whether he thinks his "demonstration" applies to this (existing) network, or whether he is simply criticizing networks that lack hidden units. Geoff From jlm+ at andrew.cmu.edu Sat Sep 10 13:35:21 1988 From: jlm+ at andrew.cmu.edu (James L. McClelland) Date: Sat, 10 Sep 88 13:35:21 -0400 (EDT) Subject: Light In-Reply-To: <8809100509.4381@prodigal.psych.rochester.edu> References: <8809100509.4381@prodigal.psych.rochester.edu> Message-ID: It is true that there are different kinds of behavior which we could assess any model with respect to. One kind of task involves language use (production, comprehension) another is language judgement. Many connectionist models to date have addressed performance rather than judgement, but there is no intrinsic reason why judgements cannot be addressed in these models. Indeed, it is becoming standard to use the goodness of match between an expected pattern and an obtained pattern as a measure of tacit knowledge, say, of what should follow what in a sentence. Such errors can be used as the basis for some kinds of judgements. I do not mean to say that connectionists have already shown that their models account for the full range of factors that influence such judgements; but at least many of us take the view (at least implicitly) that the SAME connection information that governs performance can also be used to sustain various types of judgements. With regard to such judgements, at least as far as the past tense is concerned, the facts seem not to fit perfectly with Lachter and Berver's claims. Kuczaj [Child Development, 1978, p.319] reports data from children aged 3:4 to 9:0. These children made gramaticallity judgements of a variety of kinds of past-tense forms. The probability that each type of form was judged correctly is given below from his table 1 on p. 321: Age Group Under 5 5&6 7 & up Grammatical No-Change verbs (hit) 1.00 1.00 1.00 Regularized no-change verbs (hitted) .28 .55 .05 Grammiatical Change verbs (ate) .84 .94 1.00 *Regularized Change verbs (eated) .89 .60 .26* Past + ed forms for Change verbs (ated) .26 .57 .23 Marked with asterisks above is the line containing what Lachter and Bever call the overgeneralization error. It will be seen that children of every age group studied found these sorts of forms acceptable at least to some degree. It is particularly clear in the youngest age group that such strings seem highly grammatical. The fact that there are other error types which show a much lower rate of acceptability for this group indicates that the high acceptance rate for the regularized forms is not simply due to a generalized tendency to accept anything in this age group. I do not want to suggest that there is a perfect correlation between performance in judgement tasks and measures obtained from either natural or elicited production data: One of the few things we know for certain is that different tasks elicit differences in performance. However the data clearly indicate that the child's judgements are actually strikingly similar to the patterns seen in naturalistic regularization data [Kuczaj, 1977, Journal of Verbal Learning and Verbal Behavior, p 589]. First, the late emergence of "ated" type forms in natural production relative to "eated" type forms is reflected in the judgement data. Second, both in production and acceptance, regularized forms of no-change verbs score low relative to regularized forms of other types of exceptions. Kuczaj [78] even went so far as to ask kids what they thought their mothers would say when given a choice between correct, regularized, and past+ed. Their judgements of what they thought their mothers would say were virtually identical to their judgements of what they thought they would say at all age groups. In both kinds of judgements, choice of eated type responses drops monotonically while ated type responses peak in group 2. Jay McClelland From PH706008%BROWNVM.BITNET at VMA.CC.CMU.EDU Sat Sep 10 17:14:50 1988 From: PH706008%BROWNVM.BITNET at VMA.CC.CMU.EDU (PH706008%BROWNVM.BITNET@VMA.CC.CMU.EDU) Date: Sat, 10 Sep 88 17:14:50 EDT Subject: Yann Le Cun's e-mail address Message-ID: Does anyone know Yann Le Cun's e-mail address at the University of Toronto? Thanks in advance. --Charles Bachmann : ph706008 at brownvm Brown University From jlm+ at andrew.cmu.edu Sat Sep 10 17:01:25 1988 From: jlm+ at andrew.cmu.edu (James L. McClelland) Date: Sat, 10 Sep 88 17:01:25 -0400 (EDT) Subject: correction Message-ID: In my reply to Bever, the data in the table are probability that each type of form is judged acceptable. I said "correctly" rather than "acceptable"; actually, judging some of the forms acceptable is an error; of course, it was that children made such errors that was the point of the message. Sorry if I confused anyone. -- Jay From bondc at iuvax.cs.indiana.edu Sat Sep 10 20:28:17 1988 From: bondc at iuvax.cs.indiana.edu (Clay M Bond) Date: Sat, 10 Sep 88 19:28:17 EST Subject: No subject Message-ID: Geoff Hinton: >In a recent message, Bever claims the following: > >"We have demonstrated (Lachter and Bever, 1988; Bever, 1988) that existing >connectionist models learn to simulate rule-governed behavior only insofar as >the relevant structures are built into the model or the way it is fed. What > >This claim irks me since I have already explained to him that there are >connectionist networks that really do discover representations that are not >built into the initial network ... I might say the same. The current project I am working on, along with Elise Breen, though in its infant stages, is an iac acquisition net, and no rele- vant structures, as Bever calls them, were built in. Our results so far are promising, though inconclusive. I do not see, however, why one should expect mentalists to take data into account. They have always scorned data in favor of "intuition". <<<<<<<<<<<<******<<<<<<<<<<<<******>>>>>>>>>>>>******>>>>>>>>>>>> << Clay Bond Indiana University Department of Linguistics >> << ARPA: bondc at iuvax.cs.indiana.edu >> <<<<<<<<<<<<******<<<<<<<<<<<<******>>>>>>>>>>>>******>>>>>>>>>>>> From Dave.Touretzky at B.GP.CS.CMU.EDU Sun Sep 11 10:10:17 1988 From: Dave.Touretzky at B.GP.CS.CMU.EDU (Dave.Touretzky@B.GP.CS.CMU.EDU) Date: Sun, 11 Sep 88 10:10:17 EDT Subject: Yann Le Cun's e-mail address In-Reply-To: Your message of Sat, 10 Sep 88 17:14:50 -0400. Message-ID: <595.589990217@DST.BOLTZ.CS.CMU.EDU> Yann LeCun's address is yann%ai.toronto.edu at relay.cs.net. Please: if you're trying to locate someone's net address, send mail first to connectionsts-request at cs.cmu.edu. The mailing list maintainers will be happy to help you. -- Dave From hendler at dormouse.cs.umd.edu Sun Sep 11 11:20:57 1988 From: hendler at dormouse.cs.umd.edu (Jim Hendler) Date: Sun, 11 Sep 88 11:20:57 EDT Subject: more fuel for the fire Message-ID: <8809111520.AA14998@dormouse.cs.umd.edu> While I come down on the side of the connectionists in the recent debates, I think some of our critics, and some of the criticisms of Bever and P&P, do focus on an area that is a weakness of most of the distributed models: it is one thing to learn features/structures/etc., it is another to apply these things appropriately during cognitive processing. While, for example, Geoff's model could be said to have generalized a feature corresponding to `gender', we would be hard pressed to claim that it could somehow make gender-based inferences. The structured connectionists, have gone far beyond the distributed when it comes to this. The models, albeit not learned, can make inferences based on probabilities and classifications and the like (cf. Shastri etc.) I believe that it is crucial to provide an explanation of how distributed representations can make similar inferences. One approach, which I am currently pursuing, is to use the weight spaces learned by distributed models as if they were structured networks -- spreading activation among the units and seeing what happens (the results look promising). Other approaches will surely be suggested and pursued. Thus, to reiterate my main point -- the fact that a backprop (or other) model has learned a function doesn't mean diddly until the internal representations built during that learning can be applied to other problems, can make appropriate inferences, etc. To be a cognitive model (and that is what our critics are nay-saying) we must be able to learn, our forte, but also to THINK, a true weakness of many of our current systems. -Jim H. From hinton at ai.toronto.edu Mon Sep 12 13:05:43 1988 From: hinton at ai.toronto.edu (Geoffrey Hinton) Date: Mon, 12 Sep 88 13:05:43 EDT Subject: more fuel for the fire In-Reply-To: Your message of Sun, 11 Sep 88 11:20:57 -0400. Message-ID: <88Sep12.102617edt.98@neat.ai.toronto.edu> The family trees model does make some simple inferences based on the features it has learned. It does the equivalent of inferring that a person's father's wife is the person's mother. Of course, it can only do this for people it knows about, and there are many more inferences that it cannot do. Geoff From alexis at marzipan.mitre.org Mon Sep 12 12:24:07 1988 From: alexis at marzipan.mitre.org (Alexis Wieland) Date: Mon, 12 Sep 88 12:24:07 EDT Subject: The Four-Quadrant Problem Message-ID: <8809121624.AA02190@marzipan.mitre.org.> Well, so much for my quest for a problem which *requires* more than 2 layers. Hopefully this will exhaust the issue ... The conclussions are: - for a 2-layer net of a finite number of threshold units can't do the 4-quad problem to arbitrary accuracy (I'll demonstrate in a moment). - any arbitrarily large subspace *can* be approximated to arbitrary accuracy with a finite number of nodes. - with non-hard-threshold units (including sigmoids) and assuming infinite presision (and you *need* that infinite precision) you *can* do the 4-quad problem with 2 layers. Demonstration that 2-Layers of finite numbers of threshold units can't do it: Assume that it can be done. Each node in the first layer creates linear partitions in the plane described by the input space. This finite set of partions (lines) intersect at a finite number of points. Consider a circle centered at the origin which encloses all of these intersections. Assume without loss of generality that quads 1&3 have a greater value than quads 2&4 which is subsequently thresholded by the second layer node. Above the circle, crossing from left to right across the y-axis (or any "don't care" band) must result in a net gain (since quad-2 < quad-1) so the weights connecting those nodes to the output node must sum > 0. Below the circle a simular argument has the same sum < 0. The sum can't be both > and < 0, a contradiction, therefore it can't be done. *BUT* Doing It With Other than thresholding units: If the hidden layer's transfer function f(x) = 0 for x<=0 and f(x) = x for x>=0 Ron Chrisley showed me at INNS that you can use a four node hidden layer. Given input weights (w,w), (-w,-w), (w,-w), (-w, w) on the first layer (no thresholds) and weights (w2, w2, -w2, -w2) from the hidden layer to the thresholding output node you've got it. I would add that you can approximate the semi-linear node to arbitrary accuracy with thresholding units if you only want to go out a finite distance. Also, you're really only using the fact that the transfer function is monotonically increasing -- as is a sigmoid (at least in theory). So if the four nodes in the hidden layer have *any* non-zero threshold you can use the save network and use sigmoid units. Eventually your top node will have sigmoid(x + threshold) - sigmoid(x) as its input (which gets *very* small very fast) but in theory this should work. In conclussion, this is a task that is quite simple with a 3 layer net (put a partition on the x an y axis and XOR their outputs). Instead, you can use an infinite number of nodes in 2 layers or see how well your particular computer deals with numbers like 10**-(10**10**10**...) with sigmoid nodes. In short, while it is clearly cleaner and more compact to do it with 3 layers, it *can* be done in theory with networks (which are not pragmatically realizable) with 2-layers. Alexis Wieland wieland at mitre.arpa From PVR%BGERUG51.BITNET at VMA.CC.CMU.EDU Mon Sep 12 18:04:00 1988 From: PVR%BGERUG51.BITNET at VMA.CC.CMU.EDU (Patrick Van Renterghem / Transputer Lab) Date: Mon, 12 Sep 88 18:04 N Subject: Change of subject (neural network coprocessor boards) References: > The Transputer Lab, Grotesteenweg Noord 2, +32 91 22 57 55 Message-ID: Hello connectionists, I am not a fanatic pro or con of connectionism and neural networks, but I am more interested in their applications than their basics. I have the following questions: * what kind of applications are neural networks used for ? I know pattern recognition is a favorite subject, and I would like to know more about specific realizations (and performance, compared to algorithic information processing), but there must be other application areas ??!!?? How about robotics, expert systems, image processing, ... * What coprocessor boards exist, what is the price, performance, their draw- backs, advantages, ... Addresses of manufacturers would be appreciated. Thanks in advance, Patrick Van Renterghem, State University of Ghent, Automatic Control Lab, Transputer Lab div., Grotesteenweg Noord 2, B-9710 Ghent-Zwijnaarde, Belgium. P.S.: Companies listening can send me information right away. From steve at psyche.mit.edu Mon Sep 12 14:02:50 1988 From: steve at psyche.mit.edu (Steve Pinker) Date: Mon, 12 Sep 88 14:02:50 edt Subject: Reply to Bates' second note Message-ID: <8809121803.AA15544@ATHENA.MIT.EDU> In her second note, Bates writes as if the argument for "unique qualitative mechanisms" was based entirely on the existence of a U-shaped learning curve. This reduction has the virtue of simplicity, but it bears little resemblance to the actual arguments in the literature, which work from a range of linguistic, psycholinguistic, and developmental evidence. In our paper we discuss a variety of developmental data independent of U-hood that bear on the question of what kinds of mental mechanisms are involved (OLC, pp.139-145). We also examine the qualitative differences between the irregular and regular past-tense systems in some detail (pp.108-125). Of course the issue is still open, but we doubt that the debate is ultimately going to turn on slogan-sized chunks of assertion. Aiming for another reduction, Bates asks, "Does the U-shaped function, then, mean nothing more to P&P than the claim that errors come and go?" What the U-shaped function means to us is "a function that is shaped like a U". You don't get a function shaped like a U merely if "errors come and go". You also need some correct performance around the time when errors come. Otherwise the function (percentage error vs. time) could be monotonically decreasing, not U-shaped. The evidence that Bates first cited against the U-shaped curve was based on a study that had nothing to do with the matter; then comes the terminological dispute. At this point, we'd like to sign off on the round robin. We welcome further inquiries, comments, and reprint requests at our own addresses. Alan Prince: prince at cogito.mit.edu Steven Pinker: steve at psyche.mit.edu From Scott.Fahlman at B.GP.CS.CMU.EDU Mon Sep 12 20:41:22 1988 From: Scott.Fahlman at B.GP.CS.CMU.EDU (Scott.Fahlman@B.GP.CS.CMU.EDU) Date: Mon, 12 Sep 88 20:41:22 EDT Subject: "Layers" Message-ID: I think it would help us all to follow these discussions if, when people want to talk about "N-layer networks" for some N, they would make it clear exactly what they are talking about. Does N refer to layers of units or layers of tunable weights? If layers of units, are we counting only layers of hidden units, or are we including the output layers, or both the input and output layers? I think I've seen at least one paper that uses each of these definitions. Unfortunately, there seems to be no universal agreement on this bit of terminology, and without such agreement it requires a lot of work to figure out from context what is being claimed. Sometimes a researcher will carefully define what he means by "layer" in one message -- I think Alexis Wieland did this -- but then launch into a multi-message discussion spread over a couple of weeks. Again, this makes extra work for people trying to understand the discussion, since it's hard to keep track of who is using what kinds of layers, and it's a pain to go back searching through old messages. Perhaps I'm the only one who is confused by this. Does anyone believe that there *is* a standard or obvious definition for "layer" that we all should adhere to? It would be nice if we could all adopt the same terminology, but it may be too late in this case. -- Scott From harnad at Princeton.EDU Mon Sep 12 23:22:23 1988 From: harnad at Princeton.EDU (Stevan Harnad) Date: Mon, 12 Sep 88 23:22:23 edt Subject: On the Care & Feeding of Learning Models Message-ID: <8809130322.AA11408@mind> Tom Bever (bevr at db1.cc.rochester.edu) wrote: > Recent correspondence has focussed on the performance level of the > Rumelhart and McClelland past tense learning model and subsequent > models, under varying conditions of feeding... We have demonstrated > (Lachter and Bever, 1988; Bever, 1988) that existing connectionist > models learn to simulate rule-governed behavior only insofar as the > relevant structures are built into the model or the way it is fed. > What would be important to show is that such models could achieve > the same performance level and characteristics without structures > and feeding schemes which already embody what is to be learned. I don't understand the "feeding" metaphor. If I feed an inductive device data that have certain regularities, along with feedback as to what the appropriate response would be (say, for the sake of simplicity, they are all members of a dichotomy: Category C or Category Not-C), and the device learns to perform the response (here, dichotomization), presumably by inducing the regularities statistically from the data, what is there about this "feeding" regimen that "already embodies what is to be learned" (and hence, presumably, constitutes some sort of cheating)? Rather than cheating, it seems to me that rules that are come by in this way are the wages of "honest toil." Perhaps there is a suppressed "poverty-of-the-stimulus" premise here, to the effect that we are only considering data that are so underdetermined as to make their underlying regularities uninducible (i.e., the data do not sufficiently "embody" their underlying regularities to allow them to be picked up statistically). If this is what Bever has in mind, it would seem that this putative poverty has to be argued for explicitly, on a case by case basis. Or is the problem doubts about whether nets can do nontrivial generalization from their data-sets? But then wouldn't this too have to be argued separately? But objections to "feeding conditions" alone...? Is the objection that nets are being spoon-fed, somehow? How? Trial-and-error-sampling sounds more like doing it the old-fashioned way. Biased samples? Loaded samples? Stevan Harnad From Scott.Fahlman at B.GP.CS.CMU.EDU Mon Sep 12 20:41:22 1988 From: Scott.Fahlman at B.GP.CS.CMU.EDU (Scott.Fahlman@B.GP.CS.CMU.EDU) Date: Mon, 12 Sep 88 20:41:22 EDT Subject: "Layers" Message-ID: I think it would help us all to follow these discussions if, when people want to talk about "N-layer networks" for some N, they would make it clear exactly what they are talking about. Does N refer to layers of units or layers of tunable weights? If layers of units, are we counting only layers of hidden units, or are we including the output layers, or both the input and output layers? I think I've seen at least one paper that uses each of these definitions. Unfortunately, there seems to be no universal agreement on this bit of terminology, and without such agreement it requires a lot of work to figure out from context what is being claimed. Sometimes a researcher will carefully define what he means by "layer" in one message -- I think Alexis Wieland did this -- but then launch into a multi-message discussion spread over a couple of weeks. Again, this makes extra work for people trying to understand the discussion, since it's hard to keep track of who is using what kinds of layers, and it's a pain to go back searching through old messages. Perhaps I'm the only one who is confused by this. Does anyone believe that there *is* a standard or obvious definition for "layer" that we all should adhere to? It would be nice if we could all adopt the same terminology, but it may be too late in this case. -- Scott From hinton at ai.toronto.edu Tue Sep 13 13:52:16 1988 From: hinton at ai.toronto.edu (Geoffrey Hinton) Date: Tue, 13 Sep 88 13:52:16 EDT Subject: "Layers" In-Reply-To: Your message of Mon, 12 Sep 88 20:41:22 -0400. Message-ID: <88Sep13.111249edt.407@neat.ai.toronto.edu> As Scott points out, the problem is that a net with one hidden layer has: 3 layers of units (including input and output) 2 layers of modifiable weights 1 layer of hidden units. Widrow has objected (quite reasonably) to calling the input units "units" since they don't have modifiable incoming weights, nor do they have a non-linear I/O function. So that means we will never agree on counting the total number of layers. The number of layers of modifiable weights is unambiguous, but has the problem that most people think of the "neurons" as forming the layers, and also it gets complicated when connections skip layers (of units). Terminology can be made unambiguous by referring to the number of hidden layers. This has a slight snag when the first layer of weights (counting from the input) is unmodifiable, since the units in the next layer are then not true hidden units (they dont learn representations), but we can safely leave it to the purists and flamers to worry about that. I strongly suggest that people NEVER use the term "layers" by itself. Either say "n hidden layers" or say "n+1 layers of modifiable weights". I don't think attempts to legislate in favor of one or the other of these alternatives will work. Geoff From munnari!chook.ua.oz.au!guy at uunet.UU.NET Wed Sep 14 11:22:43 1988 From: munnari!chook.ua.oz.au!guy at uunet.UU.NET (guy smith) Date: Wed, 14 Sep 88 09:22:43 CST Subject: layer terminology Message-ID: <8809140131.AA21292@uunet.UU.NET> re: what does N-layer really mean? I agree with Scott Fahlman that the lack of an accepted meaning for N-layer is confusing. In the context of nets as clearly layered as Back Propagation nets, I think 'N' should refer to the number of layers of weights, which is also the number of layers of non-input nodes. Thus, a 0-layer net makes no sense, a single node is a 1-layer net, and the minimal net that can solve the XOR problem (calculating the parity of two binary inputs) is a 2-layer net. There is at least one rationale for this choice. If an N-layer net uses the outputs of an M-layer net for its input, you end up with an N+M layer net. Yours Pedantically, Guy Smith. From moody-john at YALE.ARPA Tue Sep 13 21:27:55 1988 From: moody-john at YALE.ARPA (john moody) Date: Tue, 13 Sep 88 21:27:55 EDT Subject: network labeling conventions Message-ID: <8809140127.AA08532@NEBULA.SUN3.CS.YALE.EDU> I agree with Scott Fahlman that there is need for a standard labeling convention for multilayered networks. The convention which I prefer for an "N Layer Network" is diagramed below. Such a network has "(N-1) Internal Layers of Units" and "N Layers of Weights". Each connection has the same layer index as its post-synaptic processing unit. The output units are "Layer N". The input lines are not enumerated as a layer since they usually do no processing; for consistency, however, the input lines can be identified as "Layer 0". As a matter of style, I think it is confusing to use the same graphic symbols for input lines as for the non-linear processing units, since any operation performed on the input data prior to its arrival at the first layer of connections is really pre-processing and not part of the network computation proper. Along the same lines, it would be useful to use a distinguishing symbol when linear output units are used in networks which perform mappings from R^n to R^m. Layer N Units (Outputs) O O O O O O Activations A^N_n Layer N Weights /|\ Weight Values W^N_nm / | \ Layer N-1 Units O O O O O O Activations A^(N-1)_m Layer N-1 Weights /|\ Weight Values W^(N-1)_ml / | \ . . . . Layer 2 Units O O O O O O Activations A^2_k Layer 2 Weights /|\ Weight Values W^2_kj / | \ Layer 1 Units O O O O O O Activations A^1_j Layer 1 Weights /|\ Weight Values W^1_ji / | \ Layer 0 (Input Lines) . . . . . . Input Activations A^0_i --John Moody ------- From skrzypek at CS.UCLA.EDU Wed Sep 14 14:59:36 1988 From: skrzypek at CS.UCLA.EDU (Dr Josef Skrzypek) Date: Wed, 14 Sep 88 11:59:36 PDT Subject: "Layers" In-Reply-To: Geoffrey Hinton's message of Tue, 13 Sep 88 13:52:16 EDT <88Sep13.111249edt.407@neat.ai.toronto.edu> Message-ID: <8809141859.AA26284@lanai.cs.ucla.edu> A layer of "neurons" means that all units (cells) in this layer are at the same FUNCTIONAL distance from some reference point e.g. the input. It is rather simple and unambiguous. There is no need to discriminate against the input units by calling them something else but "units". Think of them as the same type of "neurons" which have a specialized transduction function. For example, photoreceptors transduce photons into electrical signals while other "neurons" transduce neurotransmitter modulated ionic fluxes into electrical signals. Such input unit might have many modifiable weights, some from lateral connections, others from the feedback pathways and one dedicated to the main trasduction function. Similar argument can be used for output units or "hidden" units (why do they hide? and from whom?). A layer should refer to "neurons" (units) and not synapses (weights) because it is possible to have multiple synaptic interactions between two layers of "neurons". A layer of units, regardless of their function is rather unambiguous. Josef From lakoff at cogsci.berkeley.edu Wed Sep 14 15:18:49 1988 From: lakoff at cogsci.berkeley.edu (George Lakoff) Date: Wed, 14 Sep 88 12:18:49 PDT Subject: No subject Message-ID: <8809141918.AA01563@cogsci.berkeley.edu> To: Pinker and Prince From: George Lakoff Re: Represntational adequacy and implementablility Perhaps it's time to turn the discussion back on P&P and discuss the adequacy of the alternative they advocate. Let us distinguish first between learning and representation. Most generative linguistics involves representation and says nothing about learning. Representations there are constructed by linguistics. However that theory of representation has some deep problems, a couple of which have come up in the the discussion of past tenses. Here are two problems: 1. Generative phonology cannot represent prototype structures of the sort Bybee and Slobin described, and which arise naturally -- and hence can be described easily -- in connectionist models. As for regular cases: If one puts aside learning and concentrates on representation, there is no reason why one could not hand-construct representation of regularities in connectionist networks, so that general principles are represented by patterns of weights. If this is the case, then, on representational grounds, connectionist foundations for linguistics would appear to be more adequate than generative foundations that use symbol-manipulation algortihms. If generative phonologists can represent the irregular cases, then let's see the representations. Moreover, it would seem that if such prototype phenomena cannot be represented generatively, then a Pinker-style learning device, which learns generative represetations, should not be able to learn such prototype phenomena, since a learning device can't learn something it can't represent. 2. P&P blithely assume that generative linguistic representations could be implemented in the brain's neural networks. There is reason to doubt this. Generative phonology uses sequentially operations that generate proof-like `derivations'. These sequentially-ordered operations do not occur in real time. (No one claims that they do or should, since that would make psychologically indefensible claims.) The derivations are thought of as being like mathematical proofs, which stand outside of time. Now everything that happens in the brain does happen in real time. The question is: Can non-temporal operations like the rules of generative phonology be implemented in a brain? Can they even be implemented in neural networks at all? If so, what is the implementation like, and would it make an assumptions incompatible with what brains can do? And what happens to the intermediate stages of derivations in such an implementation? Such stages are claimed by generative phonologists to have ``psychological reality'', but if they don't occur in real time what reality do they have? For P&P to defend generative phonology as a possible alternative, they must show, not just assume, that the nontemporal sequential operations of generative phonology can be implemented, preserving generations, in neural networks operating in real time. I have not heard any evidence coming from P&P on this issue. Incidentally, I have outlined a phonological theory that has no atemporal sequential operations and no derivations, that can state real linguistic generalizations, and that can be implemented in connectionist networks. A brief discussion appears in my paper in the Proceedings of the 1988 Connectionist Summer School, to be published soon by Morgan Kaufman. Well? Do Pinker and Prince have a demonstration that generative phonology can be implemented in neural networks or not? If the atemporal sequential operations of generative phonology cannot be implemented in brain's neural networks, that is a very good reason to give up on generative phonology as a cognitively-plausible theory. * * * Incidentally, I agree with Harnard that nothing P&P said in their paper has any ultimate consequences for the adequacy of connectionist foundations for linguistics. I am, in fact, on the basis of what I know about both existing generative foundations and possible connectionist foundations, I am much more optimistic about connectionist foundations. From alexis at marzipan.mitre.org Wed Sep 14 15:17:18 1988 From: alexis at marzipan.mitre.org (Alexis Wieland) Date: Wed, 14 Sep 88 15:17:18 EDT Subject: Layer Conventions Message-ID: <8809141917.AA01447@marzipan.mitre.org.> Maybe I'm just a pessimist, but I think we're always going to have to define what we mean by layers in a specific context and proceed from there. Geoff points out that conventions become muddled when you have "skip level" arcs (which are becoming pretty prevalent at least in our neck of the woods). It gets worse with feedback and down right ugly with random connections or in networks that dynamically change/grow (yes, we're playing with those too). And we all *know* that lateral connections within a layer don't increase the layer count, but what about laterally connected net with limited feedback (my graphics system starts making simplifying assumptions about now). It really depend on how *you* conceptualize them (or how your graphics draws them). And then what about Hopfield/Boltzman/Cauchy/... nets which are fully bi-directionally connected? Is that one very connected layer or a separate layer per node; and what if it has input/output from somewhere else? "Layers" are nice intuitive constructs which are enormously helpful in describing nets, but (following the INNS preident's speaking style) it's rather like good and evil, we all know what they are until we have to give precise definitions. I have a sinking feeling that we will always be the field that can't agree how to count. alexis. wieland at MITRE.arpa From pratt at paul.rutgers.edu Wed Sep 14 15:51:22 1988 From: pratt at paul.rutgers.edu (Lorien Y. Pratt) Date: Wed, 14 Sep 88 15:51:22 EDT Subject: Updated schedule for fall Rutgers Neural Network colloquium series Message-ID: <8809141951.AA01417@zztop.rutgers.edu> Fall, 1988 Neural Networks Colloquium Series at Rutgers The field of Neural networks (Connectionism, Parallel Distributed Processing) has enjoyed a resurgence in recent years, and has important implications for computer scientists interested in artificial intelligence, parallel processing, and other areas. This fall, the Rutgers Department of Computer Science is hosting several researchers in the field of neural networks to talk about their work. Talks are open to the public, and will be held on Fridays at 11:10 in the 7th floor lounge of Hill Center on the Busch campus of Rutgers University. Refreshments will be served beforehand and we hope most speakers to be available for informal discussion over lunch afterwards. Our tentative schedule follows. The schedule will no doubt change throughout the semester; the latest version can always be found in paul.rutgers.edu:/grad/u4/pratt/Colloquia/schedule or aramis.rutgers.edu:/aramis/u1/pratt/Colloquia/schedule. In addition, abstracts for each talk will be posted locally. Speaker Date (tentative) Title ------- ------- ----------------- Sara Solla 9/16/88 Learning and Generalization in Layered Bell Labs Neural Networks David Touretzky 9/23/88 What is the relationship between CMU connectionist and symbolic models? What can we expect from a connectionist knowledge representation? Steve Hanson 9/30/88 Some comments and variations on back Bellcore propagation Hector Sussmann 10/14/88 The theory of Boltzmann machine Rutgers Math learning Josh Alspector 10/21/88 Neural network implementations in Bellcore hardware Mark Jones 11/11/88 Knowledge representation in Bell Labs connectionist networks, including inheritance reasoning and default logic. E. Tzanakou 11/18/88 --unknown-- Rutgers biomed Bob Allen 12/2/88 A neural network which uses language Bellcore From terry at cs.jhu.edu Thu Sep 15 00:50:11 1988 From: terry at cs.jhu.edu (Terry Sejnowski ) Date: Thu, 15 Sep 88 00:50:11 edt Subject: "Layers" Message-ID: <8809150450.AA01503@crabcake.cs.jhu.edu> Lets not lock ourselves into a terminology that applies to only a special case -- feedforward nets. When feedback connections are allowed the relationships between the units become more complex. For example, in Pineda's recurrent backprop algorithm, the distinction between input and output units is blurred -- the same unit can have both roles. The only distinction that remains is that of hidden units, and the number of synapses separating a given hidden units from a given input or output unit. The topologies can get quite complex. Terry ----- From pratt at paul.rutgers.edu Fri Sep 16 09:55:37 1988 From: pratt at paul.rutgers.edu (Lorien Y. Pratt) Date: Fri, 16 Sep 88 09:55:37 EDT Subject: David Touretzky on connectionist vs. symbolic models, knowledge rep. Message-ID: <8809161355.AA01444@zztop.rutgers.edu> Fall, 1988 Neural Networks Colloquium Series at Rutgers presents David Touretzky Carnegie-Mellon University Room 705 Hill center, Busch Campus Friday September 23, 1988 at 11:10 am Refreshments served before the talk Abstract My talk will explore the relationship between connectionist models and symbolic models, and ask what sort of things we should expect from a connectionist knowledge representation. In particular I'm interested in certain natural language tasks, like prepositional phrase attachment, which people do rapidly and unconsciously but which involve complicated inferences and a huge amount of world knowledge. From hi.pittman at MCC.COM Fri Sep 16 12:01:00 1988 From: hi.pittman at MCC.COM (James Arthur Pittman) Date: Fri, 16 Sep 88 11:01 CDT Subject: "Layers" In-Reply-To: <8809150450.AA01503@crabcake.cs.jhu.edu> Message-ID: <19880916160115.0.PITTMAN@DIMEBOX.ACA.MCC.COM> Could you give a reference for Pineda's recurrent backprop algorithm? Sounds interesting. And by the way, whats a crabcake? From jam at bu-cs.bu.edu Fri Sep 16 14:22:16 1988 From: jam at bu-cs.bu.edu (Jonathan Marshall) Date: Fri, 16 Sep 88 14:22:16 EDT Subject: 1988 Tech Report Message-ID: <8809161822.AA25086@bu-cs.bu.edu> The following material is available as Boston University Computer Science Department Tech Report #88-010. It may be obtained from rmb at bu-cs.bu.edu or by writing to Regina Blaney, Computer Science Dept., Boston Univ., 111 Cummington St., Boston, MA 02215, U.S.A. I think the price is $7.00. ----------------------------------------------------------------------- SELF-ORGANIZING NEURAL NETWORKS FOR PERCEPTION OF VISUAL MOTION Jonathan A. Marshall ABSTRACT The human visual system overcomes ambiguities, collectively known as the aperture problem, in its local measurements of the direction in which visual objects are moving, producing unambiguous percepts of motion. A new approach to the aperture problem is presented, using an adaptive neural network model. The neural network is exposed to moving images during a developmental period and develops its own structure by adapting to statistical characteristics of its visual input history. Competitive learning rules ensure that only connection ``chains'' between cells of similar direction and velocity sensitivity along successive spatial positions survive. The resultant self-organized configuration implements the type of disambiguation necessary for solving the aperture problem and operates in accord with direction judgments of human experimental subjects. The system not only accommodates its structure to long-term statistics of visual motion, but also simultaneously uses its acquired structure to assimilate, disambiguate, and represent visual motion events in real-time. ------------------------------------------------------------------------ I am now at the Center for Research in Learning, Perception, and Cognition, 205 Elliott Hall, University of Minnesota, Minneapolis, MN 55414. I can still be reached via my account jam at bu-cs.bu.edu . --J.A.M. From Dave.Touretzky at B.GP.CS.CMU.EDU Fri Sep 16 21:02:19 1988 From: Dave.Touretzky at B.GP.CS.CMU.EDU (Dave.Touretzky@B.GP.CS.CMU.EDU) Date: Fri, 16 Sep 88 21:02:19 EDT Subject: Layers In-Reply-To: Your message of Fri, 16 Sep 88 14:28:00 -0400. <590437703/mjw@F.GP.CS.CMU.EDU> Message-ID: <1053.590461339@DST.BOLTZ.CS.CMU.EDU> > From: Michael.Witbrock at F.GP.CS.CMU.EDU > Let the distance between two units be defined as the *minimal* number of > modifiable weights forming a path between them (i.e. the number of > weights on the shortest path between the two nodes) . > Then the Layer in which a unit lies is the minimal distance between it > and an input unit. I think you meant to use MAXIMAL distance in the definition of which layer a unit lies in. If one uses minimal distance, then in a net with direct connections from input to output, the output layer would always be layer 1, even if there were hidden units forming layers 2, 3, etc. For this definition to make sense, it should always be the case that if unit i has a connection to unit j, then Layer(i) <= Layer(j). > The number of layers in the network is the maximum value of the distance between any unit and an input unit. We should tighten this up by specifying that it's ONE PLUS the maximum distance between any unit and an input unit, EXCLUDING CYCLES. This definition is fine for feed-forward nets, but it isn't very satisfying for recurrent nets like Pineda's. Imagine a recurrent backprop net in which every unit was connected to every other. If such a net has N units, then by Michael's definition it has N layers. What's really strange is that layers 1 through N-1 are empty, and layer N has N units in it. The notion of layers is just not as useful in recurrent networks. It is perhaps better to speak in terms of modules. A module might be defined as a set of units with similar connectivity patterns, or as a set of units that are densely connected to each other and less densely connected to units in other modules. This isn't a nice, clean, graph-theoretic definition, but then whoever said life was as simple as graph theory? -- Dave From todd at galadriel.STANFORD.EDU Fri Sep 16 21:09:08 1988 From: todd at galadriel.STANFORD.EDU (Peter Todd) Date: Fri, 16 Sep 88 18:09:08 PDT Subject: Layers In-Reply-To: Your message of Fri, 16 Sep 88 14:28:00 EDT. <590437703/mjw@F.GP.CS.CMU.EDU> Message-ID: I would, in fact, argue AGAINST that definition, because, for instance, in the following example: O O /|\ |\ / | \| \ | O O | \ | /| / \|/ |/ O O (O's are units, all rest are connections) we have a TWO layer network (max. number of weights from input units to any other units) and yet EVERY non-input unit is in the FIRST layer (min. number of weights from input unit to any unit). Seems pretty counterintuitive. --peter todd From kawahara at av-convex.ntt.jp Sat Sep 17 09:47:31 1988 From: kawahara at av-convex.ntt.jp (Hideki KAWAHARA) Date: Sat, 17 Sep 88 22:47:31+0900 Subject: I love BP. But your time is over. (News from Japan). Message-ID: <8809171347.AA25230@av-convex.NTT.jp> I love BP. But, your time is over. I presented our new method for designing feedforward artificial neural networks, which can approximate arbitrary continuous mapping from n-dimensional hyper-cube to m-dimensional space, at the IEICE-Japan technical meeting on 16/Sept./1988. The method SPAN (Saturated Projection Algorithm for Neural network design) can incorporate a-priori knowledge on the mapping to be approximated. Computational steps required for training a specific network are several hundredth or thousandth of those required by conventional BP procedures. SPAN, I hope, will replace considerable amount of thoughtless applications of BP, which usually found in this feverish atmosphere of Neuro-computing in Japan. And I also hope this will let researchers to change their attentions to the more essential problems (representations, dynamics, associative memory, inference....and so on). This doesn't mean that SPAN covers BP completely. Instead, SPAN is cooperative with BP, LVQ by Kohonen, and many other neural algorithms. Only thoughtless applications will be discouraged. The IEICE technical report is already available on your request. However, it is written in Japanese. The elaborated English version will be available by the end of this year. If you interested in our report, please mail to the address given at the end of this mail. References: [1]Kawahara, H. and Irino, T.: "A Procedure for Designing 3-Layer Neural Networks Which Approximate Arbitrary Continuous Mapping: Applications to Pattern Processing," PRU88-54 IEICE Technical Report, Vol.88, No.177,pp.47-54, (Sept.1988). (in Japanese) This is the report mentioned above. Sorry for using ambiguous term, 3-Layer. Networks designed by SPAN have one hidden layer with two adjustable weight-layers. [2]Irie, B. and Miyake. S.: "Capabilities of Three-layered Perceptrons," ICNN88, pp.I-641-648, (1988). [3]Funahashi, K.: "On the Capabilities of Neural Networks," MBE88-52 IEICE Technical Report, pp.127-134, (July 1988). (in Japanese). These are useful for understanding SPAN. [2] gives an explicit algorithm for designing neural networks, which can approximate arbitrary continuous mapping. [3] provides mathematical proof of statements given in [2]. However, these results provide no practical implementations. [3] is submitted to the INNS journal. ---------------------------------------------------------------- Reports presented at the IECE technical meeting on Pattern Recognition and Understanding 16/Sept./1988, Tokyo, Japan. --excerpts ------- Special session on Pattern Recognition and Neural Networks (3)PRU88-50:"On the Learning Network, Stochastic Vector Machine," Saito, H. and Ejima, T., Nagaoka University of Technology. (4)PRU88-51:"An Order Estimation of Stochastic Process Model using Neural Networks," Ohtsuki, N., Kushiro National College of Technology, Miyanaga, Y. and Tochinai, K., Faculty of Engineering, Hokkaido University, and Ito, H., Kushiro National College of Technology. (5)PRU88-52:"Selection of High-Accurate Spectra using Neural Model," Hiroshige, M., Miyanaga, Y. and Tochinai, K., Faculty of Engineering, Hokkaido University. (6)PRU88-53:"Stereo Disparity Detection with Neural Network Model," Maeda, E. and Okudaira, M., NTT Human Interface Laboratories. (7)PRU88-54:"A Procedure for Designing 3-Layer Neural Networks which Approximate Arbitrary Continuous Mapping: Applications to Pattern Processing, Kawahara, H. and Irino, T., NTT Basic Research Laboratories. (8)PRU88-55:"Speaker-Independent Word Recognition using Dynamic Programming Neural Networks," Isotani, R. Yoshida, K. Iso, K. Watanabe, T. and Sakoe, H., C&C Information Technology Res. Labs. NEC Corporation. (9)PRU88-56:"Character Recognition by Neuro Pattern Matching," Tsukui, Y. and Hirai, Y., Univ. of Tsukuba. (10)PRU88-57:"Recognition of Hand-written Alphanumeric Characters by Artificial Neural Network," Koda, T. Takagi, H. and Shimeki, Y., Central Research Laboratories Matsushita Electric Industrial Co., Ltd. (11)PRU88-58:"Character Recognition using Neural Network," Yamada, K. Kami, H. Mizoguchi, M. and Temma, T., C&C Information Technology Research Laboratories, NEC Corporation. (12)PRU88-59:"Aiming at a Large Scale Neural Network," Mori, Y., ATR Auditory and Visual Perception Research Laboratories. These are all written in Japanese. If you can read Japanese, you can order these technical reports to the following address. -------- The Institute of Electronics, Information and Communication Engineers Kikai-shinko-Kaikan Bldg., 5-8, Shibakoen 3 chome, Minato-ku, TOKYO, 105 JAPAN. ------- The price will be about $8.00 (please add postage (about $5.00 ??)). You can also find some of the authors listed above at the NIPS meeting in Denver. (Speaking for myself, I'd like to attend it. However the budgetary conditions.......) Next month, we have several annual meetings with special sessions on neural networks. I'll report them in the near future. Hideki Kawahara --------------------------------------------------------- e-mail: kawahara%nttlab.ntt.JP at RELAY.CS.NET (from ARPA) s-mail: Hideki Kawahara Information Science Research Laboratory NTT Basic Research Laboratories 3-9-11, Midori-cho, Musashino, TOKYO, 180 JAPAN. tel: +81 422 59 2276 fax: +81 422 59 3016 --------------------------------------------------------- From moody-john at YALE.ARPA Mon Sep 19 15:12:57 1988 From: moody-john at YALE.ARPA (john moody) Date: Mon, 19 Sep 88 15:12:57 EDT Subject: Speedy Alternatives to Back Propagation Message-ID: <8809191913.AA02990@NEBULA.SUN3.CS.YALE.EDU> At Yale, we have been studying two classes of neurally- inspired learning algorithms which offer 1000-fold speed increases over back propagation for learning real-valued functions. These algorithms are "Learning with localized receptive fields" and "An interpolating, multi-resolution CMAC", where CMAC means Cerebellar Model Articulation Con- troller. Both algorithms were presented in talks entitled "Speedy Alternatives to Back Propagation" given at Snowbird (April '88), nEuro '88 (Paris, June '88), and INNS (Boston, September '88). A research report describing the localized receptive fields approach is now available. Another research report describing the CMAC models will be available in about two weeks. To receive copies of these, please send a request to Judy Terrell at terrell at yalecs.bitnet, terrell at yale.arpa, or terrell at cs.yale.edu. Be sure to include your mailing address. There is no charge for the research reports, and they are written in English! An abstract follows. --John Moody Learning with Localized Receptive Fields John Moody and Christian Darken Yale Computer Science Department PO Box 2158 Yale Station, New Haven, CT 06520 Research Report YALEU/DCS/RR-649 September 1988 Abstract We propose a network architecture based upon localized receptive field units and an efficient method for training such a network which combines self-organized and supervised learning. The network architecture and learning rules are appropriate for real-time adaptive signal processing and adaptive control. For a test problem, predicting a chaotic timeseries, the network learns 1000 times faster in digital simulation time than a three layer perceptron trained with back propagation, but requires about ten times more training data to achieve comparable prediction accuracy. This research report will appear in the Proceedings of the 1988 Connectionist Models Summer School, Morgan Kaufmann, Publishers 1988. The work was supported by ONR grant N00014-86-K-0310, AFOSR grant F49620-88-C0025, and a Purdue Army subcontract. ------- From sontag at fermat.rutgers.edu Mon Sep 19 15:27:23 1988 From: sontag at fermat.rutgers.edu (Eduardo Sontag) Date: Mon, 19 Sep 88 15:27:23 EDT Subject: Recent abstracts Message-ID: <8809191927.AA00586@control.rutgers.edu> I enclose abstracts of some recent technical reports. (Ignore the skip numbers; the rest are not in any manner related to NN's.) ***Any suggestions as to which journal to send 88-08 to*** would be highly appreciated. (There don't appear to be any journals geared towards very mathematical papers in NN's, it would seem.) ________________________________________________________________________ ABSTRACTS OF SYCON REPORTS Rutgers Center for Systems and Control Hill Center, Rutgers University, New Brunswick, NJ 08903 E-mail: sycon at fermat.rutgers.edu [88-01] Two algorithms for the Boltzmann machine: description, implementation, and a preliminary comparative study of performance , Lorien Y. Pratt and H J. Sussmann, July 88. This report compares two algorithms for learning in neural networks: the Boltzmann and modified Boltzmann machines. The Boltzmann machine has been extensively studied in the past; we have recently developed the modified Boltzmann machine. We present both algorithms and discuss several considerations which must be made for their implementation. We then give a complexity analysis and preliminary empirical comparison of the two algorithms' learning ability on a benchmark problem. For this problem, the modified Boltzmann machine is shown to learn slightly slower than the Boltzmann machine. However, the modified algorithm does not require the user to build an annealing schedule to be used for training. Since building this schedule constitutes a significant amount of the engineering time for the Boltzmann algorithm, we feel that our modified algorithm may be superior to the classical one. Since we have not yet performed a rigorous comparison of the two algorithms' performance, it may also be possible to optimize the parameters to the modified algorithm so that the learning speed is comparable to the classical version. [88-02] Some remarks on the backpropagation algorithm for neural net learning , Eduardo D. Sontag, July 88. (13 pages.) This report contains some remarks about the backpropagation method for neural net learning. We concentrate in particular in the study of local minima of error functions and the growth of weights during learning. [88-03] On the convergence of learning algorithms for Boltzmann machines , H J. Sussmann, July 88. (46 pages.) We analize a learning algorithm for Boltzmann machines, based on the usual alternation between ``learning'' and ``hallucinating'' phases. We prove rigorously that, for suitable choices of the parameters, the evolution of the weights follows very closely, with very high probability, an integral trajectory of the gradient of the likelihood function whose global maxima are exactly the desired weight patterns. An abridged version of this report will appear in the Proceedings of the 27th IEEE Conference on Decision and Control, December 1988. [88-08] Backpropagation can give rise to spurious local minima even for networks without hidden layers , Eduardo D. Sontag and H J. Sussmann, Sept 88. (15 pages.) We give an example of a neural net without hidden layers and with a sigmoid transfer function, and a corresponding training set of binary vectors, for which the sum of the squared errors, regarded as a function of the weights, has a local minimum which is not a global minimum From Roni.Rosenfeld at B.GP.CS.CMU.EDU Mon Sep 19 16:35:55 1988 From: Roni.Rosenfeld at B.GP.CS.CMU.EDU (Roni.Rosenfeld@B.GP.CS.CMU.EDU) Date: Mon, 19 Sep 88 16:35:55 EDT Subject: Notes from your friendly CONNECTIONISTS mailing list maintainer Message-ID: <8588.590704555@RONI.BOLTZ.CS.CMU.EDU> Fellow neurons, Autumn is here, and with it - a new academic year. This means many of you will be changing your e-mail address, which in turns means we will be receiving many error messages for every messages posted to CONNECTIONISTS. To help us deal with the expected mess, we ask that you observe the following: - If your old address is about to be disabled, please notify us promptly so that we may remove it from the list. - Please check that your new or forwarding address works well before you report it to us. If it does not work for us, we will have no way of contacting you and will have to remove you from the list. Thank you for your cooperation. While I'm at it, here's some more: To keep the traffic on the CONNECTIONISTS mailing list to a minimum, we ask that you take special care not to send inappropriate mail to the list. - Requests for addition to the list, change of address and other administrative matters should be sent to: "connectionists-request at cs.cmu.edu" (note the exact spelling: many "connectionists", one "request"). If you mention our mailing list to someone who may apply to be added to it, please make sure they use the above and NOT "connectionists at cs.cmu.edu". - Requests for e-mail addresses of people who are believed to subscribe to CONNECTIONISTS should be sent to postmaster at appropriate-site. If the site address is unknown, send your request to "connectionists-request at cs.cmu.edu" and we'll do our best to help. A phone call to the appropriate institution may sometimes be simpler and faster. - Note that in many mail programs a reply to a message is automatically "CC"-ed to all the addresses on the "To" and "CC" lines of the original message. If the mailer you use has this property, please make sure your personal response (request for a Tech Report etc.) is NOT broadcast over the net. Roni Rosenfeld connectionists-request at cs.cmu.edu From laura%suspicion.Princeton.EDU at Princeton.EDU Tue Sep 20 11:44:29 1988 From: laura%suspicion.Princeton.EDU at Princeton.EDU (Laura Hawkins) Date: Tue, 20 Sep 88 11:44:29 EDT Subject: Princeton University Cognitive Studies Talk Message-ID: <8809201544.AA00762@suspicion.Princeton.EDU> TITLE: Connectionist Language Users SPEAKER: Robert Allen, Bell Communications Research DATE: September 26 LOCATION: Princeton University Langfeld Lounge, Green Hall Corner of Washington Road and Williams Street TIME: Noon ABSTRACT: An important property of neural networks is their ability to integrate various sources of information through activation values. By presenting both "verbal" and "perceptual" codes a sequential back-propagation network may be trained to "use language." For instance, networks can answer questions about objects that appear in a perceptual microworld. Moreover, this paradigm handles many problems of reference, such as pronoun anaphora, quite naturally. Thus this approach, which may be termed Connectionist Language Users (CLUs), introduces a computational linguistics that is holistic. Extensions to be discussed include the use of relative clauses, action verbs, grammars, planning in a blocks world, and multi-agent "conversations." From pratt at paul.rutgers.edu Tue Sep 20 14:16:04 1988 From: pratt at paul.rutgers.edu (Lorien Y. Pratt) Date: Tue, 20 Sep 88 14:16:04 EDT Subject: Stephen Hanson to speak on back propagation at Rutgers Message-ID: <8809201816.AA05749@zztop.rutgers.edu> Fall, 1988 Neural Networks Colloquium Series at Rutgers Some comments and variations on back propagation ------------------------------------------------ Stephen Jose Hanson Bellcore Cognitive Science Lab, Princeton University Room 705 Hill center, Busch Campus Friday September 30, 1988 at 11:10 am Refreshments served before the talk Abstract Backpropagation is presently one of the most widely used learning techniques in connectionist modeling. Its popularity, however, is beset with many criticisms and concerns about its use and potential misuse. There are 4 sorts of criticisms that one hears: (1) it is a well known statistical technique (least squares) (2) it is ignorant (3) it is slow--(local minima, its NP complete) (4) it is ad hoc--hidden units as "fairy dust" I believe these four types of criticisms are based on fundamental misunderstandings about the use and relation of learning methods to the world, the relation of ontogeny to phylogeny, the relation of simple neural models to neuroscience and the nature of "weak" learning theories. I will discuss these issues in the context of some variations on backpropagation. From Alex.Waibel at SPEECH2.CS.CMU.EDU Thu Sep 22 13:06:41 1988 From: Alex.Waibel at SPEECH2.CS.CMU.EDU (Alex.Waibel@SPEECH2.CS.CMU.EDU) Date: Thu, 22 Sep 88 13:06:41 EDT Subject: Scaling in Neural Nets Message-ID: Below the abstract to a paper describing our recent research addressing the problem of scaling in neural networks for speech recognition. We show that by exploiting the hidden structure (previously learned abstractions) of speech in a modular way and applying "conectionist glue", larger more complex networks can be constructed at only small additional cost in learning time and complexity. Resulting recognition performance is as good or better than comparable monolithically trained nets and as good as the smaller network modules. This work was performed at ATR Interpreting Telephony Research Laboratories, in Japan. I am now working at Carnegie Mellon University, so you may request copies from me here or directly from Japan. >From CMU: Dr. Alex Waibel Computer Science Department Carnegie-Mellon University Pittsburgh, PA 15213 phone: (412) 268-7676 email: ahw at speech2.cs.cmu.edu >From Japan, please write for technical report TR-I-0034 (with CC to me), to: Ms. Kazumi Kanazawa ATR Interpreting Telephony Research Laboratories Twin 21 MID Tower, 2-1-61 Shiromi, Higashi-ku, Osaka, 540, Japan email: kddlab!atr-la.atr.junet!kanazawa at uunet.UU.NET Please CC to: ahw at speech2.cs.cmu.edu ------------------------------------------------------------------------- Modularity and Scaling in Large Phonemic Neural Networks Alex Waibel, Hidefumi Sawai, Kiyohiro Shikano ATR Interpreting Telephony Research Laboratories ABSTRACT Scaling connectionist models to larger connectionist systems is difficult, because larger networks require increasing amounts of training time and data and the complexity of the optimization task quickly reaches computationally unmanageable proportions. In this paper, we train several small Time-Delay Neural Networks aimed at all phonemic subcategories (nasals, fricatives, etc.) and report excellent fine phonemic discrimination performance for all cases. Exploiting the hidden structure of these smaller phonemic subcategory networks, we then propose several techniques that allow us to "grow" larger nets in an incremental and modular fashion without loss in recognition performance and without the need for excessive training time or additional data. These techniques include {\em class discriminatory learning, connectionist glue, selective/partial learning and all-net fine tuning}. A set of experiments shows that stop consonant networks (BDGPTK) constructed from subcomponent BDG- and PTK-nets achieved up to 98.6% correct recognition compared to 98.3% and 98.7% correct for the component BDG- and PTK-nets. Similarly, an incrementally trained network aimed at {\em all} consonants achieved recognition scores of 95.9% correct. These result were found to be comparable to the performance of the subcomponent networks and significantly better than several alternative speech recognition strategies. From jordan at psyche.mit.edu Thu Sep 22 15:40:05 1988 From: jordan at psyche.mit.edu (Michael Jordan) Date: Thu, 22 Sep 88 15:40:05 edt Subject: Scaling in Neural Nets Message-ID: <8809221941.AA05878@ATHENA.MIT.EDU> Would you please send me a copy? Michael I. Jordan E10-034C MIT Cambridge, MA 02139 From kawahara at av-convex.ntt.jp Thu Sep 22 12:41:53 1988 From: kawahara at av-convex.ntt.jp (Hideki KAWAHARA) Date: Fri, 23 Sep 88 01:41:53+0900 Subject: A beautiful subset of SPAN, Radial Basis Function. Message-ID: <8809221641.AA07190@av-convex.NTT.jp> Things are changing very rapidly. I visited the ATR labs. yesterday before. I have inspiring discussion with Funahashi, Irie and the other ATR researchers on the SPAN concepts. Funahashi finally gave me a copy of a paper on Radial Basis Function by Broomhead of the RSRE. I read it in the super express SHINKANSEN from OSAKA to TOKYO. For my surprise, the RBF concept was a beautiful and useful subset of the SPAN. You can take benefit of important conceptual framework in neural network design by reading RBF paper. It is written in English. :-) I'm certain now that the NIPS conference will be remembered as a turning point from BP to the next generation algorithms, because I'm sure the presentation by Bridle will be on the RBF method. I'm also trying to attend the NIPS conference. Reference: D.S. Broomhead and D.Lowe: "Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive Networks," RSRE Memorandum No.4148, Royal Signals & Radar Establishment, (RSRE Malvern, WORCS.) ---- Hideki Kawahara PS: Copy of my report is free of charge. Previous mail is somewhat mis-leading. PPS: The English version of our report on SPAN will be available within a month. PPPS: The IECE office won't deliver their technical reports to the foreign countries. However, If you really need some of them, I think I can provide some assistance. ---------------------------------------------- e-mail: kawahara%nttlab.ntt.jp at RELAY.CS.NET Hideki Kawahara NTT Basic Research Labs. 3-9-11 Midori-cho Musashino, TOKYO 180, JAPAN ---------------------------------------------- From netlist at psych.Stanford.EDU Fri Sep 23 09:19:12 1988 From: netlist at psych.Stanford.EDU (Mark Gluck) Date: Fri, 23 Sep 88 06:19:12 PDT Subject: Stanford Adaptive Networks Colloquium Message-ID: Stanford University Interdisciplinary Colloquium Series: Adaptive Networks and their Applications Oct. 4th (Tuesday, 3:15pm) ************************************************************************** Connectionist Prediction Systems: Relationship to Least-Squares Estimation and Dynamic Programming RICHARD S. SUTTON GTE Laboratories Incorporated 40 Sylvan Road Waltham, MA 02254 ************************************************************************** - Abstract - In this talk I will present two examples of productive interplay between connectionist machine learning and more traditional engineering areas. The first concerns the problem of learning to predict time series. I will briefly review previous approaches including least squares linear estimation and the newer nonlinear backpropagation methods, and then present a new class of methods called Temporal-Difference (TD) methods. Whereas previous methods are driven by the error or difference between predictions and actual outcomes, TD methods are similarly driven the difference between temporally successive predictions. This idea is also the key idea behind the learning in Samuel's checker player, in Holland's bucket brigade, and in Barto, Sutton & Anderson's pole-balancer. TD methods can be more efficient computationally because their errors are available immediately after the predictions are made, without waiting for a final outcome. More surprisingly, they can also be more efficient in terms of how much data is needed to achieve a particular level of accuracy. Formal results will be presented concerning the computational complexity, convergence, and optimality of TD methods. Possible areas of application of TD methods include temporal pattern recognition such as speech recognition and weather forecasting, the learning of heuristic evaluation functions, and learning control. Second, I would like to present work on the theory of TD methods used in conjunction with reinforcement learning techniques to solve control problems. ************************************************************************** Location: Room 380-380W, which can be reached through the lower level between the Psychology and Mathematical Sciences buildings. Technical Level: These talks will be technically oriented and are intended for persons actively working in related areas. They are not intended for the newcomer seeking general introductory material. Information: To be added to the network mailing list, netmail to netlist at psych.stanford.edu For additional information, contact Mark Gluck (gluck at psych.stanford.edu). Upcomming talks: Nov. 22: Mike Jordan (MIT) Dec. 6: Ralph Linsker (IBM) * * * Co-Sponsored by: Departments of Electrical Engineering (B. Widrow) and Psychology (D. Rumelhart, M. Pavel, M. Gluck), Stanford Univ. From kawahara at av-convex.ntt.jp Sun Sep 25 01:29:21 1988 From: kawahara at av-convex.ntt.jp (Hideki KAWAHARA) Date: Sun, 25 Sep 88 14:29:21+0900 Subject: RSRE address Message-ID: <8809250529.AA17939@av-convex.NTT.jp> Several readers requested RSRE address mentioned in my previous mail. Followings are all what I know. Royal Signals & Rader Establishment St Andrews Road Great Malvern Worcestreshire WR14 3PS, UK D.S.Broomhead e-mail from USA: dsb%rsre.mod.uk at relay.mod.uk e-mail from UK : dsb%rsre.mod.uk at uk.ac.ucl.cs.nss David Lowe e-mail from USA: dl%rsre.mod.uk at relay.mod.uk e-mail from UK : dl%rsre.mod.uk at uk.ac.ucl.cs.nss Please contact them to get the paper on RBF. Hideki Kawahara PS: I can't access e-mail for next several days. Excuse me for delay in reply. ------------------------------------------------------ e-mail: kawahara%nttlab.ntt.jp at RELAY.CS.NET (from ARPA) s-mail: Hideki Kawahara Information Science Research Laboratory NTT Basic Research Laboratories 3-9-11 Midori-cho Musashino-shi, TOKYO 180, JAPAN ------------------------------------------------------ From niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK Mon Sep 26 00:03:02 1988 From: niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK (M. Niranjan) Date: Mon, 26 Sep 88 00:03:02 BST Subject: radial basis functions Message-ID: <1483.8809252303@dsl.eng.cam.ac.uk> Re: Hideki KAWAHARA's recent postings on Radial basis functions Radial basis functions as pattern classifiers is a kind of Kernel discriminant analysis ("Kernel Discriminant Analysis" by Hand, Research Studies Press, 1982). In KDA, a class conditional probability density function is estimated as a weighted sum of kernel functions centred on the training examples (and then Bayes' type classification); in RBF, the discriminant function itself is calculated as a weighted sum of kernel functions. In this sense, RBF is superior to KDA, I think. It forms class boundaries by segments of hyper-spheres (rather than hyper-planes for a BP type network). Something very similar to RBF is the method of potential functions. This works something like placing weighted electric charges on every training example and the equi-potential lines act as class boundaries. I think the green book by Duda and Hart mention this, but the original reference is, Aizerman, M.A., Braverman, E.M. \& Rozonoer, L.I. (1964): ``On the method of potential functions''; Avtomatika i Telemekhanika, {\bf Vol. 26, No. 11}, 2086-2088. (This is in Russian, but there is a one-to-one translation in most electrical engineering libraries) Also, if you make a network with one hidden layer of 'spherical graded units' (Hanson and Burr, "Knowledge representation in connectionist networks"), and a simple perceptron as output unit (plus some simplifying assumptions), you can derive the RBF method!! niranjan From russ at baklava.mitre.org Mon Sep 26 08:42:11 1988 From: russ at baklava.mitre.org (Russell Leighton) Date: Mon, 26 Sep 88 08:42:11 EDT Subject: Psychnet In-Reply-To: Psychology Newsletter and Bulletin Board's message of Sun, 25 Sep 88 <8809260843.AA26526@mitre.arpa> Message-ID: <8809261242.AA01985@baklava.mitre.org.> Please include me on your distribution list. Please use Thanks, Russ. ARPA: russ at mitre.arpa Russell Leighton M.S. Z406 MITRE Signal Processing Lab 7525 Colshire Dr. McLean, Va. 22102 USA From jose at tractatus.bellcore.com Tue Sep 27 09:37:55 1988 From: jose at tractatus.bellcore.com (Stephen J Hanson) Date: Tue, 27 Sep 88 09:37:55 EDT Subject: rbfs Message-ID: <8809271337.AA15038@tractatus.bellcore.com> >Re: Hideki KAWAHARA's recent postings on Radial basis functions >Also, if you make a network with one hidden layer of 'spherical graded >units' (Hanson and Burr, "Knowledge representation in connectionist >networks"), and a simple perceptron as output unit (plus some simplifying >assumptions), you can derive the RBF method!! >>niranjan It's also worth noting that any sort of generalized dichotomy (discriminant) can be naturally embedded in Back-prop nets--in terms of polynomial boundaries (also suggested in Hanson & Burr) or any sort of generalized volume or edge one would like (sigma-pi for example are simple rectangular volumes). I believe that this sort of variation has a relation to synaptic-dendritic interactions which one might imagine could be considerably more complex than linear. However, I suspect there is a tradeoff in terms neuron complexity and learning generality as one increases the complexity of the discriminant or predicate that one is using-- consequently as componential network power increases the advantage of network computation may decrease. (as usual "generalized discriminants" was suggested previously in statistical and pattern recognition literature-- Duda and Hart, pp. 134-138. and also see Tou & Gonzalez, Pattern Recognition Principles, Addison-Wesley, 1974, pp. 48-52-- Btw--I don't think the fact that many sorts of statistical methods seem to "pop out" of neural network approaches also means that neural network framework is somehow derivative--remember that many of the statistical models and methods are ad hoc and explicitly rely on "normative" sorts of assumptions which may provide the only connection to some other sort of statistical method. In fact, i think it is rather remarkable that such simple sorts of "neural like" assumptions can lead to families of such powerful sorts of general methods.) Stephen Hanson From schmidhu at tumult.informatik.tu-muenchen.de Tue Sep 27 07:26:30 1988 From: schmidhu at tumult.informatik.tu-muenchen.de (Juergen Schmidhuber) Date: Tue, 27 Sep 88 10:26:30 -0100 Subject: Abstract available Message-ID: <8809270926.AA19521@tumult.informatik.tu-muenchen.de> This is the abstract of an extended abstract of the description of some ongoing work that will be presented at the conference `Connectionism in Perspective' in Zurich. THE NEURAL BUCKET BRIGADE Juergen Schmidhuber For several reasons standard back-propagation (BP) in recurrent networks does not make too much sense in typical non-stationary environments. We identify the main problem of BP in not being `really local', meaning that BP is not what we call `local in time'. Doing some constructive criticism we introduce a learning method for neural networks that is `really local' and still allows credit-assignment for states that are `hidden in time'. ------------------- For those who are interested in the extended abstract there are copies available. (There also will be a more detailed and more formal treatment in the proceedings of CiP.) Include a physical address in your reply. Juergen From Roni.Rosenfeld at B.GP.CS.CMU.EDU Wed Sep 28 19:32:03 1988 From: Roni.Rosenfeld at B.GP.CS.CMU.EDU (Roni.Rosenfeld@B.GP.CS.CMU.EDU) Date: Wed, 28 Sep 88 19:32:03 EDT Subject: MIRRORS/II: Connectionist simulation software Message-ID: <637.591492723@RONI.BOLTZ.CS.CMU.EDU> The following is being posted on behalf of James Reggia. (please Please PLEASE do not reply to me or to "connectionists") Roni Rosenfeld connectionists-request at cs.cmu.edu ------- Forwarded Message MIRRORS/II Connectionist Simulator Available MIRRORS/II is a general-purpose connectionist simulator which can be used to implement a broad spectrum of connec- tionist (neural network) models. MIRRORS/II is dis- tinguished by its support of an extensible high-level non- procedural language, an indexed library of networks, spread- ing activation methods, learning methods, event parsers and handlers, and a generalized event-handling mechanism. The MIRRORS/II language allows relatively inexperienced computer users to express the structure of a network that they would like to study and the parameters which will con- trol their particular connectionist model simulation. Users can select an existing spreading activation/learning method and other system components from the library to complete their connectionist model; no programming is required. On the other hand, more advanced users with programming skills who are interested in research involving new methods for spreading activation or learning can still derive major benefits from using MIRRORS/II. The advanced user need only write functions for the desired procedural components (e.g., spreading activation method, control strategy, etc.). Based on language primitives specified by the user MIRRORS/II will incorporate the user-written components into the connection- ist model; no changes to the MIRRORS/II system itself are required. Connectionist models developed using MIRRORS/II are not limited to a particular processing paradigm. Spreading activation methods, and Hebbian learning, competitive learn- ing, and error back-propogation are among the resources found in the MIRRORS/II library. MIRRORS/II provides both synchronous and asynchronous control strategies that deter- mine which nodes should have their activation values updated during an iteration. Users can also provide their own con- trol strategies and have control over a simulation through the generalized event handling mechanism. Simulations produced by MIRRORS/II have an event- handling mechanism which provides a general framework for scheduling certain actions to occur during a simulation. MIRRORS/II supports system-defined events (constant/cyclic input, constant/cyclic output, clamp, learn, display and show) and user-defined events. An event command (e.g., the input-command) indicates which event is to occur, when it is to occur, and which part of the network it is to affect. Simultaneously occurring events are prioritized according to user specification. At run time, the appropriate event handler performs the desired action for the currently- occurring event. User-defined events can redefine the work- ings of system-defined events or can create new events needed for a particular application. MIRRORS/II is implemented in Franz Lisp and will run under Opuses 38, 42, and 43 of Franz Lisp on UNIX systems. It is currently running on a MicroVAX, VAX and SUN 3. If you are interested in obtaining more detailed information about the MIRRORS/II system see D'Autrechy, C. L. et al., 1988, "A General-Purpose Simulation Environment for Develop- ing Connectionist Models," Simulation, 51, 5-19. The MIRRORS/II software and reference manual are available for no charge via tape or ftp. If you are interested in obtain- ing a copy of the software send e-mail to mirrors at mimsy.umd.edu or ...!uunet!mimsy!mirrors or send mail to Lynne D'Autrechy University of Maryland Department of Computer Science College Park, MD 20742 ------- End of Forwarded Message From niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK Thu Sep 29 11:48:11 1988 From: niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK (M. Niranjan) Date: Thu, 29 Sep 88 11:48:11 BST Subject: RBFs Message-ID: <2577.8809291048@dsl.eng.cam.ac.uk> David Lowe (of RSRE) says their work on Radial basis functions is published in, Complex systems Vol 2 No 3, pp269-303, 1988. niranjan From jam%bu-cs.BU.EDU at bu-it.bu.edu Thu Sep 29 13:30:09 1988 From: jam%bu-cs.BU.EDU at bu-it.bu.edu (jam%bu-cs.BU.EDU@bu-it.bu.edu) Date: Thu, 29 Sep 88 13:30:09 EDT Subject: Neural networks & visual motion perception Message-ID: <8809291730.AA16018@bucse.bu.edu> The following material is available as Boston University Computer Science Department Tech Report #88-010. It may be obtained from pam at bu-cs.bu.edu or by writing to Pam Pletz, Computer Science Dept., Boston Univ., 111 Cummington St., Boston, MA 02215, U.S.A. It is 100 pages long, and the price is $7.00. ----------------------------------------------------------------------- SELF-ORGANIZING NEURAL NETWORKS FOR PERCEPTION OF VISUAL MOTION Jonathan A. Marshall ABSTRACT The human visual system overcomes ambiguities, collectively known as the aperture problem, in its local measurements of the direction in which visual objects are moving, producing unambiguous percepts of motion. A new approach to the aperture problem is presented, using an adaptive neural network model. The neural network is exposed to moving images during a developmental period and develops its own structure by adapting to statistical characteristics of its visual input history. Competitive learning rules ensure that only connection ``chains'' between cells of similar direction and velocity sensitivity along successive spatial positions survive. The resultant self-organized configuration implements the type of disambiguation necessary for solving the aperture problem and operates in accord with direction judgments of human experimental subjects. The system not only accommodates its structure to long-term statistics of visual motion, but also simultaneously uses its acquired structure to assimilate, disambiguate, and represent visual motion events in real-time. ------------------------------------------------------------------------ I am now at the Center for Research in Learning, Perception, and Cognition, 205 Elliott Hall, University of Minnesota, Minneapolis, MN 55414. I can still be reached via my account jam at bu-cs.bu.edu . --J.A.M. From kawahara at av-convex.ntt.jp Thu Sep 29 09:47:53 1988 From: kawahara at av-convex.ntt.jp (Hideki KAWAHARA) Date: Thu, 29 Sep 88 22:47:53+0900 Subject: Neural network capabilities and alternatives to BP Message-ID: <8809291347.AA07886@av-convex.NTT.jp> Dear colleagues: First of all, I have to apologize that my previous mails have somewhat rude tone and un-intended negative effects. I would like to correct them by making my points clear and will try to supply usable and traceable information. I suggested too many things with too few evidences. What I want to point out are as follows. (1) Neural network capabilities and learning algorithms are different problems. Separating these problems will clarify their characteristics better. (2) Theoretically, feed-forward networks with one hidden layer can approximate any arbitrary continuous mapping from n dimensional hypercube to m dimensional space. However, networks designed according to procedures suggested by the theory (like Irie-Miyake) will suffer from so-called "combinatorial explosion" problems, because complexity of the network is proportional to the degrees of freedom of the input space. Irie-Miyake proof is based on multi-dimensional Fourier transform. An interesting demonstration of neural network capabilities can be implemented using CT(Computerized Tomography) procedures. (Irie once said that his inspiration came from his knowledge on CT.) (3) In pattern processing applications, there is a useful class of neural network architectures including RBF. They are not likely to suffer from "combinatorial explosion" problems, because the network complexity in this case is mainly bounded by the number of clusters in input space. In other words, the degrees of freedom is usually proportional to the number of clusters. (Thank you for providing useful information on RBF and PGU. Hanson's article and Niranjan's article supplied additional information.) (4) There are simple transformations for converting feed-forward networks to the networks which are members of a class mentioned in (3). PGU introduced by Hanson and Burr is one of such extensions. However, there are at least two cases where linear graded units can form Radial Basis Functions. Case(1): If input vectors are distributed only on a surface of a hypersphere, output of a linear graded unit will be a RBF. Case(2): If input vectors are auto-correlation coefficients of input signals, and if weight vectors of a linear graded unit is calculated from the maximum likelihood spectral parameters of a reference spectrum, output of a linear graded unit also will be a RBF. (5) These transformations and various neural network learning algorithms can be combined to work together. For example, self-organizing feature map can be utilized for preparing reference points of RBF. A BP-based procedure can be used for fine tuning. (6)Procedures through (3) to (4) suggest a prototype-based perception model, because hidden units in this case correspond to reference vectors in input space. This is a local representation. Even if we choose a RBF function with broader radius, it resembles coarse coding at best. It is somewhat contrasting with our experience using BP, where usually distributed representations emerge as internal representations. This is an interesting point to discuss. (7) My point of view: I agree with Hanson's view that neural networks are not mere derivatives of statistical methods. I believe that neural networks are fruitful sources of important algorithms, which are not discovered yet. This doesn't imply that neural networks simply implement those algorithms. It implies that we can extract those algorithms if we carefully investigate its functions using appropriate formalism and abstractions. I hope this mail will clarify my points and contribute for increasing our knowledge on neural network characteristics and hopefully will stimulate productive discussions. Hideki Kawahara NTT Basic Research Laboratories. Reference: Itakura,F.: "Minimum Prediction Residual Principle Applied to Speech Recognition," IEEE Trans., ASSP-23, pp.67-72, Feb. 1975. (This is the original paper. The Itakura-measure may be found in many text books on speech processing.) From pratt at paul.rutgers.edu Fri Sep 30 16:53:45 1988 From: pratt at paul.rutgers.edu (Lorien Y. Pratt) Date: Fri, 30 Sep 88 16:53:45 EDT Subject: Hector Sussmann to speak on formal analysis of Boltzmann Machine Learning Message-ID: <8809302053.AA03471@zztop.rutgers.edu> Fall, 1988 Neural Networks Colloquium Series at Rutgers On the theory of Boltzmann Machine Learning ------------------------------------------- Hector Sussmann Rutgers University Mathematics Department Room 705 Hill center, Busch Campus Friday October 14, 1988 at 11:10 am Refreshments served before the talk Abstract The Boltzmann machine is an algorithm for learning in neural networks, involving alternation between a ``learning'' and ``hallucinating'' phase. In this talk, I will present a Boltzmann machine algorithm for which it can be proven that, for suitable choices of the parameters, the weights converge so that the Boltzmann machine correctly classifies all training data. This is because the evolution of the weights follow very closely, with very high probability, an integral trajectory of the gradient of the likelihood function whose global maxima are exactly the desired weight patterns. From jlm+ at andrew.cmu.edu Thu Sep 1 16:10:27 1988 From: jlm+ at andrew.cmu.edu (James L. McClelland) Date: Thu, 1 Sep 88 16:10:27 -0400 (EDT) Subject: Comments on Pinker's Replies to Harnad Message-ID: <0X7Ouny00jWDM2TdAk@andrew.cmu.edu> I tried to send this yesterday but for some reason it appears to have slipped through the cracks. The copy I sent to steve got through, but apparently the one to connectionists didn't. ============================================================= Steve -- In the first of your two messages, there seemed to be a failure to entertain the possibility that there might be a network that is not a strict implementation of a rule system nor a pattern associator of the type described by Rumelhart and me that could capture the past tense phenomena. The principle shortcoming of our network, in my view, was that it treated the problem of past-tense formation as a problem in which one generates the past tense of a word from its present tense. This of course cannot be the right way to do things, for reasons which you describe at some length in your paper. However, THIS problem has nothing to do with whether a network or some other method is used for going from present to past tense. Several researchers are now exploring models that take as input a distributed representation of the intended meaning, and generate as output a description of the phonological properties of the utterance that expresses that meaning. Such a network must have at least one hidden layer to do this task. Note that such a network would naturally be able to exploit the common structure of the various different versions of English inflectional morphology. It is already clear that it would have a much easier time learning inflection rather than word-reversal as a way of mastering past tense etc. What remain to be addressed are issues about the nature and onset of use of the regular inflection in English. Suffice it to say here that the claims you and Prince make about the sharp distinction between the regular and irregular systems deserve very close scrutiny. I for one find the arguments you give in favor of this view unconvincing. We will be writing at more length on these matters, but for now I just wanted two points to be clear: 1) The argument about what class of models a particular model's shortcomings exemplify is not an easy one to resolve, and there is considerable science and (yes) mathematics to be done to understand just what the classes are and what can be taken as examples of them. Just what generalization you believe you have reason to claim your arguments allow you to make has not always been clear. In the first of your two recent messages you state: Our concern is not with (the mathematical question of) what nets can or cannot do in principle, but with which theories are true, and our conclusions were about pattern associators using distributed phonological representations. We showed that it is unlikely that human children learn the regular rule the way such a pattern associator learns the regular rule, because it is simply the wrong tool for the job. After receiving the message containing the above I wrote the following: < Now, the model Rumelhart and I proposed was a pattern associator using distributed phonological representations, but so are the other kinds of models that people are currently exploring; they happen though to use such representations at the output and not the input and to have hidden layers. I strongly suspect that you would like your argument to apply to the broad class of models which might be encompassed by the phrase "pattern associators using distributed phonological representations", and I know for a fact that many readers think that this is what you intend. However, I think it is much more likely that your arguments apply to the much narrower class of models which map distributed phonological representations of present tense to distributed phonological represenations of past tense. > In your longer, second note, you are very clear in stating that you indend your arguments to be taken against the narrow class of models that map phonology to phonology. I do hope that this sensible view gets propagated, as I think many may feel that you think you have a more general case. Indeed, your second message takes a general attitude that I find I can agree with: Let's do some more research and find out what can and can't be done and what the important taxonomic classes of architecture types might be. 2) There's quite a bit more empirical research to be done even characterizing accurately the facts about the past tense. I believe this research will show that you have substantially overstated the empirical situation in several respects. Just as one example, you and Prince state the following: The baseball term _to fly out_, meaning 'make an out by hitting a fly ball that gets caught', is derived from the baseball noun _fly (ball)_, meaning 'ball hit on a conspicuously parabolic trajectory', which is in turn related to the simple strong verb _fly_, 'proceed through the air. Everyone says 'he flied out'; no mere mortal has yet been observed to have "flown out" to left field. You repeated this at Cog Sci two weeks ago. Yet in October of 87 I received the message appended below, which directly contradicts your claim. As you state in your second, more constructive message, we ALL need to be very clear about what the facts are and not to rush around making glib statements! Jay McClelland ======================================================= [The following is appended with the consent of the author.] Date: Sun, 11 Oct 87 21:20:55 PDT From: elman at amos.ling.ucsd.edu (Jeff Elman) To: der at psych.stanford.edu, jlm at andrew.cmu.edu Subject: flying baseball players Heard in Thursday's play-off game between the Tigers and Twins: "...and he flew out to left field...he's...OUT!" What was that P&P were saying?! Jeff ======================================================= From jlm+ at andrew.cmu.edu Thu Sep 1 16:17:59 1988 From: jlm+ at andrew.cmu.edu (James L. McClelland) Date: Thu, 1 Sep 88 16:17:59 -0400 (EDT) Subject: Cognitive Science and Connectionist Models In-Reply-To: <4483.588995267@DST.BOLTZ.CS.CMU.EDU> References: <4483.588995267@DST.BOLTZ.CS.CMU.EDU> Message-ID: I've been meaning to send this message for a long time: The recent discussion about the proceedings of cognitive science have pushed me over the edge: [Actually I sent this yesterday but as with my previous mail this one seems to have gotten lost as well.] The Journal Cognitive Science (the publication of the Cognitive Science Society) has a commitment to the exploration of connectionist models. I am one of the senior editors and the editorial board includes several prominant connectionists. I speak for the journal in saying that we welcome connectionist research with an interdisiplinary flavor. There will be a group of connectionist papers coming out shortly. If you want to submit, read the instructions for authors inside the back cover of a recent issue. If you want to subscribe, write to Ablex Publishing, 355 Chestnut St. Norwood, NJ 07648 or join the society by writing to Alan Lesgold, Secretary-Treasurer Learning Research and Development Center University of Pittsburgh Pittsburgh, PA 15260 membership is just a bit more than a plain subscription and gets you announcements about meetings etc as well as the journal. -- Jay McClelland From steve at cogito.mit.edu Fri Sep 2 12:13:20 1988 From: steve at cogito.mit.edu (Steve Pinker) Date: Fri, 2 Sep 88 12:13:20 edt Subject: VERY brief note on Steven Harnad's reply to answers Message-ID: <8809021614.AA09711@ATHENA.MIT.EDU> In his reply to our answers to his questions, Harnad Harnad writes that: -Looking at the actual behavior and empirical fidelity of connectionist models is not the right way to test connectionist hypotheses; -Developmental, neural, reaction time, and brain-damage data should be put aside in evaluating psychological theories. -The meaning of the word "learning" should be stipulated to apply only to extracting statistical regularities from input data. -Induction has philosophical priority over innatism. We don't have much to say here (thank God, you are probably all thinking). We disagree sharply with the first two claims, and have no interest whatsoever in discussing the last two. Alan Prince Steven Pinker From FROSTROMS%CPVB.SAINET.MFENET at NMFECC.ARPA Thu Sep 1 17:05:02 1988 From: FROSTROMS%CPVB.SAINET.MFENET at NMFECC.ARPA (FROSTROMS%CPVB.SAINET.MFENET@NMFECC.ARPA) Date: Thu, 1 Sep 88 14:05:02 PDT Subject: A Harder Learning Problem Message-ID: <880901140502.20200215@NMFECC.ARPA> This is a (delayed) response to Alexis P. Wieland's posting of Fri Aug 5 on the spiral problem: _A Harder Learning Problem_ : > One of the tasks that we've been using at MITRE to test and compare our > learning algorithms is to distinguish between two intertwined spirals. > This task uses a net with 2 inputs and 1 output. The inputs correspond > to points, and the net should output a 1 on one spiral and > a 0 on the other. Each of the spirals contains 3 full revolutions. > This task has some nice features: it's very non-linear, it's relatively > difficult (our spiffed up learning algorithm requires ~15-20 million > presentations = ~150-200 thousand epochs = ~1-2 days of cpu on a (loaded) > Sun4/280 to learn, ... we've never succeeded at getting vanilla bp to > correctly converge), and because you have 2 in and 1 out you can *PLOT* > the current transfer function of the entire network as it learns. > > I'd be interested in seeing other people try this or a related problem. Here at SAIC, Dennis Walker obtained the following results: "I tried the spiral problem using the standard Back Propagation model in ANSim (Artificial Neural System Simulation Environment) and found that neither spiffed-up learning algorithms nor tricky learning rate adjustments are necessary to find a solution to this difficult problem. Our network had two hidden layers -- a 2-20-10-1 structure for a total of 281 weights. No intra-layer connections were necessary. The learning rates for all 3 layers were set to 0.1 with the momentums set to 0.7. Batching was used for weight updating. Also, an error tolerance of 0.15 was used: as long as the output was within 0.15 of the target no error was assigned. It took ANSim 13,940 cycles (passes through the data) to get the outputs within 0.3 of the targets. (In ANSim, the activations range from -0.5 to 0.5 instead of the usual 0 to 1 range.) Using the SAIC Delta Floating Point Processor with ANSim, this took less than 27 minutes to train (~0.114 seconds/pass). I also tried reducing the network size to 2-16-8-1 and again was able to train the network successfully, but it took an unbelievable 300K cycles! This is definitly a tough problem." Stephen A. Frostrom Science Applications International Corporation 10260 Campus Point Drive San Diego, CA 92121 (619) 546-6404 frostroms at SAIC-CPVB.arpa From steve at cogito.mit.edu Thu Sep 1 13:06:38 1988 From: steve at cogito.mit.edu (Steve Pinker) Date: Thu, 1 Sep 88 13:06:38 edt Subject: Input to Past Tense Net Message-ID: <8809011707.AA25787@ATHENA.MIT.EDU> Dear Jay, We of course agree with you completely that there's a lot of work to be done in exploring both the properties of nets and the relevant empirical data. On the input/output of nets for the past tense: We agree that some of the problems with RM'86 can be attributed to its using distributed phonological representations of the stem as input. We also agree that by using different a different kind of input some of those problems would be diminished. But models that "take as input a distributed representation of the intended meaning, and generate as output a description of the phonological properties of the utterance that expresses the meaning" is on the wrong track. As we showed in OLC (pp. 110-114), the crucial aspects of the input are not its semantic properties, but whether the root of its lexical entry is marked as 'irregular', which in turn often depends on the grammatical category of the root. Two words with different roots will usually have different meanings, but the difference is epiphenomenal -- there's no *systematic*, generalization-supporting pattern between verb semantics and regularity. As we noted, there are words with high semantic similarity and different past tense forms ('hit/hit', 'strike/struck', 'slap/slapped') and words with low semantic similarity and the same past tense forms ('come=arrive/came'; 'come=have an organism/came', 'become/became', 'overcome/overcame', 'come to one's senses/came to one's senses', etc.). On flying-out: We're not sure what the Elman anecdote is supposed to imply. The phenomena are quite clear: a word of Category X that is transparently derived from a word of Category Y is regular with respect to inflectional rules applying to X. That is why the vast majority of time one hears 'flied out', not 'flew out' ('flew out' is a vanishingly rare anecdote worthy of an e-mail message; 'flied-out' usages would over-run mboxes if we bothered to publicly document every instance). That's also why all the other examples of unambiguous cross-category conversion one can think of are regular (see OLC p. 111). That's also why you can add to this list of regulars that are homophonous with an irregular indefinitely (e.g. 'Mary out-Sally-Rided/*out-Sally-Rode Sally Ride'; 'Alcatraz out-Sing-Singed/*out-sang-sang Sing Sing', etc.). And that's why you find the phenomenon in different categories ('Toronto Maple Leafs') and in other languages. In other words we have an absolutely overwhelming empirical tendency toward overregularizing cross-categorially derived verbs and an extremely simple and elegant explanation for it (OLC 111-112). If one is also interested in accounting for one-shot violations like the Elman anecdote there are numerous hypotheses to test (an RM86-like model that doesn't apply the majority of the time (?); a speech error (OLC n.32); hypercorrection (OLC. p. 127); derivational ambiguity (OLC n. 16), and no doubt others.) In general: What the facts are telling us is that the right way to set up a net for the past tense is to have an input vector that encodes grammatical cateogry, root/derived status, etc. Perhaps such a net would "merely implement" a traditional grammar, but perhaps it would shed new light on the problem, solving some previous difficulties. What baffles us is why this obvious step would be anathema to so many connectionists. There seems to be a puzzling trend in connectionist approaches to language -- the goal of exploring the properties of nets as psycholinguistic models is married to the goal of promoting a particular view of language that eschews grammatical representations of any sort at any cost and tries to use knowledge-driven processing, associationist-style learning, or both as a substitute. In practice the empirical side of this effort often relies on isolated anecdotes and examples and ignores the vast amount of systematic research on the phenomena at hand. There's no reason why connectionist work on language has to proceed this way, as Paul Smolensky for one has pointed out. Why not exploit the discoveries of linguistics and psycholinguistics, instead of trying to ignore or rewrite them? Our understanding of both connectionism and of language would be the better for it. Steve Pinker Alan Prince From jlm+ at andrew.cmu.edu Thu Sep 1 16:10:27 1988 From: jlm+ at andrew.cmu.edu (James L. McClelland) Date: Thu, 1 Sep 88 16:10:27 -0400 (EDT) Subject: Comments on Pinker's Replies to Harnad Message-ID: <0X7Ouny00jWDM2TdAk@andrew.cmu.edu> I tried to send this yesterday but for some reason it appears to have slipped through the cracks. The copy I sent to steve got through, but apparently the one to connectionists didn't. ============================================================= Steve -- In the first of your two messages, there seemed to be a failure to entertain the possibility that there might be a network that is not a strict implementation of a rule system nor a pattern associator of the type described by Rumelhart and me that could capture the past tense phenomena. The principle shortcoming of our network, in my view, was that it treated the problem of past-tense formation as a problem in which one generates the past tense of a word from its present tense. This of course cannot be the right way to do things, for reasons which you describe at some length in your paper. However, THIS problem has nothing to do with whether a network or some other method is used for going from present to past tense. Several researchers are now exploring models that take as input a distributed representation of the intended meaning, and generate as output a description of the phonological properties of the utterance that expresses that meaning. Such a network must have at least one hidden layer to do this task. Note that such a network would naturally be able to exploit the common structure of the various different versions of English inflectional morphology. It is already clear that it would have a much easier time learning inflection rather than word-reversal as a way of mastering past tense etc. What remain to be addressed are issues about the nature and onset of use of the regular inflection in English. Suffice it to say here that the claims you and Prince make about the sharp distinction between the regular and irregular systems deserve very close scrutiny. I for one find the arguments you give in favor of this view unconvincing. We will be writing at more length on these matters, but for now I just wanted two points to be clear: 1) The argument about what class of models a particular model's shortcomings exemplify is not an easy one to resolve, and there is considerable science and (yes) mathematics to be done to understand just what the classes are and what can be taken as examples of them. Just what generalization you believe you have reason to claim your arguments allow you to make has not always been clear. In the first of your two recent messages you state: Our concern is not with (the mathematical question of) what nets can or cannot do in principle, but with which theories are true, and our conclusions were about pattern associators using distributed phonological representations. We showed that it is unlikely that human children learn the regular rule the way such a pattern associator learns the regular rule, because it is simply the wrong tool for the job. After receiving the message containing the above I wrote the following: < Now, the model Rumelhart and I proposed was a pattern associator using distributed phonological representations, but so are the other kinds of models that people are currently exploring; they happen though to use such representations at the output and not the input and to have hidden layers. I strongly suspect that you would like your argument to apply to the broad class of models which might be encompassed by the phrase "pattern associators using distributed phonological representations", and I know for a fact that many readers think that this is what you intend. However, I think it is much more likely that your arguments apply to the much narrower class of models which map distributed phonological representations of present tense to distributed phonological represenations of past tense. > In your longer, second note, you are very clear in stating that you indend your arguments to be taken against the narrow class of models that map phonology to phonology. I do hope that this sensible view gets propagated, as I think many may feel that you think you have a more general case. Indeed, your second message takes a general attitude that I find I can agree with: Let's do some more research and find out what can and can't be done and what the important taxonomic classes of architecture types might be. 2) There's quite a bit more empirical research to be done even characterizing accurately the facts about the past tense. I believe this research will show that you have substantially overstated the empirical situation in several respects. Just as one example, you and Prince state the following: The baseball term _to fly out_, meaning 'make an out by hitting a fly ball that gets caught', is derived from the baseball noun _fly (ball)_, meaning 'ball hit on a conspicuously parabolic trajectory', which is in turn related to the simple strong verb _fly_, 'proceed through the air. Everyone says 'he flied out'; no mere mortal has yet been observed to have "flown out" to left field. You repeated this at Cog Sci two weeks ago. Yet in October of 87 I received the message appended below, which directly contradicts your claim. As you state in your second, more constructive message, we ALL need to be very clear about what the facts are and not to rush around making glib statements! Jay McClelland ======================================================= [The following is appended with the consent of the author.] Date: Sun, 11 Oct 87 21:20:55 PDT From: elman at amos.ling.ucsd.edu (Jeff Elman) To: der at psych.stanford.edu, jlm at andrew.cmu.edu Subject: flying baseball players Heard in Thursday's play-off game between the Tigers and Twins: "...and he flew out to left field...he's...OUT!" What was that P&P were saying?! Jeff ======================================================= From jlm+ at andrew.cmu.edu Thu Sep 1 16:17:59 1988 From: jlm+ at andrew.cmu.edu (James L. McClelland) Date: Thu, 1 Sep 88 16:17:59 -0400 (EDT) Subject: Cognitive Science and Connectionist Models In-Reply-To: <4483.588995267@DST.BOLTZ.CS.CMU.EDU> References: <4483.588995267@DST.BOLTZ.CS.CMU.EDU> Message-ID: I've been meaning to send this message for a long time: The recent discussion about the proceedings of cognitive science have pushed me over the edge: [Actually I sent this yesterday but as with my previous mail this one seems to have gotten lost as well.] The Journal Cognitive Science (the publication of the Cognitive Science Society) has a commitment to the exploration of connectionist models. I am one of the senior editors and the editorial board includes several prominant connectionists. I speak for the journal in saying that we welcome connectionist research with an interdisiplinary flavor. There will be a group of connectionist papers coming out shortly. If you want to submit, read the instructions for authors inside the back cover of a recent issue. If you want to subscribe, write to Ablex Publishing, 355 Chestnut St. Norwood, NJ 07648 or join the society by writing to Alan Lesgold, Secretary-Treasurer Learning Research and Development Center University of Pittsburgh Pittsburgh, PA 15260 membership is just a bit more than a plain subscription and gets you announcements about meetings etc as well as the journal. -- Jay McClelland From steve at cogito.mit.edu Fri Sep 2 10:21:08 1988 From: steve at cogito.mit.edu (Steve Pinker) Date: Fri, 2 Sep 88 10:21:08 edt Subject: Input to Past Tense Message-ID: <8809021421.AA08302@ATHENA.MIT.EDU> Dear Jay, We of course agree with you completely that there's a lot of work to be done in exploring both the properties of nets and the relevant empirical data. On the input/output of nets for the past tense: We agree that some of the problems with RM'86 can be attributed to its using distributed phonological representations of the stem as input. We also agree that by using different a different kind of input some of those problems would be diminished. But models that "take as input a distributed representation of the intended meaning, and generate as output a description of the phonological properties of the utterance that expresses the meaning" is on the wrong track. As we showed in OLC (pp. 110-114), the crucial aspects of the input are not its semantic properties, but whether the root of its lexical entry is marked as 'irregular', which in turn often depends on the grammatical category of the root. Two words with different roots will usually have different meanings, but the difference is epiphenomenal -- there's no *systematic*, generalization-supporting pattern between verb semantics and regularity. As we noted, there are words with high semantic similarity and different past tense forms ('hit/hit', 'strike/struck', 'slap/slapped') and words with low semantic similarity and the same past tense forms ('come=arrive/came'; 'come=have an organism/came', 'become/became', 'overcome/overcame', 'come to one's senses/came to one's senses', etc.). On flying-out: We're not sure what the Elman anecdote is supposed to imply. The phenomena are quite clear: a word of Category X that is transparently derived from a word of Category Y is regular with respect to inflectional rules applying to X. That is why the vast majority of time one hears 'flied out', not 'flew out' ('flew out' is a vanishingly rare anecdote worthy of an e-mail message; 'flied-out' usages would over-run mboxes if we bothered to publicly document every instance). That's also why all the other examples of unambiguous cross-category conversion one can think of are regular (see OLC p. 111). That's also why you can add to this list of regulars that are homophonous with an irregular indefinitely (e.g. 'Mary out-Sally-Rided/*out-Sally-Rode Sally Ride'; 'Alcatraz out-Sing-Singed/*out-sang-sang Sing Sing', etc.). And that's why you find the phenomenon in different categories ('Toronto Maple Leafs') and in other languages. In other words we have an absolutely overwhelming empirical tendency toward overregularizing cross-categorially derived verbs and an extremely simple and elegant explanation for it (OLC 111-112). If one is also interested in accounting for one-shot violations like the Elman anecdote there are numerous hypotheses to test (an RM86-like model that doesn't apply the majority of the time (?); a speech error (OLC n.32); hypercorrection (OLC. p. 127); derivational ambiguity (OLC n. 16), and no doubt others.) In general: What the facts are telling us is that the right way to set up a net for the past tense is to have an input vector that encodes grammatical cateogry, root/derived status, etc. Perhaps such a net would "merely implement" a traditional grammar, but perhaps it would shed new light on the problem, solving some previous difficulties. What baffles us is why this obvious step would be anathema to so many connectionists. There seems to be a puzzling trend in connectionist approaches to language -- the goal of exploring the properties of nets as psycholinguistic models is married to the goal of promoting a particular view of language that eschews grammatical representations of any sort at any cost and tries to use knowledge-driven processing, associationist-style learning, or both as a substitute. In practice the empirical side of this effort often relies on isolated anecdotes and examples and ignores the vast amount of systematic research on the phenomena at hand. There's no reason why connectionist work on language has to proceed this way, as Paul Smolensky for one has pointed out. Why not exploit the discoveries of linguistics and psycholinguistics, instead of trying to ignore or rewrite them? Our understanding of both connectionism and of language would be the better for it. Steve Pinker Alan Prince From harnad at Princeton.EDU Sat Sep 3 16:03:16 1988 From: harnad at Princeton.EDU (Stevan Harnad) Date: Sat, 3 Sep 88 16:03:16 edt Subject: On Modeling and Its Constraints (P&P PS) Message-ID: <8809032003.AA15194@mind> Pinker & Prince attribute the following 4 points (not quotes) to me, indicating that they sharply disgree with (1) and (2) and have no interest whatsoever in discussing (3) and (4).: (1) Looking at the actual behavior and empirical fidelity of connectionist models is not the right way to test connectionist hypotheses. This was not the issue, as any attentive follower of the discussion can confirm. The question was whether Pinker & Prince's article was to be taken as a critique of the connectionist approach in principle, or just of the Rumelhart & McClelland 1986 model in particular. (2) Developmental, neural, reaction time, and brain-damage data should be put aside in evaluating psychological theories. This was a conditional methodological point; it is not correctly stated in (2): IF one has a model for a small fragment of human cognitive performance capacity (a "toy" model), a fragment that one has no reason to suppose to be functionally self-contained and independent of the rest of cognition, THEN it is premature to try to bolster confidence in the model by fitting it to developmental (neural, reaction time, etc.) data. It is a better strategy to try to reduce the model's vast degrees of freedom by scaling up to a larger and larger fragment of cognitive performance capacity. This certainly applies to past-tense learning (although my example was chess-playing and doing factorials). It also seems to apply to all cognitive models proposed to date. "Psychological theories" will begin when these toy models begin to approach lifesize; then fine-tuning and implementational details may help decide between asymptotic rivals. [Here's something for connectionists to disagree with me about: I don't think there is a solid enough fact known about the nervous system to warrant "constraining" cognitive models with it. Constraints are handicaps; what's needed in the toy world that contemporary modeling lives in is more power and generality in generating our performance capacities. If "constraints" help us to get that, then they're useful (just as any source of insight, including analogy and pure fantasy can be useful). Otherwise they are just arbitrary burdens. The only face-valid "constraint" is our cognitive capacity itself, and we all know enough about that already to provide us with competence data till doomsday. Fine-tuning details are premature; we haven't even come near the station yet.] (3) The meaning of the word "learning" should be stipulated to apply only to extracting statistical regularities from input data. (4) Induction has philosophical priority over innatism. These are substantive issues, very relevant to the issues under discussion (and not decidable by stipulation). However, obviously, they can only be discussed seriously with interested parties. Stevan Harnad harnad at mind.princeton.edu From marchman at amos.ling.ucsd.edu Mon Sep 5 18:04:26 1988 From: marchman at amos.ling.ucsd.edu (Virginia Marchman) Date: Mon, 5 Sep 88 15:04:26 PDT Subject: past tense debate Message-ID: <8809052204.AA04578@amos.ling.ucsd.edu> Jumping in on the recent discussion about connectionism and the learning of the English past tense, I would like to make the following 2 points: (1) The data on acquisition of the past tense in real children may be very different from the patterns assumed by either side in this debate. (2) Networks can simulate "default" strategies that mimic the categorial rules defended by P&P, but the emergence of such rule-like behavior can depend on statistical properties of the input language (a constant input, not the discontinuous input used by R&M). This finding may be relevant to discussions for both "sides" in light of the behavioral (human) data I allude to in (1). (1) As a psychologist interested in the empirical facts which characterize the acquisition of the past tense (and other domains of linguistic knowledge), I agree with McClelland's comment directed to Pinker and Prince that > There's quite a bit more empirical research to be > done [to] even characterize accurately the facts about > the past tense. I believe this research will > show you that you have substantially > overstated the empirical situation in several respects. (Re: reply to S. Harnad, Connectionist Net, 8/31/88) After OLC was released in tech report form (Occasional Paper #33, 1987), I wrote a paper arguing that P&P may have underestimated the complexity and degree of individual variation inherent in the process of acquiring the English past tense ("Rules and Regularities in the acquisition of the English past tense." Center for Research in Language Newsletter, UCSD, vol. 2, #4, April, 1988). However, it is difficult for me to believe that developmental data are (in fact, or in principle) "too impoverished" to substantively contribute to the debate between the symbolic and connectionist accounts (S. Harnad, "On Theft vs. Honest Toil", Connectionist Net, 8/31/88). In the paper, I presented data on the production of past tense forms by English-speaking children between the ages of 3 and 8, using an elicitation technique essentially identical to the one used by Bybee & Slobin (i.e., the data cited in the original R&M paper). While I was fully expecting to see the standard "stages" of overgeneralization and "U-shaped" development, the data suggested that I should stop and re-think the standard characterization of the acquisition of inflectional morphology. First, my data indicated that a child can be in the "stage" of overgeneralizing the "add -ed" rule anywhere between 3 and 7 years of age. Second, errors took several forms beyond the one emphasized by P&P, i.e. overgeneralization of the "-ed" rule to irregular forms. Instead, errors seem to result from the misapplication of *several* (at least two) past tense formation processes. For example, identity mapping (e.g. "hit --> hit") was incorrectly applied to forms from several different classes (both regulars and irregulars that require a vowel change). Vowel changes were inappropriately applied to regulars and irregulars alike (including examples like "pick --> puck"). Furthermore, children committed these "irregularizations" of regular forms at the same time (i.e., within the same child) that they also committed the better-known error of regularizing irregular forms. Although individual children had "favorite" error types, the different errors patterns were not concentrated in any particular age range. These data provide two challenges to the stage model so often assumed by investigators on either side of the symbolic/connectionist debate: (a) Why is it that children with very *different* amounts of linguistic experience (e.g., 4 year olds and 7 year olds) over- and undergeneralize verbs in qualitatively similar ways? This degree of individual variation within and across age levels in "rate" of acquisition among normal children may be outside acceptable levels of tolerance for a stage model. At the very least, additional evidence is needed to conclusively assume that acquisition proceeds in a "U-shaped" fashion from rote to rule to rule+rote mechanisms. (b) In several interesting ways, children can be shown to treat irregular and regular verbs similarly during acquisition. Exactly what evidence does one need to show that the regular transformation (add -ed) has a privileged status *during acquisition*? Although overextension of the -ed rule is the most frequent error type overall, there was little in my data upon which to claim that regulars and irregulars are *qualitatively* different at any point in the learning process. As I state in the conclusion: ".... addressing at least some of the interesting questions for language acquistion requires looking beyond what children are supposed to be doing within any one "stage" of development. I emphasized the idiosyncratic and multi-faceted nature of children's rule-governed systems and asked whether the three-phased model is the most useful metaphor for understanding how children deal with the complexities inherent in the *systems* of language at various points in development. Rather than looking for ways to explain qualitative changes in rule types and their domain of operation, it may be more used to shift theoretical emphasis onto acquisition as a protracted resolution of several competing and interdependent sub-systems." (2) In a Technical Report that will be available in the next 4-6 weeks ("Pattern Association in a Back Propagation Network: Implications for Child Language Acquisition", Center for Research in Language, UCSD), Kim Plunkett (psykimp%dkarh02.bitnet) and I will report on a series of approx. 20 simulations conducted during the last 8 months at UCSD. Our goal was to extend the original R&M work with particular focus on the developmental aspects of the model by exploring the interaction of input assumptions with the specific learning properties of the patterns that the simulation is required to associate from input to output. Our first explorations in this problem confirmed the claim by P&P (OLC), that the U-shaped developmental performance of the R&M simulation was indeed highly sensitive to the discontinuity in vocabulary size and structure imposed upon the model. In our simulations, we did NOT introduce any "artificial" discontinuities in the input to the network across the learning period. We restricted ourselves to mappings between phonological strings -- although we agree with both P&P and McClelland that children use more sources of information (e.g. semantics) in the acquisition of an inflectional system like the past tense. It is certainly not our goal to suggest that linguistic categories (i.e. phonology, semantics) play no role in the acquisition of language, nor that a connectionist network that is required to perform phonological-to-phonological mappings is faced with the same task as a child learning language. But the results from these simulations may present useful information about the effects of different input characteristics on the kinds of errors a net will produce -- including some understanding of the conditions under which "rule-like" behaviors will and will not emerge. And, these error patterns (and the individual variability obtained -- where different simulations stand for different individuals) can shed some light on the "real" phenomena that is of the most concern. In our mixture of approaches, we are trying to systematically explore the assumptions of both the symbolic and connectionist approaches to acquisition, keeping what kids "really" do firmly in mind. For our simulations, we constructed a language that consists of legal English CVC, VCC, and CCV strings. Each present and past tense form was represented using a fixed-length distributed phonological feature system. The task for each network was to learn (using back-propagation) approximately 500 phonological-to-phonological mappings where the present tense forms are transformed to the past tense via one of four types of "rules": Arbitrary (any phoneme can go to any other phoneme, like GO --> WENT), Vowel Change (12 possible English vowel changes, analogous to COME --> CAME), Identity map (no change, analogous to HIT --> HIT), and the turning on of a suffix (one of three depending on the voicing of the final phoneme in the stem, analogous to WALK --> WALKED). Input strings were randomly assigned to verb classes and therefore, *no information was provided which tells the network to which class a particular verb belongs*. One primary goal of this work was to outline the particular configuration of vocabulary input (i.e. "diet") that allowed the system to achieve "adult-like competence" in the past tense, with "child-like" stages in between. Across simulations, we systematically varied the overall number of unique forms that undergo each transformation (i.e., class size), as well as the number of times each class member is presented to the system per epoch (token frequency). We experimented with several different class size and token ratios that, according to estimates out there in the literature, represent the vocabulary configuration of the past tense system in English (e.g., arbitraries are relatively few in number but are highly frequent). We used two measures of performance/acquisition after every sweep through the vocabulary: 1) rate of learning (overall error rate), and 2) success at achieving the target output forms (overall "hit" rate, consonant "hits", vowel "hits" and suffix "hits"). With these, we determined the degree to which the network was achieving the target, as well as the tendency for the network to, for example, turn on a suffix when it shouldn't, change a vowel when it should identity map, etc. *at every point along the learning curve*. I will not describe all of the results here, however, one finding is particularily relevant to the current discussion. In several of our simulations, the network tended to adopt a "default" suffixation strategy when it formed the past tense of verbs. That is, even though the system was getting a high proportion of the both the "regular" and the "irregular" (arbitrary, vowel change and identity) verbs correct, the most common errors made by the system at various points in development are best described as overgeneralizations of the "add -ed" rule. However, other error types (analogous to the "irregularizations" described above) also occurred. Certain configurations of class size (# of forms) and token frequency (# of exemplars repeated) resulted in a network that adopted suffixation as its "default" strategy; yet, in other simulations (i.e., vocabulary configurations), the network adopted "identity mapping" as its guide through the acquisition of the vocabulary. Overgeneralizations of the identity mapping procedure were prevalent in several simulations, as was the tendency to incorrectly change a vowel. It is important to stress that these different outcomes occurred in the *same* network (e.g., 3 layer, 20 input units, etc.), each one exposed to a different combination of regular and irregular input. Emergence of a default strategy (a rule?) at certain points in learning depended not on tagging of the input (as P&P suggest), but on the ratio of regulars and irregulars in the input to which the system was exposed. This pattern of performance could *not* have been determined by the phonological characteristics of members of either the regular or the irregular classes. That is, phonological information was available to the system (within the distributed feature representation) but the phonological structure of the stem did not determine class membership (i.e., performance was not determined by the identifiability of which "class" of relationships would obtain between the input and the output). The input-sensitivity of error patterns in our simulations may come as bad news to those who (1) care about what children do, and (2) believe that children go through a universal U-shaped pattern of development. However, as I suggest in my CRL paper, this familiar characterization of "real" children may not be the most useful for understanding the acquisition process. Default mappings, rule-like in nature, can emerge in a system that is given no explicit information about class membership (bad news for P&P?), but such an outcome is by no means guaranteed. Our current and future work includes a comparison of this set of simulations with additional sets in which information about class membership is explicitly "tagged" in the system (as P&P assume), models in which phonological similarity in the stem is varied systematically (to determine whether default mappings still emerge), and models in which semantic information is also available (as everyone on earth assumes must be the case for a realistic model of language learning). Virginia Marchman Department of Psychology C-009 UCSD La Jolla, CA 92093 marchman at amos.ucsd.ling.edu From marchman at amos.ling.ucsd.edu Tue Sep 6 20:09:40 1988 From: marchman at amos.ling.ucsd.edu (Virginia Marchman) Date: Tue, 6 Sep 88 17:09:40 PDT Subject: Past tense debate -- address correction Message-ID: <8809070009.AA09152@amos.ling.ucsd.edu> It appears that I provided the wrong email address on my posting of 9/5/88. Sorry for the inconvenience. -virginia the correct address is: marchman at amos.ling.ucsd.edu From prince at cogito.mit.edu Tue Sep 6 21:03:23 1988 From: prince at cogito.mit.edu (Alan Prince) Date: Tue, 6 Sep 88 21:03:23 edt Subject: Final Word on Harnad's Final Word Message-ID: <8809070104.AA11135@ATHENA.MIT.EDU> ``The Eye's Plain Version is a Thing Apart'' Whatever the intricacies of the other substantive issues that Harnad deals with in such detail, for him the central question must always be: "whether Pinker & Prince's article was to be taken as a critique of the connectionist approach in principle, or just of the Rumelhart & McClelland 1986 model in particular" (Harnad 1988c, cf. 1988a,b). At this we are mildly abashed: we don't understand the continuing insistence on exclusive "or". It is no mystery that our paper is a detailed analysis of one empirical model of a corner (of a corner) of linguistic capacity; nor is it obscure that from time to time, when warranted, we draw broader conclusions (as in section 8). Aside from the 'ambiguities' arising from Harnad's humpty-dumpty-ish appropriation of words like 'learning', we find that the two modes of reasoning coexist in comfort and symbiosis. Harnad apparently wants us to pledge allegiance to one side (or the other) of a phony disjunction. May we politely refuse? S. Pinker A. Prince From bondc at iuvax.cs.indiana.edu Wed Sep 7 07:13:02 1988 From: bondc at iuvax.cs.indiana.edu (Clay M Bond) Date: Wed, 7 Sep 88 06:13:02 EST Subject: No subject Message-ID: >From Connectionists-Request at q.cs.cmu.edu Wed Sep 7 02:21:24 1988 >Received: from B.GP.CS.CMU.EDU by Q.CS.CMU.EDU; 6 Sep 88 21:06:22 EDT >Received: from C.CS.CMU.EDU by B.GP.CS.CMU.EDU; 6 Sep 88 21:04:48 EDT >Received: from ATHENA (ATHENA.MIT.EDU.#Internet) by C.CS.CMU.EDU with TCP; Tue 6 Sep 88 21:04:28-EDT >Received: by ATHENA.MIT.EDU (5.45/4.7) id AA11135; Tue, 6 Sep 88 21:04:19 EDT >Message-Id: <8809070104.AA11135 at ATHENA.MIT.EDU> >Date: Tue, 6 Sep 88 21:03:23 edt >From: Alan Prince >Site: MIT Center for Cognitive Science >To: connectionists at c.cs.cmu.edu >Subject: Final Word on Harnad's Final Word >Status: R > > >``The Eye's Plain Version is a Thing Apart'' > >Whatever the intricacies of the other substantive issues that >Harnad deals with in such detail, for him the central question >must always be: "whether Pinker & Prince's article was to be taken >as a critique of the connectionist approach in principle, or just of >the Rumelhart & McClelland 1986 model in particular" (Harnad 1988c, cf. >1988a,b). > >At this we are mildly abashed: we don't understand the continuing >insistence on exclusive "or". It is no mystery that our paper >is a detailed analysis of one empirical model of a corner (of a >corner) of linguistic capacity; nor is it obscure that from time >to time, when warranted, we draw broader conclusions (as in section 8). >Aside from the 'ambiguities' arising from Harnad's humpty-dumpty-ish ^^^^^^^^^^^^^^^^^^^^^^^^^^ >appropriation of words like 'learning', we find that the two modes >of reasoning coexist in comfort and symbiosis. Harnad apparently ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >wants us to pledge allegiance to one side (or the other) of a phony >disjunction. May we politely refuse? > >S. Pinker >A. Prince It certainly says a great deal about the MITniks that when confronted with a valid criticism of their assumptions which they cannot defend they resort to smugness and condescension. No one has to comment on their maturity or status as scientists; they say more by their nastiness than anyone else could. I requested to be included on this mailing list because I am a cognitive scientist and am currently involved in connectionist research. Intel- ligent, scientific discussion is productive for all. Childish trash such as Pinker and Prince's response above is not welcome in my mail queue. If you have nothing of substance to say, then please don't presume that my time can be wasted. Send such pre-adult filth to alt.flame, P and P. And if you don't have the basic intelligence to perceive a very important and obvious disjunction of issues, then you certainly have no business with BAs, much less PhDs. Sincerely, C. Bond Flames to: /dev/null From jose at tractatus.bellcore.com Wed Sep 7 17:03:24 1988 From: jose at tractatus.bellcore.com (Stephen J Hanson) Date: Wed, 7 Sep 88 17:03:24 EDT Subject: observations Message-ID: <8809072103.AA28301@tractatus.bellcore.com> I thought it interesting in the various exchanges that Pinker and Prince never bothered to provide an alternative model for what seems to clear set of phenomenon in language acquisition. Rumelhart and McClelland did have a model--and it kind of worked.. even if they maybe should have considered in other experiments using other kinds of features (perhaps sentential syntactic or semantic). Nonetheless, the model has/had interesting properties, could be extended, tested and analyzed, was well defined in terms of failures and successes, and apparently provides some heuristics for more experiments and refinements and improvements on the basic model --I'm not sure what more one could ask for. The complaints concerning the nature of pattern associators seems odd and off the mark--probably a simple misunderstanding concerning technical issues. And the data concerning verb past tense acquisition are obviously important-- I doubt R & M would disagree. So what's the problem? I and perhaps others watching all the words fly (no, I have nothing to say about flying words) by wonder what exactly is going on here--Is there another model waiting in the wings that can compete with the R & M model? What specific alternative approaches really exist for modeling verb past tense acquisition (notice this does mean learning)? If there are no others, perhaps P &P and R &M should work on improved model together. Stephen J. Hanson (jose at bellcore.com) From bates at amos.ling.ucsd.edu Wed Sep 7 18:14:21 1988 From: bates at amos.ling.ucsd.edu (Elizabeth Bates) Date: Wed, 7 Sep 88 15:14:21 PDT Subject: observations Message-ID: <8809072214.AA12917@amos.ling.ucsd.edu> As a child language researcher and a by-stander in the current debate, I would like to reassure some of the AI folks about the good intentions on both sides. Unfortunately, the current argument has deteriorated to the academic equivalent of "Your mother wears army boots!". But there is valid stuff behind it all. My sympathies tend to lie more on the connectionist side, but P&P deserve our careful attention for several reasons. (1) They are (in my humble view) the first of the vocal critics of PDP who have bothered to look carefully at the details of even ONE model, as opposed to those (like Fodor and Pylyshyn) who have pulled their 1960's arguments out of the closet and dusted them off in the smug conviction that nothing has changed. (2) Although I think P&P overstate the strength of their empirical case (i.e. they are wrong on many counts about the intuitions of adults and the behavior of children) they do take the empirical evidence seriously, something I wish practicioners on BOTH sides of the aisle would do more often. (3) Steve Pinker is one of the few child language researchers who has indeed put forward a (reasonably) coherent model of the learning process. It is far too nativist for me, in the sense that it solves too many problems by stipulation (..."Let us assume that the child knows some version of X-bar theory...."). As any mathematician knows, the more you do by assumption, the less you have to prove. In that (limited) sense, I agree with Steven Harnad. But I strongly recommend that interested network subscribers take a good look at Steve Pinker's book and decide for themselves. There is indeed a nasty habit of speech at MIT, an irritating smugness that does not contribute to the progress of science. I probably like that less than anyone. But there is also real substance and a lot of sweat that has gone into the P&P work on connectionism. They deserve to be answered on those terms (try ignoring the tone of voice -- you'll need the practice if you have or plan to have adolescent children). Having said that, let me underscore the value of looking carefully at real human data in developing models, arguments and counterarguments about the acquisition and use of language. One of the worst flaws in the R&M model was the abrupt change in the input that they used to create a U-shaped function -- in the cherished belief, based on many text-book accounts, that such a U-shaped development exists in children. To borrow a phrase from our sainted vice-president: READ MY LIPS! There is no U-shaped function, no sudden drop from one way of forming the past tense to another. There is, instead, a protracted competition between forms that may drag on for years, and there is considerable individual variability in the process. I recommend that you (re)read Virginia Marchman's comments to get a better hold of the facts. Similar arguments can be made about the supposedly crisp intuitions P&P claim for adults (fly --> (flew,flied)). They have raised an interesting behavioral domain for our consideration, but I can assure you that adult behavior and adult intuitions are not crisp at all. The Elman anecdote that Jay McClelland brought to our attention is not irrelevant, nor is it isolated. I have a reasonably good control over the English language myself, and yet I still vacillate in passivizing many forms (is it "sneaked" or "snuck"?). Crisp intuitions and U-shaped functions are idealizations invented by linguists, accepted by psycholinguists who should have known better, passed on to computer scientists and perpetuated in simulations even by people like R&M who are ideologically predisposed to think otherwise. --Elizabeth Bates (bates at amos.ling.ucsd.edu). From Dave.Touretzky at B.GP.CS.CMU.EDU Wed Sep 7 22:49:31 1988 From: Dave.Touretzky at B.GP.CS.CMU.EDU (Dave.Touretzky@B.GP.CS.CMU.EDU) Date: Wed, 07 Sep 88 22:49:31 EDT Subject: schedule for the upcoming NIPS conference Message-ID: <2710.589690171@DST.BOLTZ.CS.CMU.EDU> A copy of the preliminary schedule for the upcoming NIPS conference (November 28-December 1, with workshops December 1-3) appears below. NIPS is a single-track, purely scientific conference. The program committee, chaired by Scott Kirkpatrick, was very selective: only 25% of submissions were accepted this year. There will be 25 oral presentations and 60 posters. The proceedings will be available around the end of April '89, but they can be ordered now from Morgan Kaufmann Publishers, P.O. Box 50490, Palo Alto, CA 94303-9953; tel. 415-578-9911. Prepublication price is $33.95, plus $2.25 postage ($4.00 for overseas orders). California residents must add sales tax. Specify that you want "Advances in Neural Information Processing Systems". PRELIMINARY PROGRAM, NIPS '88 Denver, November 29-December 1, 1988 Tuesday AM __________ SESSION O1: Learning and Generalization ________________________________________ Invited Talk 8:30 O1.1: "Birdsong Learning", Mark Konishi, Division of Biology, California Institute of Technology Contributed Talks 9:10 O1.2: "Comparing Generalization by Humans and Adaptive Networks", M. Pavel, M.A. Gluck, V. Henkle, Department of Psychology, Stanford University 9:40 O1.3: "An Optimality Principle for Unsupervised Learn- ing", T. Sanger, AI Lab, MIT 10:10 Break 10:30 O1.4: "Learning by Example with Hints", Y.S. Abu- Mostafa, California Institute of Technology, Department of Electrical Engineering 11:00 O1.5: "Associative Learning Via Inhibitory Search", D.H. Ackley, Cognitive Science Research Group, Bell Communi- cation Research, Morristown NJ 11:30 O1.6: "Speedy Alternatives to Back Propagation", J. Moody, C. Darken, Computer Science Department, Yale Univer- sity Tuesday PM __________ 12:00 Poster Preview I SESSION P1A: Learning and Generalization _________________________________________ P1A.1: "Efficient Parallel Learning Algorithms for Neural Networks", A. Kramer, Prof. A. Sangiovanni-Vincentelli, De- partment of EECS, U.C. Berkeley P1A.2: "Properties of a Hybrid Neural Network-Classifier System", Lawrence Davis, Bolt Beranek and Newman Laborato- ries, Cambridge, MA P1A.3: "Self Organizing Neural Networks For The Identifica- tion Problem", M.R. Tenorio, Wei-Tsih Lee, School of Elec- trical Engineering, Purdue University P1A.4: "Comparison of Multilayer Networks and Data Analy- sis", P. Gallinari, S. Thiria, F. Fogelman-Soulie, Laboratoire d'Intelligence Artificielle, Ecole des Hautes Etudes en Informatique, Universite' de Paris 5, 75 006 Paris, France P1A.5: "Neural Networks and Principal Component Analysis: Learning from Examples, without Local Minima", P. Baldi, K. Hornik, Department of Mathematics, University of California, San Diego P1A.6: "Learning by Choice of Internal Representations", Tal Grossman, Ronny Meir, Eytan Domany, Department of Elec- tronics, Weizmann Institute of Science P1A.7: "What size Net Gives Valid Generalization?", D. Haussler, E.B. Baum, Department of Computer and Information Sciences, University of California, Santa Cruz P1A.8: "Mean Field Annealing and Neural Networks", G. Bilbro, T.K. Miller, W. Snyder, D. Van den Bout, M White, R. Mann, Department of Electrical and Computer Engineering, North Carolina State University P1A.9: "Connectionist Learning of Expert Preferences by Comparison Training", G. Tesauro, University of Illinois at Urbana-Champign, Champaign, IL P1A.10: "Dynamic Hypothesis Formation in Connectionist Net- works", M.C. Mozer, Department of Psychology and Computer Science, University of Toronto P1A.11: "Digit Recognition Using a Multi-Architecture Feed Forward Neural Network", W.R. Gardner, L. Pearlstein, De- partment of Electrical Engineering, University of Delaware P1A.12: "The Boltzmann Perceptron: A Multi-Layered Feed- Forward Network, Equivalent to the Boltzmann Machine", Eyal Yair, Allen Gersho, Center For Information Processing Re- search, University of California P1A.13: "Adaptive Neural-Net Preprocessing for Signal De- tection in Non-Gaussian Noise", R.P. Lippmann, P.E. Beckmann, MIT Lincoln Laboratory, Lexington, MA P1A.14: "Training Multilayer Perceptrons with the Extended Kalman Algorithm", S. Singhal, L. Wu, Bell Communications Research, Morristown, NJ P1A.15: "GEMINI: Gradient Estimation through Matrix Inver- sion after Noise Injection", Y. LeCun, C.C. Galland, G.E. Hinton, Computer Science Department, University of Toronto P1A.16: "Analysis of Recurrent Backpropagation", P.Y. Simard, M.B. Ottaway, D.H. Ballard, Department of Computer Science, University of Rochester P1A.17: "Scaling and Generalization in Neural Networks: a Case Study", Subutai Ahmad, Gerald Tesauro, Center for Com- plex Systems Research, University of Illinois at Urbana- Champaign P1A.18: "Does the Neuron "Learn" Like the Synapse?", R. Tawel, Jet Propulsion Laboratory, California Institute of Technology P1A.19: "Experiments on Network Learning by Exhaustive Search", D. B. Schwartz, J. S. Denker, S. A. Solla, AT&T Bell Laboratories, Holmdel, NJ P1A.20: "Some Comparisons of Constraints for Minimal Net- work Construction, with Backpropagation", Stephen Jose Hanson, Lorien Y. Pratt, Bell Communications Research, Morristown, NJ P1A.21: "Implementing the Principle of Maximum Information Preservation: Local Algorithms for Biological and Synthetic Networks", Ralph Linsker, IBM Thomas J. Watson Research Cen- ter, Yorktown Heights, NY P1A.22: "Biological Implications of a Pulse-Coded Reformu- lation of Klopf's Differential-, Hebbian Learning Algo- rithm", M.A. Gluck, D. Parker, E. Reifsnider, Department of Psychology, Stanford University SESSION P1B: Applications __________________________ P1B.1: "Comparison of Two LP Parametic Representations in a Neural Network-based, Speech Recognizer", K.K. Paliwal, Tata Institute of Fundamental Research, Homi Bhabha Road, Bombay-400005, India P1B.2: "Nonlinear Dynamical Modeling of Speech Using Neural Networks", N. Tishby, AT&T Bell Laboratories, Murray Hill, NJ P1B.3: "Use of Multi-Layered Networks for Coding Speech with Phonetic Features", Y. Bengio, R. De Mori, School of Computer Science, McGill University P1B.4: "Speech Production Using Neural Network with Cooper- ative Learning Mechanism", M. Komura, A. Tanaka, Interna- tional Institute for Advanced Study of Social Information Science, Fujitsu Limited, Japan P1B.5: "Temporal Representations in a Connectionist Speech System", E.J. Smythe, Computer Science Department, Indiana University P1B.6: "TheoNet: A Connectionist Network Implementation of a Solar Flare Forecasting Expert System (Theo)", R. Fozzard, L. Ceci, G. Bradshaw, Department of Computer Science & Psy- chology, University of Colorado at Boulder P1B.7: "An Information Theoretic Approach to Rule-Based Connectionist Expert Systems", R.M. Goodman, J.W. Miller, P. Smyth, Department of Electrical Engineering California In- stitute of Technology, Pasadena, CA P1B.8: "Neural TV Image Compression Using Hopfield Type Networks", M. Naillon, J.B. Theeten, G. Nocture, Laboratoires d'Electronique et de Physique Appliquee (LEP1), France P1B.9: "Neural Net Receivers in Spread-Spectrum Multiple- Access Communication Systems", B.P. Paris, G. Orsak, M.K. Varanasi, B. Aazhang, Department of Electrical & Computer Engineering, Rice University P1B.10: "Performance of Synthetic Neural Network Classi- fication of Noisy Radar Signals", I. Jouny, F.D. Garber, De- partment of Electrical Engineering, The Ohio State University P1B.11: "The Neural Analog Diffusion-Enhancement Layer (NADEL) and Early Visual, Processing", A.M. Waxman, M. Seibert, Laboratory for Sensory Robotics, Boston University P1B.12: "A Cooperative Network for color Segmentation", A. Hurlbert, T. Poggio, Center for Biological Information Proc- essing, Whitaker College P1B.13: "Neural Network Star Pattern Recognition for Spacecraft Attitude Determination, and Control", P. Alvelda, M.A. San Martin, C.E. Bell, J.Barhen, The Jet Propulsion Laboratory, California Institute of Technology, P1B.14: "Neural Networks that Learn to Discriminate Similar Kanji Characters", Yoshihiro Mori, Kazuhiko Yokosawa, ATR Auditory and Visual Perception Research Laboratories, Osaka, Japan P1B.15: "Further Explorations in the Learning of Visually- Guided Reaching: Making, MURPHY Smarter", B.W. Mel, Center for Complex Systems Research, University of Illinois P1B.16: "Using Backpropagation to Learn the Dynamics of a Real Robot Arm", K. Goldberg, B. Pearlmutter, Department of Computer Science, Carnegie-Mellon University SESSION O2: Applications _________________________ Invited Talk 2:20 O2.1: "Speech Recognition," John Bridle, Royal Radar Establishment, Malvern, U.K. Contributed Talks 3:00 O2.2: "Modularity in Neural Networks for Speech Recog- nition," A. Waibel, Carnegie Mellon University 3:30 O2.3: "Applications of Error Back-propagation to Pho- netic Classification," H.C. Leung, V.W. Zue, Department of Electrical Eng. & Computer Science, MIT 4:00 O2.4: "Neural Network Recognizer for Hand-Written Zip Code Digits: Representations,, Algorithms, and Hardware," J.S. Denker, H.P. Graf, L.D. Jackel, R.E. Howard, W. Hubbard, D. Henderson, W.R. Gardner, H.S. Baird, I. Guyon, AT&T Bell Laboratories, Holmdel, NJ 4:30 O2.5: "ALVINN: An Autonomous Land Vehicle in a Neural Network," D.A. Pomerleau, Computer Science Department, Carnegie Mellon University 5:00 O2.6: "A Combined Multiple Neural Network Learning System for the Classification of, Mortgage Insurance Appli- cations and Prediction of Loan Performance," S. Ghosh, E.A. Collins, C. L. Scofield, Nestor Inc., Providence, RI 8:00 Poster Session I Wednesday AM ____________ SESSION O3: Neurobiology _________________________ Invited Talk 8:30 O3.1: "Cricket Wind Detection," John Miller, Depart- ment of Zoology, UC Berkeley Contributed Talks 9:10 O3.2: "A Passive, Shared Element Analog Electronic Cochlea," D. Feld, J. Eisenberg, E.R. Lewis, Department of Electrical Eng. & Computer Science, University of California, Berkeley 9:40 O3.3: "Neuronal Maps for Sensory-motor Control in the Barn Owl," C.D. Spence, J.C. Pearson, J.J. Gelfand, R.M. Peterson, W.E. Sullivan, David Sarnoff Research Ctr, Subsid- iary of SRI International, Princeton, NJ 10:10 Break 10:30 O3.4: "Simulating Cat Visual Cortex: Circuitry Under- lying Orientation Selectivity," U.J. Wehmeier, D.C. Van Essen, C. Koch, Division of Biology, California Institute of Technology 11:00 O3.5: Model of Ocular Dominance Column Formation: Ana- lytical and Computational, Results," K.D. Miller, J.B. Keller, M.P. Stryker, Department of Physiology, University of California, San Francisco 11:30 O3.6: "Modeling a Central Pattern Generator in Soft- ware and Hardware:, Tritonia in Sea Moss," S. Ryckebusch, C. Mead, J. M. Bower, Computational Neural Systems Program, Caltech Wednesday PM ____________ 12:00 Poster Preview II SESSION P2A: Structured Networks _________________________________ P2A.1: "Training a 3-Node Neural Network is NP-Complete," A. Blum, R.L. Rivest, MIT Lab for Computer Science P2A.2: "A Massively Parallel Self-Tuning Context-Free Parser," E. Santos Jr., Department of Computer Science, Brown University, P2A.3: "A Back-Propagation Algorithm With Optimal Use of Hidden Units," Y. Chauvin, Thomson CSF, Inc./ Stanford Uni- versity P2A.4: "Analyzing the Energy Landscapes of Distributed Winner-Take-All Networks," D.S. Touretzky, Computer Science Department, Carnegie Mellon University P2A.5: "Dynamic, Non-Local Role Bindings and Inferencing in a Localist Network For Natural Language Understanding," T.E. Lange, M.G. Dyer, Computer Science Department, University of California, Los Angeles P2A.6: "Spreading Activation Over Distributed Microfea- tures," J. Hendler, Department of Computer Science, Univer- sity of Maryland P2A.7: "Short-term Memory as a Metastable State: A Model of Neural Oscillator For A Unified Submodule," A.B. Kirillov, G.N. Borisyuk, R.M. Borisyuk, Ye.I. Kovalenko, V.I. Kryukov, V.I. Makarenko, V.A. Chulaevsky, Research Computer Center, USSR Academy of Sciences P2A.8: "Statistical Prediction with Kanerva's Sparse Dis- tributed Memory," D. Rogers, Research Institute for Advanced Computer Science, NASA Ames Research Ctr, Moffett Field, CA P2A.9: "Image Restoration By Mean Field Annealing," G.L. Bilbro, W.E. Snyder, Dept. of Electrical and Computer Engi- neering, North Carolina State University P2A.10: "Automatic Local Annealing," J. Leinbach, Depart- ment of Psychology, Carnegie-Mellon University P2A.11: "Neural Networks for Model Matching and Perceptual Organization," E. Mjolsness, G. Gindi, P. Anandan, Depart- ment of Computer Science, Yale University P2A.12: "On the k-Winners-Take-All Feedback Network and Ap- plications," E. Majani, R. Erlanson, Y. Abu-Mostafa, Jet Propulsion Laboratory, California Institute of Technology, P2A.13: "An Adaptive Network that Learns Sequences of Tran- sitions," C.L. Winter, Science Applications International Corporation, Tucson, Arizona P2A.14: "Convergence and Pattern-Stabilization in the Boltzmann Machine," M. Kam, R. Cheng, Department of Elec- trical and Computer Eng., Drexel University SESSION P2B: Neurobiology __________________________ P2B.1: "Storage of Covariance By The Selective Long-Term Potentiation and Depression of, Synaptic Strengths In The Hippocampus", P.K. Stanton, J. Jester, S. Chattarji, T.J. Sejnowski, Department of Biophysics, The Johns Hopkins Uni- versity P2B.2: "A Mathematical Model of the Olfactory Bulb", Z. Li, J.J. Hopfield, Division of Biology, California Institute of Technology P2B.3: "A Model of Neural Control of the Vestibulo-Ocular Reflex", M.G. Paulin, S.Ludtke, M. Nelson, J.M. Bower, Divi- sion of Biology, California Institute of Technology P2B.4: "Associative Learning in Hermissenda: A Lumped Pa- rameter Computer Model, of Neurophysiological Processes", Daniel L. Alkon, Francis Quek, Thomas P. Vogl, Environmental Research Institute of Michigan, Arlington, VA P2B.5: "Reconstruction of the Electric Fields of the Weakly Electric Fish Gnathonemus, Petersii Generated During Explor- atory Activity", B. Rasnow, M.E. Nelson, C. Assad, J.M. Bower, Department of Physics, California Institute of Tech- nology P2B.6: "A Model for Resolution Enhancement (Hyperacuity) in Sensory Representation" J. Miller, J. Zhang, Department of Zoology, University of California, Berkeley P2B.7: "Coding Schemes for Motion Computation in Mammalian Cortex", H.T. Wang, B.P. Mathur, C. Koch, Rockwell Interna- tional Science Ctr., Thousand Oaks, CA P2B.8: "Theory of Self- Organization of Cortical Maps", S. Tanaka, NEC Corporation- Fundamental Res. Lab., Kawasaki Kanagawa, 213 JAPAN P2B.9: "A Bifurcation Theory Approach to the Programming of Periodic Attractors, in Network Models of Olfactory Cortex", Bill Baird, Department of Biophysics, University of California at Berkeley P2B.10: "Neuronal Cartography: population coding and resol- ution enhancement, through arrays of broadly tuned cells", Pierre Baldi, Walter Heiligenberg, Department of Mathemat- ics, University of California, San Diego P2B.11: "Learning the Solution to the Aperture Problem for Pattern Motion with a Hebb Rule", M.I. Sereno, Division of Biology, California Institute of Technology P2B.12: "A Model for Neural Directional Selectivity that Exhibits Robust Direction of, Motion Computation", N.M. Grzywacz, F.R. Amthor, Center for Biological Information Processing, Whitaker College, Cambridge, MA P2B.13: "A Low-Power CMOS Circuit which Emulates Temporal Electrical Properties of, Neurons", J. Meador, C. Cole, De- partment of Electrical and Computer Engineering, Washington State University P2B.14: "A General Purpose Neural Network Simulator for Im- plementing Realistic Models of Neural Circuits", M.A. Wilson, U.S. Bhalla, J.D. Uhley, J.M. Bower, Division of Bi- ology, California Institute of Technology, SESSION P2C: Implementation ____________________________ P2C.1: "MOS Charge Storage of Adaptive Networks," R.E. Howard, D.B. Schwartz, AT&T Bell Laboratories, Holmdel, NJ P2C.2: "A Self-Learning Neural Network," A. Hartstein, R.H. Koch, IBM-Thomas J. Watson Research Center, Yorktown Heights, NY P2C.3: "An Analog VLSI Chip for Cubic Spline Surface In- terpolation," J.G. Harris, Division of Computation and Neural Systems, California Institute of Technology P2C.4: "Analog Implementation of Shunting Neural Networks," B. Nabet, R.B. Darling, R.B. Pinter, Department of Elec- trical Engineering University of Washington P2C.5: "Stability of Analog Neural Networks with Time De- lay," C.M. Marcus, R.M. Westervelt, Division of Applied Sci- ences, Harvard University P2C.6: "Analog subthreshold VLSI circuit for interpolating sparsely sampled 2-D, surfaces using resistive networks," J. Luo, C. Koch, C. Mead, Division of Biology California Insti- tute of Technology P2C.7: "A Physical Realization of the Winner-Take-All Func- tion," J. Lazzaro, C.A. Mead, Computer Science California Institute of Technology P2C.8: "General Purpose Neural Analog Computer," P. Mueller, J. Van der Spiegel, D. Blackman, J. Dao, C. Donham, R. Furman, D.P. Hsieh, M. Loinaz, Department of Biochemistry and Biophysics, University of Pennsylvania P2C.9: "A Silicon Based Photoreceptor Sensitive to Small Changes in Light Intensity," C.A. Mead, T. Delbruck, California Institute of Technology Pasadena, CA P2C.10: "A Digital Realisation of Self-Organising Maps," M.J. Johnson, N.M. Allinson, K. Moon, Department of Elec- tronics, University of York, England P2C.11: "Training of a Limited-Interconnect, Synthetic Neural IC," M.R. Walker, L.A. Akers, Center for solid-State Electronics Research, Arizona State University P2C.12: "Electronic Receptors for Tactile Sensing," A.G. Andreou, Department of Electrical and Computer Engineering, The Johns Hopkins University P2C.13: "Cooperation in an Optical Associative Memory Based on Competition," D.M. Liniger, P.J. Martin, D.Z. Anderson, Department of Physics & Joint Inst. for Laboratory Astrophysics, University of Colorado, Boulder SESSION O4: Computational Structures _____________________________________ Invited Talk 2:20 O4.1: "Symbol Processing in the Brain," Geoffrey Hinton, Computer Science Department, University of Toronto Contributed Talks 3:00 O4.2: "Towards a Fractal Basis for Artificial Intelli- gence," Jordan Pollack, New Mexico State University, Las Cruces, NM 3:30 O4.3: "Learning Sequential Structure In Simple Recur- rent Networks," D. Servan-Schreiber, A. Cleeremans, J.L. McClelland, Department of Psychology, Carnegie-Mellon Uni- versity 4:00 O4.4: "Short-Term Memory as a Metastable State "Neurolocator," A Model of Attention", V.I. Kryukov, Re- search Computer Center, USSR Academy of Sciences 4:30 O4.5: "Heterogeneous Neural Networks for Adaptive Be- havior in Dynamic Environments," R.D. Beer, H.J. Chiel, L.S. Sterling, Center for Automation and Intelligent Sys. Res., Case Western Reserve University, Cleveland, OH 5:00 O4.6: "A Link Between Markov Models and Multilayer Perceptions," H. Bourlard, C.J. Wellekens, Philips Research Laboratory, Brussels, Belgium 7:00 Conference Banquet 9:00 Plenary Speaker "Neural Architecture and Function," Valentino Braitenberg, Max Planck Institut fur Biologische Kybernetik, West Germany Thursday AM ___________ SESSION O5: Applications _________________________ Invited Talk 8:30 O5.1: "Robotics, Modularity, and Learning," Rodney Brooks, AI Lab, MIT Contributed Talks 9:10 O5.2: "The Local Nonlinear Inhibition Circuit," S. Ryckebusch, J. Lazzaro, M. Mahowald, California Institute of Technology, Pasadena, CA 9:40 O5.3: "An Analog Self-Organizing Neural Network Chip," J. Mann, S. Gilbert, Lincoln Laboratory, MIT, Lexington, MA 10:10 Break 10:30 O5.4: "Performance of a Stochastic Learning Micro- chip," J. Alspector, B. Gupta, R.B. Allen, Bellcore, Morristown, NJ 11:00 O5.5: "A Fast, New Synaptic Matrix For Optically Pro- grammed Neural Networks," C.D. Kornfeld, R.C. Frye, C.C. Wong, E.A. Rietman, AT&T Bell Laboratories, Murray Hill, NJ 11:30 O5.6: "Programmable Analog Pulse-Firing Neural Net- works," Alan F. Murray, Lionel Tarassenko, Alister Hamilton, Department of Electrical Engineering, University of Edinburgh Scotland, UK 12:00 Poster Session II From steve at psyche.mit.edu Thu Sep 8 12:59:30 1988 From: steve at psyche.mit.edu (Steve Pinker) Date: Thu, 8 Sep 88 12:59:30 edt Subject: Comments on Marchman's note Message-ID: <8809081700.AA07322@ATHENA.MIT.EDU> One thing that is not in dispute in the past tense debate: we could use more data on children's development and on the behavior of network models designed to acquire morphological regularities. It is good to see Virginia Marchman contribute useful results on these problems. In a complex area, however, it especially important to be clear about the factual and theoretical claims under contention. In OLC, we praised R-M for their breadth of coverage of developmental data (primarily a diverse set of findings from Bybee & Slobin's experiments), and reviewed all of these data plus additional experimental, diary, and transcript studies. The thrust of Marchman's note is that "the data on the acquisition of the past tense in real children may be very different from the patterns assumed by either side in this debate". More specifically, she cites Jay McClelland's recent prediction that future research will show that we have "substantially overstated the empirical situation in several respects". We are certainly prepared to learn that future research will modify our current summary of the data or fail to conform to predictions. But Marchman's experiment, as valuable as it is, largely replicates results that have been in the literature for some time and that have been discussed at length, most recently, by R-M and ourselves. Furthermore, the data she presents are completely consistent with the picture presented in OLC, and she does not actually document a single case where we "underestimated the complexity and degree of individual variation inherent in the process of acquiring the English past tense". 1. Marchman reports that 'a child can be in the "stage" of overgeneralizing the "add -ed" rule anywhere between 3 and 7 years old.' The fact that overregularizations occur over a span of several years is well-known in the literature, documented most thoroughly in the important work of Kuczaj in the late 1970's. It figures prominently in the summary of children's development in OLC (e.g. p. 137). 2. She calls into question the characterization of children's development as following a 'U'-shaped curve. The 'U'- sequence that R-M and we were referring to is simply that (i) very young children do not overregularize from the day they begin to talk, but can use some correct past tense forms (e.g. 'came') for a while before (ii) overregularizations (e.g. 'comed') appear in their speech, which (iii) diminish by adulthood. Thus if you plot percentage of overregularizations against time, the curve is nonmonotonic in a way that can be described as an inverted-U. This is all that we (or everyone else) means by 'stages' or 'U-shaped development', no more, no less. No one claims that the transitions are discrete, or that the behavior within the stages is simple or homogeneous (this should be clear to anyone reading R-M or OLC). Marchman does not present any data that contradict this familiar characterization. Nor could she; her study is completely confined to children within stage (ii). 3. She reports that "errors took several forms beyond the one emphasized by P&P, i.e. overgeneralization of the "-ed" rule to irregular forms. Instead, errors seem to result from the misapplication of *several* (at least two) past tense formation processes" (identity mapping, vowel changes, and addition of 'ed'). But the fact that children can say 'bringed', 'brang', and 'brung' is hardly news. (We noted that these errors exist (e.g. 'bite/bote', p. 161, p. 180) and that they are rarer than '-ed' overregularizations (p. 160).) As for its role in the past tense debate, in OLC much attention is devoted to the acquisition of multiple regularization mechanisms in general (pp. 130-136) and identity-mapping (pp. 145-151) and vowel-shift subregularities (pp. 152-157) in particular. (Marchman does call attention to the fact that vowel-change subregularization errors can occur for *regular* verbs, as in 'pick/puck'. We find cases like 'trick/truck' in our naturalistic data as well. Interestingly, the R-M model never did this. All of its suprathreshold vowel-shift errors with regular verbs blended the vowel-change with a past tense ending (e.g.'sip/sepped'). Indeed even among the irregulars it came up with a bare vowel-change response in only 1 out of its 16 outputs. This is symptomatic of one of the major design problems of the model: its distributed representations makes it prone to blending regularities rather than entertaining them as competitors.) 4. Contrary to the claim that we neglect individual variation in children, we explicitly discuss it in a number of places (see, e.g. p. 144). 5. Marchman writes, "In several interesting ways, children can be shown to treat irregular and regular verbs similarly during acquisition." This is identical to the claim in OLC (pp. 130-131, 135-136), though of course the interpretation of this fact is open to debate. We emphasized that the regularity of the English '-ed' rule and the irregularity of the (e.g.) 'ow/ew' alternation are not innate, but are things the child has to figure out from an input sample. This learning cannot be instantaneous and thus "the child who *has not yet figured out* the distinction between regular, subregular, and idiosyncratic cases will display behavior that is similar to a system that is *incapable of making* the distinction" (p. 136). 6. According to Marchman, we suggest that regulars and irregulars are tagged as such in the input. To our knowledge, no one has made this very implausible claim, certainly not us. On the contrary, we are explicitly concerned with the learning problems resulting from the fact that the distinction is *not* marked in the input (pp.128-136). 7. Finally, Marchman previews a report of a set of runs from a new network simulation of past tense acquisition. We look forward to a full report, at which point detailed comparisons will become possible. At this point, in comparing her work to the OLC description of the past tense, she appears to have misinterpreted what we mean by the 'default' status of the regular rule. She writes as if it means that the regular rule is productively overgeneralized. However, the point of our discussion of the difference between the irregular and regular subsystems (pp. 114-123) is that there are about 6 criteria distinguishing regular from irregular alternations that go beyond the mere fact of generalizability itself. These criteria are the basis of the claim (reiterated in the Cog Sci Soc talk) that the regular rule acts as a 'default', in contrast to what happens in the R-M model (pp. 123-125). Marchman does not deal with these issues. In sum, Marchman's data are completely consistent with the empirical picture presented in OLC. Steven Pinker Alan Prince From steve at psyche.mit.edu Fri Sep 9 00:47:10 1988 From: steve at psyche.mit.edu (Steve Pinker) Date: Fri, 9 Sep 88 00:47:10 edt Subject: Two Observations of E. Bates Message-ID: <8809090448.AA17720@ATHENA.MIT.EDU> (1) Concerning the development of the past tense, Elizabeth Bates writes "there is no U-shaped function", based on Marchman's data. This implies that researchers in the area have made some fundamental error that vitiates their attempts at theory. But, as noted in our comments on Marchman, the 'U'-sequence that everyone refers to is simply that (i) very young children do not overregularize from the day they begin to talk, but can use some correct past tense forms (e.g. 'came') for a while before (ii) overregularizations (e.g. 'comed') appear in their speech, which (iii) diminish by adulthood. Thus if you plot percentage of overregularizations against time, the curve is nonmonotonic in a way that can be described as an inverted-U. This is all that we (or everyone else) mean by 'stages' or 'U-shaped development', no more, no less. No one claims that the transitions are discrete, or that the behavior within the stages is simple or homogeneous (this should be clear to anyone reading R&M or OLC). Marchman herself does not present any data that contradict this familiar characterization. Nor could she; her study is completely confined to children within stage (ii). Further discussion of the relation between Marchman's data and the empirical picture drawn in R-M, OLC and other studies can be found in our remarks on Marchman. (2) Bates runs two issues together: -whether judgments are always "crisp" ('sneaked' versus 'snuck'), -whether verbs derived from nouns and adjectives are regular ('out-Sally-Rided' versus 'overrode'). The implication is that endemic sogginess of judgment, overlooked or suppressed by linguists, makes it impossible to say anything about regularization-through-derivation. Noncrispness of judgments of irregular forms was confronted explicitly in OLC, which has a pretty thorough documentation of the phenomenon (p. 116-117, p. 118-119, and the entire Appendix). The important thing about the the cross-category effect is that it implies that the linguistic notions 'irregular', 'root', and 'syntactic category' have mentally-represented counterparts; it also emerges from a conspicuously narrow exposure to the data (since learners are not flooded with examples of denominal verbs that happen to be homophonous with irregulars); it is found with consistency across languages. The effect could be true whether the judgments respect part-of-speech distinctions absolutely or probabilistically. As long as a significant proportion of the variance is uniquely accounted for by syntactic category, there is something to explain. In fact, of course, most of the relevant judgments are quite clear (*high-stuck the goalie, *kung the checkers piece; OLC p. 111), and there can be little question that syntactic category is a compelling force for regularization, far more potent that unaided semantics (e.g. 'he cut/*cutted a deal'; OLC pp. 112-113). We regard this effect (due largely to work by Kiparsky) as a major, surprising discovery about the way linguistic systems are organized. For specific hypotheses about *when* and *why* some such judgments should be fuzzy, see Note 17 (p. 112) and pp. 126-127. Alan Prince Steven Pinker From bates at amos.ling.ucsd.edu Fri Sep 9 16:49:25 1988 From: bates at amos.ling.ucsd.edu (Elizabeth Bates) Date: Fri, 9 Sep 88 13:49:25 PDT Subject: Two Observations of E. Bates Message-ID: <8809092049.AA03836@amos.ling.ucsd.edu> Does the U-shaped function, then, mean nothing more to P&P than the claim that errors come and go? If so, I see little here that cries out for unique qualitative mechanisms and/or representations, above and beyond garden-variety learning. However, even allowing the weak version of the U that P&P describe, it is still not inevitable that children begin with irregulars, then over-regularize. My own daughter, for example, passed directly into over-regularizations of "go" and "come" as her first-ever past tense forms. It seems to me that the question re whether unitary mental categories are required (to account for the irregular/regular contrast) ought to revolve around the presence of evidence for a *qualitative@ difference between the two. Otherwise, we are merely haggling over the price.....-liz From bever at prodigal.psych.rochester.EDU Sat Sep 10 01:09:35 1988 From: bever at prodigal.psych.rochester.EDU (bever@prodigal.psych.rochester.EDU) Date: Sat, 10 Sep 88 01:09:35 EDT Subject: Light Message-ID: <8809100509.4381@prodigal.psych.rochester.edu> Recent correspondence has focussed on the performance level of the Rumelhart and McClelland past tense learning model and subsequent models, under varying conditions of feeding. Pinker and Prince point out that the model is unsuccessful by normal statistical standards. The responses so far seem to be: (1) that's always the way it is with new models (Harnad), (2) adults may perform more like the model than P&P assume (Bates, Elman, McClelland) and (3) children may not conform to the rules very well either (Bates, Marchman). We think that the exact performance level and pattern of the model is not the only test of its validity, for the following reasons: 1) Such models work only insofar as they presuppose rule-based structures. 2) The past-tense overgeneralization errors are errors of behavior not knowledge. Many statistically valid models of phenomena are fundamentally incorrect: for example, Ptolomeic astronomy was reputed to be quite accurate for its day, especially compared with the original Copernican hypothesis. The question is, WHY does a model perform the way it does? We have demonstrated (Lachter and Bever, 1988; Bever, 1988) that existing connectionist models learn to simulate rule-governed behavior only insofar as the relevant structures are built into the model or the way it is fed. What would be important to show is that such models could achieve the same performance level and characteristics without structures and feeding schemes which already embody what is to be learned. At the moment, insofar as the models succeed statistically, they confirm the view that language learning presupposes structural hypotheses on the part of the child, and helpful input from the world. The exact performance level and pattern of children or models is of limited importance for another reason: what is at issue is linguistic KNOWLEDGE, not language behavior. There is considerable evidence that the overgeneralization behavior is a speech production error, not an error of linguistic knowledge. Children explicitly know the difference between the way they say the past tense of a verb and the way they ought to say it - the child says 'readed' for the same kind of reason that it says 'puscetti' - overgeneralization in speech production of a statistically valid property of the language. Most significant is the fact that a child knows when an adult teases it by imitating the way it says such words. Whatever the success or failure of an inductive model, it must fail to discover the distinction between structural knowledge, and language behavior, a distinction which every child knows, and a distinction which is vital to understanding both the knowledge and the behavior the child exhibits. In failing to make the distinction, the more a model succeeds at mimicking the behavior, the clearer it becomes that it does NOT acquire the knowledge. The view that a bit of 'knowledge' is simply a 'behavioral generalization' taken to an extreme, begs the question about the representation of the distinction: insofar as it answers the question at all, it gets it wrong. Connectionist models offer a new way to study the role of statistically valid generalizations in the acquisition of complex structures. For example, such models may facilitate the study of how structural hypotheses might be confirmed and extended behaviorally by the data the child receives (Bever, 1988): the models are potentially exquisite analytic engines which can detect subtle regularities in the environment, given a particular representational scheme. We think this may be their ultimate contribution to behavioral science. But they solve the puzzle about the relationship between structure and behavior no more than an adding machine tells us about the relationship between the nature of numbers and how children add and subtract. Tom Bever Joel Lachter Bever or Lachter @psych.prodigal.rochester.edu References: Recent net correspondence between Bates, Elman, Harnad, Marchman, McClelland, Pinker and Prince. Bever. T.G., 1988. The Demons and the Beast - Modular and Nodular kinds of Knowledge. University of Rochester Technical Report, #48; to appear in Georgeopolous, C. and Ishihara, R. (Eds). Interdisciplinary approaches to language. Kluwer, Dordrecht, in press. Lachter J. and Bever, T.G. (1988) The relation between linguistic structure and associative theories of language learning -- A constructive critique of some connectionist learning models. Cognition, 28, pp 195-247. b From bondc at iuvax.cs.indiana.edu Sat Sep 10 11:11:14 1988 From: bondc at iuvax.cs.indiana.edu (Clay M Bond) Date: Sat, 10 Sep 88 10:11:14 EST Subject: No subject Message-ID: >We think that the exact performance level and pattern of the model is not >the only test of its validity, for the following reasons: > >1) Such models work only insofar as they presuppose rule-based >structures. > >2) The past-tense overgeneralization errors are errors of behavior not >knowledge. These are not reasons. They are only so if both sides accept that structures are rule-based and that there is some difference between behavior and know- ledge. For those who do not accept these assumptions, you have no test of validity; you cannot evaluate a model if you are making different assumptions. >The exact performance level and pattern of children or models is >of limited importance for another reason: what is at issue is >linguistic KNOWLEDGE, not language behavior. There is >considerable evidence that the overgeneralization behavior is a >speech production error, not an error of linguistic knowledge. Again, there is no such evidence without first assuming that there exists some difference between knowledge and behavior. >representational scheme. We think this may be their ultimate >contribution to behavioral science. But they solve the puzzle >about the relationship between structure and behavior no more >than an adding machine tells us about the relationship between >the nature of numbers and how children add and subtract. Once again, you have made no point above. Your arguments are remarkably similar to SLA projects which start out assuming the existence of UG, present data, and then conclude that UG exists. Whether one takes an agnostic position on these related differentiations, knowledge/behavior, competence/performance, brain/mind, micro/macrocognition is not relevant. What is relevant is that those who insist that these dif- ferentiations exists are obligated to show empirically exactly how they operate, where they reside, and how they map onto actual neurological pro- cesses, something they have conveniently ignored so far. That they must exist is highly debateable, to say the least; this, I think, is possibly the greatest contribution connectionism has offered. Until such time as these things are proven, they will remain religious issues/tenets. Clay Bond Indiana University Department of Linguistics, bondc at iuvax.cs.indiana.edu From hinton at ai.toronto.edu Sat Sep 10 16:58:29 1988 From: hinton at ai.toronto.edu (Geoffrey Hinton) Date: Sat, 10 Sep 88 16:58:29 EDT Subject: Bever's claims Message-ID: <88Sep10.141907edt.681@neat.ai.toronto.edu> In a recent message, Bever claims the following: "We have demonstrated (Lachter and Bever, 1988; Bever, 1988) that existing connectionist models learn to simulate rule-governed behavior only insofar as the relevant structures are built into the model or the way it is fed. What would be important to show is that such models could achieve the same performance level and characteristics without structures and feeding schemes which already embody what is to be learned." This claim irks me since I have already explained to him that there are connectionist networks that really do discover representations that are not built into the initial network. One example is the family trees network described in Rumelhart, D.~E., Hinton, G.~E., and Williams, R.~J. (1986) Learning representations by back-propagating errors. Nature, 323, pages 533--536. I would like Bever to clearly state whether he thinks his "demonstration" applies to this (existing) network, or whether he is simply criticizing networks that lack hidden units. Geoff From jlm+ at andrew.cmu.edu Sat Sep 10 13:35:21 1988 From: jlm+ at andrew.cmu.edu (James L. McClelland) Date: Sat, 10 Sep 88 13:35:21 -0400 (EDT) Subject: Light In-Reply-To: <8809100509.4381@prodigal.psych.rochester.edu> References: <8809100509.4381@prodigal.psych.rochester.edu> Message-ID: It is true that there are different kinds of behavior which we could assess any model with respect to. One kind of task involves language use (production, comprehension) another is language judgement. Many connectionist models to date have addressed performance rather than judgement, but there is no intrinsic reason why judgements cannot be addressed in these models. Indeed, it is becoming standard to use the goodness of match between an expected pattern and an obtained pattern as a measure of tacit knowledge, say, of what should follow what in a sentence. Such errors can be used as the basis for some kinds of judgements. I do not mean to say that connectionists have already shown that their models account for the full range of factors that influence such judgements; but at least many of us take the view (at least implicitly) that the SAME connection information that governs performance can also be used to sustain various types of judgements. With regard to such judgements, at least as far as the past tense is concerned, the facts seem not to fit perfectly with Lachter and Berver's claims. Kuczaj [Child Development, 1978, p.319] reports data from children aged 3:4 to 9:0. These children made gramaticallity judgements of a variety of kinds of past-tense forms. The probability that each type of form was judged correctly is given below from his table 1 on p. 321: Age Group Under 5 5&6 7 & up Grammatical No-Change verbs (hit) 1.00 1.00 1.00 Regularized no-change verbs (hitted) .28 .55 .05 Grammiatical Change verbs (ate) .84 .94 1.00 *Regularized Change verbs (eated) .89 .60 .26* Past + ed forms for Change verbs (ated) .26 .57 .23 Marked with asterisks above is the line containing what Lachter and Bever call the overgeneralization error. It will be seen that children of every age group studied found these sorts of forms acceptable at least to some degree. It is particularly clear in the youngest age group that such strings seem highly grammatical. The fact that there are other error types which show a much lower rate of acceptability for this group indicates that the high acceptance rate for the regularized forms is not simply due to a generalized tendency to accept anything in this age group. I do not want to suggest that there is a perfect correlation between performance in judgement tasks and measures obtained from either natural or elicited production data: One of the few things we know for certain is that different tasks elicit differences in performance. However the data clearly indicate that the child's judgements are actually strikingly similar to the patterns seen in naturalistic regularization data [Kuczaj, 1977, Journal of Verbal Learning and Verbal Behavior, p 589]. First, the late emergence of "ated" type forms in natural production relative to "eated" type forms is reflected in the judgement data. Second, both in production and acceptance, regularized forms of no-change verbs score low relative to regularized forms of other types of exceptions. Kuczaj [78] even went so far as to ask kids what they thought their mothers would say when given a choice between correct, regularized, and past+ed. Their judgements of what they thought their mothers would say were virtually identical to their judgements of what they thought they would say at all age groups. In both kinds of judgements, choice of eated type responses drops monotonically while ated type responses peak in group 2. Jay McClelland From PH706008%BROWNVM.BITNET at VMA.CC.CMU.EDU Sat Sep 10 17:14:50 1988 From: PH706008%BROWNVM.BITNET at VMA.CC.CMU.EDU (PH706008%BROWNVM.BITNET@VMA.CC.CMU.EDU) Date: Sat, 10 Sep 88 17:14:50 EDT Subject: Yann Le Cun's e-mail address Message-ID: Does anyone know Yann Le Cun's e-mail address at the University of Toronto? Thanks in advance. --Charles Bachmann : ph706008 at brownvm Brown University From jlm+ at andrew.cmu.edu Sat Sep 10 17:01:25 1988 From: jlm+ at andrew.cmu.edu (James L. McClelland) Date: Sat, 10 Sep 88 17:01:25 -0400 (EDT) Subject: correction Message-ID: In my reply to Bever, the data in the table are probability that each type of form is judged acceptable. I said "correctly" rather than "acceptable"; actually, judging some of the forms acceptable is an error; of course, it was that children made such errors that was the point of the message. Sorry if I confused anyone. -- Jay From bondc at iuvax.cs.indiana.edu Sat Sep 10 20:28:17 1988 From: bondc at iuvax.cs.indiana.edu (Clay M Bond) Date: Sat, 10 Sep 88 19:28:17 EST Subject: No subject Message-ID: Geoff Hinton: >In a recent message, Bever claims the following: > >"We have demonstrated (Lachter and Bever, 1988; Bever, 1988) that existing >connectionist models learn to simulate rule-governed behavior only insofar as >the relevant structures are built into the model or the way it is fed. What > >This claim irks me since I have already explained to him that there are >connectionist networks that really do discover representations that are not >built into the initial network ... I might say the same. The current project I am working on, along with Elise Breen, though in its infant stages, is an iac acquisition net, and no rele- vant structures, as Bever calls them, were built in. Our results so far are promising, though inconclusive. I do not see, however, why one should expect mentalists to take data into account. They have always scorned data in favor of "intuition". <<<<<<<<<<<<******<<<<<<<<<<<<******>>>>>>>>>>>>******>>>>>>>>>>>> << Clay Bond Indiana University Department of Linguistics >> << ARPA: bondc at iuvax.cs.indiana.edu >> <<<<<<<<<<<<******<<<<<<<<<<<<******>>>>>>>>>>>>******>>>>>>>>>>>> From Dave.Touretzky at B.GP.CS.CMU.EDU Sun Sep 11 10:10:17 1988 From: Dave.Touretzky at B.GP.CS.CMU.EDU (Dave.Touretzky@B.GP.CS.CMU.EDU) Date: Sun, 11 Sep 88 10:10:17 EDT Subject: Yann Le Cun's e-mail address In-Reply-To: Your message of Sat, 10 Sep 88 17:14:50 -0400. Message-ID: <595.589990217@DST.BOLTZ.CS.CMU.EDU> Yann LeCun's address is yann%ai.toronto.edu at relay.cs.net. Please: if you're trying to locate someone's net address, send mail first to connectionsts-request at cs.cmu.edu. The mailing list maintainers will be happy to help you. -- Dave From hendler at dormouse.cs.umd.edu Sun Sep 11 11:20:57 1988 From: hendler at dormouse.cs.umd.edu (Jim Hendler) Date: Sun, 11 Sep 88 11:20:57 EDT Subject: more fuel for the fire Message-ID: <8809111520.AA14998@dormouse.cs.umd.edu> While I come down on the side of the connectionists in the recent debates, I think some of our critics, and some of the criticisms of Bever and P&P, do focus on an area that is a weakness of most of the distributed models: it is one thing to learn features/structures/etc., it is another to apply these things appropriately during cognitive processing. While, for example, Geoff's model could be said to have generalized a feature corresponding to `gender', we would be hard pressed to claim that it could somehow make gender-based inferences. The structured connectionists, have gone far beyond the distributed when it comes to this. The models, albeit not learned, can make inferences based on probabilities and classifications and the like (cf. Shastri etc.) I believe that it is crucial to provide an explanation of how distributed representations can make similar inferences. One approach, which I am currently pursuing, is to use the weight spaces learned by distributed models as if they were structured networks -- spreading activation among the units and seeing what happens (the results look promising). Other approaches will surely be suggested and pursued. Thus, to reiterate my main point -- the fact that a backprop (or other) model has learned a function doesn't mean diddly until the internal representations built during that learning can be applied to other problems, can make appropriate inferences, etc. To be a cognitive model (and that is what our critics are nay-saying) we must be able to learn, our forte, but also to THINK, a true weakness of many of our current systems. -Jim H. From hinton at ai.toronto.edu Mon Sep 12 13:05:43 1988 From: hinton at ai.toronto.edu (Geoffrey Hinton) Date: Mon, 12 Sep 88 13:05:43 EDT Subject: more fuel for the fire In-Reply-To: Your message of Sun, 11 Sep 88 11:20:57 -0400. Message-ID: <88Sep12.102617edt.98@neat.ai.toronto.edu> The family trees model does make some simple inferences based on the features it has learned. It does the equivalent of inferring that a person's father's wife is the person's mother. Of course, it can only do this for people it knows about, and there are many more inferences that it cannot do. Geoff From alexis at marzipan.mitre.org Mon Sep 12 12:24:07 1988 From: alexis at marzipan.mitre.org (Alexis Wieland) Date: Mon, 12 Sep 88 12:24:07 EDT Subject: The Four-Quadrant Problem Message-ID: <8809121624.AA02190@marzipan.mitre.org.> Well, so much for my quest for a problem which *requires* more than 2 layers. Hopefully this will exhaust the issue ... The conclussions are: - for a 2-layer net of a finite number of threshold units can't do the 4-quad problem to arbitrary accuracy (I'll demonstrate in a moment). - any arbitrarily large subspace *can* be approximated to arbitrary accuracy with a finite number of nodes. - with non-hard-threshold units (including sigmoids) and assuming infinite presision (and you *need* that infinite precision) you *can* do the 4-quad problem with 2 layers. Demonstration that 2-Layers of finite numbers of threshold units can't do it: Assume that it can be done. Each node in the first layer creates linear partitions in the plane described by the input space. This finite set of partions (lines) intersect at a finite number of points. Consider a circle centered at the origin which encloses all of these intersections. Assume without loss of generality that quads 1&3 have a greater value than quads 2&4 which is subsequently thresholded by the second layer node. Above the circle, crossing from left to right across the y-axis (or any "don't care" band) must result in a net gain (since quad-2 < quad-1) so the weights connecting those nodes to the output node must sum > 0. Below the circle a simular argument has the same sum < 0. The sum can't be both > and < 0, a contradiction, therefore it can't be done. *BUT* Doing It With Other than thresholding units: If the hidden layer's transfer function f(x) = 0 for x<=0 and f(x) = x for x>=0 Ron Chrisley showed me at INNS that you can use a four node hidden layer. Given input weights (w,w), (-w,-w), (w,-w), (-w, w) on the first layer (no thresholds) and weights (w2, w2, -w2, -w2) from the hidden layer to the thresholding output node you've got it. I would add that you can approximate the semi-linear node to arbitrary accuracy with thresholding units if you only want to go out a finite distance. Also, you're really only using the fact that the transfer function is monotonically increasing -- as is a sigmoid (at least in theory). So if the four nodes in the hidden layer have *any* non-zero threshold you can use the save network and use sigmoid units. Eventually your top node will have sigmoid(x + threshold) - sigmoid(x) as its input (which gets *very* small very fast) but in theory this should work. In conclussion, this is a task that is quite simple with a 3 layer net (put a partition on the x an y axis and XOR their outputs). Instead, you can use an infinite number of nodes in 2 layers or see how well your particular computer deals with numbers like 10**-(10**10**10**...) with sigmoid nodes. In short, while it is clearly cleaner and more compact to do it with 3 layers, it *can* be done in theory with networks (which are not pragmatically realizable) with 2-layers. Alexis Wieland wieland at mitre.arpa From PVR%BGERUG51.BITNET at VMA.CC.CMU.EDU Mon Sep 12 18:04:00 1988 From: PVR%BGERUG51.BITNET at VMA.CC.CMU.EDU (Patrick Van Renterghem / Transputer Lab) Date: Mon, 12 Sep 88 18:04 N Subject: Change of subject (neural network coprocessor boards) References: > The Transputer Lab, Grotesteenweg Noord 2, +32 91 22 57 55 Message-ID: Hello connectionists, I am not a fanatic pro or con of connectionism and neural networks, but I am more interested in their applications than their basics. I have the following questions: * what kind of applications are neural networks used for ? I know pattern recognition is a favorite subject, and I would like to know more about specific realizations (and performance, compared to algorithic information processing), but there must be other application areas ??!!?? How about robotics, expert systems, image processing, ... * What coprocessor boards exist, what is the price, performance, their draw- backs, advantages, ... Addresses of manufacturers would be appreciated. Thanks in advance, Patrick Van Renterghem, State University of Ghent, Automatic Control Lab, Transputer Lab div., Grotesteenweg Noord 2, B-9710 Ghent-Zwijnaarde, Belgium. P.S.: Companies listening can send me information right away. From steve at psyche.mit.edu Mon Sep 12 14:02:50 1988 From: steve at psyche.mit.edu (Steve Pinker) Date: Mon, 12 Sep 88 14:02:50 edt Subject: Reply to Bates' second note Message-ID: <8809121803.AA15544@ATHENA.MIT.EDU> In her second note, Bates writes as if the argument for "unique qualitative mechanisms" was based entirely on the existence of a U-shaped learning curve. This reduction has the virtue of simplicity, but it bears little resemblance to the actual arguments in the literature, which work from a range of linguistic, psycholinguistic, and developmental evidence. In our paper we discuss a variety of developmental data independent of U-hood that bear on the question of what kinds of mental mechanisms are involved (OLC, pp.139-145). We also examine the qualitative differences between the irregular and regular past-tense systems in some detail (pp.108-125). Of course the issue is still open, but we doubt that the debate is ultimately going to turn on slogan-sized chunks of assertion. Aiming for another reduction, Bates asks, "Does the U-shaped function, then, mean nothing more to P&P than the claim that errors come and go?" What the U-shaped function means to us is "a function that is shaped like a U". You don't get a function shaped like a U merely if "errors come and go". You also need some correct performance around the time when errors come. Otherwise the function (percentage error vs. time) could be monotonically decreasing, not U-shaped. The evidence that Bates first cited against the U-shaped curve was based on a study that had nothing to do with the matter; then comes the terminological dispute. At this point, we'd like to sign off on the round robin. We welcome further inquiries, comments, and reprint requests at our own addresses. Alan Prince: prince at cogito.mit.edu Steven Pinker: steve at psyche.mit.edu From Scott.Fahlman at B.GP.CS.CMU.EDU Mon Sep 12 20:41:22 1988 From: Scott.Fahlman at B.GP.CS.CMU.EDU (Scott.Fahlman@B.GP.CS.CMU.EDU) Date: Mon, 12 Sep 88 20:41:22 EDT Subject: "Layers" Message-ID: I think it would help us all to follow these discussions if, when people want to talk about "N-layer networks" for some N, they would make it clear exactly what they are talking about. Does N refer to layers of units or layers of tunable weights? If layers of units, are we counting only layers of hidden units, or are we including the output layers, or both the input and output layers? I think I've seen at least one paper that uses each of these definitions. Unfortunately, there seems to be no universal agreement on this bit of terminology, and without such agreement it requires a lot of work to figure out from context what is being claimed. Sometimes a researcher will carefully define what he means by "layer" in one message -- I think Alexis Wieland did this -- but then launch into a multi-message discussion spread over a couple of weeks. Again, this makes extra work for people trying to understand the discussion, since it's hard to keep track of who is using what kinds of layers, and it's a pain to go back searching through old messages. Perhaps I'm the only one who is confused by this. Does anyone believe that there *is* a standard or obvious definition for "layer" that we all should adhere to? It would be nice if we could all adopt the same terminology, but it may be too late in this case. -- Scott From harnad at Princeton.EDU Mon Sep 12 23:22:23 1988 From: harnad at Princeton.EDU (Stevan Harnad) Date: Mon, 12 Sep 88 23:22:23 edt Subject: On the Care & Feeding of Learning Models Message-ID: <8809130322.AA11408@mind> Tom Bever (bevr at db1.cc.rochester.edu) wrote: > Recent correspondence has focussed on the performance level of the > Rumelhart and McClelland past tense learning model and subsequent > models, under varying conditions of feeding... We have demonstrated > (Lachter and Bever, 1988; Bever, 1988) that existing connectionist > models learn to simulate rule-governed behavior only insofar as the > relevant structures are built into the model or the way it is fed. > What would be important to show is that such models could achieve > the same performance level and characteristics without structures > and feeding schemes which already embody what is to be learned. I don't understand the "feeding" metaphor. If I feed an inductive device data that have certain regularities, along with feedback as to what the appropriate response would be (say, for the sake of simplicity, they are all members of a dichotomy: Category C or Category Not-C), and the device learns to perform the response (here, dichotomization), presumably by inducing the regularities statistically from the data, what is there about this "feeding" regimen that "already embodies what is to be learned" (and hence, presumably, constitutes some sort of cheating)? Rather than cheating, it seems to me that rules that are come by in this way are the wages of "honest toil." Perhaps there is a suppressed "poverty-of-the-stimulus" premise here, to the effect that we are only considering data that are so underdetermined as to make their underlying regularities uninducible (i.e., the data do not sufficiently "embody" their underlying regularities to allow them to be picked up statistically). If this is what Bever has in mind, it would seem that this putative poverty has to be argued for explicitly, on a case by case basis. Or is the problem doubts about whether nets can do nontrivial generalization from their data-sets? But then wouldn't this too have to be argued separately? But objections to "feeding conditions" alone...? Is the objection that nets are being spoon-fed, somehow? How? Trial-and-error-sampling sounds more like doing it the old-fashioned way. Biased samples? Loaded samples? Stevan Harnad From Scott.Fahlman at B.GP.CS.CMU.EDU Mon Sep 12 20:41:22 1988 From: Scott.Fahlman at B.GP.CS.CMU.EDU (Scott.Fahlman@B.GP.CS.CMU.EDU) Date: Mon, 12 Sep 88 20:41:22 EDT Subject: "Layers" Message-ID: I think it would help us all to follow these discussions if, when people want to talk about "N-layer networks" for some N, they would make it clear exactly what they are talking about. Does N refer to layers of units or layers of tunable weights? If layers of units, are we counting only layers of hidden units, or are we including the output layers, or both the input and output layers? I think I've seen at least one paper that uses each of these definitions. Unfortunately, there seems to be no universal agreement on this bit of terminology, and without such agreement it requires a lot of work to figure out from context what is being claimed. Sometimes a researcher will carefully define what he means by "layer" in one message -- I think Alexis Wieland did this -- but then launch into a multi-message discussion spread over a couple of weeks. Again, this makes extra work for people trying to understand the discussion, since it's hard to keep track of who is using what kinds of layers, and it's a pain to go back searching through old messages. Perhaps I'm the only one who is confused by this. Does anyone believe that there *is* a standard or obvious definition for "layer" that we all should adhere to? It would be nice if we could all adopt the same terminology, but it may be too late in this case. -- Scott From hinton at ai.toronto.edu Tue Sep 13 13:52:16 1988 From: hinton at ai.toronto.edu (Geoffrey Hinton) Date: Tue, 13 Sep 88 13:52:16 EDT Subject: "Layers" In-Reply-To: Your message of Mon, 12 Sep 88 20:41:22 -0400. Message-ID: <88Sep13.111249edt.407@neat.ai.toronto.edu> As Scott points out, the problem is that a net with one hidden layer has: 3 layers of units (including input and output) 2 layers of modifiable weights 1 layer of hidden units. Widrow has objected (quite reasonably) to calling the input units "units" since they don't have modifiable incoming weights, nor do they have a non-linear I/O function. So that means we will never agree on counting the total number of layers. The number of layers of modifiable weights is unambiguous, but has the problem that most people think of the "neurons" as forming the layers, and also it gets complicated when connections skip layers (of units). Terminology can be made unambiguous by referring to the number of hidden layers. This has a slight snag when the first layer of weights (counting from the input) is unmodifiable, since the units in the next layer are then not true hidden units (they dont learn representations), but we can safely leave it to the purists and flamers to worry about that. I strongly suggest that people NEVER use the term "layers" by itself. Either say "n hidden layers" or say "n+1 layers of modifiable weights". I don't think attempts to legislate in favor of one or the other of these alternatives will work. Geoff From munnari!chook.ua.oz.au!guy at uunet.UU.NET Wed Sep 14 11:22:43 1988 From: munnari!chook.ua.oz.au!guy at uunet.UU.NET (guy smith) Date: Wed, 14 Sep 88 09:22:43 CST Subject: layer terminology Message-ID: <8809140131.AA21292@uunet.UU.NET> re: what does N-layer really mean? I agree with Scott Fahlman that the lack of an accepted meaning for N-layer is confusing. In the context of nets as clearly layered as Back Propagation nets, I think 'N' should refer to the number of layers of weights, which is also the number of layers of non-input nodes. Thus, a 0-layer net makes no sense, a single node is a 1-layer net, and the minimal net that can solve the XOR problem (calculating the parity of two binary inputs) is a 2-layer net. There is at least one rationale for this choice. If an N-layer net uses the outputs of an M-layer net for its input, you end up with an N+M layer net. Yours Pedantically, Guy Smith. From moody-john at YALE.ARPA Tue Sep 13 21:27:55 1988 From: moody-john at YALE.ARPA (john moody) Date: Tue, 13 Sep 88 21:27:55 EDT Subject: network labeling conventions Message-ID: <8809140127.AA08532@NEBULA.SUN3.CS.YALE.EDU> I agree with Scott Fahlman that there is need for a standard labeling convention for multilayered networks. The convention which I prefer for an "N Layer Network" is diagramed below. Such a network has "(N-1) Internal Layers of Units" and "N Layers of Weights". Each connection has the same layer index as its post-synaptic processing unit. The output units are "Layer N". The input lines are not enumerated as a layer since they usually do no processing; for consistency, however, the input lines can be identified as "Layer 0". As a matter of style, I think it is confusing to use the same graphic symbols for input lines as for the non-linear processing units, since any operation performed on the input data prior to its arrival at the first layer of connections is really pre-processing and not part of the network computation proper. Along the same lines, it would be useful to use a distinguishing symbol when linear output units are used in networks which perform mappings from R^n to R^m. Layer N Units (Outputs) O O O O O O Activations A^N_n Layer N Weights /|\ Weight Values W^N_nm / | \ Layer N-1 Units O O O O O O Activations A^(N-1)_m Layer N-1 Weights /|\ Weight Values W^(N-1)_ml / | \ . . . . Layer 2 Units O O O O O O Activations A^2_k Layer 2 Weights /|\ Weight Values W^2_kj / | \ Layer 1 Units O O O O O O Activations A^1_j Layer 1 Weights /|\ Weight Values W^1_ji / | \ Layer 0 (Input Lines) . . . . . . Input Activations A^0_i --John Moody ------- From skrzypek at CS.UCLA.EDU Wed Sep 14 14:59:36 1988 From: skrzypek at CS.UCLA.EDU (Dr Josef Skrzypek) Date: Wed, 14 Sep 88 11:59:36 PDT Subject: "Layers" In-Reply-To: Geoffrey Hinton's message of Tue, 13 Sep 88 13:52:16 EDT <88Sep13.111249edt.407@neat.ai.toronto.edu> Message-ID: <8809141859.AA26284@lanai.cs.ucla.edu> A layer of "neurons" means that all units (cells) in this layer are at the same FUNCTIONAL distance from some reference point e.g. the input. It is rather simple and unambiguous. There is no need to discriminate against the input units by calling them something else but "units". Think of them as the same type of "neurons" which have a specialized transduction function. For example, photoreceptors transduce photons into electrical signals while other "neurons" transduce neurotransmitter modulated ionic fluxes into electrical signals. Such input unit might have many modifiable weights, some from lateral connections, others from the feedback pathways and one dedicated to the main trasduction function. Similar argument can be used for output units or "hidden" units (why do they hide? and from whom?). A layer should refer to "neurons" (units) and not synapses (weights) because it is possible to have multiple synaptic interactions between two layers of "neurons". A layer of units, regardless of their function is rather unambiguous. Josef From lakoff at cogsci.berkeley.edu Wed Sep 14 15:18:49 1988 From: lakoff at cogsci.berkeley.edu (George Lakoff) Date: Wed, 14 Sep 88 12:18:49 PDT Subject: No subject Message-ID: <8809141918.AA01563@cogsci.berkeley.edu> To: Pinker and Prince From: George Lakoff Re: Represntational adequacy and implementablility Perhaps it's time to turn the discussion back on P&P and discuss the adequacy of the alternative they advocate. Let us distinguish first between learning and representation. Most generative linguistics involves representation and says nothing about learning. Representations there are constructed by linguistics. However that theory of representation has some deep problems, a couple of which have come up in the the discussion of past tenses. Here are two problems: 1. Generative phonology cannot represent prototype structures of the sort Bybee and Slobin described, and which arise naturally -- and hence can be described easily -- in connectionist models. As for regular cases: If one puts aside learning and concentrates on representation, there is no reason why one could not hand-construct representation of regularities in connectionist networks, so that general principles are represented by patterns of weights. If this is the case, then, on representational grounds, connectionist foundations for linguistics would appear to be more adequate than generative foundations that use symbol-manipulation algortihms. If generative phonologists can represent the irregular cases, then let's see the representations. Moreover, it would seem that if such prototype phenomena cannot be represented generatively, then a Pinker-style learning device, which learns generative represetations, should not be able to learn such prototype phenomena, since a learning device can't learn something it can't represent. 2. P&P blithely assume that generative linguistic representations could be implemented in the brain's neural networks. There is reason to doubt this. Generative phonology uses sequentially operations that generate proof-like `derivations'. These sequentially-ordered operations do not occur in real time. (No one claims that they do or should, since that would make psychologically indefensible claims.) The derivations are thought of as being like mathematical proofs, which stand outside of time. Now everything that happens in the brain does happen in real time. The question is: Can non-temporal operations like the rules of generative phonology be implemented in a brain? Can they even be implemented in neural networks at all? If so, what is the implementation like, and would it make an assumptions incompatible with what brains can do? And what happens to the intermediate stages of derivations in such an implementation? Such stages are claimed by generative phonologists to have ``psychological reality'', but if they don't occur in real time what reality do they have? For P&P to defend generative phonology as a possible alternative, they must show, not just assume, that the nontemporal sequential operations of generative phonology can be implemented, preserving generations, in neural networks operating in real time. I have not heard any evidence coming from P&P on this issue. Incidentally, I have outlined a phonological theory that has no atemporal sequential operations and no derivations, that can state real linguistic generalizations, and that can be implemented in connectionist networks. A brief discussion appears in my paper in the Proceedings of the 1988 Connectionist Summer School, to be published soon by Morgan Kaufman. Well? Do Pinker and Prince have a demonstration that generative phonology can be implemented in neural networks or not? If the atemporal sequential operations of generative phonology cannot be implemented in brain's neural networks, that is a very good reason to give up on generative phonology as a cognitively-plausible theory. * * * Incidentally, I agree with Harnard that nothing P&P said in their paper has any ultimate consequences for the adequacy of connectionist foundations for linguistics. I am, in fact, on the basis of what I know about both existing generative foundations and possible connectionist foundations, I am much more optimistic about connectionist foundations. From alexis at marzipan.mitre.org Wed Sep 14 15:17:18 1988 From: alexis at marzipan.mitre.org (Alexis Wieland) Date: Wed, 14 Sep 88 15:17:18 EDT Subject: Layer Conventions Message-ID: <8809141917.AA01447@marzipan.mitre.org.> Maybe I'm just a pessimist, but I think we're always going to have to define what we mean by layers in a specific context and proceed from there. Geoff points out that conventions become muddled when you have "skip level" arcs (which are becoming pretty prevalent at least in our neck of the woods). It gets worse with feedback and down right ugly with random connections or in networks that dynamically change/grow (yes, we're playing with those too). And we all *know* that lateral connections within a layer don't increase the layer count, but what about laterally connected net with limited feedback (my graphics system starts making simplifying assumptions about now). It really depend on how *you* conceptualize them (or how your graphics draws them). And then what about Hopfield/Boltzman/Cauchy/... nets which are fully bi-directionally connected? Is that one very connected layer or a separate layer per node; and what if it has input/output from somewhere else? "Layers" are nice intuitive constructs which are enormously helpful in describing nets, but (following the INNS preident's speaking style) it's rather like good and evil, we all know what they are until we have to give precise definitions. I have a sinking feeling that we will always be the field that can't agree how to count. alexis. wieland at MITRE.arpa From pratt at paul.rutgers.edu Wed Sep 14 15:51:22 1988 From: pratt at paul.rutgers.edu (Lorien Y. Pratt) Date: Wed, 14 Sep 88 15:51:22 EDT Subject: Updated schedule for fall Rutgers Neural Network colloquium series Message-ID: <8809141951.AA01417@zztop.rutgers.edu> Fall, 1988 Neural Networks Colloquium Series at Rutgers The field of Neural networks (Connectionism, Parallel Distributed Processing) has enjoyed a resurgence in recent years, and has important implications for computer scientists interested in artificial intelligence, parallel processing, and other areas. This fall, the Rutgers Department of Computer Science is hosting several researchers in the field of neural networks to talk about their work. Talks are open to the public, and will be held on Fridays at 11:10 in the 7th floor lounge of Hill Center on the Busch campus of Rutgers University. Refreshments will be served beforehand and we hope most speakers to be available for informal discussion over lunch afterwards. Our tentative schedule follows. The schedule will no doubt change throughout the semester; the latest version can always be found in paul.rutgers.edu:/grad/u4/pratt/Colloquia/schedule or aramis.rutgers.edu:/aramis/u1/pratt/Colloquia/schedule. In addition, abstracts for each talk will be posted locally. Speaker Date (tentative) Title ------- ------- ----------------- Sara Solla 9/16/88 Learning and Generalization in Layered Bell Labs Neural Networks David Touretzky 9/23/88 What is the relationship between CMU connectionist and symbolic models? What can we expect from a connectionist knowledge representation? Steve Hanson 9/30/88 Some comments and variations on back Bellcore propagation Hector Sussmann 10/14/88 The theory of Boltzmann machine Rutgers Math learning Josh Alspector 10/21/88 Neural network implementations in Bellcore hardware Mark Jones 11/11/88 Knowledge representation in Bell Labs connectionist networks, including inheritance reasoning and default logic. E. Tzanakou 11/18/88 --unknown-- Rutgers biomed Bob Allen 12/2/88 A neural network which uses language Bellcore From terry at cs.jhu.edu Thu Sep 15 00:50:11 1988 From: terry at cs.jhu.edu (Terry Sejnowski ) Date: Thu, 15 Sep 88 00:50:11 edt Subject: "Layers" Message-ID: <8809150450.AA01503@crabcake.cs.jhu.edu> Lets not lock ourselves into a terminology that applies to only a special case -- feedforward nets. When feedback connections are allowed the relationships between the units become more complex. For example, in Pineda's recurrent backprop algorithm, the distinction between input and output units is blurred -- the same unit can have both roles. The only distinction that remains is that of hidden units, and the number of synapses separating a given hidden units from a given input or output unit. The topologies can get quite complex. Terry ----- From pratt at paul.rutgers.edu Fri Sep 16 09:55:37 1988 From: pratt at paul.rutgers.edu (Lorien Y. Pratt) Date: Fri, 16 Sep 88 09:55:37 EDT Subject: David Touretzky on connectionist vs. symbolic models, knowledge rep. Message-ID: <8809161355.AA01444@zztop.rutgers.edu> Fall, 1988 Neural Networks Colloquium Series at Rutgers presents David Touretzky Carnegie-Mellon University Room 705 Hill center, Busch Campus Friday September 23, 1988 at 11:10 am Refreshments served before the talk Abstract My talk will explore the relationship between connectionist models and symbolic models, and ask what sort of things we should expect from a connectionist knowledge representation. In particular I'm interested in certain natural language tasks, like prepositional phrase attachment, which people do rapidly and unconsciously but which involve complicated inferences and a huge amount of world knowledge. From hi.pittman at MCC.COM Fri Sep 16 12:01:00 1988 From: hi.pittman at MCC.COM (James Arthur Pittman) Date: Fri, 16 Sep 88 11:01 CDT Subject: "Layers" In-Reply-To: <8809150450.AA01503@crabcake.cs.jhu.edu> Message-ID: <19880916160115.0.PITTMAN@DIMEBOX.ACA.MCC.COM> Could you give a reference for Pineda's recurrent backprop algorithm? Sounds interesting. And by the way, whats a crabcake? From jam at bu-cs.bu.edu Fri Sep 16 14:22:16 1988 From: jam at bu-cs.bu.edu (Jonathan Marshall) Date: Fri, 16 Sep 88 14:22:16 EDT Subject: 1988 Tech Report Message-ID: <8809161822.AA25086@bu-cs.bu.edu> The following material is available as Boston University Computer Science Department Tech Report #88-010. It may be obtained from rmb at bu-cs.bu.edu or by writing to Regina Blaney, Computer Science Dept., Boston Univ., 111 Cummington St., Boston, MA 02215, U.S.A. I think the price is $7.00. ----------------------------------------------------------------------- SELF-ORGANIZING NEURAL NETWORKS FOR PERCEPTION OF VISUAL MOTION Jonathan A. Marshall ABSTRACT The human visual system overcomes ambiguities, collectively known as the aperture problem, in its local measurements of the direction in which visual objects are moving, producing unambiguous percepts of motion. A new approach to the aperture problem is presented, using an adaptive neural network model. The neural network is exposed to moving images during a developmental period and develops its own structure by adapting to statistical characteristics of its visual input history. Competitive learning rules ensure that only connection ``chains'' between cells of similar direction and velocity sensitivity along successive spatial positions survive. The resultant self-organized configuration implements the type of disambiguation necessary for solving the aperture problem and operates in accord with direction judgments of human experimental subjects. The system not only accommodates its structure to long-term statistics of visual motion, but also simultaneously uses its acquired structure to assimilate, disambiguate, and represent visual motion events in real-time. ------------------------------------------------------------------------ I am now at the Center for Research in Learning, Perception, and Cognition, 205 Elliott Hall, University of Minnesota, Minneapolis, MN 55414. I can still be reached via my account jam at bu-cs.bu.edu . --J.A.M. From Dave.Touretzky at B.GP.CS.CMU.EDU Fri Sep 16 21:02:19 1988 From: Dave.Touretzky at B.GP.CS.CMU.EDU (Dave.Touretzky@B.GP.CS.CMU.EDU) Date: Fri, 16 Sep 88 21:02:19 EDT Subject: Layers In-Reply-To: Your message of Fri, 16 Sep 88 14:28:00 -0400. <590437703/mjw@F.GP.CS.CMU.EDU> Message-ID: <1053.590461339@DST.BOLTZ.CS.CMU.EDU> > From: Michael.Witbrock at F.GP.CS.CMU.EDU > Let the distance between two units be defined as the *minimal* number of > modifiable weights forming a path between them (i.e. the number of > weights on the shortest path between the two nodes) . > Then the Layer in which a unit lies is the minimal distance between it > and an input unit. I think you meant to use MAXIMAL distance in the definition of which layer a unit lies in. If one uses minimal distance, then in a net with direct connections from input to output, the output layer would always be layer 1, even if there were hidden units forming layers 2, 3, etc. For this definition to make sense, it should always be the case that if unit i has a connection to unit j, then Layer(i) <= Layer(j). > The number of layers in the network is the maximum value of the distance between any unit and an input unit. We should tighten this up by specifying that it's ONE PLUS the maximum distance between any unit and an input unit, EXCLUDING CYCLES. This definition is fine for feed-forward nets, but it isn't very satisfying for recurrent nets like Pineda's. Imagine a recurrent backprop net in which every unit was connected to every other. If such a net has N units, then by Michael's definition it has N layers. What's really strange is that layers 1 through N-1 are empty, and layer N has N units in it. The notion of layers is just not as useful in recurrent networks. It is perhaps better to speak in terms of modules. A module might be defined as a set of units with similar connectivity patterns, or as a set of units that are densely connected to each other and less densely connected to units in other modules. This isn't a nice, clean, graph-theoretic definition, but then whoever said life was as simple as graph theory? -- Dave From todd at galadriel.STANFORD.EDU Fri Sep 16 21:09:08 1988 From: todd at galadriel.STANFORD.EDU (Peter Todd) Date: Fri, 16 Sep 88 18:09:08 PDT Subject: Layers In-Reply-To: Your message of Fri, 16 Sep 88 14:28:00 EDT. <590437703/mjw@F.GP.CS.CMU.EDU> Message-ID: I would, in fact, argue AGAINST that definition, because, for instance, in the following example: O O /|\ |\ / | \| \ | O O | \ | /| / \|/ |/ O O (O's are units, all rest are connections) we have a TWO layer network (max. number of weights from input units to any other units) and yet EVERY non-input unit is in the FIRST layer (min. number of weights from input unit to any unit). Seems pretty counterintuitive. --peter todd From kawahara at av-convex.ntt.jp Sat Sep 17 09:47:31 1988 From: kawahara at av-convex.ntt.jp (Hideki KAWAHARA) Date: Sat, 17 Sep 88 22:47:31+0900 Subject: I love BP. But your time is over. (News from Japan). Message-ID: <8809171347.AA25230@av-convex.NTT.jp> I love BP. But, your time is over. I presented our new method for designing feedforward artificial neural networks, which can approximate arbitrary continuous mapping from n-dimensional hyper-cube to m-dimensional space, at the IEICE-Japan technical meeting on 16/Sept./1988. The method SPAN (Saturated Projection Algorithm for Neural network design) can incorporate a-priori knowledge on the mapping to be approximated. Computational steps required for training a specific network are several hundredth or thousandth of those required by conventional BP procedures. SPAN, I hope, will replace considerable amount of thoughtless applications of BP, which usually found in this feverish atmosphere of Neuro-computing in Japan. And I also hope this will let researchers to change their attentions to the more essential problems (representations, dynamics, associative memory, inference....and so on). This doesn't mean that SPAN covers BP completely. Instead, SPAN is cooperative with BP, LVQ by Kohonen, and many other neural algorithms. Only thoughtless applications will be discouraged. The IEICE technical report is already available on your request. However, it is written in Japanese. The elaborated English version will be available by the end of this year. If you interested in our report, please mail to the address given at the end of this mail. References: [1]Kawahara, H. and Irino, T.: "A Procedure for Designing 3-Layer Neural Networks Which Approximate Arbitrary Continuous Mapping: Applications to Pattern Processing," PRU88-54 IEICE Technical Report, Vol.88, No.177,pp.47-54, (Sept.1988). (in Japanese) This is the report mentioned above. Sorry for using ambiguous term, 3-Layer. Networks designed by SPAN have one hidden layer with two adjustable weight-layers. [2]Irie, B. and Miyake. S.: "Capabilities of Three-layered Perceptrons," ICNN88, pp.I-641-648, (1988). [3]Funahashi, K.: "On the Capabilities of Neural Networks," MBE88-52 IEICE Technical Report, pp.127-134, (July 1988). (in Japanese). These are useful for understanding SPAN. [2] gives an explicit algorithm for designing neural networks, which can approximate arbitrary continuous mapping. [3] provides mathematical proof of statements given in [2]. However, these results provide no practical implementations. [3] is submitted to the INNS journal. ---------------------------------------------------------------- Reports presented at the IECE technical meeting on Pattern Recognition and Understanding 16/Sept./1988, Tokyo, Japan. --excerpts ------- Special session on Pattern Recognition and Neural Networks (3)PRU88-50:"On the Learning Network, Stochastic Vector Machine," Saito, H. and Ejima, T., Nagaoka University of Technology. (4)PRU88-51:"An Order Estimation of Stochastic Process Model using Neural Networks," Ohtsuki, N., Kushiro National College of Technology, Miyanaga, Y. and Tochinai, K., Faculty of Engineering, Hokkaido University, and Ito, H., Kushiro National College of Technology. (5)PRU88-52:"Selection of High-Accurate Spectra using Neural Model," Hiroshige, M., Miyanaga, Y. and Tochinai, K., Faculty of Engineering, Hokkaido University. (6)PRU88-53:"Stereo Disparity Detection with Neural Network Model," Maeda, E. and Okudaira, M., NTT Human Interface Laboratories. (7)PRU88-54:"A Procedure for Designing 3-Layer Neural Networks which Approximate Arbitrary Continuous Mapping: Applications to Pattern Processing, Kawahara, H. and Irino, T., NTT Basic Research Laboratories. (8)PRU88-55:"Speaker-Independent Word Recognition using Dynamic Programming Neural Networks," Isotani, R. Yoshida, K. Iso, K. Watanabe, T. and Sakoe, H., C&C Information Technology Res. Labs. NEC Corporation. (9)PRU88-56:"Character Recognition by Neuro Pattern Matching," Tsukui, Y. and Hirai, Y., Univ. of Tsukuba. (10)PRU88-57:"Recognition of Hand-written Alphanumeric Characters by Artificial Neural Network," Koda, T. Takagi, H. and Shimeki, Y., Central Research Laboratories Matsushita Electric Industrial Co., Ltd. (11)PRU88-58:"Character Recognition using Neural Network," Yamada, K. Kami, H. Mizoguchi, M. and Temma, T., C&C Information Technology Research Laboratories, NEC Corporation. (12)PRU88-59:"Aiming at a Large Scale Neural Network," Mori, Y., ATR Auditory and Visual Perception Research Laboratories. These are all written in Japanese. If you can read Japanese, you can order these technical reports to the following address. -------- The Institute of Electronics, Information and Communication Engineers Kikai-shinko-Kaikan Bldg., 5-8, Shibakoen 3 chome, Minato-ku, TOKYO, 105 JAPAN. ------- The price will be about $8.00 (please add postage (about $5.00 ??)). You can also find some of the authors listed above at the NIPS meeting in Denver. (Speaking for myself, I'd like to attend it. However the budgetary conditions.......) Next month, we have several annual meetings with special sessions on neural networks. I'll report them in the near future. Hideki Kawahara --------------------------------------------------------- e-mail: kawahara%nttlab.ntt.JP at RELAY.CS.NET (from ARPA) s-mail: Hideki Kawahara Information Science Research Laboratory NTT Basic Research Laboratories 3-9-11, Midori-cho, Musashino, TOKYO, 180 JAPAN. tel: +81 422 59 2276 fax: +81 422 59 3016 --------------------------------------------------------- From moody-john at YALE.ARPA Mon Sep 19 15:12:57 1988 From: moody-john at YALE.ARPA (john moody) Date: Mon, 19 Sep 88 15:12:57 EDT Subject: Speedy Alternatives to Back Propagation Message-ID: <8809191913.AA02990@NEBULA.SUN3.CS.YALE.EDU> At Yale, we have been studying two classes of neurally- inspired learning algorithms which offer 1000-fold speed increases over back propagation for learning real-valued functions. These algorithms are "Learning with localized receptive fields" and "An interpolating, multi-resolution CMAC", where CMAC means Cerebellar Model Articulation Con- troller. Both algorithms were presented in talks entitled "Speedy Alternatives to Back Propagation" given at Snowbird (April '88), nEuro '88 (Paris, June '88), and INNS (Boston, September '88). A research report describing the localized receptive fields approach is now available. Another research report describing the CMAC models will be available in about two weeks. To receive copies of these, please send a request to Judy Terrell at terrell at yalecs.bitnet, terrell at yale.arpa, or terrell at cs.yale.edu. Be sure to include your mailing address. There is no charge for the research reports, and they are written in English! An abstract follows. --John Moody Learning with Localized Receptive Fields John Moody and Christian Darken Yale Computer Science Department PO Box 2158 Yale Station, New Haven, CT 06520 Research Report YALEU/DCS/RR-649 September 1988 Abstract We propose a network architecture based upon localized receptive field units and an efficient method for training such a network which combines self-organized and supervised learning. The network architecture and learning rules are appropriate for real-time adaptive signal processing and adaptive control. For a test problem, predicting a chaotic timeseries, the network learns 1000 times faster in digital simulation time than a three layer perceptron trained with back propagation, but requires about ten times more training data to achieve comparable prediction accuracy. This research report will appear in the Proceedings of the 1988 Connectionist Models Summer School, Morgan Kaufmann, Publishers 1988. The work was supported by ONR grant N00014-86-K-0310, AFOSR grant F49620-88-C0025, and a Purdue Army subcontract. ------- From sontag at fermat.rutgers.edu Mon Sep 19 15:27:23 1988 From: sontag at fermat.rutgers.edu (Eduardo Sontag) Date: Mon, 19 Sep 88 15:27:23 EDT Subject: Recent abstracts Message-ID: <8809191927.AA00586@control.rutgers.edu> I enclose abstracts of some recent technical reports. (Ignore the skip numbers; the rest are not in any manner related to NN's.) ***Any suggestions as to which journal to send 88-08 to*** would be highly appreciated. (There don't appear to be any journals geared towards very mathematical papers in NN's, it would seem.) ________________________________________________________________________ ABSTRACTS OF SYCON REPORTS Rutgers Center for Systems and Control Hill Center, Rutgers University, New Brunswick, NJ 08903 E-mail: sycon at fermat.rutgers.edu [88-01] Two algorithms for the Boltzmann machine: description, implementation, and a preliminary comparative study of performance , Lorien Y. Pratt and H J. Sussmann, July 88. This report compares two algorithms for learning in neural networks: the Boltzmann and modified Boltzmann machines. The Boltzmann machine has been extensively studied in the past; we have recently developed the modified Boltzmann machine. We present both algorithms and discuss several considerations which must be made for their implementation. We then give a complexity analysis and preliminary empirical comparison of the two algorithms' learning ability on a benchmark problem. For this problem, the modified Boltzmann machine is shown to learn slightly slower than the Boltzmann machine. However, the modified algorithm does not require the user to build an annealing schedule to be used for training. Since building this schedule constitutes a significant amount of the engineering time for the Boltzmann algorithm, we feel that our modified algorithm may be superior to the classical one. Since we have not yet performed a rigorous comparison of the two algorithms' performance, it may also be possible to optimize the parameters to the modified algorithm so that the learning speed is comparable to the classical version. [88-02] Some remarks on the backpropagation algorithm for neural net learning , Eduardo D. Sontag, July 88. (13 pages.) This report contains some remarks about the backpropagation method for neural net learning. We concentrate in particular in the study of local minima of error functions and the growth of weights during learning. [88-03] On the convergence of learning algorithms for Boltzmann machines , H J. Sussmann, July 88. (46 pages.) We analize a learning algorithm for Boltzmann machines, based on the usual alternation between ``learning'' and ``hallucinating'' phases. We prove rigorously that, for suitable choices of the parameters, the evolution of the weights follows very closely, with very high probability, an integral trajectory of the gradient of the likelihood function whose global maxima are exactly the desired weight patterns. An abridged version of this report will appear in the Proceedings of the 27th IEEE Conference on Decision and Control, December 1988. [88-08] Backpropagation can give rise to spurious local minima even for networks without hidden layers , Eduardo D. Sontag and H J. Sussmann, Sept 88. (15 pages.) We give an example of a neural net without hidden layers and with a sigmoid transfer function, and a corresponding training set of binary vectors, for which the sum of the squared errors, regarded as a function of the weights, has a local minimum which is not a global minimum From Roni.Rosenfeld at B.GP.CS.CMU.EDU Mon Sep 19 16:35:55 1988 From: Roni.Rosenfeld at B.GP.CS.CMU.EDU (Roni.Rosenfeld@B.GP.CS.CMU.EDU) Date: Mon, 19 Sep 88 16:35:55 EDT Subject: Notes from your friendly CONNECTIONISTS mailing list maintainer Message-ID: <8588.590704555@RONI.BOLTZ.CS.CMU.EDU> Fellow neurons, Autumn is here, and with it - a new academic year. This means many of you will be changing your e-mail address, which in turns means we will be receiving many error messages for every messages posted to CONNECTIONISTS. To help us deal with the expected mess, we ask that you observe the following: - If your old address is about to be disabled, please notify us promptly so that we may remove it from the list. - Please check that your new or forwarding address works well before you report it to us. If it does not work for us, we will have no way of contacting you and will have to remove you from the list. Thank you for your cooperation. While I'm at it, here's some more: To keep the traffic on the CONNECTIONISTS mailing list to a minimum, we ask that you take special care not to send inappropriate mail to the list. - Requests for addition to the list, change of address and other administrative matters should be sent to: "connectionists-request at cs.cmu.edu" (note the exact spelling: many "connectionists", one "request"). If you mention our mailing list to someone who may apply to be added to it, please make sure they use the above and NOT "connectionists at cs.cmu.edu". - Requests for e-mail addresses of people who are believed to subscribe to CONNECTIONISTS should be sent to postmaster at appropriate-site. If the site address is unknown, send your request to "connectionists-request at cs.cmu.edu" and we'll do our best to help. A phone call to the appropriate institution may sometimes be simpler and faster. - Note that in many mail programs a reply to a message is automatically "CC"-ed to all the addresses on the "To" and "CC" lines of the original message. If the mailer you use has this property, please make sure your personal response (request for a Tech Report etc.) is NOT broadcast over the net. Roni Rosenfeld connectionists-request at cs.cmu.edu From laura%suspicion.Princeton.EDU at Princeton.EDU Tue Sep 20 11:44:29 1988 From: laura%suspicion.Princeton.EDU at Princeton.EDU (Laura Hawkins) Date: Tue, 20 Sep 88 11:44:29 EDT Subject: Princeton University Cognitive Studies Talk Message-ID: <8809201544.AA00762@suspicion.Princeton.EDU> TITLE: Connectionist Language Users SPEAKER: Robert Allen, Bell Communications Research DATE: September 26 LOCATION: Princeton University Langfeld Lounge, Green Hall Corner of Washington Road and Williams Street TIME: Noon ABSTRACT: An important property of neural networks is their ability to integrate various sources of information through activation values. By presenting both "verbal" and "perceptual" codes a sequential back-propagation network may be trained to "use language." For instance, networks can answer questions about objects that appear in a perceptual microworld. Moreover, this paradigm handles many problems of reference, such as pronoun anaphora, quite naturally. Thus this approach, which may be termed Connectionist Language Users (CLUs), introduces a computational linguistics that is holistic. Extensions to be discussed include the use of relative clauses, action verbs, grammars, planning in a blocks world, and multi-agent "conversations." From pratt at paul.rutgers.edu Tue Sep 20 14:16:04 1988 From: pratt at paul.rutgers.edu (Lorien Y. Pratt) Date: Tue, 20 Sep 88 14:16:04 EDT Subject: Stephen Hanson to speak on back propagation at Rutgers Message-ID: <8809201816.AA05749@zztop.rutgers.edu> Fall, 1988 Neural Networks Colloquium Series at Rutgers Some comments and variations on back propagation ------------------------------------------------ Stephen Jose Hanson Bellcore Cognitive Science Lab, Princeton University Room 705 Hill center, Busch Campus Friday September 30, 1988 at 11:10 am Refreshments served before the talk Abstract Backpropagation is presently one of the most widely used learning techniques in connectionist modeling. Its popularity, however, is beset with many criticisms and concerns about its use and potential misuse. There are 4 sorts of criticisms that one hears: (1) it is a well known statistical technique (least squares) (2) it is ignorant (3) it is slow--(local minima, its NP complete) (4) it is ad hoc--hidden units as "fairy dust" I believe these four types of criticisms are based on fundamental misunderstandings about the use and relation of learning methods to the world, the relation of ontogeny to phylogeny, the relation of simple neural models to neuroscience and the nature of "weak" learning theories. I will discuss these issues in the context of some variations on backpropagation. From Alex.Waibel at SPEECH2.CS.CMU.EDU Thu Sep 22 13:06:41 1988 From: Alex.Waibel at SPEECH2.CS.CMU.EDU (Alex.Waibel@SPEECH2.CS.CMU.EDU) Date: Thu, 22 Sep 88 13:06:41 EDT Subject: Scaling in Neural Nets Message-ID: Below the abstract to a paper describing our recent research addressing the problem of scaling in neural networks for speech recognition. We show that by exploiting the hidden structure (previously learned abstractions) of speech in a modular way and applying "conectionist glue", larger more complex networks can be constructed at only small additional cost in learning time and complexity. Resulting recognition performance is as good or better than comparable monolithically trained nets and as good as the smaller network modules. This work was performed at ATR Interpreting Telephony Research Laboratories, in Japan. I am now working at Carnegie Mellon University, so you may request copies from me here or directly from Japan. >From CMU: Dr. Alex Waibel Computer Science Department Carnegie-Mellon University Pittsburgh, PA 15213 phone: (412) 268-7676 email: ahw at speech2.cs.cmu.edu >From Japan, please write for technical report TR-I-0034 (with CC to me), to: Ms. Kazumi Kanazawa ATR Interpreting Telephony Research Laboratories Twin 21 MID Tower, 2-1-61 Shiromi, Higashi-ku, Osaka, 540, Japan email: kddlab!atr-la.atr.junet!kanazawa at uunet.UU.NET Please CC to: ahw at speech2.cs.cmu.edu ------------------------------------------------------------------------- Modularity and Scaling in Large Phonemic Neural Networks Alex Waibel, Hidefumi Sawai, Kiyohiro Shikano ATR Interpreting Telephony Research Laboratories ABSTRACT Scaling connectionist models to larger connectionist systems is difficult, because larger networks require increasing amounts of training time and data and the complexity of the optimization task quickly reaches computationally unmanageable proportions. In this paper, we train several small Time-Delay Neural Networks aimed at all phonemic subcategories (nasals, fricatives, etc.) and report excellent fine phonemic discrimination performance for all cases. Exploiting the hidden structure of these smaller phonemic subcategory networks, we then propose several techniques that allow us to "grow" larger nets in an incremental and modular fashion without loss in recognition performance and without the need for excessive training time or additional data. These techniques include {\em class discriminatory learning, connectionist glue, selective/partial learning and all-net fine tuning}. A set of experiments shows that stop consonant networks (BDGPTK) constructed from subcomponent BDG- and PTK-nets achieved up to 98.6% correct recognition compared to 98.3% and 98.7% correct for the component BDG- and PTK-nets. Similarly, an incrementally trained network aimed at {\em all} consonants achieved recognition scores of 95.9% correct. These result were found to be comparable to the performance of the subcomponent networks and significantly better than several alternative speech recognition strategies. From jordan at psyche.mit.edu Thu Sep 22 15:40:05 1988 From: jordan at psyche.mit.edu (Michael Jordan) Date: Thu, 22 Sep 88 15:40:05 edt Subject: Scaling in Neural Nets Message-ID: <8809221941.AA05878@ATHENA.MIT.EDU> Would you please send me a copy? Michael I. Jordan E10-034C MIT Cambridge, MA 02139 From kawahara at av-convex.ntt.jp Thu Sep 22 12:41:53 1988 From: kawahara at av-convex.ntt.jp (Hideki KAWAHARA) Date: Fri, 23 Sep 88 01:41:53+0900 Subject: A beautiful subset of SPAN, Radial Basis Function. Message-ID: <8809221641.AA07190@av-convex.NTT.jp> Things are changing very rapidly. I visited the ATR labs. yesterday before. I have inspiring discussion with Funahashi, Irie and the other ATR researchers on the SPAN concepts. Funahashi finally gave me a copy of a paper on Radial Basis Function by Broomhead of the RSRE. I read it in the super express SHINKANSEN from OSAKA to TOKYO. For my surprise, the RBF concept was a beautiful and useful subset of the SPAN. You can take benefit of important conceptual framework in neural network design by reading RBF paper. It is written in English. :-) I'm certain now that the NIPS conference will be remembered as a turning point from BP to the next generation algorithms, because I'm sure the presentation by Bridle will be on the RBF method. I'm also trying to attend the NIPS conference. Reference: D.S. Broomhead and D.Lowe: "Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive Networks," RSRE Memorandum No.4148, Royal Signals & Radar Establishment, (RSRE Malvern, WORCS.) ---- Hideki Kawahara PS: Copy of my report is free of charge. Previous mail is somewhat mis-leading. PPS: The English version of our report on SPAN will be available within a month. PPPS: The IECE office won't deliver their technical reports to the foreign countries. However, If you really need some of them, I think I can provide some assistance. ---------------------------------------------- e-mail: kawahara%nttlab.ntt.jp at RELAY.CS.NET Hideki Kawahara NTT Basic Research Labs. 3-9-11 Midori-cho Musashino, TOKYO 180, JAPAN ---------------------------------------------- From netlist at psych.Stanford.EDU Fri Sep 23 09:19:12 1988 From: netlist at psych.Stanford.EDU (Mark Gluck) Date: Fri, 23 Sep 88 06:19:12 PDT Subject: Stanford Adaptive Networks Colloquium Message-ID: Stanford University Interdisciplinary Colloquium Series: Adaptive Networks and their Applications Oct. 4th (Tuesday, 3:15pm) ************************************************************************** Connectionist Prediction Systems: Relationship to Least-Squares Estimation and Dynamic Programming RICHARD S. SUTTON GTE Laboratories Incorporated 40 Sylvan Road Waltham, MA 02254 ************************************************************************** - Abstract - In this talk I will present two examples of productive interplay between connectionist machine learning and more traditional engineering areas. The first concerns the problem of learning to predict time series. I will briefly review previous approaches including least squares linear estimation and the newer nonlinear backpropagation methods, and then present a new class of methods called Temporal-Difference (TD) methods. Whereas previous methods are driven by the error or difference between predictions and actual outcomes, TD methods are similarly driven the difference between temporally successive predictions. This idea is also the key idea behind the learning in Samuel's checker player, in Holland's bucket brigade, and in Barto, Sutton & Anderson's pole-balancer. TD methods can be more efficient computationally because their errors are available immediately after the predictions are made, without waiting for a final outcome. More surprisingly, they can also be more efficient in terms of how much data is needed to achieve a particular level of accuracy. Formal results will be presented concerning the computational complexity, convergence, and optimality of TD methods. Possible areas of application of TD methods include temporal pattern recognition such as speech recognition and weather forecasting, the learning of heuristic evaluation functions, and learning control. Second, I would like to present work on the theory of TD methods used in conjunction with reinforcement learning techniques to solve control problems. ************************************************************************** Location: Room 380-380W, which can be reached through the lower level between the Psychology and Mathematical Sciences buildings. Technical Level: These talks will be technically oriented and are intended for persons actively working in related areas. They are not intended for the newcomer seeking general introductory material. Information: To be added to the network mailing list, netmail to netlist at psych.stanford.edu For additional information, contact Mark Gluck (gluck at psych.stanford.edu). Upcomming talks: Nov. 22: Mike Jordan (MIT) Dec. 6: Ralph Linsker (IBM) * * * Co-Sponsored by: Departments of Electrical Engineering (B. Widrow) and Psychology (D. Rumelhart, M. Pavel, M. Gluck), Stanford Univ. From kawahara at av-convex.ntt.jp Sun Sep 25 01:29:21 1988 From: kawahara at av-convex.ntt.jp (Hideki KAWAHARA) Date: Sun, 25 Sep 88 14:29:21+0900 Subject: RSRE address Message-ID: <8809250529.AA17939@av-convex.NTT.jp> Several readers requested RSRE address mentioned in my previous mail. Followings are all what I know. Royal Signals & Rader Establishment St Andrews Road Great Malvern Worcestreshire WR14 3PS, UK D.S.Broomhead e-mail from USA: dsb%rsre.mod.uk at relay.mod.uk e-mail from UK : dsb%rsre.mod.uk at uk.ac.ucl.cs.nss David Lowe e-mail from USA: dl%rsre.mod.uk at relay.mod.uk e-mail from UK : dl%rsre.mod.uk at uk.ac.ucl.cs.nss Please contact them to get the paper on RBF. Hideki Kawahara PS: I can't access e-mail for next several days. Excuse me for delay in reply. ------------------------------------------------------ e-mail: kawahara%nttlab.ntt.jp at RELAY.CS.NET (from ARPA) s-mail: Hideki Kawahara Information Science Research Laboratory NTT Basic Research Laboratories 3-9-11 Midori-cho Musashino-shi, TOKYO 180, JAPAN ------------------------------------------------------ From niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK Mon Sep 26 00:03:02 1988 From: niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK (M. Niranjan) Date: Mon, 26 Sep 88 00:03:02 BST Subject: radial basis functions Message-ID: <1483.8809252303@dsl.eng.cam.ac.uk> Re: Hideki KAWAHARA's recent postings on Radial basis functions Radial basis functions as pattern classifiers is a kind of Kernel discriminant analysis ("Kernel Discriminant Analysis" by Hand, Research Studies Press, 1982). In KDA, a class conditional probability density function is estimated as a weighted sum of kernel functions centred on the training examples (and then Bayes' type classification); in RBF, the discriminant function itself is calculated as a weighted sum of kernel functions. In this sense, RBF is superior to KDA, I think. It forms class boundaries by segments of hyper-spheres (rather than hyper-planes for a BP type network). Something very similar to RBF is the method of potential functions. This works something like placing weighted electric charges on every training example and the equi-potential lines act as class boundaries. I think the green book by Duda and Hart mention this, but the original reference is, Aizerman, M.A., Braverman, E.M. \& Rozonoer, L.I. (1964): ``On the method of potential functions''; Avtomatika i Telemekhanika, {\bf Vol. 26, No. 11}, 2086-2088. (This is in Russian, but there is a one-to-one translation in most electrical engineering libraries) Also, if you make a network with one hidden layer of 'spherical graded units' (Hanson and Burr, "Knowledge representation in connectionist networks"), and a simple perceptron as output unit (plus some simplifying assumptions), you can derive the RBF method!! niranjan From russ at baklava.mitre.org Mon Sep 26 08:42:11 1988 From: russ at baklava.mitre.org (Russell Leighton) Date: Mon, 26 Sep 88 08:42:11 EDT Subject: Psychnet In-Reply-To: Psychology Newsletter and Bulletin Board's message of Sun, 25 Sep 88 <8809260843.AA26526@mitre.arpa> Message-ID: <8809261242.AA01985@baklava.mitre.org.> Please include me on your distribution list. Please use Thanks, Russ. ARPA: russ at mitre.arpa Russell Leighton M.S. Z406 MITRE Signal Processing Lab 7525 Colshire Dr. McLean, Va. 22102 USA From jose at tractatus.bellcore.com Tue Sep 27 09:37:55 1988 From: jose at tractatus.bellcore.com (Stephen J Hanson) Date: Tue, 27 Sep 88 09:37:55 EDT Subject: rbfs Message-ID: <8809271337.AA15038@tractatus.bellcore.com> >Re: Hideki KAWAHARA's recent postings on Radial basis functions >Also, if you make a network with one hidden layer of 'spherical graded >units' (Hanson and Burr, "Knowledge representation in connectionist >networks"), and a simple perceptron as output unit (plus some simplifying >assumptions), you can derive the RBF method!! >>niranjan It's also worth noting that any sort of generalized dichotomy (discriminant) can be naturally embedded in Back-prop nets--in terms of polynomial boundaries (also suggested in Hanson & Burr) or any sort of generalized volume or edge one would like (sigma-pi for example are simple rectangular volumes). I believe that this sort of variation has a relation to synaptic-dendritic interactions which one might imagine could be considerably more complex than linear. However, I suspect there is a tradeoff in terms neuron complexity and learning generality as one increases the complexity of the discriminant or predicate that one is using-- consequently as componential network power increases the advantage of network computation may decrease. (as usual "generalized discriminants" was suggested previously in statistical and pattern recognition literature-- Duda and Hart, pp. 134-138. and also see Tou & Gonzalez, Pattern Recognition Principles, Addison-Wesley, 1974, pp. 48-52-- Btw--I don't think the fact that many sorts of statistical methods seem to "pop out" of neural network approaches also means that neural network framework is somehow derivative--remember that many of the statistical models and methods are ad hoc and explicitly rely on "normative" sorts of assumptions which may provide the only connection to some other sort of statistical method. In fact, i think it is rather remarkable that such simple sorts of "neural like" assumptions can lead to families of such powerful sorts of general methods.) Stephen Hanson From schmidhu at tumult.informatik.tu-muenchen.de Tue Sep 27 07:26:30 1988 From: schmidhu at tumult.informatik.tu-muenchen.de (Juergen Schmidhuber) Date: Tue, 27 Sep 88 10:26:30 -0100 Subject: Abstract available Message-ID: <8809270926.AA19521@tumult.informatik.tu-muenchen.de> This is the abstract of an extended abstract of the description of some ongoing work that will be presented at the conference `Connectionism in Perspective' in Zurich. THE NEURAL BUCKET BRIGADE Juergen Schmidhuber For several reasons standard back-propagation (BP) in recurrent networks does not make too much sense in typical non-stationary environments. We identify the main problem of BP in not being `really local', meaning that BP is not what we call `local in time'. Doing some constructive criticism we introduce a learning method for neural networks that is `really local' and still allows credit-assignment for states that are `hidden in time'. ------------------- For those who are interested in the extended abstract there are copies available. (There also will be a more detailed and more formal treatment in the proceedings of CiP.) Include a physical address in your reply. Juergen From Roni.Rosenfeld at B.GP.CS.CMU.EDU Wed Sep 28 19:32:03 1988 From: Roni.Rosenfeld at B.GP.CS.CMU.EDU (Roni.Rosenfeld@B.GP.CS.CMU.EDU) Date: Wed, 28 Sep 88 19:32:03 EDT Subject: MIRRORS/II: Connectionist simulation software Message-ID: <637.591492723@RONI.BOLTZ.CS.CMU.EDU> The following is being posted on behalf of James Reggia. (please Please PLEASE do not reply to me or to "connectionists") Roni Rosenfeld connectionists-request at cs.cmu.edu ------- Forwarded Message MIRRORS/II Connectionist Simulator Available MIRRORS/II is a general-purpose connectionist simulator which can be used to implement a broad spectrum of connec- tionist (neural network) models. MIRRORS/II is dis- tinguished by its support of an extensible high-level non- procedural language, an indexed library of networks, spread- ing activation methods, learning methods, event parsers and handlers, and a generalized event-handling mechanism. The MIRRORS/II language allows relatively inexperienced computer users to express the structure of a network that they would like to study and the parameters which will con- trol their particular connectionist model simulation. Users can select an existing spreading activation/learning method and other system components from the library to complete their connectionist model; no programming is required. On the other hand, more advanced users with programming skills who are interested in research involving new methods for spreading activation or learning can still derive major benefits from using MIRRORS/II. The advanced user need only write functions for the desired procedural components (e.g., spreading activation method, control strategy, etc.). Based on language primitives specified by the user MIRRORS/II will incorporate the user-written components into the connection- ist model; no changes to the MIRRORS/II system itself are required. Connectionist models developed using MIRRORS/II are not limited to a particular processing paradigm. Spreading activation methods, and Hebbian learning, competitive learn- ing, and error back-propogation are among the resources found in the MIRRORS/II library. MIRRORS/II provides both synchronous and asynchronous control strategies that deter- mine which nodes should have their activation values updated during an iteration. Users can also provide their own con- trol strategies and have control over a simulation through the generalized event handling mechanism. Simulations produced by MIRRORS/II have an event- handling mechanism which provides a general framework for scheduling certain actions to occur during a simulation. MIRRORS/II supports system-defined events (constant/cyclic input, constant/cyclic output, clamp, learn, display and show) and user-defined events. An event command (e.g., the input-command) indicates which event is to occur, when it is to occur, and which part of the network it is to affect. Simultaneously occurring events are prioritized according to user specification. At run time, the appropriate event handler performs the desired action for the currently- occurring event. User-defined events can redefine the work- ings of system-defined events or can create new events needed for a particular application. MIRRORS/II is implemented in Franz Lisp and will run under Opuses 38, 42, and 43 of Franz Lisp on UNIX systems. It is currently running on a MicroVAX, VAX and SUN 3. If you are interested in obtaining more detailed information about the MIRRORS/II system see D'Autrechy, C. L. et al., 1988, "A General-Purpose Simulation Environment for Develop- ing Connectionist Models," Simulation, 51, 5-19. The MIRRORS/II software and reference manual are available for no charge via tape or ftp. If you are interested in obtain- ing a copy of the software send e-mail to mirrors at mimsy.umd.edu or ...!uunet!mimsy!mirrors or send mail to Lynne D'Autrechy University of Maryland Department of Computer Science College Park, MD 20742 ------- End of Forwarded Message From niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK Thu Sep 29 11:48:11 1988 From: niranjan%digsys.engineering.cambridge.ac.uk at NSS.Cs.Ucl.AC.UK (M. Niranjan) Date: Thu, 29 Sep 88 11:48:11 BST Subject: RBFs Message-ID: <2577.8809291048@dsl.eng.cam.ac.uk> David Lowe (of RSRE) says their work on Radial basis functions is published in, Complex systems Vol 2 No 3, pp269-303, 1988. niranjan From jam%bu-cs.BU.EDU at bu-it.bu.edu Thu Sep 29 13:30:09 1988 From: jam%bu-cs.BU.EDU at bu-it.bu.edu (jam%bu-cs.BU.EDU@bu-it.bu.edu) Date: Thu, 29 Sep 88 13:30:09 EDT Subject: Neural networks & visual motion perception Message-ID: <8809291730.AA16018@bucse.bu.edu> The following material is available as Boston University Computer Science Department Tech Report #88-010. It may be obtained from pam at bu-cs.bu.edu or by writing to Pam Pletz, Computer Science Dept., Boston Univ., 111 Cummington St., Boston, MA 02215, U.S.A. It is 100 pages long, and the price is $7.00. ----------------------------------------------------------------------- SELF-ORGANIZING NEURAL NETWORKS FOR PERCEPTION OF VISUAL MOTION Jonathan A. Marshall ABSTRACT The human visual system overcomes ambiguities, collectively known as the aperture problem, in its local measurements of the direction in which visual objects are moving, producing unambiguous percepts of motion. A new approach to the aperture problem is presented, using an adaptive neural network model. The neural network is exposed to moving images during a developmental period and develops its own structure by adapting to statistical characteristics of its visual input history. Competitive learning rules ensure that only connection ``chains'' between cells of similar direction and velocity sensitivity along successive spatial positions survive. The resultant self-organized configuration implements the type of disambiguation necessary for solving the aperture problem and operates in accord with direction judgments of human experimental subjects. The system not only accommodates its structure to long-term statistics of visual motion, but also simultaneously uses its acquired structure to assimilate, disambiguate, and represent visual motion events in real-time. ------------------------------------------------------------------------ I am now at the Center for Research in Learning, Perception, and Cognition, 205 Elliott Hall, University of Minnesota, Minneapolis, MN 55414. I can still be reached via my account jam at bu-cs.bu.edu . --J.A.M. From kawahara at av-convex.ntt.jp Thu Sep 29 09:47:53 1988 From: kawahara at av-convex.ntt.jp (Hideki KAWAHARA) Date: Thu, 29 Sep 88 22:47:53+0900 Subject: Neural network capabilities and alternatives to BP Message-ID: <8809291347.AA07886@av-convex.NTT.jp> Dear colleagues: First of all, I have to apologize that my previous mails have somewhat rude tone and un-intended negative effects. I would like to correct them by making my points clear and will try to supply usable and traceable information. I suggested too many things with too few evidences. What I want to point out are as follows. (1) Neural network capabilities and learning algorithms are different problems. Separating these problems will clarify their characteristics better. (2) Theoretically, feed-forward networks with one hidden layer can approximate any arbitrary continuous mapping from n dimensional hypercube to m dimensional space. However, networks designed according to procedures suggested by the theory (like Irie-Miyake) will suffer from so-called "combinatorial explosion" problems, because complexity of the network is proportional to the degrees of freedom of the input space. Irie-Miyake proof is based on multi-dimensional Fourier transform. An interesting demonstration of neural network capabilities can be implemented using CT(Computerized Tomography) procedures. (Irie once said that his inspiration came from his knowledge on CT.) (3) In pattern processing applications, there is a useful class of neural network architectures including RBF. They are not likely to suffer from "combinatorial explosion" problems, because the network complexity in this case is mainly bounded by the number of clusters in input space. In other words, the degrees of freedom is usually proportional to the number of clusters. (Thank you for providing useful information on RBF and PGU. Hanson's article and Niranjan's article supplied additional information.) (4) There are simple transformations for converting feed-forward networks to the networks which are members of a class mentioned in (3). PGU introduced by Hanson and Burr is one of such extensions. However, there are at least two cases where linear graded units can form Radial Basis Functions. Case(1): If input vectors are distributed only on a surface of a hypersphere, output of a linear graded unit will be a RBF. Case(2): If input vectors are auto-correlation coefficients of input signals, and if weight vectors of a linear graded unit is calculated from the maximum likelihood spectral parameters of a reference spectrum, output of a linear graded unit also will be a RBF. (5) These transformations and various neural network learning algorithms can be combined to work together. For example, self-organizing feature map can be utilized for preparing reference points of RBF. A BP-based procedure can be used for fine tuning. (6)Procedures through (3) to (4) suggest a prototype-based perception model, because hidden units in this case correspond to reference vectors in input space. This is a local representation. Even if we choose a RBF function with broader radius, it resembles coarse coding at best. It is somewhat contrasting with our experience using BP, where usually distributed representations emerge as internal representations. This is an interesting point to discuss. (7) My point of view: I agree with Hanson's view that neural networks are not mere derivatives of statistical methods. I believe that neural networks are fruitful sources of important algorithms, which are not discovered yet. This doesn't imply that neural networks simply implement those algorithms. It implies that we can extract those algorithms if we carefully investigate its functions using appropriate formalism and abstractions. I hope this mail will clarify my points and contribute for increasing our knowledge on neural network characteristics and hopefully will stimulate productive discussions. Hideki Kawahara NTT Basic Research Laboratories. Reference: Itakura,F.: "Minimum Prediction Residual Principle Applied to Speech Recognition," IEEE Trans., ASSP-23, pp.67-72, Feb. 1975. (This is the original paper. The Itakura-measure may be found in many text books on speech processing.) From pratt at paul.rutgers.edu Fri Sep 30 16:53:45 1988 From: pratt at paul.rutgers.edu (Lorien Y. Pratt) Date: Fri, 30 Sep 88 16:53:45 EDT Subject: Hector Sussmann to speak on formal analysis of Boltzmann Machine Learning Message-ID: <8809302053.AA03471@zztop.rutgers.edu> Fall, 1988 Neural Networks Colloquium Series at Rutgers On the theory of Boltzmann Machine Learning ------------------------------------------- Hector Sussmann Rutgers University Mathematics Department Room 705 Hill center, Busch Campus Friday October 14, 1988 at 11:10 am Refreshments served before the talk Abstract The Boltzmann machine is an algorithm for learning in neural networks, involving alternation between a ``learning'' and ``hallucinating'' phase. In this talk, I will present a Boltzmann machine algorithm for which it can be proven that, for suitable choices of the parameters, the weights converge so that the Boltzmann machine correctly classifies all training data. This is because the evolution of the weights follow very closely, with very high probability, an integral trajectory of the gradient of the likelihood function whose global maxima are exactly the desired weight patterns.