Reply to S. Harnad's questions, longer version

Tue Aug 30 18:46:06 EDT 1988

Dear Stevan,

This letter is a reply to your posted list of questions and
observations alluding to our paper "On language and connectionism:
Analysis of a PDP model of language acquisition" (Pinker & Prince,
1988; see also Prince and Pinker, 1988).  The questions are based on
misunderstandings of our papers, in which they are already answered.

(1) Contrary to your suggestion, we never claimed that pattern
associators cannot learn the past tense rule, or anything else, in
principle. Our concern is with which theories of the psychology of
language are true.  This question cannot be answered from an archair
but only by examining what people learn and how they learn it.  Our
main conclusion is that the claim that the English past tense rule is
learned and represented as a pattern-associator with distributed
representations over phonological features for input and output forms
(e.g., the Rumelhart-McClelland 1986 model) is false.  That's because
what pattern-associators are good at is precisely what the regular
rule doesn't need. Pattern associators are designed to pick up
patterns of correlation among input and output features. The regular
past tense alternation, as acquired by English speakers, is not
systematically sensitive to phonological features.  Therefore some of
the failures of the R-M model we found are traceable to its trying to
handle the regular rule with an architecture inappropriate to the
regular rule.

We therefore predict that these failures should be seen in other
network models that compute the regular past tense alternation using
pattern associators with distributed phonological representations
(*not* all conceivable network models, in general, in principle,
forever, etc.).  This prediction has been confirmed.  Egedi and Sproat
(1988) devised a network model that retained the assumption of
associations between distributed phonological representations but
otherwise differed radically from the R-M model: it had three layers,
not two; it used a back-propagation learning rule, not just the simple
perceptron convergence procedure; it used position-specific
phonological features, not context-dependent ones; and it had a
completely different output decoder. Nonetheless its successes and
failures were virtually identical to those of the R-M model.

(2) You claim that 

     "the regularities you describe -- both in the
     irregulars and the regulars -- are PRECISELY the kinds of
     invariances you would expect a statistical pattern     
     learner that was sensitive to higher order correlations to
     be able to learn successfully. In particular, the
     form-independent default option for the regulars should be
     readily inducible from a representative sample."

This is an interesting claim and we strongly encourage you to back it
up with argument and analysis; a real demonstration of its truth would
be a significant advance.  It's certainly false of the R-M and
Egedi-Sproat models.  There's a real danger in this kind of glib
commentary of trivializing the issues by assuming that net models are
a kind of miraculous wonder tissue that can do anything.  The
brilliance of the Rumelhart and McClelland (1986) paper is that they
studiously avoided this trap. In the section of their paper called
"Learning regular and exceptional patterns in a pattern associator"
they took great pains to point out that pattern associators are good
at specific things, especially exploiting statistical regularities in
the mapping from one set of featural patterns to another. They then
made the interesting emprical claim that these basic properties of the
pattern associator model lie at the heart of the acquisition of the
past tense. Indeed, the properties of the model afforded it some
interesting successes with the *irregular* alternations, which fall
into family resemblance clusters of the sort that pattern associators
handle in interesting ways.  But it is exactly these properties of the
model that made it fail at the *regular* alternation, which does not
form family resemblance clusters.

We like to think that these kinds of comparisons make for productive
empirical science. The successes of the pattern associator
architecture for irregulars teaches us something about the psychology
of the irregulars (basically a memory phenomenon, we argue), and its
failures for the regulars teach us something about the psychology of
the regulars (use of a default rule, we argue).  Rumelhart and
McClelland disagree with us over the facts but not over the key
emprical tests. They hold that pattern associators have particular
aptitudes that are suited to modeling certain kinds of processes,
which they claim are those of cognition.  One can argue for or against
this and learn something about psychology while so doing.  Your claim
about a 'statistical pattern learner...sensitive to higher order
correlations' is essentially impossible to evaluate.

(3) We're mystified that you attribute to us the claim that "past
tense formation is not learnable in principle." The implication is
that our critique of the R-M model was based on the assertion that the
rule is unlearned and that this is the key issue separating us from
R&M.  Therefore -- you seem to reason -- if the rule is learned, it is
learned by a network. But both parts are wrong. No one in his right
mind would claim that the English past tense rule is "built in".  We
spent a full seven pages (130-136) of 'OLC' presenting a simple model
of how the past tense rule might be learned by a symbol manipulation
device.  So obviously we don't believe it can't be learned. The
question is how children in fact do it.

The only way we can make sense of this misattribution is to suppose
that you equate "learnable" with "learnable by some (nth-order)
statistical algorithm". The underlying presupposition is that
statistical modeling (of an undefined character) has some kind of
philosophical priority over other forms of analysis; so that if
statistical modeling seems somehow possible-in-principle, then
rule-based models (and the problems they solve) can be safely ignored.
As a kind of corollary, you seem to assume that unless the input is so
impoverished as to rule out all statistical modeling, rule theories
are irrelevant; that rules are impossible without major
stimulus-poverty. In our view, the question is not CAN some (ungiven)
algorithm 'learn' it, but DO learners approach the data in that
fashion. Poverty-of-the-stimulus considerations are one out of many
sources of evidence in this issue. (In the case of the past tense
rule, there is a clear P-of-S argument for at least one aspect of the
organization of the inflectional system: across languages, speakers
automatically regularize verbs derived from nouns and adjectives
(e.g., 'he high-sticked/*high-stuck the goalie'; she braked/*broke the
car'), despite virtually no exposure to crucial informative data in
childhood. This is evidence that the system is built around
representations corresponding to the constructs 'word', 'root', and
'irregular'; see OLC 110-114.)

(4) You bring up the old distinction between rules that describe
overall behavior and rules that are explicitly represented in a
computational device and play a causal role in its behavior.  Perhaps,
as you say, "these are not crisp issues, and hence not a solid basis
for a principled critique". But it was Rumelhart and McClelland who
first brought them up, and it was the main thrust of their paper. We
tend to agree with them that the issues are crisp enough to motivate
interesting research, and don't just degenerate into discussions of
logical possibilities. We just disagree about which conclusions are
warranted. We noted that (a) the R-M model is empirically incorrect,
therefore you can't use it to defend any claims for whether or not
rules are explicitly represented; (b) if you simply wire up a network
to do exactly what a rule does, by making every decision about how to
build the net (which features to use, what its topology should be,
etc.) by consulting the rule-based theory, then that's a clear sense
in which the network "implements" the rule.  The reason is that the
hand-wiring and tweaking of such a network would not be motivated by
principles of connectionist theory; at the level at which the
manipulations are carried out, the units and connections are
indistinguishable from one another and could be wired together any way
one pleased. The answer to the question "Why is the network wired up
that way?" would come from the rule-theory; for example, "Because the
regular rule is a default operation that is insensitive to stem
phonology". Therefore in the most interesting sense such a network
*is* a rule. The point carries over to more complex cases, where one
would have different subnetworks corresponding to different parts of
rules.  Since it is the fact that the network implements such-and-such
a rule that is doing the work of explaining the phenomenon, the
question now becomes, is there any reason to believe that the rule is
implemented in that way rather some other way?

Please note that we are *not* asserting that no PDP model of any sort
could ever acquire linguistic knowledge without directly implementing
linguistic rules. Our hope, of course, is that as the discussion
proceeds, models of all kinds will be become more sophisticated and
ambitious. As we said in our Conclusion, "These problems are exactly
that, problems.  They do not demonstrate that interesting PDP models
of language are impossible in principle. At the same time, they show
that there is no basis for the belief that connectionism will dissolve
the difficult puzzles of language, or even provide radically new
solutions to them."

So to answer the catechism:

(a) Do we believe that English past tense formation is not learnable?
Of course we don't!

(b) If it is learnable, is it specifically unlearnable by nets?  No,
there may be some nets that can learn it; certainly any net that is
intentionally wired up to behave exactly like a rule-learning
algorithm can learn it. Our concern is not with (the mathematical
question of) what nets can or cannot do in principle, but about which
theories are true, and our analysis was of pattern associators using
distributed phonological representations. We showed that it is
unlikely that human children learn the regular rule the way such a
pattern associator learns the regular rule, because it is simply the
wrong tool for the job. Therefore it's not surprising that the
developmental data confirm that children do not behave the way such a
pattern associator behaves.

(c) If past tense formation is learnable by nets, but only if the
invariance that the net learns and that causally constrains its
successful performance is describable as a "rule", what's wrong with
that? Absolutely nothing! -- just like there's nothing wrong with
saying that past tense formation is learnable by a bunch of
precisely-arranged molecules (viz., the brain) such that the
invariance that the molecules learn, etc. etc.  The question is, what
explains the facts of human cognition? Pattern associator networks
have some interesting properties that can shed light on certain kinds
of phenomena, such as irregular past tense forms.  But it is simply a
fact about the regular past tense alternation in English that it is
not that kind of phenomenon.  You can focus on the interesting
empirical properties of pattern associators, and use them to explain
certain things (but not others), or you can generalize them to a class
of universal devices that can explain nothing without appeals to the
rules that they happen to implement. But you can't have it both ways.

Steven Pinker
Department of Brain and Cognitive Sciences
E10-018
MIT
Cambridge, MA 02139
steve at cogito.mit.edu

Alan Prince
Program in Cognitive Science
Department of Psychology
Brown 125
Brandeis University
Waltham, MA 02254-9110
prince at brandeis.bitnet

References:

Egedi, D.M. and R.W. Sproat (1988) Neural Nets and Natural Language
Morphology, AT&T Bell Laboratories, Murray Hill,NJ, 07974.

Pinker, S. & Prince, A. (1988) On language and connectionism: Analysis
of a parallel distributed processing model of language acquisition.
Cognition, 28, 73-193. Reprinted in S. Pinker & J.  Mehler (Eds.),
Connections and symbols. Cambridge, MA: Bradford Books/MIT Press.

Prince, A. & Pinker, S. (1988) Rules and connections in human
language. Trends in Neurosciences, 11, 195-202.

Rumelhart, D. E. & McClelland, J. L. (1986) On learning the past
tenses of English verbs. In J. L. McClelland, D. E. Rumelhart, & The
PDP Research Group, Parallel distributed processing: Explorations in
the microstructure of cognition. Volume 2: Psychological and
biological models. Cambridge, MA: Bradford Books/MIT Press.