On Theft vs. Honest Toil (Pinker & Prince Discussion, cont'd)

Wed Aug 31 16:39:33 EDT 1988

Pinker & Prince write in reply:

>>  Contrary to your suggestion, we never claimed that pattern associators
>>  cannot learn the past tense rule, or anything else, in principle.

I've reread the paper, and unfortunately I still find it ambiguous:
For example, one place (p. 183) you write:
   "These problems are exactly that, problems. They do not demonstrate
   that interesting PDP models of language are impossible in principle."
But elsewhere (p. 179) you write:
   "the representations used in decomposed, modular systems are
   abstract, and many aspects of their organization cannot be learned
   in any obvious way." [Does past tense learning depend on any of
   this unlearnable organization?]
On p. 181 you write:
   "Perhaps it is the limitations of these simplest PDP devices --
   two-layer association networks -- that causes problems for the
   R & M model, and these problems would diminish if more
   sophisticated kinds of PDP networks were used."
But earlier on the same page you write:
   "a model that can learn all possible degrees of correlation among a
   set of features is not a model of a human being" [Sounds like a
   Catch-22...]

It's because of this ambiguity that my comments were made in the form of
conditionals and questions rather than assertions. But we now stand
answered: You do NOT claim "that pattern associaters cannot learn the
past tense rule, or anything else, in principle."

[Oddly enough, I do: if by "pattern associaters" you mean (as you mostly
seem to mean) 2-layer perceptron-style nets like the R & M model, then I
would claim that they cannot learn the kinds of things Minsky showed they
couldn't learn, in principle. Whether or not more general nets (e.g., PDP
models with hidden layers, back-prop, etc.) will turn out to have corresponding
higher-order limitations seems to be an open question at this point.]

You go on to quote my claim that:

     "the regularities you describe -- both in the
     irregulars and the regulars -- are PRECISELY the kinds of
     invariances you would expect a statistical pattern     
     learner that was sensitive to higher order correlations to
     be able to learn successfully. In particular, the
     form-independent default option for the regulars should be
     readily inducible from a representative sample."

and then you comment:

>>  This is an interesting claim and we strongly encourage you to back it
>>  up with argument and analysis; a real demonstration of its truth would
>>  be a significant advance. It's certainly false of the R-M and
>>  Egedi-Sproat models. There's a real danger in this kind of glib
>>  commentary of trivializing the issues by assuming that net models are
>>  a kind of miraculous wonder tissue that can do anything.

I don't understand the logic of your challenge. You've disavowed
having claimed that any of this was unlearnable in principle. Why is it
glibber to conjecture that it's learnable in practice than that it's
unlearnable in practice? From everything you've said, it certainly
LOOKS perfectly learnable: Sample a lot of forms and discover that the
default invariance turns out to work well in most cases (i.e., the
"regulars"; the rest, the "irregulars," have their own local invariances,
likewise inducible from statistical regularities in the data).

This has nothing to do with a belief in wonder tissue. It was precisely
in order to avoid irrelevant stereotypes like that that the first
posting was prominently preceded by the disclaimer that I happen to be
a sceptic about connectionism's actual accomplishments and an agnostic
about its future potential. My critique was based solely on the logic of
your argument against connectionism (in favor of symbolism). Based
only on what you've written about its underlying regularities, past
tense rule learning simply doesn't seem to pose a serious challenge for a
statistical learner -- not in principle, at any rate. It seems to have
stumped R & M 86 and E & S 88 in practice, but how many tries is
that? It is possible, for example, as suggested by your valid analysis of
the limitations of the Wickelfeature representation, that some of the
requisite regularities are simply not reflected in this phonological
representation, or that other learning (e.g. plurals) must complement
past-tense data. This looks more like an entry-point problem
(see (1) below), however, rather than a problem of principle for
connectionist learning of past tense formation. After all, there's no
serious underdetermination here; it's not like looking for a needle in
a haystack, or NP-complete, or like that.

I agree that R & M made rather inflated general claims on the basis of
the limited success of R & M 86. But (to me, at any rate) the only
potentially substantive issue here seems to be the one of principle (about
the relative scope and limits of the symbolic vs. the connectionistic
approach). Otherwise we're all just arguing about the scope and limits
of R & M 86 (and perhaps now also E & S 88).

Two sources of ambiguity seem to be keeping this disagreement
unnecessarily vague:

(1) There is an "entry-point" problem in comparing a toy model (e.g.,
R & M 86) with a lifesize cognitive capacity (e.g., the human ability
to form past tenses): The capacity may not be modular; it may depend on
other capacities. For example, as you point out in your article, other
phonological and morphological data and regularities (e.g.,
pluralization) may contribute to successful past tense formation. Here
again, the challenge is to come up with a PRINCIPLED limitation, for
otherwise the connectionist can reasonably claim that there's no reason
to doubt that those further regularities could have been netted exactly
the same way (if they had been the target of the toy model); the entry
point just happened to be arbitrarily downstream. I don't say this
isn't hand-waving; but it can't be interestingly blocked by hand-waving
in the opposite direction.

(2) The second factor is the most critical one: learning. You
put a lot of weight on the idea that if nets turn out to behave
rulefully then this is a vindication of the symbolic approach.
However, you make no distinction between rules that are built in (as
"constraints," say) and rules that are learned. The endstate may be
the same, but there's a world of difference in how it's reached -- and
that may turn out to be one of the most important differences between
the symbolic approach and connectionism: Not whether they use
rules, but how they come by them -- by theft or honest toil. Typically,
the symbolic approach builds them in, whereas the connectionistic one
learns them from statistical regularities in its input data. This is
why the learnability issue is so critical. (It is also what makes it
legitimate for a connectionist to conjecture, as in (1) above, that if
a task is nonmodular, and depends on other knowledge, then that other
knowledge too could be acquired the same way: by learning.)

>>  Your claim about a 'statistical pattern learner...sensitive to higher
>>  order correlations' is essentially impossible to evaluate.

There are in principle two ways to evaluate it, one empirical and
open-ended, the other analytical and definitive. You can demonstrate
that specific regularities can be learned from specific data by getting
a specific learning model to do it (but its failure would only be evidence
that that model fails for those data). The other way is to prove analytically
that certain kinds of regularities are (or are not) learnable from
certain kinds of data (in certain ways, I might add, because
connectionism may be only one candidate class of statistical learning
algorithms). Poverty-of-the-stimulus arguments attempt to demonstrate
the latter (i.e., unlearnability in principle).

>>  We're mystified that you attribute to us the claim that "past
>>  tense formation is not learnable in principle."... No one in his right
>>  mind would claim that the English past tense rule is "built in".  We
>>  spent a full seven pages (130-136) of 'OLC' presenting a simple model
>>  of how the past tense rule might be learned by a symbol manipulation
>>  device. So obviously we don't believe it can't be learned.

Here are some extracts from OLC 130ff:

   "When a child hears an inflected verb in a single context, it is
   utterly ambiguous what morphological category the inflection is
   signalling... Pinker (1984) suggested that the child solves this
   problem by "sampling" from the space of possible hypotheses defined
   by combinations of an innate finite set of elements, maintaining
   these hypotheses in the provisional grammar, and testing them
   against future uses of that inflection, expunging a hypothesis if
   it is counterexemplified by a future word. Eventually... only
   correct ones will survive." [The text goes on to describe a
   mechanism in which hypothesis strength grows with success frequency
   and diminishes with failure frequency through trial and error.]
   "Any adequate rule-based theory will have to have a module that
   extracts multiple regularities at several levels of generality,
   assign them strengths related to their frequency of exemplification
   by input verbs, and let them compete in generating a past tense for
   for a given verb."

It's not entirely clear from the description on pp. 130-136 (probably
partly because of the finessed entry-point problem) whether (i) this is an
innate parameter-setting or fine-tuning model, as it sounds, with the
"learning" really just choosing among or tuning the built-in parameter
settings, or whether (ii) there's genuine bottom-up learning going on here.
If it's the former, then that's not what's usually meant by "learning."
If it's the latter, then the strength-adjusting mechanism sounds equivalent
to a net, one that could just as well have been implemented nonsymbolically.
(You do state that your hypothetical module would be equivalent to R & M's in
many respects, but it is not clear how this supports the symbolic approach.)

[It's also unclear what to make of the point you add in your reply (again
partly because of the entry-point problem):
>>"(In the case of the past tense rule, there is a clear P-of-S argument
for at least one aspect of the organization of the inflectional system...)">>
Is this or is this not a claim that all or part of English past tense
formation is not learnable (from the data available to the child) in
principle? There seems to be some ambiguity (or perhaps ambivalence) here.]

>>  The only way we can make sense of this misattribution is to suppose
>>  that you equate "learnable" with "learnable by some (nth-order)
>>  statistical algorithm". The underlying presupposition is that
>>  statistical modeling (of an undefined character) has some kind of
>>  philosophical priority over other forms of analysis; so that if
>>  statistical modeling seems somehow possible-in-principle, then
>>  rule-based models (and the problems they solve) can be safely ignored.

Yes, I equate learnability with an algorithm that can extract
statistical regularities (possibly nth order) from input data.
Connectionism seems to be (an interpretation of) a candidate class of
such algorithms; so does multiple nonlinear regression. The question of
"philosophical priority" is a deep one (on which I've written:
"Induction, Evolution and Accountability," Ann. NY Acad. Sci. 280,
1976). Suffice it to say that induction has epistemological priority
over innatism (or such a case can be made) and that a lot of induction
(including hypothesis-strengthening by sampling instances) has a
statistical character. It is not true that where statistical induction
is possible, rule-based models must be ignored (especially if the
rule-based models learn by what is equivalent to statistics anyway),
only that the learning NEED not be implemented symbolically. But it is
true that where a rule can be learned from regularities in the data,
it need not be built in. [Ceterum sentio: there is an entry-point
problem for symbols that I've also written about: "Categorical
Perception," Cambr. U. Pr. 1987. I describe there a hybrid approach in
in which symbolic and nonsymbolic representations, including a
connectionistic component, are put together bottom-up in a principled
way that avoids spuriously pitting connectionism against symbolism.]

>>  As a kind of corollary, you seem to assume that unless the input is so
>>  impoverished as to rule out all statistical modeling, rule theories
>>  are irrelevant; that rules are impossible without major stimulus-poverty.

No, but I do think there's an entry-point problem. Symbolic rules can
indeed be used to implement statistical learning, or even to preempt it, but
they must first be grounded in nonsymbolic learning or in innate
structures. Where there is learnability in principle, learning does
have "philosophical (actually methodological) priority" over innateness.

>>  In our view, the question is not CAN some (ungiven) algorithm
>>  'learn' it, but DO learners approach the data in that fashion.
>>  Poverty-of-the-stimulus considerations are one out of many
>>  sources of evidence in this issue...
>>  developmental data confirm that children do not behave the way such a
>>  pattern associator behaves.

Poverty-of-the-stimulus arguments are the cornerstone of modern
linguistics because, if they are valid, they entail that certain
rules are unlearnable in principle (from the data available to the
child) and hence that a learning model must fail for such cases.
The rule system itself must accordingly be attributed to the brain,
rather than just the general-purpose inductive wherewithal to learn the
rules from experience.

Where something IS learnable in principle, there is of course still a
question as to whether it is indeed learned in practice rather than
being innate; but neither (a) the absence of data on whether it is learned
nor (b) the existence of a rule-based model that confers it on the child
for free provide very strong empirical guidance in such a case. In any
event, developmental performance data themselves seem far too
impoverished to decide between rival theories at this stage. It seems
advisable to devise theories that account for more lifesize chunks of our
asymptotic (adult) performance capacity before trying to fine-tune them
with developmental (or neural, or reaction-time, or brain-damage) tests
or constraints. (Standard linguistic theory has in any case found it
difficult to find either confirmation or refutation in developmental
data to date.)

By way of a concrete example, suppose we had two pairs of rival toy
models, symbolic vs. connectionistic, one pair doing chess-playing and
the other doing factorials. (By a "toy" model I mean one that models
some arbitrary subset of our total cognitive capacity; all models to
date, symbolic and connectionistic, are toy models in this sense.) The
symbolic chess player and the connectionistic chess player both
perform at the same level; so do the symbolic and connectionistic
factorializer. It seems evident that so little is known about how people
actually learn chess and factorials that "developmental" support would
hardly be a sound basis for choosing between the respective pairs of models
(particularly because of the entry-point problem, since these skills
are unlikely to be acquired in isolation). A much more principled way
would be to see how they scaled up from this toy skill to more and
more lifesize chunks of cognitive capacity. (It has to be conceded,
however, that the connectionist models would have a marginal lead in
this race, because they would already be using the same basic
[statistical learning] algorithm for both tasks, and for all future tasks,
presumably, whereas the symbolic approach would have to be making its
rules on the fly, an increasingly heavy load.)

I am agnostic about who would win this race; connectionism may well turn
out to be side-lined early because of a higher-order Perceptron-like limit
on its rule-learning ability, or because of principled unlearnability
handicaps. Who knows? But the race is on. And it seems obvious that
it's far too early to use developmental (or neural) evidence to decide
which way to bet. It's not even clear that it will remain a 2-man race
for long -- or that a finish might not be more likely as a
collaborative relay. (Nor is the one who finishes first or gets
farthest guaranteed to be the "real" winner -- even WITH developmental
and neural support. But that's just normal underdetermination.)

>>  if you simply wire up a network to do exactly what a rule does, by
>>  making every decision about how to build the net (which features to
>>  use, what its topology should be, etc.) by consulting the rule-based
>>  theory, then that's a clear sense in which the network "implements"
>>  the rule

What if you don't WIRE it up but TRAIN it up? That's the case at
issue here, not the one you describe. (I would of course agree that if
nets wire in a rule as a built-in constraint, that's theft, not
honest toil, but that's not the issue!)

Stevan Harnad
harnad at mind.princeton.edu