No subject

Mon Mar 15 15:03:22 EST 1993

Hinton says:

>>>
First, consider K nearest neighbors, where a validation set is used to pick K
and all K neighbors get to vote on the answer.  It seems fair to call this a
"local" algorithm.  If the training data contains a fraction p of all the
possible cases of n-bit parity, then each novel test case will have about pn
neighbors in the training set that differ by one bit.  It will also have about
pn(n-1)/2 training neighbors that differ by 2 bits, and so on.  So for
reasonably large n we will get correct performance by picking K so that we
tend to get all the training cases that differ by 1 bit and most of the far
more numerous training cases that differ by 2 bits. For very small values of p
we need to consider more distant neighbors to get this effect to work, and
this requires larger values of n.
>>>

This seems very plausible but is not quite right.

First off, as I stated in my posting, I think we've been here before,
about a year ago ... my main point in my posting was that how backprop
does on XOR is only of historical interest. But since the broth has
been stirred up again ...

One can make a strong argument that using cross-validation, as in
Geoff's scheme, is, by definition, non-local. When describing a
learning algorithm as "local", one (or at least I) implicitly means
that the guess it makes in response to a novel input test value
depends only on nearest neighbors in the training set. K nearest
neighbor for fixed (small) K is a local learning algorithm, as I
stated in my original posting. Geoff wishes to claim that the learning
algorithm which chooses K = K* via cross-validation and then uses K*
nearest neighbor is also local. Such a learning algorithm is
manifestly global however - on average, changes in the training set
far away will affect (perhaps drastically) how the algorithm responds
to a novel test input, since they will affect calculated
cross-validation errors, and therefore will affect choice of K*.

In short, one should not be misled by concentrating on the fact that
an overall algorithm has *as one of its parts*, an algorithm which, if
used by itself, is local (namely, K* nearest neighbor). It is the
*entire* algorithm which is clearly the primary object of interest.
And if that algorithm uses cross-validation, it is not "local" in the
(reasonable) way it's defined above.

Indeed, one might argue that it is precisely this global character
which makes cross-validation such a useful heuristic - it allows
information from the entire training set to affect one's guess, and it
does so w/o resulting in a fit going through all those points in the
training set, with all the attendant "overtraining" dangers of such a
fit.

On the other hand, if K had been set beforehand, rather than via
cross-validation, and if for some reason one had set K particularly
high, then, as Geoff correctly points out, the algorithm wouldn't
perform poorly on parity at all. Moreover, consider changing the
definition of "local", along the lines of "a learning algorithm is
local if, on average, the the pairs in the set {single input-output
pairs from the training set such that changing any single one of those
pairs has a big effect on the guess} all lie close to the test input
value", with some appropriate definition of "big". For such a
definition, cross-validation-based K nearest neighbor might be local.
(The idea is that for a big enough training set, changing a single
point far away will have little affect, on average, on calculated
cross-validation errors. On the other hand, for K* large, changing a
single nearby point will also have little effect, so it's not clear
that this modified definition of "local" will save Hinton's argument.)

I didn't feel any of this was worth getting into in detail in my
original posting. In particular, I did not discuss cross-validation or
other such schemes, because the people to whom I was responding did
not discuss cross-validation. For the record though, in that posting I
was thinking of K nearest neighbor where K is fixed, and on the order
of n. For such a scenario, everything I said is true, as Geoff's own
reasoning shows.

Geoff goes on

>>>
Second, backprop generalizes parity pretty well (on NOVEL test cases) provided
you give it a chance.  If we use the "standard" net with n hidden units (we
can get by with less) we have (n+1)^2 connections.  To get several bits of
constraint per connection we need p 2^n >> (n+1)^2 where p is the fraction of
all possible cases that are used for training.  For n=6 there are only 64
possible cases and we use 49 connections so this isnt possible.  For n=10, if
we train on 512 cases we only get around 3 or 4% errors on the remaining
cases.  Of course training is slow for 10 bit parity: About 2000 epochs even
with adaptive learning rates on each connection (and probably a million epochs
if you choose a bad enough version of backprop.)
>>>

By and large, I agree. However I think this misses the point. XOR is
parity for low n, not high n, and my comments were based on extending
XOR (since that's what the people I was responding to were talking
about.) Accordingly, it's the low n case which was of interest, and as
Geoff agrees, backprop dies a gruesome death for low n. Again, I
didn't want to get into all this in my original posting.

However, while we're on the subject of cross-validation and the like,
I'd like to direct the attention of the connectionist community to an
article in the Feb. '93 ML journal by Schaffer which suggests that
cross-validation fails as often as it succeeds, on average. So it is
by no means a panacea. Formal arguments on this topic can be found in a
Complex Systems paper of mine from last year, and also in a new paper,
directly addressing Schaffer's experiments, which I plan to post in a
week or so.

David Wolpert