yet more interp. vs. gen.

Tue Sep 10 13:53:35 EDT 1991

Geoff Hinton writes, concerning using generalizers on the
parity problem:

"Curiously, K nearest neighbors can do a good job of generalizing
parity from randomly distributed examples IFF we use generalization on
a validation set to determine the optimal value of K.  Each n-bit input
vector has n neighbors that differ on 1 bit, but order n^2 neighbors
that differ on two bits (and have the same parity). So the
generalization to a validation set will be best for K small and even."

1) I don't understand the conclusion; for K small (e.g., <~ n), as
Geoff points out, the nearest neighbors differ by 1 bit, have the wrong
parity, and you run into the problem of increasingly bad
generalization. In fact,

2) I've done tests with K <= n on parity, using a weighted (by Hamming
distance) average of those K neighbors, and generalization error
off of the training set definitely gets worse as the training set
size increase, asymptoting at 100% error. Indeed, if K <= n, I
don't care what kind of weighting one uses, or what the actual value of
K is; 100% error will occur for training sets consisting of samples
from the entire space except for the single off-training set question.

3) I don't doubt that if one uses cross-validation to set K one can do
better than if one doesn't. (In fact, this is exactly the tack I took
in my 1990 Neural Networks article to beat NETtalk using a weighted
average generalizer.) However I should point out that if we're
broadening the discussion to allow the technique of cross-validation,
then one should consider non-linear-time-series-type techniques, like
fan generalizers. On the parity problem it turns out that the guessing
of such generalizers is more accurate than either backprop or nearest
neighbors. For example, I've recently run a test on the parity problem
with n = 24, with a training set consisting solely of the 301 sample
points with 2 or fewer of the 24 bits on; fan generalizers have
*perfect* generalization to each of the remaining 16 million + sample
points. Not a single error. (With a bit of work, I've managed to prove
why this behavior occurs.) For some other sets of sample points (e.g.,
for most randomly distributed training sets) the kind of fan generalizers
I've been experimenting with can only make guesses for a subset of the
questions off of the training set; for those points where they can't
make a guess one must use some other technique. Nonetheless, for those
questions where they *can* make a guess, in the thousand or so
experiments I've run on the parity problem they NEVER make an error.
	I should point out that fan generalizers obviously can't have behavior
superior to backprop on *all* target functions. Nonetheless, fan
generalizers *do* have superior behavior for 8 out of the 9 simple
target functions that have been tested so far; even for those target functions
where they don't do generalize perfectly, they have fewer errors than
backprop.

David