redundancy and generalization

Mon Oct 21 23:04:38 EDT 1991

Consider the eight words PAT PAD CAT CAD POT POD COT COD. Give a net the
task
of translating these from letters to phonemes. Choose any subset of, say,
four
items as the training set and after training to asymptote test performance
on
the other four. Even with a training set that contains all the information
needed for the test set (e.g. PAT POD CAT COD exemplifies every
letter-phoneme
pairing twice), the various architectures we have been trying score 0% on
the
generalization set (in this example, the net learns nothing about the third
letter so in the generalisation test translates PAD as "pat", POT as "pod",
COT as "cod" and CAD as "cat". Is this problem, trivial for rule-learning
algorithms, insoluble for any system that learns by error-correction?

Tom Dietterich writes:

>Generally speaking, in noise-free domains, windowing works quite well.
>A very high-performing decision tree can be learned with a relatively
>small window.  However, for noisy data, the general experience has
>been that the window eventually grows to include the entire training set.
>Jason Catlett (Sydney U) recently completed his dissertation on
>testing windowing and various other related tricks on datasets of
>roughly 100K examples (straight classification problems).  I recommend
>his papers and thesis.
>
>His main conclusion is that if you want high performance, you need to
>look at all of the data.
"The window eventually grows to include the entire training set" = "the
system is incapable of generalizing accurately ". Note that noise isn't the
problem. In
my example, there's no noise, and no generalization

Max Coltheart
max.coltheart at mrc-apu.cam.ac.uk