'malicious' training tokens

Tue Mar 13 16:59:24 EST 1990

Fundamentally I agree with Gerry Tesauro's argument that
training tokens in the vicinity of a random vector's true
class boundaries are the best ones to use if you want
good generalization --- these tokens will delineate the
optimal class boundaries.  I think there's a caveat though:

Say that you could consistently obtain training tokens in the
vicinity of the optimal class boundaries.  If you could
get an arbitrarily large number of independent training tokens,
then you could build a perceptron (or for non-linear class boundaries,
an MLP) with appropriate connectivity for good generalization.
If, however, you were severely limited in the number of
independent training samples you could obtain (again, they're
all near the optimal class boundaries), then you'd be faced
with insufficient data to avoid bad inferences about your limited
data --- and you'd get crummy generalization.  This would happen
because your classifier needs to be sufficiently parameterized to
learn the training tokens; however, this degree of parameterization
leads to rotten generalization due to insufficient data.
In cases of limited training data, then, it may be better to
have training tokens near the modes of the class conditional
densities (and away from the optimal boundaries) in order for
you to at least make a good inference about the prototypical
nature of the RV being classified.  These tokens would also require
a lower degree of parameterization in your classifier, which
would give better performance on disjoint test data.  I haven't
read Baum's paper and I wouldn't presume to put words in
anyone's mouth, but maybe this is what he was getting at by
characterizing near-boundary tokens as 'malicious'.

Incidentally, Duda & Hart (sec. 5.5) give a convergence proof for the
perceptron criterion function that illustrates why it takes so
long to learn 'malicious' training tokens near the optimal class
boundaries.

John