Ph.D. thesis: combining nearest neighbor classifiers

Wed Nov 19 15:11:05 EST 1997

The following Ph.D. dissertation is available:

Prototype Selection for Composite Nearest Neighbor Classifiers
David B. Skalak
Dept. of Computer Science
University of Massachusetts
Amherst, MA 01003-4610

The dissertation and other publications can be obtained from my homepage:

http://www.cs.cornell.edu/Info/People/skalak

The dissertation can also be retrieved directly from

http://www.cs.cornell.edu/Info/People/skalak/thesis-header-dspace.ps.gz

Keywords:
classifier combination, ensemble methods, stacked generalization,
boosting, accuracy-diversity graphs, classifier sampling,
k-nearest neighbor, prototype selection, reference selection,
editing algorithms, instance taxonomy, coarse classification,
deliberate misclassification

Abstract:
Combining the predictions of a set of classifiers has been shown to be an
effective way to create composite classifiers that are more accurate than any
of the component classifiers.  Increased accuracy has been shown in a variety
of real-world applications, ranging from protein sequence identification to
determining the fat content of ground meat.

Despite such individual successes, the answers are not known to fundamental
questions about classifier combination, such as ``Can classifiers from any
given model class be combined to create a composite classifier with higher
accuracy?''  or ``Is it possible to increase the accuracy of a given
classifier by combining its predictions with those of only a small number of
other classifiers?''.  The goal of this dissertation is to provide answers to
these and closely related questions with respect to a particular model class,
the class of nearest neighbor classifiers.

We undertake the first study that investigates in depth the combination of
nearest neighbor classifiers.  Although previous research has questioned the
utility of combining nearest neighbor classifiers, we introduce algorithms
that combine a small number of component nearest neighbor classifiers, where
each of the components stores a small number of prototypical instances.  In
a variety of domains, we show that these algorithms yield composite
classifiers that are more accurate than a nearest neighbor classifier that
stores all training instances as prototypes.

The research presented in this dissertation also extends previous work on
prototype selection for an independent nearest neighbor classifier.  We show
that in many domains, storing a very small number of prototypes can provide
classification accuracy greater than or equal to that of a nearest neighbor
classifier that stores all training instances.  We extend previous work by
demonstrating that algorithms that rely primarily on random sampling can
effectively choose a small number of prototypes.

Regards,
David.

David B. Skalak, Ph.D.
Senior Data Mining Analyst
IBM Global Business Intelligence Solutions
IBM tie-line:  320-9774; Telephone: 607 257-5510
Internet: skalak at us.ibm.com