Q: Statistical Evaluation of Classifiers

Mon Jan 2 22:05:27 EST 1995

Dear colleagues,

I am looking for references on systematic accounts of the problem of the
statistical evaluation of (statistical, machine learning or neural network) 
classifiers. I.e. of systematic accounts of statistical tests, which can be
employed if one wants to ensure, whether observed performance differences are
indeed caused by the varied independent variables (e.g. kind of method, certain
parameters of the method, used data set, ...) and not by mere chance.
What I am looking for are the appropriate statistical tests for significance
of the observed performance differences.

To make things more clear, let me simplify the case and give some references
to the literature that I have already found:

Problem 1:
You have one method for classification (e.g. a neural network) and one data
set.
There are several network parameters to tune (number of layers, learning rate,
..) and you are looking for optimal performance on your data set.
So the independent variables are the network's parameters and the dependent
variable be accuracy (i.e. percent of correct classification) (see Kibler &
Langley 1988 for an account of machine learning as an experimental science).

For each of the parameter settings, multiple neural networks should be trained
to rule out the influence of different training sets, weight initialisations
and so on. One could even employ experimental designs like bootstrap, jacknife
and thelike (see Michie et al. 1994 for an overview) for each of the parameter
settings.
Therefore, for each parameter setting, the mean accuracy over all runs is the
observed performance criteria.

A statistical test that can be employed for the testing of the significance of
the differences in observed mean accuracies is the t-test. Finnoff et al. 1992
and Hergert et al. 1992 use "a robust modification of a t-test statistic" for
comparison.

Problem 2:
is similar to Problem 1. Instead of one method for classification and one data
set, there are several methods of classification and you want to know, which
of them shows optimal performance on one data set.
Just procede as stated under Problem 1, replacing the parameter settings with 
the different methods.

Problem 3:
is quite difficult and I have no solution yet :).
You have one method of classification and several data sets. You want to know,
on what data set your algorithm performs best (again in terms of mean accuracy).The problem is that the different data sets have different numbers of classes
and different probabilities of classes. E.g. one data set has N=100 and the
first class has 50 members and the second class also. Another data set has
N=100 and the first class has 20 members, the second 30, the third 20 and the
fourth again 30.
Therfore, an accuracy of 50% would be only as good as chance for the first
data set, but maybe quite something for the second data set. This problem
has been adressed by Kononenko & Bratko 1991 from an information-based point
of view.

Problem 4:
would of course be the ultimate:
Several methods and several data sets.

As you can see from the references I have given above, I am aware that there
*are* some pointers in the literature. But as the problem of classification 
has been around for quite a while (at least for statisticians), I am wondering
if there already exists an systematic and extensive overview of methods to
employ.

On the other hand, awareness of the need for such statistical evaluation often
is very low :(.

So the question is: Is there already a comprehensive text on these matters or
do we all have to pick the information out of the standard statistic text books?

Regards and thanks for any help, Arthur.

-----------------------------------------------------------------------------
Arthur Flexer					       arthur at ai.univie.ac.at
Austrian Research Inst. for Artificial Intelligence    +43-1-5336112(Tel)
Schottengasse 3, A-1010 Vienna, Austria, Europe        +43-1-5320652(Fax) 

Literature:

Finnoff W., Hergert F., Zimmermann H.G.: Improving Generalization
	Performance by Nonconvergent Model Selection Methods, in Aleksander I. &
	Taylor J.(eds.), Artificial Neural Networks, 2, North-Holland,
	Amsterdam, pp.233-236, 1992.

Hergert F., Zimmermann H.G., Kramer U., Finnoff W.: Domain Independent
	Testing and Performance Comparisons for Neural Networks, in 
	Aleksander I. &	Taylor J.(eds.), Artificial Neural Networks, 2,
	North-Holland, Amsterdam, pp.1071-1076, 1992.

Kononenko I., Bratko I.: Information-Based Evaluation Criterion for
	Classifiers' Performance, Machine Learning, 6(1), 1991.

Kibler & Langley P.: Machine Learning as an Experimental Science, Machine 
	Learning, 3(1), 5-8, 1988.

Michie D., Spiegelhalter D.J., Taylor C.C.(eds.): Machine Learning, Neural
	and Statistical Classification, Ellis Horwood, England, 1994.