Cross-validation theory

Sat Aug 13 01:19:30 EDT 1994

Mark Plutowski recently said on connectionist:

>>>
A discussion on the Machine Learning List prompted the question

"Have theoretical conditions been established under which
cross-validation is justified?"

The answer is "Yes."
>>>

As Mark goes on to point out, there are decades
of work on cross-validation from a likelihood-driven
sampling theory perspective. Indeed, Mark's thesis is a
major addition to that literature.

Mark then correctly notes that this literature doesn't *directly* 
apply to the discusson on the ML list, since that discussion involves
off-training set rather than iid error.

It should be noted that there is another important distinction between 
the framework Mark uses and the implicit framework in the ML list 
discussion; the latter has been concerned w/ zero-one loss, whereas 
Mark's work concentrates on quadratic loss. The no-free-lunch results 
being discussed in the ML list change form drastically if one uses 
quadratic rather than zero-one loss. That should not be too surprising, 
given, for example, the results in Michael Perrone's work involving
quadratic loss.

It's also worth noting that even in the regime of iid error and quadratic
loss, there is still much to be understood. For example, in the 
average-data scenario of sampling theory statistics that Mark uses, 
asymptotic properties are better understood than finite data properties. 
And in the Bayesian this-data framework, very little is known for any 
data-size regime (though some work by Dawid on this subject comes to mind).

David Wolpert