No subject

Fri Aug 12 16:32:37 EDT 1994

Mark Plutowski recently said on connectionist:

>>>
A discussion on the Machine Learning List prompted the question 

"Have theoretical conditions been established under which 
cross-validation is justified?"  

The answer is "Yes." 
>>>

Mark is being polite by not using names; I am the one he is
(implicitly) taking to task, for the following comment on the
ML list:

>>>
. an assumption *must always* be
present if we are to have any belief in learnability in the problem
at hand...
However, to give just one example, nobody has yet delineated
just what those assumptions are for the technique of cross-validation.
>>>

Mark is completely correct in his (implicit) criticism.

As he says, there has in fact been decades of work analyzing 
cross-validation from a sampling theory perspective. Mark's thesis is 
a major contribution to this literature. Any implications coming from my
message that such literature doesn't exist or is somehow invalid
are completely mistaken and were not intended. (Indeed, I've had 
several very illuminating discussions with Mark about his thesis!)

The only defense for the imprecision of my comment is that I made 
it in the context of the ongoing discussion on the ML list that 
Mark referred to. That discussion concerned off-training set error 
rather than iid error, so my comments implicitly assumed off-training 
set error. And as Mark notes, his results (and the others in the 
literature) don't extend to that kind of error.

(Another important distinction between the framework Mark uses and the
implicit framework in the ML list discussion is that the latter has
been concerned w/ zero-one loss, whereas Mark's work concentrates on
quadratic loss. The no-free-lunch results being discussed in the ML
list change form drastically if one uses quadratic rather than
zero-one loss. That should be no surprise, given, for example, the
results in Michael Perrone's work.)

While on the subject of cross-validation and iid error though, it's
interesting to note that there is still much to be understood. For
example, in the average-data scenario of sampling theory statistics
that Mark uses, asymptotic properties are better understood than
finite data properties. And in the Bayesian this-data framework, very
little is known for any data-size regime (though some work by Dawid on
this subject comes to mind).

David Wolpert