compressibility and generalization

Mon Dec 11 20:01:05 EST 1995

"Michael Perrone" <mpp at watson.ibm.com> wrote:
>[hicks at cs.titech.ac.jp wrote:]
>> PSS. What is anti-cross validation?
>Suppose we are given a set of functions and a crossvalidation data set.
>The CV and Anti-CV algorithms are as follows:
>     CV: Choose the function with the best  performance on the CV set.
>Anti-CV: Choose the function with the worst performance on the CV set.

case 1: 
*	Either the target function is (noise/uncompressible/has no structure),
or none of the candidate functions have any correlation with the target
function.*
	In this case both Anti-CV and CV provide (ON AVERAGE) equal
improvement in prediction ability: none.  For that matter so will ANY method
of selection.
	Moreover, if we plot a graph of the number of data used for training
vs. the estimated error (using the residual data), we will (ON AVERAGE) see no
decrease in estimated error.  Since CV provides an estimated prediction error,
it can also tell us "you might as well be using anti-cross validation, or
random selection for that matter, because it will be equally useless".

case 2: 
*	The target (is compressible/has structure), and some the candidate
functions are positively correlated with the target function.*
	In this case CV will outperform anti-CV (ON AVERAGE).

By ON AVERAGE I mean the expectation across the ensemble of samples for
a FIXED target function.  This is different from the ensemble and distribution
of target functions, which is a much bigger question.  We known much already
about about the ensemble of samples from a fixed target function.  I am not
avoiding the issue of the ensemble or distribution of target functions, but
merely showing that we have 2 general cases, and that in both of them CV is
never WORSE than anti-CV.  It follows that whatever the distribution of
targets is, CV is never worse (ON AVERAGE) than anti-CV.

I don't believe this contradicts NFL in any way.  It just clarifies the
role that CV can play.  

Learning and monitoring prediction error go hand in hand.
This is even more true for cases when the underlying function 
may be changing and the data has the form of an infinite stream.

Craig Hicks
Tokyo Institute of Technology