compressibility and generalization

Thu Dec 7 19:49:53 EST 1995

finnoff at predict.com (William Finnoff) wrote:
>Reading some of the recent postings concerning NFL theorems, it appears
>that there is still some misunderstandings about what they refer to in
>the versions dealing with statistical inference.   For example,  Craig
>Hicks writes:
>> (paraphrase: I want to clarify the meaning of the following assertion)
>>  (A) cross-validation works as well as anti-cross 
>>      validation (paraphrase: on average)

finnoff at predict.com (William Finnoff) continued:
>An example of this 
>would be the case of a two by two contingency table
>where the inputs are, say, 0=patient received treatment A,
>1=patient received treatment B, and values of the dependent variable
>are 0=patient died within three months, or 1=patient still alive
>after three months.  ... Using the example given above, this corresponds
>to cases where the training data contains no examples of 
>of a patient receiving one of the treatments (for example, where
>the training data only contains examples of patients
>that have received treatment A).   

Since there is no data for treatment B, how can we use cross-validation?  In
this case statement (A) above is not wrong, but it is implicitly occuring
within a context where there is no data to use for cross-validation.  If so
isn't it rather a trivial statement?  Possibly misleading?

finnoff at predict.com (William Finnoff) continued:
>The NFL theorems state that in this case, unless there is some other prior
>information available about the performance of treatment B in keeping patients
>alive, all predictions are equivalent in their average expected performance.

I certainly wouldn't expect cross-validation to work when it can't even be
used.  And I think it would work just as well as anti-cross validation,
whatever that is, where anti-cross validation is also not being used.  In
fact, both would score `0', not only on average, but every time, since they
are not being used.

----

After further study and reading postings to this list 
my current understanding is that (A) merely means that for any problem
	(cross validation >= 0)
in the sense that it will never be deceptive (never < 0) 
taking the average across the ensemble of samplings.

However, by taking a straight average over a certain infinite
(and arguably universal) ensemble of problems we can obtain 
	Expectation[cross validation] = 0
because in this ensemble the positive scoring problems are an infinitely small 
proportion.

This is exciting, because in our universe at the present time evidently 
	Expectation[cross validation] > 0,
which implies a non uniform prior over the ensemble of problems.
Or are we just choosing our problems unfairly?  
And if so, what algorithm are we using (or is using us) to choose them?

Craig Hicks           hicks at cs.titech.ac.jp 
Ogawa Laboratory, Dept. of Computer Science 
Tokyo Institute of Technology, Tokyo, Japan 

PS.  I do not claim to be clear on all the issues, 
or be free from misunderstandings by any means.  

PSS. What is anti-cross validation?