NFL Summary

Sun Dec 10 09:24:29 EST 1995

Micheal Perrone writes:
>   I think that the NFL results point towards what I feel are extremely
>   interesting research topics:
> ...
>      Can we identify a set of assumptions that are equivalent to the
>      assumption that CV model selection improves generalization?

CV is nothing more than the random sampling of prediction ability.  If the
average over the ensemble of samplings of this ability on 2 different models A
and B come out showing that A is better than B, then by definition A is better
than B.  This assumes only that the true domain and the ensemble of all
samplings coincide.  Therefore CV will not, on average, cause a LOSS in
prediction ability.  That is, when it fails, it fails gracefully,
on average.  It cannot be consistently deceptive.

(A quick note:  Sometimes it is advocated that a complexity 
parameter be set by splitting the data set into training and testing,
and using CV.  Then with the complexity parameter fixed the
whole data set can be used to train the other parameters.
Behind this is an ASSUMPTION about the independence of the 
complexity from the other parameters.  Of course it often works
in practice, but it violates the principle in the above paragraph,
so I do not count this as real CV here.)

Two prerequisites exist to obtain a GAIN with CV

	1) The objective function must be "compressible".  I.e., it cannot 
	   be noise.
        2) We must have a model which can recognize the structure in 
           the data.  This structure might be quite hard to see, as in chaotic
	   signals.  

I think NFL says that on average CV will not obtain GAINful results, because
the chance that a randomly selected problem and a randomly selected algorithm
will hit it off is vanishingly small.  (Or even any fixed problem and a
randomly selected algorithm.)

But I think it tells us something more important as well.  It tells us that
not using CV means we are always implicitly trusting our a priori knowledge.
Any reasonable learning algorithm can always predict the training data, or a
"smoothed" version of it.  But because of the NFL theorem, this, over the
ensemble of all algorithms and problems, means nothing.  On average there will
be no improvement in the off training set error.  Fortunately, CV will report
this fact by showing a zero correlation between prediction and true value on
the off training set data. (Of course this is only the performance of CV on
average over the ensemble of off training set datas; CV may be deceptive for a
single off training set data.)  Thus, we shouldn't think we can do away with
CV unless we admit to having great faith in our prior.

Going back to NFL, I think it poses another very interesting problem:
Supposing we have "a foot in the door".  That is, an algorithm which makes
some sense of the data by showing some degree of prediction capability.  Can
we always use this prediction ability to gain better prediction ability?  Is
there some kind of ability to perform something like steepest descent over the
space of algorithms, ONCE we are started on a slope?  Is there a provable 
snowball effect?

I think NFL reminds us that we are already rolling down the hill,
and we shouldn't think otherwise.

Craig Hicks
Tokyo Institute of Technology