No free lunch for Cross Validation!

Thu Dec 14 13:12:43 EST 1995

Dear Colleagues,

A little while ago someone claimed that 
    Cross validation will benefit from the presence of any structure,
    and if there is no structure it does no harm; 
yet
    NFL explicitly states that a structure can be equally good or
    bad for any given method, depending on how they match each other;
yet
    It was further claimed that they do not conflict with each other.

I was quite curious and did the following five-minute experiment to
find out which is correct.

Suppose we have a Gaussian variable x, with mean mu and unit variance.
We have the following three estimators for estimating mu from a
sample of size n.
  A: The sample mean.  It is optimal both in the sense of Maximum
Likelihood and Least Mean Squares.
  B: The maximum of sample.  It is a bad estimator in any reasonable sense.
  C: Cross validation to choose between A and B, with one extra data point.

The numerical result with n=16 and averaged over 10000 samples, gives
mean squared error:
        A: 0.0627    B: 3.4418    C: 0.5646
This clearly shows that cross validation IS harmful in this case,
despite the fact it is based on a larger sample.  NFL still wins!

Many of you might jump on me at this point: But this is a very
artificial example, which is not what normally occurs in practice.
To this I have two answers, short and long.

The short answer is from principle.  Any counter-example, however 
artificial it is, clearly demolishes the hope that cross validation
is a "universally beneficial method".

The longer answer is divided in several parts, which hopefully will
answer any potential criticism from any aspect:

1. The cross validation is performed on extra data points.  We are not
requiring it to perform as good as the mean on 17 data points.  If it 
cannot extract more information from the one extra data point, a minimum
requirement is that it keeps the information in the original 16 points. 
But it can't even do this.

2. The maximum of a sample is the 100 percentile. The median is the 50
percentile, which is in fact a quite reasonable estimator.  Let us use
a larger cross validation set (of size k), and replace B with a
different percentile.  The result is that, for the median, CV needs k>2
to work. For 70 percentile CV needs k>16.  The required k increases 
dramatically with the percentile.

3. It is not true that we have set up a case in which cross validation 
can't win.  There is indeed a small probability that a sample can be so 
bad that the sample maximum is even a better estimate than the sample 
mean.  However to utilise such rare chances to good effect k must be at 
least several hundred (maybe exponential) while n=16.  We know such k 
exists since k=infinity certainly helps.  Yet to adopt such a method 
is clearly absurd.

4. Although we have chosen estimator A to be the known optimal
estimator in this case, it can be replaced by something else. For
example, both A and B can be some reasonable averages over
percentiles, so that without detailed analysis it may appear doing
cross validation might give a C which is better than both A and B.  
Such believes can be defeated by similar counter-examples.

5. The above scheme of cross validation may appear different from what
is familiar, but here is a "practical example" which shows that it is
indeed what people normally do.  Suppose we have a random variable
which is either Gaussian or Cauchy.  Consider the following three
estimators:
    A: Sample mean: It has 100% efficiency for Gaussian, and 0%
efficiency for Cauchy.
    B: Sample median: It is 2/pi=63.66% efficient for Gaussian and
8/pi^2=81.06% efficient for Cauchy.
    C: Cross validation on an additional sample of size k, to choose
between A and B.
Intuitively it appears quite reasonable to expect cross validation to
pick out the correct one, for most of the time, so that, if averaged
over all samples, C ought to be superior to both A and B.  But no!!
This will depend on the PRIOR mixing probability of these two sub-models.  
If the variable is in fact always Gaussian, then we have just seen that 
if n=16, CV will be worse unless k>2.  The same is even more true in 
the reversed order, since the mean is an essentially useless estimator 
for Cauchy. 

6. In any of the above cases, "anti cross validation" would be even
more disastrous.

If you are not convinced by these arguments, or if you want to know 
more about efficiency, then maybe the following reference can help:
Fisher, R.A.: Theory of statistical estimation, Proc. Camb. Phil. Soc.,
Vol. 122, pp. 700-725, 1925.

If you are more or less convinced, I have the following speculation:

Several centuries ago, the French Academy of Science (or is it the
Royal Society?) made a decision that they would no longer examine 
inventions of "perpetual motion machines", on the ground that the Law
of Energy Conservation was so reliable that it would defeat any such
attempt.  History proved that this was a wise decision, which assisted
the effort of designing machines which utilise energy in fuel.

Should we expect the same fate for "the universally beneficial
methods" in the face of NFL?  Should we put more effort in designing
methods which use prior information? 

   posterior information <= prior information + data information.

--
Huaiyu Zhu, PhD                   email: H.Zhu at aston.ac.uk
Neural Computing Research Group   http://neural-server.aston.ac.uk/People/zhuh
Dept of Computer Science          ftp://cs.aston.ac.uk/neural/zhuh
    and Applied Mathematics       tel: +44 121 359 3611 x 5427
Aston University,                 fax: +44 121 333 6215
Birmingham B4 7ET, UK