NFL, practice, and CV

Tue Dec 19 10:14:20 EST 1995

This is in reply to the critisism by Craig Hicks and Kevin Cherkauer,
and will be my last posting in this thread.

Craig Hicks thought that my statement (A)

> >You can't make every term positive in your balance sheet, if the grand
> >total is bound to be zero.

is contradictory to his statements (B)

> There ARE functions which are always non-negative, but which under 
> an appropriate measure integrate to 0.
> It only requires that 
> 
> 	1) the support of the non-negative values is vanishingly small,
> 	2) the non-negative values are bounded 

But they are actually talking about different things.  There is a big 
difference between positive and non-negative.  For all practical purposes, 
the functions described by (B) can be regarded as identically zero.  

Translating back to the original topic, statement (B) becomes

(C) There are algorithms which are always no worse than random guessing,
on any prior, provided that 
	1) The priors on which it performs better than random guessing
	   have zero probability to occur in practice.
	2) It cannot be infinitely better on these priors.

It is true that something improbable may still be possible, but this is
only of academic interest.  In most of modern treatment of function spaces,
functions are only identified up to a set of measure zero, so that phrases
like "almost everywhere" or "almost surely" are redundent.

I suspect that due to the way NFL are proved, even (C) is impossible,
but this does not matter anyway, because (C) itself is of no practical
interest whatsoever.

> ----

Considering cross validation, Craig wrote
> 
> There is another important issue which needs to be clarified, and that is the
> definition of CV and the kinds of problems to which it can be applied.  Now
> anybody can make whatever definition they want, and then come to some
> conclusions based upon that definition, and that conclusion may be correct
> given that definition.  However, there are also advantages to sharing a common
> intellectual currency.  

Risking a little bit over-simplification, I would like to summarise the two
usages of CV as the following

(CV1)	A method for evaluating estimates,
(CV2)	A method for evaluating estimators.

The key difference is that in (CV1), a decision is made for each sample,
while in (CV2) a decision is made for all samples.

If (CV1) is applied on two algorithms A and B, then we can always define
a third algorithm C, by always choosing the estimate given by either A or
B which is favoured by (CV1).   But my previous counter-example shows
that averaging over all samples, C can be worse than A.  One may seek 
refuge in statements like "optimal decision for each sample does not mean 
optimal decision for all samples".  Well, such incoherent inference is the 
defining characteristic of non-Bayesian statistics. In Bayesian decision 
theory it is well known that
	A method is optimal iff it is optimal on almost all samples,
	(excluding various measure zero anomolies.)

The case of (CV2) is quite different.  It is of a higher level than
algorithms like A and B.  It is in fact a statistical estimator mapping
(D,A,f) to to a real number r, where D is a finite data set, A is a given 
algorithm, f is an objective function, and r is the predicted average 
performance.  It should therefore be compared with other such methods.
This appears not to be a topic considered in this discussion.

--------------
Kevin Cherkauer wrote
> 
> You forgot
> 
>    D: Anti-cross validation to choose between A and B, with one extra data
>       point.

Well, I did not forget that, as you have quoted below, point 6.  
> 
> I don't understand your claim that "cross validation IS harmful in this case."
> You seem to equate "harmful" with "suboptimal." 

See my original answer, points 1. and 4.

> 	Cross validation is a technique
> we use to guess the answer when we don't already know the answer. 

This is true for any statistical estimator.

>	You give
> technique A the benefit of your prior knowledge of the true answer, but C must
> operate without this knowledge. 

The prior knowledge is that the distribution is a unit Gaussian with
unspecified mean, the true answer is its mean.  No, they are not the 
same thing.  C also operates with the knowledge that the distribution 
is a unit Gaussian, but it refuses to use this knowledge (which implies 
A is better than B).  Instead, it insists on evaluating A and B on a 
cross validation set.  That's why it performs miserably.

>	A fair comparison would pit C against D, not C
> against A. As you say:
> 
> >6. In any of the above cases, "anti cross validation" would be even
> >more disastrous. 

If the definition was that "An algorithm is good if it is no worse than
the worst algorithm", then I would have no objection.  Well, almost any
algorithm would be good in this sense.  However, if the phrase "in any of 
the above cases" is droped without putting a prior restriction as remedy, 
then it's also true that all algorithm is as bad as the worst algorithm.

Huaiyu

PS. I think I have already talked enough about this subject so I'll shut 
up from now on, unless there's anything new to say. More systematic
treatment of these subjects instead of counter-examples can be found
in the ftp site below.

--
Huaiyu Zhu, PhD                   email: H.Zhu at aston.ac.uk
Neural Computing Research Group   http://neural-server.aston.ac.uk/People/zhuh
Dept of Computer Science          ftp://cs.aston.ac.uk/neural/zhuh
    and Applied Mathematics       tel: +44 121 359 3611 x 5427
Aston University,                 fax: +44 121 333 6215
Birmingham B4 7ET, UK