NFL once again, I'm afraid

Wed Dec 20 20:00:48 EST 1995

First and foremost, I would like to request that this NFL thread fade
out. It is only sowing confusion - people should read the papers on
NFL to understand NFL.

  [[ Moderator's note: I concur.  We've had enough "No Free Lunch" discussion
  for a while; people are starting to protest.  Future discussion should be
  done in email.  -- Dave Touretzky, CONNECTIONISTS moderator ]]

Full stop.

*After* that, after there is common grounding, we can all debate.
There is much else that connectionist is more appropriate for in the
meantime.

(To repeat: ftp.santafe.edu, pub/dhw_ftp, nfl.1.ps.Z and nfl.2.ps.Z.)

Please, I'm on my knees, use the time that would have been spent
thrashing at connectionist in a more fruitful fashion. Like by reading
the NFL papers. :-)

***

Hicks writes:

>>>
case 1: 
*	Either the target function is (noise/uncompressible/has no structure),
or none of the candidate functions have any correlation with the target
function.*
Since CV provides an estimated prediction error,
it can also tell us "you might as well be using anti-cross validation, or
random selection for that matter, because it will be equally useless".
>>>

This is wrong.

Construct the following algorithm: "If CV says one of the algorithms
under consideration has particularly low error in comparison to the
other, use that algorithm. Otherwise, choose randomly among the
algorithms."

Averaged over all targets, this will do exactly as well as the
algorithm that always guesses randomly among the algorithms. (For
zero-one loss, either OTS error or IID error with a big input space,
etc.)

So you cannot rely on CV's error estimate *at all* (unless you impose
a prior over targets or some such, etc.).

Alternatively, keep in mind the following simple argument: In its
uniform prior(targets) formulation, NFL holds even for error
distributions conditioned on *any* property of the training set. So in
particular, you can condition on having a training set for which CV
says "yep, I'm sure; choose that one". And NFL still holds. So even in
those cases where CV "is sure", by following CV, you'll die as often
as not.

>>>
case 2: 
*	The target (is compressible/has structure), and some the candidate
functions are positively correlated with the target function.*
	In this case CV will outperform anti-CV (ON AVERAGE).
>>>

This is wrong.

As has been mentioned many times, having structure in the target, by
itself, gains you nothing. And as has also been mentioned, if "the
candidate functions are positively correlated with the target
function", then in fact *anti-CV wins*.

READ THE PAPERS.

>>>
By ON AVERAGE I mean the expectation across the ensemble of samples for
a FIXED target function.  This is different from the ensemble and distribution
of target functions, which is a much bigger question.
>>>

This distinction is irrelevent. There are versions of NFL that address
both of these cases (as well as many others).

READ THE PAPERS.

*****

Lemm writes:

>>>
1.) In short, NFL assumes that data, i.e. information of the form y_i=f(x_i),
do not contain information about function values on a non-overlapping
test set.
>>>

This is wrong.

See all the previous discussion about how NFL holds even if you
restrict yourself to targets with a lot of structure. The problem is
that the structure can hurt just as easily as help. There is no need
for the data set to contain no information about the test set - simply
that the limited types of information can "confuse" the learning
algorithm at hand.

READ THE PAPERS.

>>>
This is done by postulating "unrestricted uniform" priors, 
or uniform hyperpriors over nonumiform priors...
>>>

This is wrong. There is (obviously) a version of NFL that holds for
uniform priors. And there is another version in which one averages
over all priors - so the uniform prior has measure 0. But one can also
restrict oneself to average only over those priors "with a lot of
structure", and again get NFL.

And there are many other versions of NFL in which there is *no* prior,
because things are conditioned on a fixed target. Exactly as in
(non-Bayesian) sampling theory statistics.

Some of those alternative NFL results involve saying "if you're
conditioning on a target, there are as many such targets where you die
as where you do well". 

Other NFL results never vary the target *in any sense*, even to
compare different targets. Rather they vary something concerning the
generalizer. This is the case with the more sophisticated xvalidation
results, for example.

READ THE PAPERS.

>>>
There is much information which is not of this 
"single sharp data" type. (Examples see below.)
>>>

*Obviously* if you have extra information and/or knowledge beyond that
in the training set, you can (often) do better than randomly. That's
what Bayesian analysis is all about. More generally, as I have proven
in [1], the probability of error can be written as a non-Euclidean
inner product between the learning algorithm and the posterior. So
obviously if your posterior is structured in an appropriate manner,
that can be exploited by the algorithm.

This was never the issue however. The issue had to do with "blind"
supervised learning, in which one has no such additional
information. Like in COLT, for example.

You're arguing apples and oranges here.

>>>
4) Real measurements (especially of continuous variables)
normally do also NOT have the form y_i=f(x_i) !
They mostly perform some averaging over f(x_i) or
at least they have some noise on the x_i (as small as you like, but present).
>>>

Again, this is obvious. And stated explicitly in the papers,
moreover. And completely irrelevent to the current discussion. The
issue at hand has *always* been "sharp" data. And if you look at
what's done in the neural net community, or in COLT, 95% of it assumes
"sharp data".

Indeed, there are many other assumptions almost always made and almost
never true that Lemm has missed. Like making a "weak filtering
assumption": assume the target and the distribution over inputs are
independent. But again, just like in COLT, we're starting simple here,
with such assumptions intact.

READ THE PAPERS.

>>>
This shows that smoothness of the expectation (in contrast to uniform priors) 
is the result of the measurement process and therefore
is a real phenomena for "effective" functions.
>>>

To give one simple example, what about with categorical data, where
there is not even a partial ordering over the inputs? What does
"locally smooth" even mean then?

And even if we're dealing with real valued spaces, if there's input
space noise, NFL simply changes to be a statement concerning test set
elements that are sufficiently far (on the scale of the input space
noise) from the elements of the training set. The input space noise
makes the math more messy, but doesn't change the underlying phenomenon.

(Readers interested in previous work on the relationship between local
(!) regularization, smoothness, and input noise should see Bishop's
Neural Computation article of about 6 months ago.)

>>>
Even more: situations without "priors" are VERY artificial.
So if we specify the "priors" (and the lesson from NFL is
that we should if we want to make a good theory) 
then we cannot use NFL anymore.(What should it be used for then?)
>>>

Sigh. 

1) I am a Bayesian whenever feasible. (In fact, I've been taken to
task for being "too Bayesian".) But situations without obvious priors
- or where eliciting the priors is not trivial and you don't have the
time - are in fact *very* common.

A simple example is a project I am currently involved on for detecting
phone fraud for MCI. Quick, tell me the prior probability that a
fraudulent call arises from area code 617 vs. the prior probability
that a non-fraudulent call does...

2) Essentially all of COLT is non-Bayesian. (Although some of it makes
assumptions about things like the support of the priors.) You haven't
a prayer of really understanding what COLT has to say without keeping
in mind the admonitions of NFL.

3) As I've now said until I'm blue in the face, NFL is only the
starting point. What it's "good for", beyond proving to people that
they must pay attention to their assumptions, be wary of COLT-type
claims, etc. is: head-to-head minimax theory, scrambled algorithms
theory, hypothesis-averaging theory, etc., etc., etc.

READ THE PAPERS.

****

Zhu writes:

>>>
I quite agree with Joerg's observation about learning algorithms in
practice, and the priors they use.  The key difference is

	Is it legitimate to be vague about prior?

Put it another way,

	Do you claim the algorithm can pick up whatever prior automatically,
	instead of being specified before hand?

My answer is NO, to both questions, because for an algorithm to be good on 
any prior is exactly the same as for an algorithm to be good without prior,
as NFL told us.
>>>

Yes!

Everybody, LISTEN TO ZHU!!!!

David Wolpert

[1] - Wolpert, D. "The Relationshop Between PAC, the Statistical
Physics Framework, the Bayesian Framework, and the VC Framework", in
"The Mathematics of Generalization", D. Wolpert (Ed.), Addison-Wesley,
1995