Correcting misunderstandings about NFL

Fri Dec 1 11:18:19 EST 1995

This posting is to correct some misunderstandings that were recently
posted concerning the NFL theorems. I also draw attention to some of
the incorrect interpretations commonly ascribed to certain COLT
results.

***

Joerg Lemm writes:

>>>
1.) If there is no relation between the function values
    on the test and training set
    (i.e. P(f(x_j)=y|Data) equal to the unconditional P(f(x_j)=y) ),
    then, having only training examples y_i = f(x_i) (=data) 
    from a given function, it is clear that I cannot learn anything 
    about values of the function at different arguments, 
    (i.e. for f(x_j), with x_j not equal to any x_i = nonoverlapping
test set).
>>>

Well put. Now here's the tough question: Vapnik *proves* that it is
unlikely (for large enough training sets and small enough VC dimension
generalizers) for error on the training set and full "generalization
error" to be grealy different. Regardless of the target. Using this,
Baum and Haussler even wrote a paper "What size net gives valid
generalization?" in which no assumptions whatsoever are made about the
target, and yet the authors are able to provide a response the
question of their title. HOW IS THAT POSSIBLE GIVEN WHAT YOU JUST
WROTE????

NFL is "obvious". And so are VC bounds on generalization error (well,
maybe not "obvious"). And so is the PAC "proof" of Occam's razor. And
yet the latter two bound generalization error (for those cases where
training set error is small enough) without making any assumptions
about the target. What gives?

The answer: The math of those works is correct. But far more care must
be exercised in the interpretation of that math than you will find in
those works. The care involves paying attention to what goes on the
right-hand side of the conditioning bars in one's probabilities, and
the implications of what goes there.  Unfortunately, such conditioning
bar are completely absent in those works...

(In fact, the sum-total of the difference between Bayesian and COLT
approaches to supervised batch learning lies in what's on the
right-hand side of those bars, but that's another story. See [2].)

As an example, it is widely realized that VC bounds suffer from being
worst-case.  However there is another hugely important caveat to those
bounds. The community as a whole simply is not aware of that caveat,
because the caveat concerns what goes on the right-hand side of the
conditioning bar, and this is NEVER made explicit.

This caveat is the fact that VC bounds do NOT concern

Pr(IID generalization error |
	observed error on the training set, training set size,
					VC dimension of the generalizer).

But you wouldn't know that to read the claims made on behalf of those
bounds ...

To give one simple example of the ramifications of this: Let's say you
have a favorite low-VC generalizer. And in the course of your career
you parse though learning problems, either explicitly or (far more
commonly) without even thinking about it. When you come across one
with a large training set on which your generalizer has small
generalization error, you want to invoke Vapnik to say you have
assuraces about full generalization error.

Well, sorry. You don't and you can't. You simply can't escape Bayes by
using confidence intervals. Confidence intervals in general (not just
in VC work) have the annoying property that as soon as you try to use
them, very often you contradict the underlying statistical assumptions
behind them. Details are in [1] and in the discussion of "We-Learn-It
Inc." in [2].

>>>
2.) We are considering two of those (influence) relations P(f(x_j)=y|Data):
    one, named A, for the true nature (=target) and one, named B, for our 
    model under study (=generalizer).
    Let P(A and B) be the joint probability distribution for the
    influence relations for target and generalizer.

3.) Of course, we do not know P(A and B), but in good old Bayesian tradition,
    we can construct a (hyper-)prior P(C) over the family of probability 
    distributions of the joint distributions C = P(A and B).

4.) NFL now uses the very special prior assumption
    P(A and B) = P(A)P(B)
>>>

If I understand you correctly, I would have to disagree. NFL also
holds with your P(C) being any prior assumption - more formally,
averaging over all priors, you get NFL. So the set of priors for which
your favorite algorithm does *worse than random* is just as large as
the set for which it does better. (In this sense, the uniform prior is
a typical prior, not a pathological one, out on the edge of the
space. It is certainly not a "very special prior".)

In fact, that's one of the major points of NFL - it's not to see what
life would be like if this or that were uniform, but to use such
uniformity as a mathematical tool, to get a handle on the underlying
geometry of inference, the size of the various spaces (e.g., the size
of the space of priors for which you lose to random), etc.

The math *starts* with NFL, and then goes on to many other things (see
[1]). It's only the beginning chapter of the text book.

>>>
I say that it is rational to believe 
(and David does so too, I think) that in real life cross-validation 
works better in more cases than  anti-cross-validation.
>>>

Oh, most definitely.

There are several issues here: 1) what gives with all the "prior-free"
general proofs of COLT, given NFL, 2) purely theoretical issues (e.g.,
as mentioned before, characterizing the relationship between target
and generalizers needed for xval. to beat anti-xval.) and 3) perhaps
most provocatively of all, seeing if NFL (and the associated
mathematical structure) can help you generalize in the real world
(e.g., with head-to-head minimax distinctions between generalizers).

***

Finally, Eric Baum weighs in:

>>>
Barak Pearlmutter remarked that saying
	We have *no* a priori reason to believe that targets with "low
	Kolmogorov complexity" (or anything else) are/not likely to
	occur in the real world.
(which I gather was a quote from David Wolpert?)
is akin to saying we have no a priori reason to believe there is non-random
structure in the world, which is not true, since we make great
predictions about the world.
>>>

Well, let's get a bit formal here. Take all the problems we've ever
tried to make "great predictions" on. Let's even say that these
problems were randomly chosen from those in the real world (i.e., no
selection effects of people simply not reporting when their
predictions were not so great). And let's for simplicity say that all
the predictions were generated by the same generalizer - the algorithm
in the brain of Eric Baum will do as a straw man.

Okay. Now take all those problems together and view them as one huge
training set. Better still, add in all the problems that Eric's
anscestors addressed, so that the success of his DNA is also taken
into account. That's still one training set. It's a huge one, but it's
tiny in comparison to the full spaces it lives in.

Saying we (Eric) makes "great predictions" simply means that the
xvalidation error of our generalizer (Eric) on that training set is
small. (You train on part of the data, and predict on the rest.)
Formally (!!!!!), this gives no assuraces whatsoever about any
behavior off-training-set. As I've stated before, without assumptions,
you cannot conclude that low xvalidation error leads to low
off-training set generalization error. And of course, each passing
second, each new scene you view, is "off-training-set".

The fallacy in Eric's claim was noted all the way back by
Hume. Success at inductive inference cannot formally establish the
utility of using inductive inference. To claim that it can you have to
invoke inductive inference, and that, as any second grader can tell
you, is circular reasoning.

Practically speaking of course, none of this is a concern in the real
world. We are all (me included) quite willing to conclude there is
structure in the real world.  But as was noted above, what we do in
practice is not the issue. The issue is one of theory.

***

It's very similar to high-energy physics. There are a bunch of
physical constants that, if only slightly varied, would (seem to) make
life impossible. Why do they have the values they have? Some invoke
the anthropic principle to answer this - we wouldn't be around if they
had other values. QED. But many find this a bit of a cop-out, and
search for something more fundamental. After all, you could have
stopped the progress of physics at any point in the past if you had
simply gotten everyone to buy into the anthropic principle at that
point in time.

Similarly with inductive inference. You could just cop out and say
"anthropic principle" - if inference were not possible, we wouldn't be
having this debate. But that's hardly a satisfying answer.

***

Eric goes on:

>>>
Consider the problem of learning
to predict the pressure of a gas from its temperature. Wolpert's theorem,
and his faith in our lack of prior about the world, predict,
that any learning algorithm whatever is as likely
to be good as any other. This is not correct.
>>>

To give two examples from just the past month, I'm sure MCI and
Coca-Cola would be astonished to know that the algorithms they're so
pleased with were designed for them by someone having "faith in our
lack of prior about the world".

Less glibly, let me address this claim about my "faith" with two
quotes from the NFL for supervised learning paper. The first is in the
introduction, and the second in a section entitled "On uniform
averaging". So neither is exactly hidden ...

1) "It cannot be emphasized enough that no claim is being made .. that
all algorithms are equivalent in the real world."

2) "The uniform sums over targets ... weren't chosen because there is
strong reason to believe that all targets are equally likely to arise
in practice. Indeed, in many respects it is absurd to ascribe such a
uniformity over possible targets to the real world. Rather the uniform
sums were chosen because such sums are a useful theoretical tool with
which to analyze supervised learning."

Finally, given that I'm mixing it up with Eric on NFL, I can't help
but quote the following from his "What size net gives valid
generalization" paper:

"We have given bounds (independent of the target) on the training set
size vs. neural net size need such that valid generalization can be
expected." 

(Parenthetical comment added - and true.)

Nowhere in the paper is there any discussion whatsoever of the
apparent contradiction between this statement and NFL-type concerns.
Indeed, as mentioned above, with only the conditioning-bar-free
mathematics in Eric's paper, there is no way to resolve the
contradiction. In this particular sense, that paper is extremely
misleading. (See discussion above on misinterpretations of Vapnik's
results.)

>>>>
Creatures evolving in this "play world" would exploit this structure and
understand their world in terms of it. There are other things they would
find hard to predict. In fact, it may be mathematically valid to say that
one could mathematically construct equally many functions on which
these creatures would fail to make good predictions. But so what?
So would their competition. This is not relevant to looking for
one's key, which is best done under the lamppost, where one has a
hope of finding it. In fact, it doesn't seem that the play world
creatures would care about all these other functions at all.
>>>

I'm not sure I quite follow this. In particular, the comment about the
"competition" seems to be wrong.

Let me just carry further Eric's metaphor though, and point out though
that it makes a hell of a lot more sense to pull out a flashlight and
explore into the surrounding territory for your key than it does to
spend all your time with your head down, banging into the lamppost. And
NFL is such a flashlight.

David Wolpert

[1] The current versions of the NFL for supervised learning papers,
nfl.ps.1.Z and nfl.ps.2.Z, at ftp.santafe.edu, in pub/dhw_ftp.

[2] "The Relationship between PAC, the Statistical Physics Framework,
the Bayesian Framework, and the VC Framework", in *The Mathematics of
Generalization*, D. Wolpert Ed., Addison-Wesley, 1995.