Non-randomness is no panacea

David Wolpert dhw at santafe.edu
Mon Dec 4 19:49:47 EST 1995


Craig Hicks writes:


>>>>>
(1) >Practically speaking of course, none of this is a concern in the real
>world. We are all (me included) quite willing to conclude there is
>structure in the real world.  But as was noted above, what we do in
>practice is not the issue. The issue is one of theory.

(2) >Okay. Now take all those problems together and view them as one huge
>training set. Better still, add in all the problems that Eric's
>anscestors addressed, so that the success of his DNA is also taken
>into account. That's still one training set. It's a huge one, but it's
>tiny in comparison to the full spaces it lives in.

The above statements seem  to me to be contradictory in some meaning.
>>>>

Not at all. The second statement is concerned with theoretical issues,
whereas the first one is concerned with practical issues. The
distinction is ubiquitous in science and engineering. Even in the
little corner of academia known as supervised learning, most people
are content to distinguish the concerns of COLT (theory) from those of
what-works-in-practice.

>>>
"(1)" is saying we should, when discussing generalization, 
not concern ourselves with the real universe in which we live,
but should consider theoretical alternative universes as well.
>>>

Were you referring to (2) instead? Neither statement says anything
like "we should not concern ourselves with the real universe".



>>>
On the other hand "(2)" seems to say that the real universe in which we live
is itself sufficiently "diverse" that any single approach to generalization
must on average be the same.
>>>

Again, I would have hoped that nothing I have said could be construed
as saying something like that. It may or may not be true, but you said
it, not me. :-) I am sorry if you were somehow given the wrong
impression.



>>>>
I feel we ought to distinguish between a single universe (ours for example),
and the ensemble of possible universes.
>>>>

This is a time-worn concern. Read up on the past two centuries worth
of battles between Bayesians and non-Bayesians...



>>>>
Lets suppose a universe which is an N-dimensional binary (0/1) vector 
random variable X,
whose elements are iid with p(0)=p(1)=(1/2).  Apparently there is no structure
in this universe.
>>>>

NO!!! Forgive my ... passion, but as I've said many times now, even in
a purely random universe, there are many very deep distinctions
between the behavior of different learning algorithms (and in this
sense there is plenty of "structure"). Like head-to-head minimax
distinctions. (Or uniform convergence theory ala Vapnik.) Please read
the relevent papers! ftp.santafe.edu, pub/dhw_ftp, nfl.ps.1.Z and
nfl.ps.2.Z.




>>>>
(b) In any given universe, we can expect structure to be present.

Would I be correct in saying that only (b) needs to be true in order
for cross-validation to be profitable?
>>>>

Nope. The structure can just as easily negate the usefulness of
xvalidation as establish it. And in fact, the version of NFL in which
one fixes the target and then averages over generalizers says that the
state of the universe is (in a certain precise sense), by itself,
irrelevent. Structure or not; that fact alone can not determine the
utility of xvalidation. 

***

Although I think it is at best tangential to further discuss
Kolmogorov complexity, Juergen Schmidhuber's recent comment deserves a
response. He writes:


>>>>>
(2) Back to: what does this have to do with machine learning? As a
first step, we may simply apply Solomonoff's theory of inductive
inference to a dynamic system or ``universe''. Loosely speaking,
in a universe whose history is compressible, we may expect to
generalize well.
>>>>

How could this be true? Nothing has been specified in Juergen's
statement about the loss function, how test sets are generated (IID
vs. off-training-set vs. who knows what), the generalizer used, how it
is related (if at all) to the prior over targets (a prior which, I
take it, Juergen wishes to be "compressible"), the noise process,
whether there is noise in the inputs as well as the outputs, etc.,
etc. Yet all of those factors are crucial in determining the efficacy
of the generalizer.

Obviously if your generalizer *knows* the "compression scheme of the
universe", knows the noise process, etc., then it will generalize
well. Is that what you're saying Juergen? It reduces to saying that if
you know the prior, you can perform Bayes-optimally. There is
certainly no disputing that statement.

It is worth bearing in mind though that NFL can be cast in terms of
averages over priors. In that guise, it says that there are just as
many priors - just as many ways of having a universe be
"compressible", loosely speaking - for which your favorite algorithm
dies as there are for which it shines.

In fact, it's not hard to show that an average over only those priors
that are more than a certain distance from the uniform prior results
in NFL - under such an average, for OTS error, etc., all algorithms
have the same expected performance.

The simply fact of having a non-uniform prior does not mean that
better-than-random generalization arises.

***

Structure, compressibility, whatever you want to call it; it can hurt
just as readily as it can help. The simple claim that there is
non-randomness in the universe does not establish that any particular
algorithm performs better than randomly.

To all those who dispute this, I ask that they present a theorem,
relating generalization error to "compressibility". (To do this of
course, they will have to specify the loss function, noise, etc.) Not
words, but math, and not just math concerning Kolmogorov complexity
considered in isolation. Math presenting a formal relationship between
generalization error and "compressibility". (A relationship that
doesn't reduce to the statement that if you have information
concerning the prior, you can exploit it to generalize well - no
rediscovery of the wheel please.)



David Wolpert







More information about the Connectionists mailing list