compressibility and generalization

Mon Dec 4 20:40:08 EST 1995

On Sun, 3 Dec 1995 12:40:25, Juergen Schmidhuber's wrote:
>(2) Back to: what does this have to do with machine learning? As a
>first step, we may simply apply Solomonoff's theory of inductive
>inference to a dynamic system or ``universe''. Loosely speaking,
>in a universe whose history is compressible, we may expect to
>generalize well.  A simple, old counting argument shows: most
>computable universes are incompressible. Therefore, in most
>computable universes you won't generalize well (this is related
>to what has been (re)discovered in NFL).

In an earlier communication I hypothesized that a typical universe would have
structure that could be exploited by cross-validation.  This communication
from Juergen Schmidhuber contradicts my hypothesis, I think, because of the
existence the "simple, old counting argument" showing that "most computable
universes are incompressible".  I stand corrected.

The point I really wanted clarified was what was meant by the asseration
that in a typical universe

(A)	cross-validation works as well as anti-cross validation

I will just talk about the problem of (determinisitic or stochastic) function
estimation.  I can accept that for any set of model functions, there will be
an infinity of problems where cross-validation will be of no assistance,
because that model does not have the capacity to predict future input/output
realtions from any finite set of examples from the past.  This could be either
becuase the true function is pure noise, or because it looks like pure noise
from the perspective of any function from the set of candidate model
functions.  In this there will be no correlation between predictions and
samples, and cross-validation will do its job of telling us that the
generalization error is not decreasing.

However, I interpret the assertion that anti-cross validation can be expected
to work as well as cross-validation to mean that we can equally well expect
cross-validation to lie.  That is, if cross-validation is telling us that the
generalization error is decreasing, we can expect, on average, that the true
generalization error is not decreasing.

Isn't this a contradiction, if we assume that the samples are really randomly
chosen?  Of course, we can a posteriori always choose a worst case function
which fits the samples taken so far, but contradicts the learned model
elsewhere.  But if we turn things around and randomly sample that deceptive
function anew, the learned model will probably be different, and
cross-validation will behave as it should.

I think this follows from the principle that the empirical distribution over
an ever larger number of samples converges to the the true distribution of a
single sample (assuming the true distribution is stationary).

Does assertion (A) mean that this principle fails in alternative universes?

Respectfully Yours,

	Craig Hicks

Craig Hicks           hicks at cs.titech.ac.jp 
Ogawa Laboratory, Dept. of Computer Science 
Tokyo Institute of Technology, Tokyo, Japan