No Free Lunch? (Delayed reply)

Wed Nov 8 07:07:49 EST 1995

Pau Bofill paraphrases Wolpert, et al.'s paper "No Free Lunch..." [1] as

>	"any two search algorithms have exactly the same expected
>	performance, over the space of all optimization problems
>	that can be defined whithin a search space."

This captures the trivial part of the argument.  However, Wolpert
et. al. also show that you can never cook up a method which will
be good for an arbitrary non-uniform prior, because for every prior
with which it works better than it would on a uniform prior, there is
another with which it is worse by the same amount.

This has important implications, given that neural net researchers
are not in the habit of being precise about the distribution of 
optimisation problems their algorithms are directed at.  This is an 
issue worth raising, and there is a principled way of dealing with it.

The essential point is that without specifying the prior P(f) over
the space of target distributions that a learning algorithm is meant to
cope with, there is no objective way of comparing one algorithm 
with another.  If you want to claim that good performance on a selection
of problems from some domain implies that good performance should be
expected on other problems from that domain, then you need to know
enough about the prior distribution of problems to claim that your test
suite forms a representative sample of it.

Scientifically verifiable claims about a learning algorithm should
specify at least the following three things:
1. the prior over target distributions.
2. the loss function.
3. the model.
The first two are combined in Wolpert's work as the "prior on the 
fitness function".

Conversely, it is shown [2,3,4] that 
1. Given the first specification, there is always a uniquely defined 
"ideal estimator".
2. If the loss function is chosen as the information divergence [5] then 
the ideal estimate keeps all the information in the prior and the
training data.  Any learning algorithm must be a function of these 
ideal estimators.
3. If the purpose of a learning algorithm is to keep information, then 
it must approximate the ideal estimate.
4. For any computational model, the optimal estimate within the model
is always an appropriate projection of the ideal estimate onto the 
model.

In essence, the endavour to design better learning rules is fundamentally
identical to the activity of finding good priors for application problems.

Furthermore, the usual argument that in practice the prior is something
vague and cannot be quantified is not substantiated.  It is shown [6]
that even in the case of training on data sets retrieved by ftp from 
the Internet, with little or no description of the problems, a reasonably
good prior is still available.  

REFERENCES:

[1] Wolpert, D. H. and Macready W. G.:
  No Free Lunch Theorems for Search, Santa Fe Inst. SFI-TR-95-02-010
  ftp://ftp.santafe.edu/pub/wgm/nfl.ps

[2] Zhu, H. and Rohwer, R.:
  Bayesian Invariant Measurements of Generalisation, 1995
  ftp://cs.aston.ac.uk/neural/zhuh/letter.ps.Z

[3] Zhu, H. and Rohwer, R.:
  Measurements of Generalisation Based on Information, MANNA conference,
  Oxford, 1995, (to appear in Ann. Math. Artif. Intell.)
  ftp://cs.aston.ac.uk/neural/zhuh/generalisation-manna.ps.Z

[4] Zhu, H. and Rohwer, R.:
  A Bayesian Geometric Theory of Statistical Inference, 1995
  ftp://cs.aston.ac.uk/neural/zhuh/stat.ps.Z

[5] Amari, S.:
  Differential-Geometrical Methods in Statistics, 
  Springer-Verlag, 1985.

[6] Zhu, H. and Rohwer, R.:
  Bayesian regression filters and the issue of priors, 1995
  ftp://cs.aston.ac.uk/neural/zhuh/reg_fil_prior.ps.Z

--
Dr. Huaiyu Zhu                           zhuh at aston.ac.uk
Neural Computing Research Group
Dept of Computer Sciences and Applied Mathematics
Aston University, Birmingham B4 7ET, UK