3 TechReports on Measuring Generalisation ...

Fri Sep 1 08:47:23 EDT 1995

Is there any well-defined meaning to statements like 
    "Learning rule A is better than learning rule B"?

The answer is yes, as long as three things are specified: the prior,
which is the distribution of problems to be solved; the information
divergence, which tells how different the estimated distribution is
from the true distribution; and the model, which is the space of all
the representable solutions.  

The following three Technical Reports develop the necessary theory to
evaluate and compare any neural network learning rules and other
statistical estimators.

ftp://cs.aston.ac.uk/neural/zhuh/discrete.ps.Z
ftp://cs.aston.ac.uk/neural/zhuh/continuous.ps.Z
ftp://cs.aston.ac.uk/neural/zhuh/generalisation.ps.Z

Bayesian Invariant Measurements of Generalisation for Discrete Distributions
Bayesian Invariant Measurements of Generalisation for Continuous Distributions
Information Geometric Measurements of Generalisation
                by Huaiyu Zhu and Richard Rohwer

                            ABSTRACT

Neural networks can be considered as statistical models, and learning
rules as statistical estimators.  They should be compared in the
framework of Bayesian decision theory, with information divergence as
the loss function.  This ensures coherence (An estimator is optimal if
and only if it gives optimal estimates for almost all the data) and
invariance (the optimality condition does not depend on one-one
transforms in the input, output and parameter spaces).  The main
result is that the ideal optimal estimator is given as an appropriate
average over the posterior.  The optimal estimator restricted to any
particular model is given by an appropriate projection of the ideal
optimal estimator onto the model.  The ideal optimal estimator is a
sufficient statistic so that all the practical learning rules are its
functions.  They are also its approximations if preserving information
in the data is the sole utility.

This new theory of statistical inference retains many of the desirable
properties of the least mean squares theory for linear Gaussian
models, yet is applicable to any statistical estimation problem,
including all the neural network learning rules (deterministic and
stochastic, supervised, reinforcement and unsupervised).

Comments are welcome and very much appreciated!

-- 
Dr. Huaiyu Zhu                                  zhuh at aston.ac.uk
Neural Computing Research Group
Dept of Computer Sciences and Applied Mathematics
Aston University, Birmingham B4 7ET, UK