Evaluating learning algorithms (Was Re: Connectionist symbol processing: any progress?)

Mon Aug 17 19:33:13 EDT 1998

Dear Connectionists,

First, I apologize for jumping into an already lengthy debate between Lev
Goldfarb and Mitsu Hadeishi.  I will try to be succinct. I hope you will
find the following worth reading. 

To begin with, let me claim that it is not contradictory to say:
1. The most important issue IS performance of learning rules.
2. Some quantitative measurement (loosely called "metric") is needed.
3. There is no intrinsic Euclidean metric on the space of learning
algorithms.
4. The geometry and topology of input space is generally irrelevant.

Now let me explain why they must be true, and what kind of theory can be
developed to satisfy all of them:

The key observation is that learning algorithms act on the space of
probability distributions.  Unless we are doing rote learning, we cannot
assume the data are generated by a deterministic mechanism. Therefore the
proper measurement should be derived from some geometry of the space of
probability distributions.

Now it is clear that even for a discrete set X, the space of functions on
X is still a linear space equipped with norms and so on, and the space
of probability distributions on X is still a differentiable manifold with
intrinsic geometric structures.  In fact the function space case can
always be regarded as a special case of probability space.

The space of probability distributions is not a Euclidean space.  However,
the beautiful theory of Information Geometry developed by Amari and others
shows that it behaves almost as if it has a squared distance.  For
example, there is a Pythagorean Theorem for cross-entropy that enables us
to do things very similar to taking averages and projecting on a linear
subspace in an Euclidean space. 

Information Geometry fully specifies the amount of information that is
lost by various computations.  (A learning algorithm tries to reduce data
with minimum reduction of information.)  However, our information is not
solely contained in the data.  Different conclusions can be drawn from the
same data if different statistical assumptions are made.  Such assumptions
can be specified by a prior (a distribution of all the possible
distributions).  Technically this is called the complete class theorem. 
The prior and the data should be consistently combined in a Bayesian
method.  It can be shown that learning algorithm can be completely
quantitatively evaluated in this framework. 

The input space is generally irrelevant because usually we only want to
learn the conditional distribution of output for given input.  In the case
of unsupervised learning, such as independent component analysis, it is
still the space of distributions that is relevant. Again, information
geometry has helped to make several breakthroughs in recent years.  (Check
for papers by Amari, Cardoso, Sejnowski and many others.  See, eg.,
http://www.cnl.salk.edu/~tewon/ica_cnl.html.) 

The following short paper (4 pages) contains an outline with a little bit
more technical detail.  The numerous articles by Prof Amari, and esp his
1985 book, should prove extremely useful regarding information geometry. 

      H. Zhu, Bayesian geometric theory of learning algorithms. 
      Proc. of Intnl. Conf. Neural Networks (ICNN'97), 
      Vol.2, pp.1041-1044. Houston, 9-12 June, 1997.
      ftp://ftp.santafe.edu/pub/zhuh/ig-learn-icnn.ps 

Hope you enjoyed reading to this point.

Huaiyu
-- 
Huaiyu Zhu                      Tel: 1 505 984 8800 ext 305       
Santa Fe Institute              Fax: 1 505 982 0565
1399 Hyde Park Road             mailto:zhuh at santafe.edu          
Santa Fe, NM 87501              http://www.santafe.edu/~zhuh/  
USA                             ftp://ftp.santafe.edu/pub/zhuh/