New paper

Tue Apr 21 17:54:03 EDT 1992

********* DO NOT FORWARD TO OTHER MAILING LISTS ******

The following paper has just been placed in neuroprose. It is a major
revision of an earlier preprint of the same title, and appears in the
current issue of Complex Systems.

ON THE CONNECTION BETWEEN IN-SAMPLE TESTING AND
GENERALIZATION ERROR.

David H. Wolpert, The Santa Fe Institute, 1660 Old Pecos Trail, Suite
A, Santa Fe, NM, 87501.

Abstract: This paper proves that it is impossible to justify a correlation
between reproduction of a training set and generalization error off of the
training set using only a priori reasoning. As a result, the use in the real
world of any generalizer which fits a hypothesis function to a training set
(e.g., the use of back-propagation) is implicitly predicated on an assumption
about the physical universe. This paper shows how this assumption can be
expressed in terms of a non-Euclidean inner product between two vectors,
one representing the physical universe and one representing the generalizer.
In deriving this result, a novel formalism for addressing machine learning is
developed. This new formalism can be viewed as an extension of the
conventional "Bayesian" formalism which (amongst other things) allows
one to address the case where one's assumed "priors" are not exactly
correct. The most important feature of this new formalism is that it uses an
extremely low-level event space, consisting of triples of {target function,
hypothesis function, training set}. Partly as a result of this feature, most
other formalisms that have been constructed to address machine learning
(e.g., PAC, the Bayesian formalism, the "statistical mechanics" formalism)
are special cases of the formalism presented in this paper. Consequently
such formalisms are capable of addressing only a subset of the issues
addressed in this paper. In fact, the formalism of this paper can be used to
address all generalization issues of which I am aware: over-training, the
need to restrict the number of free parameters in the hypothesis function,
the problems associated with a "non-representative" training set, whether
and when cross-validation works, whether and when stacked generalization
works, whether and when a particular regularizer will work, etc. A summary
of some of the more important results of this paper concerning these and
related topics can be found in the conclusion.

**********************************************

To retrieve this paper, which comes in two parts, do the following:

unix> ftp archive.cis.ohio-state.edu
login> anonymous
password> neuron
ftp> binary
ftp> cd pub/neuroprose
ftp> get wolpert.reichenbach-1.ps.Z
ftp> get wolpert.reichenbach-2.ps.Z
ftp> quit
unix> uncompress wolpert.reichenbach-1.ps.Z
unix> uncompress wolpert.reichenbach-2.ps.Z
unix> lpr wolpert.reichenbach-1.ps.Z    # or however you print out postscript
unix> lpr wolpert.reichenbach-2.ps.Z    # or however you print out postscript