Paradigmatic over-fitting
Rick BELEW
belew%FRULM63.BITNET at pucc.Princeton.EDU
Fri May 15 08:55:04 EDT 1992
There has been a great deal of discussion in the GA community concerning
the use of particular functions as a "test suite" against which different
methods (e.g., of performing cross-over) might be compared. The GA is
perhaps particularly well-suited to this mode of analysis, given the way
arbitrary "evaluation" functions are neatly cleaved from characteristics
of the algorithm itself. We have argued before that dichotomizing the
GA/evaluation function relationship in this fashion is inapprorpiate [W.
Hart & R. K. Belew, ICGA'91]. This note, however, is intended to focus on
the general use of test sets, in any fashion.
Ken DeJong set an ambitious precident with his thesis [K. DeJong, U.
Michigan, 1975]. As part of a careful empirical investigation of several
major dimensions of the GA, DeJong identified a set of five functions that
seemed to him (at that time) to provide a wide variety of test conditions
affecting the algorithm's performance. Since then, the resulting "De Jong
functions F1-F5" have assumed almost mythic status, and continue to be one
of the primary ways in which new GA techniques are evaluated.
Within the last several years, a number of researchers have re-evaluated
the DeJong test suite and found it wanting in one respect or another.
Some have felt that it does not provide a real test for cross-over, some
that the set does not accurately characterize the general space of
evaluation functions, and some that naturally occuring problems provide a
much better evaluation than anything synthesized on theoretical grounds.
There is merit to each of these criticisms, and all of this discussion has
furthered our understanding of the GA.
But it seems to me that somewhere along the line the original DeJong suite
became villified. It isn't just that "hind-sight is always 20-20." I
want to argue that DeJong's functions WERE excellent, so good that they
now ARE a victim of their own success. My argument goes beyond any
particular features of these functions or the GA, and therefore won't make
historical references beyond those just sketched. \footnote{If I'm right
though, it would be an interesting exercise in history of science to
confirm it, with a careful analysis of just which test functions were
used, when, by whom, with citation counting, etc.} It will rely instead on
some fundamental facts from machine learning.
Paradigmatic Over-fitting
"Over-fitting" is a widely recognized phenonemon in machine learning (and
before that, statistics). It refers to a tendancy by learning algorithms
to force the rule induced from a training corpus to agree with this data
set too closely, at the expense of generalization to other instances. We
have all probably seen the example of the same data set fit with two
polynomials, one that is correct and a second, higher-order one that also
attempts to fit the data's noise. A more recent example is provided by
some neural networks, which generalize much better to unseen data if their
training is stopped a bit early, even though further epochs of training
would continue to reduce the observed error on the training set.
I suggest entire scientific disciplines can suffer a similar fate. Many
groups of scientists have found it useful to identify a particular data
set, test suite or "model animal" (i.e., particular species or even
genetic strains that become {\em de rigueur} for certain groups of
biologists). In fact, collective agreement as the validity and utility of
scientific artifacts like this are critically involved in defining the
"paradigms" (ala Kuhn) in which scientists work. Scientifically, there
are obvious benefits to coordinated use of common test sets. For example,
a wide variety of techniques can be applied to common data and the results
of these various experiments can be compared directly. But if science is
also seen as an inductive process, over-fitting suggests there may also be
dangers inherent in this practice.
Initially, standardized test sets are almost certain to help any field
evaluate alternative methods; suppose they show that technique A1 is
superior to B1 and C1. But as soon as the results of these experiments
are used to guide the development ("skew the sampling") of new methods (to
A2, A3 and A4 for example), our confidence in the results of this second
set of experiments as accurate reflections of what will be found generally
true, must diminish. Over time, then, the same data set that initially
served the field well can come to actually impeed progress by creating a
false characterization of the real problems to be solved. The problem is
that the time-scale of scientific induction is so much slower than that of
our computational methods that the biases resulting from "paradigmatic
over-fitting" may be very difficult to recognize.
Machine learning also offers some remedies to the dilema of over-training.
The general idea is to use more than one data set for training, or more
accurately, partition available training data into subsets. Then,
portions of the training set can be methodically held back in order to
compare the result induced from one subset with that induced from another
(via cross validatation, jack-knifing, etc.).
How might this procedure be applied to science? It would be somewhat
artificial to purposefully identify but then hold back some data sets,
perhaps for years. More natural strategies with about the same effect
seem workable, however. First, a field should maintain MULTIPLE data
sets, to minimize aberations due to any one. Second, each of these can
only be USED FOR A LIMITED TIME, to be replaced by a new ones.
The problem is that even these modest conventions require significant
"discipline discipline." Accomplishing any coordination across
independent-minded scientists is difficult, and the use of shared data
sets is a fairly effortless way to accomplish useful coordination. Data
sets are difficult to obtain in the first place, and convincing others to
become familiar with them ever harder; these become intertial forces that
will make scientists reluctant to part with the classic data sets they
know well. Evaluating results across multiple data sets also makes new
problems for reviewers and editors. And, because the time-scale of
scientific induction is so long relative to the careers of the scientists
involved, the costs associated with all these concrete problems, relative
to theoretical ones due to pardigmatic over-fitting, will likely seem
huge: "Why should I give up familiar data sets when we previously agreed
to their validity, especially since my methods seem to be working better
and better on them?!"
Back to the GA
The GA community, at present, seems fairly healthy according to this
analysis. In addition to De Jong's, people like Dave Ackley have
genereated very useful sets of test functions. There are now test suites
that have many desirable properities, like being "GA-hard,"
"GA-deceptive," "royal road," "practical" and "real-world." So there are
clearly plenty of tests.
For this group, my main point is that this plurality is very desirable. I
too am dissatisfied with De Jong's test suite, but I am equally
dissatisfied with any ONE of the more recently proposed alternatives. I
suggest it's time we move beyond debates about whose tests are most
illuminating. If we ever did pick just one set to use for testing GAs it
would --- like De Jong's --- soon come to warp the development of GAs
according to ITS inevitable biases. What we need are more sophisticated
analyses and methodologies that allow a wide variety of testing
procedures, each showing something different.
Flame off,
Rik Belew
[I owe the basic insight --- that an entire discipline can be seen to
over-fit to a limited training corpora --- to conversations with Richard
Palmer and Rich Sutton, at the Santa Fe Institute in March, 1992. Of
course, all blame for damage occuring as the neat, little insight was
stretched into this epistle concerning GA research is mine alone.]
Richard K. Belew
Computer Science & Engr. Dept (0014)
Univ. California - San Diego
La Jolla, CA 92093
rik at cs.ucsd.edu
From now until about 20 July I will be working in Paris:
Status: RO
c/o J-A. Meyer
Groupe de BioInformatique
URA686. Ecole Normale Superieure
46 rue d'Ulm
75230 PARIS Cedex05
France
Tel: 44 32 36 23
Fax: 44 32 39 01
belew at wotan.ens.fr
More information about the Connectionists
mailing list