Paradigmatic over-fitting

Fri May 15 08:55:04 EDT 1992

   There has been a great deal of discussion in the GA community concerning
   the use of particular functions as a "test suite" against which different
   methods (e.g., of performing cross-over) might be compared.  The GA is
   perhaps particularly well-suited to this mode of analysis, given the way
   arbitrary "evaluation" functions are neatly cleaved from characteristics
   of the algorithm itself.  We have argued before that dichotomizing the
   GA/evaluation function relationship in this fashion is inapprorpiate [W.
   Hart & R. K. Belew, ICGA'91].  This note, however, is intended to focus on
   the general use of test sets, in any fashion.

   Ken DeJong set an ambitious precident with his thesis [K. DeJong, U.
   Michigan, 1975].  As part of a careful empirical investigation of several
   major dimensions of the GA, DeJong identified a set of five functions that
   seemed to him (at that time) to provide a wide variety of test conditions
   affecting the algorithm's performance.  Since then, the resulting "De Jong
   functions F1-F5" have assumed almost mythic status, and continue to be one
   of the primary ways in which new GA techniques are evaluated.

   Within the last several years, a number of researchers have re-evaluated
   the DeJong test suite and found it wanting in one respect or another.
   Some have felt that it does not provide a real test for cross-over, some
   that the set does not accurately characterize the general space of
   evaluation functions, and some that naturally occuring problems provide a
   much better evaluation than anything synthesized on theoretical grounds.
   There is merit to each of these criticisms, and all of this discussion has
   furthered our understanding of the GA.

   But it seems to me that somewhere along the line the original DeJong suite
   became villified.  It isn't just that "hind-sight is always 20-20."  I
   want to argue that DeJong's functions WERE excellent, so good that they
   now ARE a victim of their own success.  My argument goes beyond any
   particular features of these functions or the GA, and therefore won't make
   historical references beyond those just sketched.  \footnote{If I'm right
   though, it would be an interesting exercise in history of science to
   confirm it, with a careful analysis of just which test functions were
   used, when, by whom, with citation counting, etc.} It will rely instead on
   some fundamental facts from machine learning.

   Paradigmatic Over-fitting

   "Over-fitting" is a widely recognized phenonemon in machine learning (and
   before that, statistics).  It refers to a tendancy by learning algorithms
   to force the rule induced from a training corpus to agree with this data
   set too closely, at the expense of generalization to other instances.  We
   have all probably seen the example of the same data set fit with two
   polynomials, one that is correct and a second, higher-order one that also
   attempts to fit the data's noise.  A more recent example is provided by
   some neural networks, which generalize much better to unseen data if their
   training is stopped a bit early, even though further epochs of training
   would continue to reduce the observed error on the training set.

   I suggest entire scientific disciplines can suffer a similar fate.  Many
   groups of scientists have found it useful to identify a particular data
   set, test suite or "model animal" (i.e., particular species or even
   genetic strains that become {\em de rigueur} for certain groups of
   biologists).  In fact, collective agreement as the validity and utility of
   scientific artifacts like this are critically involved in defining the
   "paradigms" (ala Kuhn) in which scientists work.  Scientifically, there
   are obvious benefits to coordinated use of common test sets.  For example,
   a wide variety of techniques can be applied to common data and the results
   of these various experiments can be compared directly.  But if science is
   also seen as an inductive process, over-fitting suggests there may also be
   dangers inherent in this practice.

   Initially, standardized test sets are almost certain to help any field
   evaluate alternative methods; suppose they show that technique A1 is
   superior to B1 and C1.  But as soon as the results of these experiments
   are used to guide the development ("skew the sampling") of new methods (to
   A2, A3 and A4 for example), our confidence in the results of this second
   set of experiments as accurate reflections of what will be found generally
   true, must diminish.  Over time, then, the same data set that initially
   served the field well can come to actually impeed progress by creating a
   false characterization of the real problems to be solved.  The problem is
   that the time-scale of scientific induction is so much slower than that of
   our computational methods that the biases resulting from "paradigmatic
   over-fitting" may be very difficult to recognize.

   Machine learning also offers some remedies to the dilema of over-training.
   The general idea is to use more than one data set for training, or more
   accurately, partition available training data into subsets.  Then,
   portions of the training set can be methodically held back in order to
   compare the result induced from one subset with that induced from another
   (via cross validatation, jack-knifing, etc.).

   How might this procedure be applied to science?  It would be somewhat
   artificial to purposefully identify but then hold back some data sets,
   perhaps for years.  More natural strategies with about the same effect
   seem workable, however.  First, a field should maintain MULTIPLE data
   sets, to minimize aberations due to any one.  Second, each of these can
   only be USED FOR A LIMITED TIME, to be replaced by a new ones.

   The problem is that even these modest conventions require significant
   "discipline discipline."  Accomplishing any coordination across
   independent-minded scientists is difficult, and the use of shared data
   sets is a fairly effortless way to accomplish useful coordination.  Data
   sets are difficult to obtain in the first place, and convincing others to
   become familiar with them ever harder; these become intertial forces that
   will make scientists reluctant to part with the classic data sets they
   know well.  Evaluating results across multiple data sets also makes new
   problems for reviewers and editors.  And, because the time-scale of
   scientific induction is so long relative to the careers of the scientists
   involved, the costs associated with all these concrete problems, relative
   to theoretical ones due to pardigmatic over-fitting, will likely seem
   huge: "Why should I give up familiar data sets when we previously agreed
   to their validity, especially since my methods seem to be working better
   and better on them?!"

   Back to the GA

   The GA community, at present, seems fairly healthy according to this
   analysis.  In addition to De Jong's, people like Dave Ackley have
   genereated very useful sets of test functions.  There are now test suites
   that have many desirable properities, like being "GA-hard,"
   "GA-deceptive," "royal road," "practical" and "real-world."  So there are
   clearly plenty of tests.

   For this group, my main point is that this plurality is very desirable.  I
   too am dissatisfied with De Jong's test suite, but I am equally
   dissatisfied with any ONE of the more recently proposed alternatives.  I
   suggest it's time we move beyond debates about whose tests are most
   illuminating.  If we ever did pick just one set to use for testing GAs it
   would --- like De Jong's --- soon come to warp the development of GAs
   according to ITS inevitable biases.  What we need are more sophisticated
   analyses and methodologies that allow a wide variety of testing
   procedures, each showing something different.

   Flame off,
	   Rik Belew

   [I owe the basic insight --- that an entire discipline can be seen to
   over-fit to a limited training corpora --- to conversations with Richard
   Palmer and Rich Sutton, at the Santa Fe Institute in March, 1992.  Of
   course, all blame for damage occuring as the neat, little insight was
   stretched into this epistle concerning GA research is mine alone.]

       Richard K. Belew
       Computer Science & Engr. Dept (0014)
       Univ. California - San Diego
       La Jolla, CA 92093

       rik at cs.ucsd.edu

   From now until about 20 July I will be working in Paris:
   Status: RO

       c/o J-A. Meyer
       Groupe de BioInformatique
       URA686. Ecole Normale Superieure
       46 rue d'Ulm
       75230 PARIS Cedex05
       France
       Tel: 44 32 36 23
       Fax: 44 32 39 01

       belew at wotan.ens.fr