validation sets

Mon Nov 25 21:04:20 EST 1991

Jude Shavlik says:

>  The question of whether or not validation sets are useful can easily be
>  answered, at least on specific datasets.  We have run that experiment and
>  found that devoting some training examples to validation is useful (ie,
>  training on N examples does worse than training on N-k and validating on k).
>  
>  This same issue comes up with decision-tree learners (where the validation set
>  is often called a "tuning set", as it is used to prune the decision tree).  I
>  believe there people have also found it is useful to devote some examples to
>  pruning/validating.

Sorry Jude but I couldn't let this one slip by.

Use of a validation set in decision-tree learners produces great results
ONLY when you have LOTS and LOTS of data.  When you have less data,
Cross validation or use of a well put together complexity/penalty
term (i.e.  carefully thought out MDL,  weight decay/elimination,
            Bayesian maximum posterior, regularization, etc. etc. etc.)
works much better.  If the penalty term isn't well thought out
(e.g.  the early stuff on feed-forward networks such as
       weight decay/elimination was still toying with a new idea,
       so I'd call these not well thought out, although revolutionary 
       for the time)
then performance isn't as good.  Best results with trees are obtained
so far from doing "averaging", i.e.  probabilistically combining the
results from many different trees.  i.e.  experimental confirmation of
the COLT-91 Haussler et al. style of results.  
NB.  good penalty terms are discussed in Nowlan & Hinton, Buntine & Weigend
	and MacKay, and probably in lots of other places ...

Jude's comments:
>  found that devoting some training examples to validation is useful (ie,
>  training on N examples does worse than training on N-k and validating on k).

Only applies because they haven't included a reasonable penalty term.
Get with it guys!

> I think there is also an important point about "proper" experimental
> methodology lurking in the discussion.  If one is using N examples for weight
> adjustment (or whatever kind of learning one is doing) and also use k examples 
> for selecting among possible final answers, one should report that their 
> testset accuracy resulted from N+k training examples.

There's an interesting example of NOT doing this properly recently in
the machine learning journal.  See Mingers in Machine Learning 3(4),
1989, then see our experimental work in Buntine and Niblett 7, 1992.
Mingers produced an otherwise *excellent* paper, but produced perculiar
results (to those experienced in the area) because of mixing the
"tuning set" with the "validation set".

Wray Buntine
NASA Ames Research Center                 phone:  (415) 604 3389
Mail Stop 269-2                           fax:    (415) 604 3594
Moffett Field, CA, 94035 		  email:  wray at kronos.arc.nasa.gov