No subject

Mon Jul 26 17:51:04 EDT 1993

Tom Dietterich writes:

>>>
This analysis predicts that using a committee of very diverse
algorithms (i.e., having diverse approximation errors) would yield
better performance (as long as the committee members are competent)
than a committee made up of a single algorithm applied multiple times
under slightly varying conditions.
>>>

There is a good deal of heuristic and empirical evidence supporting
this claim. In general, when using stacking to combine generalizers,
one wants them to be as "orthogonal" as possible, as Tom maintains.

Indeed, one might even want to choose constituent generalizers which
behave poorly stand-alone, just so that they are sufficiently
different from one another when one combines them. (This is somewhat
similar to what's going on w/ error-correcting codes, if one considers
each (output code bit)-learning algorithm to be a different
generalizer, trying to correctly classify things.)

In fact, consider the situation where one uses very different
generalizers, and their errors are highly correlated on a particular
data set, so that *as far as the data is concerned* those generalizers
are identical. For such situations, stacking (or any other kind of
combining) can not help - all the guesses will be the same. However in
such a situation you have good reason to suspect that you are
data-limited - there is simply nothing more to be milked from the
data. (An example of this phenomenon occurs with the intron-exon
prediction problem; for some data sets, ID3, backprop, and a simple
memory-based algorithm don't only get the same error rate; they have
highly correlated errors, making mistakes in the same situations.)

>>>
In the error-correcting code work, we compared a committee of decision
trees to an error-correcting output procedure that also used decision
trees.  The members of the committee were generated by training on
different subsamples of the data (as in stacking), but the combination
method was simple voting.  No matter how many trees we added to the
committee, we could not come close to achieving the same performance
on the nettalk task as with the error-correcting output coding
procedure.
>>>

Well, if I understand this correctly, Tom's using simple voting to
combine, w/o any regard to behavior on the validation set. This will
rarely be the best way of doing things. It amounts to training many
times on subsets of the training data and then voting, rather than
training once on the whole data; as such, this scheme might even
result in worse generalization generically (whether it improves or
helps probably depends on the generalizer's learning curve for the
problem in question).

Moreover, if (as in this case) one is playing w/ algorithms which are
essentially identical (the members of the "committee"), then one might
as well go whole-hog, and use the formulation of stacking designed to
improve a single algorithm. (In this formulation, one uses partitions
of the training set to try to find correlations between the *signed*
error of the learning algorithm, and factors like its guess, the input
value, distance to training set, etc.). Alternatively, as Tom well
points out, you could modify the algorithms to make them substantially
different from one another.

Of course, one can also have fun by combining error-correcting-code
algorithms based on different generalizers (say by stacking).

***

Tom also writes:

>>>
There are three basic
components that determine generalization error:

   * inherent error in the data (which determines the bayes optimal error rate)
   * small sample size
   * approximation error (which prevents the algorithm from correctly
     expressing the bayes optimal hypothesis, even with infinite
samples)
>>>

One has to be a bit careful here; there's a lot of other stuff going
on besides these three "components".

Since Tom seems to have a Bayesian context in mind, it's worth
analyzing things a bit from that perspective. In general,

Pr(generalization error E | data D)

		=

   a (non-Euclidean) inner product between 

Pr(hypothesis h | data D) (i.e., one's learning algorithm)

	and

Pr("truth" f | data D)) (i.e., the posterior distribution).

Tom's component 3, "approximation error", seems to refer to having a
poor alignment (so to speak) between Pr(h | D) and Pr(f | D); he seems
to have in mind a scenario in which the learning algorithm is not
optimally designed for the posterior.

The first two components seem to instead refer to "lack of sharpness"
(over f) of Pr(f | D). More precisely, they seem to refer to having
the likelihood, Pr(D | f), broad, when viewed as a function of f.

If this is indeed the context Tom has in mind, there is another factor
to consider as well: Since the posterior is proportional to the
likelihood times the prior, one also has to consider "lack of
sharpness" in Pr(f), and more generally how aligned the prior is with
the likelihood.

In other words, there are situations in which one has no
"approximation error", no error in the data, and a large sample, but
there is still a large chance of large generalization error. This
occurs for example if the prior f-peak is broad, but far removed from
the likelihood f-peak.

(Generically such situations are unlikely - that's the essential logic
behind confidence intervals, VC stuff, etc. - but there's no way to
assure that one isn't in such a situation in a given learning
problem. And if you're running a program in which you search for
something, like low error on the training set, then most bets - even
conventinal VC bets, despite dogma to the contrary - are off. This is
because conventional VC results refer to the distribution

Pr(|gen. error E - training set error S| > bound | training set size m),

which is NOT the same as

Pr(|E - S| > bound | S, m);

to calculate Pr(|E - S| > bound | S, m) you need to know something
about Pr(f). Indeed, it's trivial to construct examples in which one
can find S = 0, but whenever that occurs, one knows, w/ 100%
certainty, that E is near maximal.)

Moreover, if one instead defines "generalization error" to be
off-training set error, then in fact you'll *always* have large error,
if Pr(f) is uniform over f. (This is proven in a number of different
guises in the papers referenced below. Intuitively, it holds because
if all f are equally likely, any one off-training set behavior of f is
as likely as any other, and the training set tells you nothing.) This
result is completely independent of what learning algorithm you use,
what VC analysis says, and the like, and well-exemplifies the
importance of the prior Pr(f).

David Wolpert

Ref.'s

[1] Wolpert, D. On the connection between in-sample testing and
generalization error. Complex Systems, vol. 6, 47-94. (1992).

[2] Wolpert, D. On overfitting-avoidance as bias. Not yet published
(but placed in the neuroprose archive a couple of months ago). A
related article, "Overfitting avoidance as bias", was published by C.
Schaffer in the February Machine Learning.