"Orthogonality" of the generalizers being combined

David Wolpert dhw at santafe.edu
Sat Jul 1 20:02:19 EDT 1995


In his recent posting, Nathan Intrator writes

>>>
 combining, or in the simple case
averaging estimators is effective only if these estimators are made
somehow to be independent.
>>>

This is an extremely important point. Its importance extends beyond
the issue of generalization accuracy however. For example, I once did
a set of experiments trying to stack together ID3, backprop, and a
nearest neighbor scheme, in the vanilla way. The data set was splice
junction prediction. The stacking didn't improve things much at
all. Looking at things in detail, I found that the reason for this was
that not only did the 3 generalizers have identical xvalidation error
rates, but *their guesses were synchronized*. They tended to guess the
same thing as one another.

In other words, although those generalizers are about as different
from one another as can be, *as far as the data set in question was
concerned*, they were practically identical. This is a great flag that
one is in a data-limited scenario. I.e., if very different
generalizers perform identically, that's a good sign that you're
screwed.

Which is a round-about way of saying that the independence Nathan
refers to is always with respect to the data set at hand. This is
discussed in a bit of detail in the papers referenced below.

***

Getting back to the precise subject of Nathan's posting: Those
interested in a formal analysis touching on how the generalizers being
combined should differ from one another should read the Ander Krough
paper (to come out in NIPS7) that I mentioned in my previous
posting. A more intuitive discussion of this issue occurs in my
original paper on stacking, where there's a whole page of text
elaborating on the fact that "one wants the generalizers being
combined to (loosely speaking) 'span the space' of algorithms and be
'mutually orthogonal'" to as much a degree as possible.

Indeed, that was one of the surprising aspects of Leo Breiman's
seminal paper - he got significant improvement even though the
generalizers he combined were quite similar to one another.





David Wolpert




Stacking can be used for things other than generalizing. The example
mentioned above is using it as a flag for when you're
data-limited. Another use is as an empirical-Bayes method of setting
hyperpriors. These and other non-generalization uses of stacking are
discussed in the following two papers:


Wolpert, D. H. (1992). "How to deal with multiple possible
generalizers". In Fast Learning and Invariant Object Recognition,
B. Soucek (Ed.), pp. 61-80. Wiley and Sons.

Wolpert, D. H. (1993). "Combining generalizers using partitions of the
learning set". In 1992 Lectures in Complex Systems, L. Nadel and
D. Stein (Eds), pp. 489-500. Addison Wesley.


More information about the Connectionists mailing list