combining generalizers' guesses

Fri Jul 23 11:39:22 EDT 1993

Mahesan Niranjan writes:

>>>
The committee of networks doing the energy prediction (committee members
chosen by ranking models by performance on cross-validation set, and the
average performance of these being better than the best member) is a
somewhat surprising result to me. Surprising because, the average predictions
are taken without weighting by the model probabilities (which are difficult
to compute). In practice, even for linear models in Gaussian noise, I find
probabilities tend to differ by large numbers, for models that look
very similar. Hence if these are difficult to evaluate and are assumed
equal, I would have expected the average performance to be worse than the
best member.
>>>

In general, when using stacking to combine guesses of separate
generalizers (i.e., when combining guesses by examining validation
set behavior), one doesn't simply perform an unweighted average,
as MacKay did, but rather a weighted average.

For example, in Leo Breiman's Stacked regression paper of last
year, he combined guesses by means of a weighted average. The
weights were set to minimize LMS error on the validation sets.
(Sets plural because J-fold partitioning  of the training data
was used, like in cross-validation, rather than a single split
into a training set and a validation set.)

In literally hundreds of regression experiments, Leo found that
this almost always beat cross-validation, and never (substantially)
lost to it.

In essence, in this scheme validation set behavior is being used to 
estimate the model "probabilities" Niranjan refers to.

Also, in MacKay's defense, just because he "got the probabilities
wrong" doesn't imply his average would be worse than just choosing
the single best model. Just a few of the other factors to consider:

1) What is the relationship between mis-assignment of model
probabilities, model's guess, and optimal guess?

2) How do estimation errors (due to finite validation sets, due
to finite training sets) come into play?

Also, it should be noted that there are other ways to perform
stacking (either to combine generalizers or to improve a single
one) which do not use techniques which are interprable in terms of
"model probabilities". For example, rather than combining generalizers
via the generalizer "find the hyperlane (w/ non-negative summing-to-1
coefficients) w/ the minimal LMS error on the data", which is what
Leo did, one can instead use nearest neighbor algorithms, or even neural 
nets. 

In general though, one should use such a "second level"
generalizer which has low variance, i.e., which doesn't bounce
around a lot w/ the data. Otherwise you can easily run into the kinds of
problems Niranjan worries about.

David Wolpert

References:

Breiman, L, "Stacked regressions", TR 367, Dept. of Stat., Univ. of Cal. 
Berkeley (1992).

Wolpert, D., "Stacked Generalization", Neural Networks, vol. 5,
241-259 (1992). (Aside from an early tech. report, the original public
presentation of the idea was at Snowbird '91.)

I also managed to convince Zhang, Mesirov and Waltz to try combining
with stacking rather than with non-validation-set-based methods (like Qian 
and Sejnowski used), for the problem of predicting protein secondary 
structure. Their (encouraging) results appeared last year in JMB.