Paper announcements

Thu Aug 22 18:33:47 EDT 1996

                        *** Paper Announcements ***

====================================================================

The following new paper is now available with anonymous ftp to
ftp.santafe.edu, in the directory pub/dhw_ftp, under the names BS.ps.Z
and BS.ps.Z.encoded.

Any comments are welcomed.

*

COMBINING STACKING WITH BAGGING TO IMPROVE A LEARNING ALGORITHM

                        by

        David H. Wolpert and William G. Macready

Abstract: In bagging \cite{breiman:bagging} one uses bootstrap
replicates of the training set \cite{efron:computers,
efron.tibshirani:introduction} to improve a learning algorithm's
performance, often by tens of percent. This paper presents several
ways that stacking \cite{wolpert:stacked,breiman:stacked} can be used
in concert with the bootstrap procedure to achieve a further
improvement on the performance of bagging for some regression
problems. In particular, in some of the work presented here, one first
converts a single underlying learning algorithm into several learning
algorithms. This is done by bootstrap resampling the training set,
exactly as in bagging. The resultant algorithms are then combined via
stacking.  This procedure can be viewed as a variant of bagging, where
stacking rather than uniform averaging is used to achieve the
combining. The stacking improves performance over simple bagging by up
to a factor of 2 on the tested problems, and never resulted in worse
performance than simple bagging. In other work presented here, there
is no step of converting the underlying learning algorithm into
multiple algorithms, so it is the improve-a-single-algorithm variant
of stacking that is relevant. The precise version of this scheme
tested can be viewed as using the bootstrap and stacking to estimate
the input-dependence of the statistical bias and then correct for it.
The results are preliminary, but again indicate that combining
stacking with the bootstrap can be helpful.

====================================================================

The following paper has been previously announced. A new version,
incorporating major modifications of the original, is now available at
ftp.santafe.edu, in pub/dhw_ftp, as estimating.baggings.error.ps.Z or
estimating.baggings.error.ps.Z.encoded. The new version shows in
particular how the generalization error of a bagged version of a
learning algorithm can be estimated with more accuracy than that
afforded by using cross-validation on the original algorithm.

Any comments are welcomed.

*

AN EFFICIENT METHOD TO ESTIMATE BAGGING'S GENERALIZATION ERROR

                        by

        David H. Wolpert and William G. Macready

Abstract: In bagging \cite{Breiman:Bagging} one uses bootstrap
replicates of the training set \cite{Efron:Stat,BootstrapIntro} to try
to improve a learning algorithm's performance. The computational
requirements for estimating the resultant generalization error on a
test set by means of cross-validation are often prohibitive; for
leave-one-out cross-validation one needs to train the underlying
algorithm on the order of $m\nu$ times, where $m$ is the size of the
training set and $\nu$ is the number of replicates.  This paper
presents several techniques for exploiting the bias-variance
decomposition \cite{Geman:Bias, Wolpert:Bias} to estimate the
generalization error of a bagged learning algorithm without invoking
yet more training of the underlying learning algorithm. The best of
our estimators exploits stacking \cite{Wolpert:Stack}. In a set of
experiments reported here, it was found to be more accurate than both
the alternative cross-validation-based estimator of the bagged
algorithm's error and the cross-validation-based estimator of the
underlying algorithm's error. This improvement was particularly
pronounced for small test sets. This suggests a novel justification
for using bagging--- improved estimation of generalization error.

====================================================================

The following paper has been previously announced. A new version,
incorporating major modifications of the original, is now available at
ftp.santafe.edu, in pub/dhw_ftp, as bias.plus.ps.Z or
bias.plus.ps.Z.encoded. The new version contains in particular an
analysis of the Friedman effect, discussed in Jerry Friedman's
recently announced paper on 0-1 loss.

Any comments are welcomed.

*

        ON BIAS PLUS VARIANCE

                by

        David H. Wolpert

Abstract: This paper presents several additive "corrections" to the
conventional quadratic loss bias- plus-variance formula. One of these
corrections is appropriate when both the target is not fixed (as in
Bayesian analysis) and also training sets are averaged over (as in the
conventional bias-plus- variance formula). Another additive correction
casts conventional fixed-training-set Bayesian analysis directly in
terms of bias-plus-variance. Another correction is appropriate for
measuring full generalization error over a test set rather than (as
with conventional bias-plus-variance) error at a single point. Yet
another correction can help explain the recent counter-intuitive
bias-variance decomposition of Friedman for zero-one loss. After
presenting these corrections this paper then discusses some other
loss-function-specific aspects of supervised learning. In particular,
there is a discussion of the fact that if the loss function is a
metric (e.g., zero-one loss), then there is bound on the change in
generalization error accompanying changing the algorithm's guess from
h1 to h2 that depends only on h1 and h2 and not on the target. This
paper ends by presenting versions of the bias-plus-variance formula
appropriate for logarithmic and quadratic scoring, and then all the ad
ditive corrections appropriate to those formulas. All the correction
terms presented in this paper are a covariance, between the learning
algorithm and the posterior distribution over targets. Accordingly,
in the (very common) contexts in which those terms apply, there is not
a "bias-variance trade-off", or a "bias-variance dilemma", as one
often hears. Rather there is a bias-variance-cova riance trade-off.