Multiple Models, Committee of nets etc...

Sun Aug 1 16:14:14 EDT 1993

Michael P. Perrone  writes

>Tom Dietterich write:
>> This analysis predicts that using a committee of very diverse
>> algorithms (i.e., having diverse approximation errors) would yield
>> better performance (as long as the committee members are competent)
>> than a committee made up of a single algorithm applied multiple times
>> under slightly varying conditions.
>
>and David Wolpert writes:
>>There is a good deal of heuristic and empirical evidence supporting
>>this claim. In general, when using stacking to combine generalizers,
>>one wants them to be as "orthogonal" as possible, as Tom maintains.
>
>One minor result from my thesis shows that when the estimators are
>orthogonal in the sense that
>
>              E[n_i(x)n_j(x)] = 0 for all i<>j
>
>where n_i(x) = f(x) - f_i(x), f(x) is the target function, f_i(x) is
>the i-th estimator and the expected value is over the underlying 
>distribution; then the MSE of the average estimator goes like 1/N
>times the average of the MSE of the estimators where N is the number 
>of estimators in the population.  
>
>This is a shocking result because all we have to do to get arbitrarily 
>good performance is to increase the size of our estimator population!
>Of course in practice, the nets are correlated and the result is no
>longer true.

The matrix E[n_i(x)n_j(x)] may not be known but an estimate E[n'_i(x)n'_j(x)]
may be obtained using some training data which is different from the training
data used to train the generalizers in the first place. Here n'_i(x) = f'(x) -
f_i(x), E[n'_i(x)] = 0, f'(x) is a training data, f_i(x) is the i-th
estimator, and the expected value is over the training data.  Take the
eigenvectors (with non-zero eigenvalues) of E[n'_i(x)n'_j(x)] and you have a
set of generalizers (each a linear combination of the original generalizers)
which are orthogonal and uncorrelated over the training data, i.e.
E[n'_i(x)n'_j(x)] = 0 for all i<>j.  They can even be normalized by 
their eigenvalues so that E[n'_i(x)n'_j(x)] = 1 for all i==j. 

To summarize, in practice the generalizers can be de-correlated (to the extent
that they are linearly independent) by finding new generalizers composed of
appropriate linear sums of the originals.

I have an unrelated comment regarding Drucker Harris' earlier mail about using
synthetic data to improve performance.  Wouldn't it be true to say that if you
had a choice between learning with N synthetically created data and learning
with N novel training data that the latter is, on average, going to give
better results?  If so, then using synth data is a way to stretch your
training data; something like potato helper.