Yet more on averaging

Wed Aug 18 02:29:42 EDT 1993

dhw at santafe.edu writes:
|To sum it up: one can not prove averaging to be preferable to a scheme
|like using the alphabet to pick. Michael's result shows instead that
|averaging the guess is better (over multiple trials) than randomly
|picking amongst the guesses.
|
|Which simply means that one should not randomly pick amongst the
|guesses. It does *not* mean that one should average rather than use
|some other (arbitrarilly silly) single-valued scheme.

I would like to strengthen this point a little.

In general, averaging is clearly not optimal, nor even justifiable on
theoretical grounds.  For example, let us take the classification case
and let us assume that each neural network $i$ returns an estimate
$p^i_j(x)$ of the probability that the object belongs to class $j$
given the measurement $x$.

Consider now the case in which we know that the predictions of those
networks are statistically independent (for example, because they are
run on independent parts of the input data).  Then, we should really
multiply the probabilities estimated by each network, rather than
computing a weighted sum.  That is, we should make a decision
according to the maximum of $\prod_i p^i_j(x)$, not according to the
maximum of $\sum_i w_i p^i_j(x)$ (assuming a 0-1 loss function).

As another example, consider the case in which we have an odd number
of experts.  If they are trained and designed individually in a
particularly peculiar way, it might turn out that the optimal decision
rule is to output class 1 if an odd number of them pick class 1, and
pick class 0 otherwise.

Now, Michael probably limits the scope of his claims in his thesis to
exclude such cases (I only had a brief look, I must admit), but I
think it is important to make the point that, without some additional
assumptions, averaging is just a heuristic and not necessarily
optimal.

Still, linear combinations of the outputs of classifiers, regressors,
and networks seem to be useful in practice for improving
classification rates in many cases.  Lots of practical experience in
both statistics and neural networks points in that direction.

				Thomas.