Yet more on averaging

Tue Aug 17 21:26:08 EDT 1993

In several recent e-mail conversations, Michael Perrone and I have
gotten to where think we agree with each other substance, although
we disagree a bit on emphasis. To complete the picture for the 
connectionist community and present the other side to Michael's
recent posting:

In my back pocket, I have a number. I'll fine you according to the
squared difference between your guess for the number and its actual
value. Okay, should you guess 3 or 5? Obviously you can't answer. 7 or
5? Same response. 5 or a random sample of 3 or 7? Now, as Michael points
out, you *can* answer: 5.

However I'm not as convinced as Michael that this actually tells us
anything of practical use. How should you use this fact to help you
guess the number in my back pocket? Seems to me you can't.

The bottom line, as I see it: arguments like Michael's show that one
should always use a single-valued learning algorithm rather than a
stochastic one. (Subtle caveat: If used only once, there is no
difference between a stochastic learning algorithm and a single-valued
one; multiple trials are implicitly assumed here.)

But if one has before one a smorgasbord of single-valued learning
algorithms, one can not infer that one should average over them. Even if I
choose amongst them in a really stupid way (say according to the
alphabetical listing of their creators), *so long as I am consistent
and single-valued in how I make my choice*, I have no assurace that doing
this will give worse results than averaging them.

To sum it up: one can not prove averaging to be preferable to a scheme
like using the alphabet to pick. Michael's result shows instead that
averaging the guess is better (over multiple trials) than randomly
picking amongst the guesses.

Which simply means that one should not randomly pick amongst the
guesses. It does *not* mean that one should average rather than use
some other (arbitrarilly silly) single-valued scheme.

David Wolpert

Disclaimer: All the above notwithstanding, I personally *would*
use some sort of averaging scheme in practice. The only issue of
contention here is what is *provably* the way one should generalize.
In addition to disseminating the important result concerning
the sub-optimality of stochastic schems (of which there are many
in the neural nets community!), Michael is to be commended for
bringing this entire fascinating subject to the attention of the
community.