Provable optimality of averaging generalizers

Tue Aug 10 15:58:41 EDT 1993

Michael Perrone writes:

>>>
In the case of averaging for MSE optimization (the meat and potatoes of 
neural networks) and any other convex measure, the improvement due
to averaging is independent of the distribution - on-training or off-.
It depends only on the topology of the optimization measure. 

It is important to note that this result does NOT say the average is
better than any individual estimate - only better than the average
population performance.  For example, if one had a reliable selection
criterion for deciding which element of a population of estimators was
the best and that estimator was better than the average estimator,
then just choose the better one. (Going one step further, simply use
the selection criterion to choose the best estimator from all possible
weighted averages of the elements of the population.) As David Wolpert
pointed out, any estimator can be confounded by a pathological data
sample and therefore there doesn't exist a *guaranteed* method for
deciding which estimator is the best from a population in all cases.
Weak (as opposed to guaranteed) selection criteria exist in in the
form of cross-validation (in all of its flavors).  Coupling
cross-validation with averaging is a good idea since one gets the best
of both worlds particularly for problems with insufficient data.

I think that another very interesting direction for research (as David Wolpert
alluded to) is the investigation of more reliable selection criterion.
>>>

***

Well, I agree with the second two paragraphs, but not the first. At
least not exactly as written. Although Michael is making an
interesting and important point, I think it helps to draw attention to
some things:

I) First, I haven't yet gone though Michael's work in detail, but it
seems to me that the "measures" Michael is referring to really only
make sense as real-world cost functions (otherwise known as loss
functions, sometimes as risk functions, etc.). Indeed many very
powerful learning algorithms (e.g., memory based reasoning) are not
directly cast as finding the minimum on an energy surface, be it
"convex" or otherwise. For such algorithms, "measures" come in with
the cost function.

In fact, *by definition*, one is only interested in such real world
cost - results concerning anything else do not concern the primary
object of interest.

With costs, an example of a convex surface is the quadratic cost
function, which says that given truth f, your penalty for guessing h
is given by the function (f - h)^2. For such a cost, Michael's result
holds essentially because by guessing the average you reduce variance
but keep the same bias (as compared to the average over all
guesses). In other words, it holds because for any f, h1, and h2, [(h1
+ h2)/2 - f)]^2 <= [(h1 - f)^2 + (h2 - f)^2] / 2. (When f, h1, and h2
refer to distributions rather than single values, as Michael rightly
points out, you have to worry about other issues before making this
statement, like whether the distributions are correlated with one
another.)

***

It should be realized though that there are many non-convex cost
functions in the real world. For example, when doing classification,
one popular cost function is zero-one. This function says you get credit
for guessing exactly correctly, and if you miss, it doesn't matter
what you guessed; all misses "cost you" the same.

This cost function is implicit in much of PAC, stat. mech. of
learning, etc. Moreover, in Bayesian decision theory, guessing the
weights which maximize the posterior probability P(weights | data)
(which in the Bayesian perspective of neural nets is exactly what is
done in backprop with weight decay) is the optimal strategy only for
this zero-one cost.

Now if we take this zero-one cost function, and evaluate it only off
the training set, it is straight-forward to prove that for a uniform
Pr(target function), the probability of a certain value of cost, given
data, is independent of the learning algorithm. (The same result holds
for other cost functions as well, though as Michael points out, you
must be careful in trying to extend this result to convex cost
functions.)

This is true for any data set, i.e., it is not based on "pathological
data", as Michael puts it. It says that unless you can rule out a
uniform Pr(target function), you can not prove any one algorithm to be
superior to any other (as far as this cost function is concerned).

***

II) Okay. Now Michael correctly points out that even in those cases w/
a convex cost "measure", you must interpret his result with caution. I
agree, and would say that this is somewhat like the famous "two
letters" paradox of probability theory. Consider the following:

1) Say I have 3 real numbers, A, B, and X. In general, it's always
true that with C = [A + B] / 2, [C - X]^2 <= {[A - X]^2 + [B - X]^2} /
2. (This is exactly analogous to having the cost of the average guess
bounded above by the average cost of the individual guesses.)

2) This means that if we had a choice of either randomly drawing one
of the numbers {A, B}, or drawing C, that on average drawing C would
give smaller quadratic cost with respect to X.

3) However, as Michael points out, this does *not* mean that if we had
just the numbers A and C, and could either draw A or C, that we should
draw C. In fact, point (1) tells us nothing whatsoever about whether A
or C is preferable (as far as quadratic cost with respect to X is
concerned).

4) In fact, now create a 5th number, D = [C + A] / 2. By the same
logic as in (1), we see that the cost (wrt/ X) of D is less than the
average of the costs of C and A. So to the exact same degree that (1)
says we "should" guess C rather than A or B, it also says we should
guess D rather than A or C. (Note that this does *not* mean that D's
cost is necessarily less than C's though; we don't get endlessly
diminishing costs.)

5) Step (4) can be repeated ad infinitum, getting a never-ending
sequence of "newly optimal" guesses. In particular, in the *exact*
sense in which C is "preferable" to A or B, and therefore should
"replace" them, D is preferable to A or B, and therefore should
replace *them* (and in particular replace C). So one is never left
with C as the object of choice.

***

So (1) isn't really normative; it doesn't say one "should" guess the
average of a bunch of guesses:

7) Choosing D is better than randomly choosing amongst C or A, just as
   choosing C is better than randomly choosing amongst A or B.

8) This doesn't mean that given C, one should introduce an A and
then guess the average of C and A (D) rather than C, just as
   this doesn't mean that given A, one should introduce a B and 
then guess the average of A and B (C) rather than A.

9) An analogy which casts some light on all this: view A and B not
as the outputs of separate single-valued learning algorithms, but rather
as the random outputs of a single learning algorithm. Using this analogy,
the result of Michael's, that one should always guess C rather than 
randomly amongst A or B, suggest that one should always use a
deterministic, single-valued learning algorithm (i.e., just guess C) 
rather than one that guesses randomly from a distribution over
possible guesses (i.e., one that guess randomly amongst A or B). 

This implication shouldn't surprise anyone familiar with Bayesian
decision theory. In fact, it's (relatively) straight-forward to prove
that independent of priors or the like, for a convex cost function,
one should always use a single-valued learning algorithm rather than
one which guesses randomly. (This has probably been proven many
times. One proof can be found in Wolpert and Stolorz, On the
implementation of Bayes optimal generalizers, SFI tech. report
92-03-012.)
(Blatant self-promotion: Other interesting things proven in that
report and others in its series are: there are priors and noise
processes such that the expected cost, given the data set and that one
is using a Bayes-optimal learning algorithm, can *decrease* with added
noise; if the cost function is a proper metric, then the magnitude of
the change in expected cost if one guesses h rather than h' is bounded
above by the cost of h relative to h'; other results about using
"Bayes-optimal" generalizers predicated on an incorrect prior, etc., etc.)

***

The important point is that although it is both intriguing and
illuminating, there are no implications of Michael's result for what
one should do with (or in place of) a particular deterministic,
single-valued learning algorithm.  It was for such learning algorithms
that my original comments were intended.

David Wolpert