Mixtures of experts and combining learning algorithms

Tue Jun 27 17:09:22 EDT 1995

John Hampshire writes:

>>>>
Robbie Jacobs, Mike Jordan, and Steve Nowlan have a long series
of papers on the topic of combining estimators (maybe that's not what
they called it, but that's certainly what it is);  their works date
back to well before 92.  Likewise, I wrote a NIPS paper (90 or 91...
not worth reading) and an IEEE PAMI article (92... probably worth
reading) with Alex Waibel on this topic.  Since David's not aware of
these earlier and contemporary works, his second ''most thoroughly
researched'' claim would appear doubtful as well.
>>>>

I am aware of the ground-breaking work of Nowlan, Jacobs, Jordan,
Hinton, etc. on adaptive mixtures of experts (AME) (as well as other
related schemes John didn't mention). And it is related to the subject
at hand, so I should have mentioned it in posting. However I have
trouble seeing what exactly John was driving at, since although it's
related, I don't think AME directly addresses Gil's question.

Most (almost all?) of the work on AME I've encountered concerns
members of a restricted family of learning algorithms.  Namely,
parametric learning algorithms that work by minimizing some cost
function. (For example, in Nowlan and Hinton (NIPS3), one is
explicitly combining neural nets.) Loosely speaking, AME in essence
"co-opts" how these algorithms work, by combining their individual
cost functions into a larger cost function that is then minimized by
varying everybody's parameters together.

However I took Gil's question (perhaps incorrectly) to concern the
combination of *arbitrary* types of estimators, which in particular
includes estimators (like nearest neighbor) that need not be
parametric and therefore can not readily be "co-opted". (Certainly the
work she listed, like Sharif's, concerns the combination of such
arbitrary estimators.) This simply is not the concern of most of the
work on AME.

Now one could imagine varying AME so that the "experts" being combined
are not parameterized input-output functions but rather the outputs of
more general kinds of learning algorithms. For example, one could have
the "experts" be the end-products of assorted nearest neighbor
schemes, that are trained independently of one another. *After* that
training one would train the gating network to combine the individual
experts. (In contrast, in vanilla AME one trains the gating network
together with the individual experts in one go.) However

1) It can be argued that it is stretching thing to view this as AME,
especially if you adopt the perspective that AME is a kind of mixture
modelling.

2) More importantly, I already referred to this kind of scheme in my
original posting:

"Actually, there is some earlier work on combining estimators, in
which one does not partition the training set (as in stacking), but
rather uses the residuals (created by training the estimators on the
full training set) to combine those estimators. However this scheme
appears to perform worse than stacking. See for example the earlier of
the two articles by Zhang et al."

***

Summarizing, we have one of two possibilities. Either

i) John is referring to a possible variant of AME that I did mention
(albeit without explicitly using the phrase "AME"), or

ii) John is referring to the more common variant of AME in which it
can not combine arbitrary kinds of estimators, and therefore is not a
candidate for what (I presumed) Gil had in mind.

Obviously I am not as much of an expert on the AME as John, so there
might very well be a section or two (or even a whole paper or two!)
that falls outside of those two categorizations of AME. But I think
it's fair to say that most of the work on AME is not concerned with
combining arbitrary estimators, in ways other than those referred to in
my posting.

Nonetheless, I certainly would recommend that Gil (and others)
acquaint themselves with the seminal work of AME. I was definitely
remiss in not including AME in my (quickly knocked together) list of
work related to Gil's question.

David Wolpert