Combining neural estimators, NetGene
Michael Jordan
jordan at psyche.mit.edu
Thu Jun 29 17:15:25 EDT 1995
Just a small clarifying point regarding combining estimators.
Just because two algorithms (e.g., stacking and mixtures of
experts) end up forming linear combinations of models doesn't
necessarily mean that they have much to do with each other.
It's not the architecture that counts, it's the underlying
statistical assumptions that matter---the statistical assumptions
determine how the parameters get set. Indeed, a mixture of
experts model is making the assumption that, probabilistically,
a single underlying expert is responsible for each data point.
This is very different from stacking, where there is no such
mutual exclusivity assumption. Moreover, the linear combination
rule of mixtures of experts arises only if you consider the
conditional mean of the mixture distribution; i.e., E(y|x).
When the conditional distribution of y|x has multiple modes,
which isn't unusual, a mixture model is particularly appropriate
and the linear combination rule *isn't* the right way to
summarize the distribution.
In my view, mixtures of experts are best thought of just
another kind of statistical model, on the same level as,
say, loglinear models or hidden Markov models. Indeed, they
basically are a statistical form of a decision tree model.
Note that decision trees embody the mutual exclusivity
assumption (by definition of "decision")---this makes
it very natural to formalize decision trees as mixture
models. (Cf. decision *graphs*, which don't make a mutual
exclusivity assumption and *aren't* handled well within the
mixture model framework.) I would tend to place stacking at
a higher level in the inference process, as a general methodology
for--in some sense--approximating an average over a posterior
distribution on a complex model space. "Higher level" just
means that it's harder to relate stacking to a specific
generative probability model. It's the level of inference
at which everybody agrees that no one model is very likely
to be correct for any instantiation of any x---for two
reasons: because our current library of possible statistical
models is fairly impoverished, and because we have an even
more impoverished theory of how all of these models relate
to each other (i.e., how they might be parameterized instances
of some kind of super-model). This means that--in our current
state of ignorance--mutual exclusivity (and exhaustivity---the
second assumption underlying mixture models) make no sense (at
the higher levels of inference), and some kind of smart averaging
has got to be built in whether we understand it fully or not.
Mike
More information about the Connectionists
mailing list