Combining neural estimators, NetGene

Thu Jun 29 17:15:25 EDT 1995

Just a small clarifying point regarding combining estimators.
Just because two algorithms (e.g., stacking and mixtures of 
experts) end up forming linear combinations of models doesn't 
necessarily mean that they have much to do with each other.  
It's not the architecture that counts, it's the underlying 
statistical assumptions that matter---the statistical assumptions 
determine how the parameters get set.  Indeed, a mixture of 
experts model is making the assumption that, probabilistically, 
a single underlying expert is responsible for each data point.
This is very different from stacking, where there is no such 
mutual exclusivity assumption.  Moreover, the linear combination 
rule of mixtures of experts arises only if you consider the 
conditional mean of the mixture distribution; i.e., E(y|x).
When the conditional distribution of y|x has multiple modes,
which isn't unusual, a mixture model is particularly appropriate
and the linear combination rule *isn't* the right way to 
summarize the distribution.

In my view, mixtures of experts are best thought of just 
another kind of statistical model, on the same level as,
say, loglinear models or hidden Markov models.  Indeed, they 
basically are a statistical form of a decision tree model.
Note that decision trees embody the mutual exclusivity 
assumption (by definition of "decision")---this makes 
it very natural to formalize decision trees as mixture 
models.  (Cf. decision *graphs*, which don't make a mutual 
exclusivity assumption and *aren't* handled well within the 
mixture model framework.)  I would tend to place stacking at 
a higher level in the inference process, as a general methodology 
for--in some sense--approximating an average over a posterior 
distribution on a complex model space.  "Higher level" just 
means that it's harder to relate stacking to a specific 
generative probability model.  It's the level of inference 
at which everybody agrees that no one model is very likely 
to be correct for any instantiation of any x---for two
reasons:  because our current library of possible statistical 
models is fairly impoverished, and because we have an even 
more impoverished theory of how all of these models relate 
to each other (i.e., how they might be parameterized instances 
of some kind of super-model).  This means that--in our current 
state of ignorance--mutual exclusivity (and exhaustivity---the 
second assumption underlying mixture models) make no sense (at 
the higher levels of inference), and some kind of smart averaging 
has got to be built in whether we understand it fully or not.

Mike