Encoding missing values
Michael Jordan
jordan at psyche.mit.edu
Mon Feb 7 20:47:09 EST 1994
> There is at least one kind of network that has no problem (in
> principle) with missing inputs, namely a Boltzmann machine.
> You just refrain from clamping the input node whose value is
> missing, and treat it like an output node or hidden unit.
>
> This may seem to be irrelevant to anything other than Boltzmann
> machines, but I think it could be argued that nothing very much
> simpler is capable of dealing with the problem.
The above is a nice observation that is worth emphasizing;
I agree with all of it except the comment about being irrelevant
to anything else. The Boltzmann machine is actually relevant
to everything else. What the Boltzmann algorithm is doing
with the missing value is essentially the same as what the
EM algorithm for mixtures (that Ghahramani and Tresp referred
to) is doing, and epitomizes the general case of an iterative
"filling in" algorithm. The Boltzmann machine learning algorithm
is a generalized EM (GEM) algorithm. During the E step the
system computes the conditional correlation function for the
nodes under the Boltzmann distribution, where the conditioning
variables are the known data (the values of the clamped units)
and the current values of the parameters (weights). This
"fills in" the relevant statistic (the correlation function)
and allows it to be used in the generalized M step (the
contrastive Hebb rule).
Moreover, despite the fancy terminology, these algorithms are
nothing more (nor less) than maximum likelihood estimation,
where the likelihood function is the likelihood of the parameters
*given the data that was actually observed*. By "filling in"
missing data, you're not adding new information to the problem;
rather, you're allowing yourself to use all the information
that is in those components of the data vector that aren't
missing. (EM theory provides the justification for that
statement). E.g., if only one component of an input vector
is missing, it's obviously wasteful to neglect what the
other components of the input vector are telling you. And,
indeed, if you neglect the whole vector, you will not end up
with maximum likelihood estimates for the weights (nor in
general will you get maximum likelihood estimates if you fill
in a value with the unconditional mean of that variable).
"Filling in" is not the only way to compute ML estimates for
missing data problems, but its virtue is that it allows the
use of the same learning algorithms as would be used for complete
data (without incurring any bias, if the filling in is done correctly).
The only downside is that even if the complete-data algorithm
is one-pass (which the Boltzmann algorithm and mixture fitting
are not) the "filling-in" approach is generally iterative,
because the parameter estimates depend on the filled-in values
which in turn depend on the parameter estimates.
On the other hand, there are so-called "monotone" patterns
of missing data for which the filling-in approach is not
necessarily iterative. This monotone case might be of
interest, because it is relevant for problems involving
feedforward networks in which the input vectors are
complete but some of the outputs are missing. (Note that even
if all the output values for a case are missing, a ML
algorithm will not throw the case out; there is statistical
structure in the input vector that the algorithm must not
neglect).
Mike
(See Ghahramani's message for references; particularly the
Little and Rubin book).
More information about the Connectionists
mailing list