Encoding missing values

Mon Feb 7 20:47:09 EST 1994

> There is at least one kind of network that has no problem (in
> principle) with missing inputs, namely a Boltzmann machine.
> You just refrain from clamping the input node whose value is
> missing, and treat it like an output node or hidden unit.
>  
> This may seem to be irrelevant to anything other than Boltzmann
> machines, but I think it could be argued that nothing very much
> simpler is capable of dealing with the problem. 

The above is a nice observation that is worth emphasizing;
I agree with all of it except the comment about being irrelevant 
to anything else.  The Boltzmann machine is actually relevant
to everything else.  What the Boltzmann algorithm is doing
with the missing value is essentially the same as what the 
EM algorithm for mixtures (that Ghahramani and Tresp referred 
to) is doing, and epitomizes the general case of an iterative
"filling in" algorithm.  The Boltzmann machine learning algorithm 
is a generalized EM (GEM) algorithm.  During the E step the 
system computes the conditional correlation function for the 
nodes under the Boltzmann distribution, where the conditioning 
variables are the known data (the values of the clamped units) 
and the current values of the parameters (weights).  This
"fills in" the relevant statistic (the correlation function)
and allows it to be used in the generalized M step (the
contrastive Hebb rule).

Moreover, despite the fancy terminology, these algorithms are 
nothing more (nor less) than maximum likelihood estimation, 
where the likelihood function is the likelihood of the parameters 
*given the data that was actually observed*.  By "filling in" 
missing data, you're not adding new information to the problem;
rather, you're allowing yourself to use all the information 
that is in those components of the data vector that aren't 
missing.  (EM theory provides the justification for that 
statement).  E.g., if only one component of an input vector 
is missing, it's obviously wasteful to neglect what the 
other components of the input vector are telling you.  And, 
indeed, if you neglect the whole vector, you will not end up 
with maximum likelihood estimates for the weights (nor in 
general will you get maximum likelihood estimates if you fill 
in a value with the unconditional mean of that variable).

"Filling in" is not the only way to compute ML estimates for
missing data problems, but its virtue is that it allows the 
use of the same learning algorithms as would be used for complete 
data (without incurring any bias, if the filling in is done correctly).
The only downside is that even if the complete-data algorithm 
is one-pass (which the Boltzmann algorithm and mixture fitting 
are not) the "filling-in" approach is generally iterative,
because the parameter estimates depend on the filled-in values 
which in turn depend on the parameter estimates.
On the other hand, there are so-called "monotone" patterns
of missing data for which the filling-in approach is not
necessarily iterative.  This monotone case might be of
interest, because it is relevant for problems involving 
feedforward networks in which the input vectors are
complete but some of the outputs are missing.  (Note that even 
if all the output values for a case are missing, a ML
algorithm will not throw the case out; there is statistical 
structure in the input vector that the algorithm must not
neglect).

	Mike

(See Ghahramani's message for references; particularly the
Little and Rubin book).