missing data

Volker Tresp Volker.Tresp at zfe.siemens.de
Fri Feb 4 13:09:46 EST 1994




In response to  the questions raised by Lutz Prechelt concerning
the missing data problem:



In general, the solution to the missing-data problem depends on 
the missing-data mechanism. For example, if you sample the income
of a population and rich people tend to refuse the answer the mean
of your sample is biased. To obtain an unbiased solution
you would have to take into account the missing-data mechanism.

The missing-data mechanism can be ignored if it is independent of 
the input and the output (in the example: the likelihood that a 
person refuses to answer is independent of the person's income). 
Most approaches assume that the missing-data mechanism can be ignored.


There exist a number of ad hoc solutions to the missing-data problem
but it is also possible to approach the problem from a statistical point
of view. In our paper (which will be published in the upcoming 
NIPS-volume and which will be available on neuroprose
shortly) we discuss a systematic likelihood-based approach.
NN-regression  can be framed as a maximum likelihood learning problem
if we assume the standard signal plus Gaussian noise model  

P(x, y) =  P(x) P(y|x)    \propto P(x) exp(-1/(2 \sigma^2) (y - NN(x))^2).


By deriving the probability density function for  a pattern with missing
features  we can formulate a likelihood function including patterns 
with complete and incomplete features.  

The solution  requires an  integration over the missing input. 
In practice, the  integral  is  approximated  using a numerical approximation. 
For networks of Gaussian basis functions,  it is possible to obtain 
closed-form solutions (by extending the EM algorithm).

Our paper also discusses why and when ad hoc solutions --such as substituting
the mean for an unknown input--  are  harmful. For example, 
if the mapping is approximately linear substituting the mean might work
quite well. In general, although, it introduces bias. 



Training with missing and noisy input data is described in:

``Training Neural Networks with Deficient Data,''
V. Tresp, S. Ahmad and R. Neuneier, in Cowan, J. D., Tesauro, G., 
and Alspector, J. (eds.), {\em  Advances in Neural Information Processing Systems 6}, Morgan Kaufmann,  1994.

A related paper by Zoubin Ghahramani and Michael Jordan will also appear 
in the  upcoming NIPS-volume.



Recall with missing and noisy data is discussed in (available in neuroprose
as ahmad.missing.ps.Z): 

``Some Solutions to the Missing Feature Problem in Vision,'' 
 S. Ahmad and  V. Tresp,  in  
{\em Advances in Neural Information Processing Systems 5,}
S. J. Hanson, J. D. Cowan,  and C. L. Giles eds.,
San Mateo, CA, Morgan Kaufman,  1993. 



Volker Tresp		Subutai Ahmad		Ralph Neuneier
tresp at zfe.siemens.de	ahmad at interval.com 	ralph at zfe.siemens.de


More information about the Connectionists mailing list