missing data
Volker Tresp
Volker.Tresp at zfe.siemens.de
Fri Feb 4 13:09:46 EST 1994
In response to the questions raised by Lutz Prechelt concerning
the missing data problem:
In general, the solution to the missing-data problem depends on
the missing-data mechanism. For example, if you sample the income
of a population and rich people tend to refuse the answer the mean
of your sample is biased. To obtain an unbiased solution
you would have to take into account the missing-data mechanism.
The missing-data mechanism can be ignored if it is independent of
the input and the output (in the example: the likelihood that a
person refuses to answer is independent of the person's income).
Most approaches assume that the missing-data mechanism can be ignored.
There exist a number of ad hoc solutions to the missing-data problem
but it is also possible to approach the problem from a statistical point
of view. In our paper (which will be published in the upcoming
NIPS-volume and which will be available on neuroprose
shortly) we discuss a systematic likelihood-based approach.
NN-regression can be framed as a maximum likelihood learning problem
if we assume the standard signal plus Gaussian noise model
P(x, y) = P(x) P(y|x) \propto P(x) exp(-1/(2 \sigma^2) (y - NN(x))^2).
By deriving the probability density function for a pattern with missing
features we can formulate a likelihood function including patterns
with complete and incomplete features.
The solution requires an integration over the missing input.
In practice, the integral is approximated using a numerical approximation.
For networks of Gaussian basis functions, it is possible to obtain
closed-form solutions (by extending the EM algorithm).
Our paper also discusses why and when ad hoc solutions --such as substituting
the mean for an unknown input-- are harmful. For example,
if the mapping is approximately linear substituting the mean might work
quite well. In general, although, it introduces bias.
Training with missing and noisy input data is described in:
``Training Neural Networks with Deficient Data,''
V. Tresp, S. Ahmad and R. Neuneier, in Cowan, J. D., Tesauro, G.,
and Alspector, J. (eds.), {\em Advances in Neural Information Processing Systems 6}, Morgan Kaufmann, 1994.
A related paper by Zoubin Ghahramani and Michael Jordan will also appear
in the upcoming NIPS-volume.
Recall with missing and noisy data is discussed in (available in neuroprose
as ahmad.missing.ps.Z):
``Some Solutions to the Missing Feature Problem in Vision,''
S. Ahmad and V. Tresp, in
{\em Advances in Neural Information Processing Systems 5,}
S. J. Hanson, J. D. Cowan, and C. L. Giles eds.,
San Mateo, CA, Morgan Kaufman, 1993.
Volker Tresp Subutai Ahmad Ralph Neuneier
tresp at zfe.siemens.de ahmad at interval.com ralph at zfe.siemens.de
More information about the Connectionists
mailing list