Encoding missing values

Wed Feb 2 03:58:56 EST 1994

I am currently thinking about the problem of how to encode data with
attributes for which some of the values are missing in the data set for
neural network training and use.

An example of such data is the 'heart-disease' dataset from the UCI machine
learning database (anonymous FTP on "ics.uci.edu" [128.195.1.1], directory
"/pub/machine-learning-databases"). There are 920 records altogether with
14 attributes each. Only 299 of the records are complete, the others have one
or several missing attribute values.  11% of all values are missing.

I consider only networks that handle arbitrary numbers of real-valued inputs
here (e.g. all backpropagation-suited network types etc). I do NOT consider
missing output values. In this setting, I can think of several ways how to
encode such missing values that might be reasonable and depend on
the kind of attribute and how it was encoded in the first place:

1. Nominal attributes (that have n different possible values)
  1.1 encoded "1-of-n", i.e., one network input per possible value,
      the relevant one being 1 all others 0.
      This encoding is very general, but has the disadvantage of producing
      networks with very many connections.
      Missing values can either be represented as 'all zero' or by simply
      treating 'is missing' as just another possible input value, resulting
      in a "1-of-(n+1)" encoding.
  1.2 encoded binary, i.e.,  log2(n) inputs being used like the bits in a
    binary representation of the numbers 0...n-1 (or 1...n).
      Missing values can either be represented as just another possible input
      value (probably all-bits-zero is best) or by adding an additional network
      input which is 1 for 'is missing' and 0 for 'is present'. The original
      inputs should probably be all zero in the 'is missing' case.

2. continuous attributes (or attributes treated as continuous)
  2.1 encoded as a single network input, perhaps using some monotone
      transformation to force the values into a certain distribution.
      Missing values are either encoded as a kind of 'best guess' (e.g. the
      average of the non-missing values for this attribute) or by using
      an additional network input being 0 for 'missing' and 1 for 'present' 
      (or vice versa) and setting the original attribute input either to 0
      or to the 'best guess'. (The 'best guess' variant also applies to
      nominal attributes above)

3. binary attributes (truth values)
  3.1 encoded by one input:  0=false  1=true   or vice versa
      Treat like (2.1)
  3.2 encoded by one input:  -1=false 1=true   or vice versa
      In this case we may act as for (3.1) or may just use 0 to
      indicate 'missing'.
  3.3 treat like nominal attribute with 2 possible values

4. ordinal attributes (having n different possible values, which are ordered)
  4.1 treat either like continuous or like nominal attribute.
    If (1.2) is chosen, a Gray-Code should be used.
    Continuous representation is risky unless a 'sensible' quantification
    of the possible values is available.    

So far to my considerations. Now to my questions.

a) Can you think of other encoding methods that seem reasonable ?  Which ?

b) Do you have experience with some of these methods that is worth sharing ?

c) Have you compared any of the alternatives directly ?

------------------------------------------------------------------------

SUMMARY:

For a), the following ideas were mentioned:
1. use statistical techniques to compute replacement values from the rest
   of the data set
2. use a Boltzman machine to do this for you
3. use an autoencoder feed forward network to do this for you
4. randomize on the missing values (correct in the Bayesian sense)

For b), some experience was reported. I don't know how to summarize
that nicely, so I just don't summarize at all.

For c), no explicit quantitative results were given directly.

Some replies suggest that data is not always missing randomly.
The biases are often known and should be taken into account (e.g.
medical tests are not carried out (resulting in missing data) for
moreless healthy persons more often than for ill persons).

Many replies contained references to published work on this area,
from NN, machine learning, and mathematical statistics.
To ease searching for these references in the replies below, I have
marked them with the string ##REF## (if you have a 'grep' program that
extracts whole paragraphs, you can get them all out with one command).

Thanks to all who answered.
These are the trimmed versions of the replies:

------------------------------------------------------------------------

From: tgd at research.CS.ORST.EDU (Tom Dietterich)

[...for nominal attributes:]
An alternative here is to encode them as bit-strings in a
error-correcting code, so that the hamming distance between any two
bit strings is constant.  This would probably be better than a dense
binary encoding.  The cost in additional inputs is small.  I haven't
tried this though.  My guess is that distributed representations at
the input are a bad idea.

One must always determine WHY the value is missing.  In the heart
disease data, I believe the values were not measured because other
features were believed to be sufficient in each case.  In such cases,
the network should learn to down-weight the importance of the feature
(which can be accomplished by randomizing it---see below).

In other cases, it may be more appropriate to treat a missing value as
a separate value for the feature, e.g., in survey research, where a
subject chooses not to answer a question.

[...for continuous attributes:]
Ross Quinlan suggests encoding missing values as the mean observed
output value when the value is missing.  He has tried this in his
regression tree work.

Another obvious approach is to randomize the missing values--on each
presentation of the training example, choose a different, random,
value for each missing input feature.  This is the "right thing to do"
in the bayesian sense.

[...for binary attributes:]
I'm skeptical of the -1,0,1 encoding, but I think there is more
research to be done here.

[...for ordinal attributes:]
I would treat them as continuous.  

------------------------------------------------------------------------

From: shavlik at cs.wisc.edu (Jude W. Shavlik)

We looked at some of the methods you talked about in
the following article in the journal Machine Learning.

##REF##
%T Symbolic and Neural Network Learning Algorithms: An Experimental Comparison
%A J. W. Shavlik
%A R. J. Mooney
%A G. G. Towell
%J Machine Learning
%V 6
%N 2
%P 111-143
%D 1991

------------------------------------------------------------------------

From: hertz at nordita.dk (John Hertz)

It seems to me that the most natural way to handle missing data is to
leave them out.  You can do this if you work with a recurrent network
(fx Boltzmann machine) where the inputs are fed in by clamping the
input units to the given input values and the rest of the net relaxes
to a fixed point, after which the output is read off the output units.
If some of the input values are missing, the corresponding input units
are just left unclamped, free to relax to values most consistent with 
the known inputs. 

I have meant for a long time to try this on some medical prognosis data
I was working on, but I never got around to it, so I would be happy to
hear how it works if you try it.  

------------------------------------------------------------------------

From: jozo at sequoia.WPI.EDU (Jozo Dujmovic)

In the case of clustering benchmark programs I frequently have
the the problem of estimation of missing data. A relatively
simple SW that implements a heuristic algorithm generates
estimates having the average error of 8%. NN will somehow
"implicitly estimate" the missing data. The two approaches
might even be in some sense equivalent (?).

Jozo

[ I suspect that they are not: When you generate values for the
  missing items and put them in the training set, the network loses the
  information that this data is only estimated. Since estimations are
  not as reliable as true input data, the network will weigh inputs that
  have lots of generated values as less important. If it gets the 'is
  missing' information explicitly, it can discriminate true values from
  estimations instead. ]

------------------------------------------------------------------------

From: guy at cs.uq.oz.au

A final year student of mine worked on the problem of dealing with missing
inputs, without much success. However, the student as not very good, so take
the following opinions with a pinch of salt. 

We (very tentatively) came to the conclusion that if the inputs were
redundant, the problem was easy; if the missing input contained vital
information, the problem was pretty much impossible.

We used the heart disease data. I don't recommend it for the missing inputs
problem. All of the inputs are very good indicators of the correct result,
so missing inputs were not important. 

Apparently there is a large literature in statistics on dealing with missing
inputs. 

Anthony Adams (University of Tasmania) has published a technical report on
this. His email address is "A.Adams at cs.utas.edu.au". 

##REF##
@techreport{kn:Vamplew-91,
   author = "P. Vamplew and A. Adams", 
   address = {Hobart, Tasmania, Australia},
   institution = {Department of Computer Science, University of Tasmania},
   number = {R1-4},
   title = {Real World Problems in Backpropagation: Missing Values and Generalisability},
   year = {1991}
}

------------------------------------------------------------------------

From: Mike Southcott <mlsouth at cssip.levels.unisa.edu.au>

##REF##
I wrote a paper for the Australian conference on neural networks in 1993.
``Classification of Incomplete Data using neural networks''
Southcott, Bogner.

You may find it interesting. You may not be able to get the proceedings
for this conference, but I am in the process of digging up a postscript
copy for someone in the States, so when I do that, I will send
you a copy.

------------------------------------------------------------------------

From: Eric Saund <saund at parc.xerox.com>

I have done some work on unsupervised learning of mulitple cause 
clusters in binary data, for which an appropriate encoding scheme
is -1 = FALSE, 1 = TRUE, and 0 = NO DATA.  This has worked well
for me, but my paradigm is not your standard feedforward network
and uses a different activiation function from the standard
weighted sum followed by sigmoid squashing.
I presented the paper on this work at NIPS:

##REF##
Saund, Eric; 1994; "Unsupervised Learning of Mixtures of Multiple Causes
in Binary Data," in Advances in Neural Information Processing Systems -6-, 
Cowan, J., Tesauro, G, and Alspector, J., eds. Morgan Kaufmann, San Francisco.

------------------------------------------------------------------------

From: Thierry.Denoeux at hds.univ-compiegne.fr

In a recent mailing, Lutz Prechelt mentioned the interesting problem of how 
to encode attributes with missing values as inputs to a neural network.
I have recently been faced to that problem while applying neural nets to
rainfall prediction using weather radar images. The problem was to classify
pairs of "echoes" -- defined as groups of connected pixels with reflectivity
above some threshold -- taken from successive images as corresponding to
the same rain cell or not. Each pair of echoes was discribed by a list of
attributes. Some of these attributes, refering to the past of a sequence, were
not defined for some instances. To encode these attributes with potentially
missing values, we applied two different methods actually suggested by Lutz:

- the replacement of the missing value by a "best-guess" value 
- the addition of a binary input indicating whether the corresponding attribute
  was present or absent.

Significantly better results were obtained by the second method.

This work was presented at ICANN'93 last september:
##REF##
X. Ding, T. Denoeux & F. Helloco (1993). Tracking rain cells in radar images
using multilayer neural networks. In Proc. of ICANN'93, Springer-Verlag, 
p. 962-967.

------------------------------------------------------------------------

From: "N. Karunanithi" <karun at faline.bellcore.com>

[...for nominal attributes:]
   Both methods have the problem of poor scalability. If the number of
missing values increases then the number of additional inputs will
increase linearly in 1.1 and logarithmically in 1.2.
    In fact, 1-of-n encoding may be a poor choice if (1) the number
of input features is large and (2) such an expanded dimensional 
representation does not become a (semi) linearly separable problem.
Even if it becomes a linearly separable problem, the overall complexity
of the network can sometimes be very high.

[...for continuous attributes:]
This representation requires GUESS. A nominal transformation may not be
a proper representation in some cases. Assume that the output values
range over a large numerical interval. For example, from 0.0 to 10,000.0.  
If you use a simple scaling like dividing by 10,000.0 to make it
between 0.0 and 1.0, this will result in poor accuracy of prediction.
If the attribute is on the input side, then on theory the
scaling is unnecessary because the input layer weights will scale
accordingly. However, in practice I had lot of problem with this
approach. Maybe a log tranformation before scaling may not be a bad
choice.
If you use a closed scaling you may have problem whenever a future value
exceeds the maximum value of the numerical intervel. For example,
assume that the attribute is time, say in miliseconds. Any future time 
from the point of reference can exceed the limit. Hence any closed
scaling will not work properly.

[...for ordinal attributes:]
I have compared Binary Encoding (1.2), Gray-Coded representation and
straighforward scaling. Colsed scaling seems to do a good job. I have 
also compared open scaling and closed scaling and did find significant
improvement in prediction accuracy. 

###REF###
N. Karunanithi, D. Whitley and Y. K. Malaiya,
   "Prediction of Software Reliability Using Connectionist Models",
    IEEE Trans. Software Eng., July 1992, pp 563-574.

N. Karunanithi and Y. K. Malaiya, "The Scaling Problem in Neural
     Networks for Software Reliability Prediction", Proc. IEEE Int.
    Symposium on Rel. Eng., Oct. 1992, pp. 776-82.

I have not found a simple solution that is general. I think
representation in general and the missing information in specific
are open problems within connectionist research. I am not sure we will
have a magic bullet for all problems. The best approach is to come up
with a specific solution for a given problem.

------------------------------------------------------------------------

From: Bill Skaggs <bill at nsma.arizona.edu>

There is at least one kind of network that has no problem (in
principle) with missing inputs, namely a Boltzmann machine.
You just refrain from clamping the input node whose value is
missing, and treat it like an output node or hidden unit.

This may seem to be irrelevant to anything other than Boltzmann
machines, but I think it could be argued that nothing very much
simpler is capable of dealing with the problem.  When you ask
a network to handle missing inputs, you are in effect asking it
to do pattern completion on the input layer, and for this a
Boltzmann machine or some other sort of attractor network would
seem to be required. 

------------------------------------------------------------------------

From: "Scott E. Fahlman" <sef+ at cs.cmu.edu>

[Follow-up to Bill Skaggs:]
Good point, but perhaps in need of clarification for some readers:

There are two ways of training a Boltzmann machine.  In one (the original
form), there is no distinction between input and output units.  During
training we alternate between an instruction phase, in which all of the
externally visible units are clamped to some pattern, and a normalization
phase, in which the whole network is allow to run free.  The idea is to
modify the weights so that, when running free, the external units assume
the various pattern values in the training set in their proper frequencies.
If only some subset of the externally visible units are clamped to certain
values, the net will produce compatible completions in the other units,
again with frequencies that match this part of the training set.

A net trained in this way will (in principle -- it might take a *very* long
time for anything complicated) do what you suggest: Complete an "input"
pattern and produce a compatible output at the same time.  This works even
if the input is *totally* missing.

I believe it was Geoff Hinton who realized that a Boltzmann machine could
be trained more efficiently if you do make a distinction between input and
output units, and don't waste any of the training effort learning to
reconstruct the input.  In this model, the instruction phase clamps both
input and output units to some pattern, while the normalization phase
clamps only the input units.  Since the input units are correct in both
cases, all of the networks learning power (such as it is) goes into
producing correct patterns on the output units.  A net trained in this way
will not do input-completion.

I bring this up because I think many people will only have seen the latter
kind of Boltzmann training, and will therefore misunderstand your
observation.

By the way, one alternative method I have seen proposed for reconstructing
missing input values is to first train an auto-encoder (with some degree of
bottleneck to get generalization) on the training set, and then feed the
output of this auto-encoder into the classification net.  The auto-encoder
should be able to replace any missing values with some degree of accuracy.
I haven't played with this myself, but it does sound plausible.  If anyone
can point to a good study of this method, please post it here or send me
E-mail.

------------------------------------------------------------------------

From: "David G. Stork" <stork at cache.crc.ricoh.com>

##REF##
There is a provably optimal method for performing classification with
missing inputs, described in Chapter 2 of "Pattern Classification and
Scene Analysis" (2nd ed.) by R. O. Duda, P. E. Hart and D. G. Stork,
which avoids the ad-hoc heuristics that have been described by others.
Those interested in obtaining Chapter two via ftp should contact me.

------------------------------------------------------------------------

From: Wray Buntine <wray at ptolemy-ethernet.arc.nasa.gov>

This missing value problem is of course shared amongst all the
learning communities, artificial intelligence, statistics, pattern
recognition, etc., not just neural networks.

A classic study in this area, which includes most suggestions
I've read here so far, is

##REF##
@inproceedings{quinlan:ml6,
        AUTHOR = "J.R. Quinlan",
        TITLE = "Unknown Attribute Values in Induction",
        YEAR = 1989,
        BOOKTITLE = "Proceedings of the Sixth International
                        Machine Learning Workshop",
        PUBLISHER = "Morgan Kaufmann",
        ADDRESS = "Cornell, New York"}

The most frequently cited methods I've seen, and they're so common 
amongst the different communities its hard to lay credit:
  1)	 replace missings by their some best guess
  2)     fracture the example into a set of fractional examples
		each with the missing value filled in somehow
  3)     call the missing value another input value

3 is a good thing to do if they are "informative" missing,
i.e.  if someone leaves the entry "telephone number" blank in a 
	questionaire, then maybe they don't have a telephone,
	but probably not good otherwise unless you
	have loads of data and don't mind all the extra
	example types generated (as already mentioned)

1 is a quick and dirty hack at 2.  How good depends on your
application.

2 is an approximation to the "correct" approach for handling
"non-informative" missing values according to the standard
"mixture model".  The mathematics for this is general and applies
to virtually any learning algorithm trees, feed-forward nets,
linear regression, whatever.  We do it for feed-forward nets in

##REF##
@article{buntine.weigend:bbp,
        AUTHOR = "W.L. Buntine and A.S. Weigend",
        TITLE =  "Bayesian Back-Propagation",
        JOURNAL = "Complex Systems",
        Volume = 5,
        PAGES = "603--643",
        Number = 1,
        YEAR = "1991" }
and see Tresp, Ahmad & Neuneier in NIPS'94 for an implementation.
But no doubt someone probably published the general idea back in
the 50's.

I certainly wouldn't call missing values an open problem.
Rather, "efficient implementations of the standard approaches"
is, in some cases, an open problem.

------------------------------------------------------------------------

From: Volker Tresp <Volker.Tresp at zfe.siemens.de>

In general, the solution to the missing-data problem depends on 
the missing-data mechanism. For example, if you sample the income
of a population and rich people tend to refuse the answer the mean
of your sample is biased. To obtain an unbiased solution
you would have to take into account the missing-data mechanism.

The missing-data mechanism can be ignored if it is independent of 
the input and the output (in the example: the likelihood that a 
person refuses to answer is independent of the person's income). 
Most approaches assume that the missing-data mechanism can be ignored.

There exist a number of ad hoc solutions to the missing-data problem
but it is also possible to approach the problem from a statistical point
of view. In our paper (which will be published in the upcoming 
NIPS-volume and which will be available on neuroprose
shortly) we discuss a systematic likelihood-based approach.
NN-regression  can be framed as a maximum likelihood learning problem
if we assume the standard signal plus Gaussian noise model  

P(x, y) =  P(x) P(y|x)    \propto P(x) exp(-1/(2 \sigma^2) (y - NN(x))^2).

By deriving the probability density function for  a pattern with missing
features  we can formulate a likelihood function including patterns 
with complete and incomplete features.  

The solution  requires an  integration over the missing input. 
In practice, the  integral  is  approximated  using a numerical approximation. 
For networks of Gaussian basis functions,  it is possible to obtain 
closed-form solutions (by extending the EM algorithm).

Our paper also discusses why and when ad hoc solutions --such as substituting
the mean for an unknown input--  are  harmful. For example, 
if the mapping is approximately linear substituting the mean might work
quite well. In general, although, it introduces bias. 

Training with missing and noisy input data is described in:

##REF##
``Training Neural Networks with Deficient Data,''
V. Tresp, S. Ahmad and R. Neuneier, in Cowan, J. D., Tesauro, G., 
and Alspector, J. (eds.), {\em  Advances in Neural Information Processing Systems 6}, Morgan Kaufmann,  1994.

A related paper by Zoubin Ghahramani and Michael Jordan will also appear 
in the  upcoming NIPS-volume.

Recall with missing and noisy data is discussed in (available in neuroprose
as ahmad.missing.ps.Z): 

``Some Solutions to the Missing Feature Problem in Vision,'' 
 S. Ahmad and  V. Tresp,  in  
{\em Advances in Neural Information Processing Systems 5,}
S. J. Hanson, J. D. Cowan,  and C. L. Giles eds.,
San Mateo, CA, Morgan Kaufman,  1993. 

------------------------------------------------------------------------

From: Subhash Kak <kak at gate.ee.lsu.edu>

Missing values in feedback networks raise interesting questions:
Should these values be considered "don't know" values or should 
these be generated in some "most likelihood" fashion? These issues
are discussed in the following paper:

##REF##
S.C. Kak, "Feedback neural networks: new characteristics and a
generalization", Circuits, Systems, Signal Processing, vol. 12,
no. 2, 1993, pp. 263-278.

------------------------------------------------------------------------

From: Zoubin Ghahramani <zoubin at psyche.mit.edu>

I have also been looking into the issue of encoding and learning from
missing values in a neural network. The issue of handling missing
values has been addressed extensively in the statistics literature for
obvious reasons.  To learn despite the missing values the data has to
be filled in, or the missing values integrated over. The basic
question is how to fill in the missing data. There are many different
methods for doing this in stats (mean imputation, regression
imputation, Bayesian methods, EM, etc.). For good reviews see (Little
and Rubin 1987; Little, 1992).

I do not in general recommend encoding "missing" as yet another value
to be learned over. Missing means something in a statistical sense --
that the input could be any of the values with some probability
distribution. You could, for example, augment the original data
filling in different values for the missing data points according to a
prior distribution. Then the training would assign different weights
to the artificially filled-in data points depending on how well they
predict the output (their posterior probability). This is essentially
the method proposed by Buntine and Weigand (1991). Other approaches
have been proposed by Tresp et al. (1993) and Ahmad and Tresp (1993).

I have just written a paper on the topic of learning from incomplete
data. In this paper I bring a statistical algorithm for learning from
incomplete data, called EM, into the framework of nonlinear function
approximation and classification with missing values. This approach
fits the data iteratively with a mixture model and uses that same
mixture model to effectively fill in any missing input or output
values at each step. 

You can obtain the preprint by 
	ftp psyche.mit.edu
	login: anonymous
	cd pub
	get zoubin.nips93.ps
To obtain code for the algorithm please contact me directly.

##REF##
Ahmad, S and Tresp, V (1993) "Some Solutions to the Missing Feature
Problem in Vision." In Hanson, S.J., Cowan, J.D., and Giles, C.L.,
editors, Advances in Neural Information Processing Systems 5. Morgan
Kaufmann Publishers, San Mateo, CA.

Buntine, WL, and Weigand, AS (1991) "Bayesian back-propagation." Complex
Systems. Vol 5 no 6 pp 603-43

Ghahramani, Z and Jordan MI (1994) "Supervised learning from
incomplete data via an EM approach" To appear in Cowan, J.D., Tesauro,
G., and Alspector,J. (eds.). Advances in Neural Information Processing
Systems 6.  Morgan Kaufmann Publishers, San Francisco, CA, 1994.

Little, RJA (1992) "Regression With Missing X's:  A Review." Journal of the
American Statistical Association.  Volume 87, Number 420. pp.
1227-1237

Little, RJA. and Rubin, DB (1987). Statistical Analysis with Missing
Data. Wiley, New York.

Tresp, V, Hollatz J, Ahmad S (1993) "Network structuring and training
using rule-based knowledge." In Hanson, S.J., Cowan, J.D., and
Giles, C.~L., editors,  Advances in Neural Information Processing
Systems 5. Morgan Kaufmann Publishers, San Mateo, CA.

------------------------------------------------------------------------

That's it.

  Lutz

Lutz Prechelt   (email: prechelt at ira.uka.de)            | Whenever you 
Institut fuer Programmstrukturen und Datenorganisation  | complicate things,
Universitaet Karlsruhe;  76128 Karlsruhe;  Germany      | they get
(Voice: ++49/721/608-4068, FAX: ++49/721/694092)        | less simple.