Encoding missing values

Wed Feb 2 03:58:56 EST 1994

I am currently thinking about the problem of how to encode data with
attributes for which some of the values are missing in the data set for
neural network training and use.

An example of such data is the 'heart-disease' dataset from the UCI machine
learning database (anonymous FTP on "ics.uci.edu" [128.195.1.1], directory
"/pub/machine-learning-databases"). There are 920 records altogether with
14 attributes each. Only 299 of the records are complete, the others have one
or several missing attribute values.  11% of all values are missing.

I consider only networks that handle arbitrary numbers of real-valued inputs
here (e.g. all backpropagation-suited network types etc). I do NOT consider
missing output values. In this setting, I can think of several ways how to
encode such missing values that might be reasonable and depend on
the kind of attribute and how it was encoded in the first place:

1. Nominal attributes (that have n different possible values)
  1.1 encoded "1-of-n", i.e., one network input per possible value,
      the relevant one being 1 all others 0.
      This encoding is very general, but has the disadvantage of producing
      networks with very many connections.
      Missing values can either be represented as 'all zero' or by simply
      treating 'is missing' as just another possible input value, resulting
      in a "1-of-(n+1)" encoding.
  1.2 encoded binary, i.e.,  log2(n) inputs being used like the bits in a
    binary representation of the numbers 0...n-1 (or 1...n).
      Missing values can either be represented as just another possible input
      value (probably all-bits-zero is best) or by adding an additional network
      input which is 1 for 'is missing' and 0 for 'is present'. The original
      inputs should probably be all zero in the 'is missing' case.

2. continuous attributes (or attributes treated as continuous)
  2.1 encoded as a single network input, perhaps using some monotone
      transformation to force the values into a certain distribution.
      Missing values are either encoded as a kind of 'best guess' (e.g. the
      average of the non-missing values for this attribute) or by using
      an additional network input being 0 for 'missing' and 1 for 'present' 
      (or vice versa) and setting the original attribute input either to 0
      or to the 'best guess'. (The 'best guess' variant also applies to
      nominal attributes above)

3. binary attributes (truth values)
  3.1 encoded by one input:  0=false  1=true   or vice versa
      Treat like (2.1)
  3.2 encoded by one input:  -1=false 1=true   or vice versa
      In this case we may act as for (3.1) or may just use 0 to
      indicate 'missing'.
  3.3 treat like nominal attribute with 2 possible values

4. ordinal attributes (having n different possible values, which are ordered)
  4.1 treat either like continuous or like nominal attribute.
    If (1.2) is chosen, a Gray-Code should be used.
    Continuous representation is risky unless a 'sensible' quantification
    of the possible values is available.    

So far to my considerations. Now to my questions.

a) Can you think of other encoding methods that seem reasonable ?  Which ?

b) Do you have experience with some of these methods that is worth sharing ?

c) Have you compared any of the alternatives directly ?

------------------------------------------------------------------------

SUMMARY:

For a), the following ideas were mentioned:
1. use statistical techniques to compute replacement values from the rest
   of the data set
2. use a Boltzman machine to do this for you
3. use an autoencoder feed forward network to do this for you
4. randomize on the missing values (correct in the Bayesian sense)

For b), some experience was reported. I don't know how to summarize
that nicely, so I just don't summarize at all.

For c), no explicit quantitative results were given directly.

Some replies suggest that data is not always missing randomly.
The biases are often known and should be taken into account (e.g.
medical tests are not carried out (resulting in missing data) for
moreless healthy persons more often than for ill persons).

Many replies contained references to published work on this area,
from NN, machine learning, and mathematical statistics.
To ease searching for these references in the replies below, I have
marked them with the string ##REF## (if you have a 'grep' program that
extracts whole paragraphs, you can get them all out with one command).

Thanks to all who answered.
These are the trimmed versions of the replies:

------------------------------------------------------------------------

From: tgd at research.CS.ORST.EDU (Tom Dietterich)

[...for nominal attributes:]
An alternative here is to encode them as bit-strings in a
error-correcting code, so that the hamming distance between any two
bit strings is constant.  This would probably be better than a dense
binary encoding.  The cost in additional inputs is small.  I haven't
tried this though.  My guess is that distributed representations at
the input are a bad idea.

One must always determine WHY the value is missing.  In the heart
disease data, I believe the values were not measured because other
features were believed to be sufficient in each case.  In such cases,
the network should learn to down-weight the importance of the feature
(which can be accomplished by randomizing it---see below).

In other cases, it may be more appropriate to treat a missing value as
a separate value for the feature, e.g., in survey research, where a
subject chooses not to answer a question.

[...for continuous attributes:]
Ross Quinlan suggests encoding missing values as the mean observed
output value when the value is missing.  He has tried this in his
regression tree work.

Another obvious approach is to randomize the missing values--on each
presentation of the training example, choose a different, random,
value for each missing input feature.  This is the "right thing to do"
in the bayesian sense.

[...for binary attributes:]
I'm skeptical of the -1,0,1 encoding, but I think there is more
research to be done here.

[...for ordinal attributes:]
I would treat them as continuous.  

------------------------------------------------------------------------

From: shavlik at cs.wisc.edu (Jude W. Shavlik)

We looked at some of the methods you talked about in
the following article in the journal Machine Learning.

##REF##
%T Symbolic and Neural Network Learning Algorithms: An Experimental Comparison
%A J. W. Shavlik
%A R. J. Mooney
%A G. G. Towell
%J Machine Learning
%V 6
%N 2
%P 111-143
%D 1991

------------------------------------------------------------------------

From: hertz at nordita.dk (John Hertz)

It seems to me that the most natural way to handle missing data is to
leave them out.  You can do this if you work with a recurrent network
(fx Boltzmann machine) where the inputs are fed in by clamping the
input units to the given input values and the rest of the net relaxes
to a fixed point, after which the output is read off the output units.
If some of the input values are missing, the corresponding input units
are just left unclamped, free to relax to values most consistent with 
the known inputs. 

I have meant for a long time to try this on some medical prognosis data
I was working on, but I never got around to it, so I would be happy to
hear how it works if you try it.  

------------------------------------------------------------------------

From: jozo at sequoia.WPI.EDU (Jozo Dujmovic)

In the case of clustering benchmark programs I frequently have
the the problem of estimation of missing data. A relatively
simple SW that implements a heuristic algorithm generates
estimates having the average error of 8%. NN will somehow
"implicitly estimate" the missing data. The two approaches
might even be in some sense equivalent (?).

Jozo

[ I suspect that they are not: When you generate values for the
  missing items and put them in the training set, the network loses the
  information that this data is only estimated. Since estimations are
  not as reliable as true input data, the network will weigh inputs that
  have lots of generated values as less important. If it gets the 'is
  missing' information explicitly, it can discriminate true values from
  estimations instead. ]

------------------------------------------------------------------------

From: guy at cs.uq.oz.au

A final year student of mine worked on the problem of dealing with missing
inputs, without much success. However, the student as not very good, so take
the following opinions with a pinch of salt. 

We (very tentatively) came to the conclusion that if the inputs were
redundant, the problem was easy; if the missing input contained vital
information, the problem was pretty much impossible.

We used the heart disease data. I don't recommend it for the missing inputs
problem. All of the inputs are very good indicators of the correct result,
so missing inputs were not important. 

Apparently there is a large literature in statistics on dealing with missing
inputs. 

Anthony Adams (University of Tasmania) has published a technical report on
this. His email address is "A.Adams at cs.utas.edu.au". 

##REF##
@techreport{kn:Vamplew-91,
   author = "P. Vamplew and A. Adams", 
   address = {Hobart, Tasmania, Australia},
   institution = {Department of Computer Science, University of Tasmania},
   number = {R1-4},
   title = {Real World Problems in Backpropagation: Missing Values and Generalisability},
   year = {1991}
}

------------------------------------------------------------------------

From: Mike Southcott <mlsouth at cssip.levels.unisa.edu.au>

##REF##
I wrote a paper for the Australian conference on neural networks in 1993.
``Classification of Incomplete Data using neural networks''
Southcott, Bogner.

You may find it interesting. You may not be able to get the proceedings
for this conference, but I am in the process of digging up a postscript
copy for someone in the States, so when I do that, I will send
you a copy.

------------------------------------------------------------------------

From: Eric Saund <saund at parc.xerox.com>

I have done some work on unsupervised learning of mulitple cause 
clusters in binary data, for which an appropriate encoding scheme
is -1 = FALSE, 1 = TRUE, and 0 = NO DATA.  This has worked well
for me, but my paradigm is not your standard feedforward network
and uses a different activiation function from the standard
weighted sum followed by sigmoid squashing.
I presented the paper on this work at NIPS:

##REF##
Saund, Eric; 1994; "Unsupervised Learning of Mixtures of Multiple Causes
in Binary Data," in Advances in Neural Information Processing Systems -6-, 
Cowan, J., Tesauro, G, and Alspector, J., eds. Morgan Kaufmann, San Francisco.

------------------------------------------------------------------------

From: Thierry.Denoeux at hds.univ-compiegne.fr

In a recent mailing, Lutz Prechelt mentioned the interesting problem of how 
to encode attributes with missing values as inputs to a neural network.
I have recently been faced to that problem while applying neural nets to
rainfall prediction using weather radar images. The problem was to classify
pairs of "echoes" -- defined as groups of connected pixels with reflectivity
above some threshold -- taken from successive images as corresponding to
the same rain cell or not. Each pair of echoes was discribed by a list of
attributes. Some of these attributes, refering to the past of a sequence, were
not defined for some instances. To encode these attributes with potentially
missing values, we applied two different methods actually suggested by Lutz:

- the replacement of the missing value by a "best-guess" value 
- the addition of a binary input indicating whether the corresponding attribute
  was present or absent.

Significantly better results were obtained by the second method.

This work was presented at ICANN'93 last september:
##REF##
X. Ding, T. Denoeux & F. Helloco (1993). Tracking rain cells in radar images
using multilayer neural networks. In Proc. of ICANN'93, Springer-Verlag, 
p. 962-967.

------------------------------------------------------------------------

From: "N. Karunanithi" <karun at faline.bellcore.com>

[...for nominal attributes:]
   Both methods have the problem of poor scalability. If the number of
missing values increases then the number of additional inputs will
increase linearly in 1.1 and logarithmically in 1.2.
    In fact, 1-of-n encoding may be a poor choice if (1) the number
of input features is large and (2) such an expanded dimensional 
representation does not become a (semi) linearly separable problem.
Even if it becomes a linearly separable problem, the overall complexity
of the network can sometimes be very high.

[...for continuous attributes:]
This representation requires GUESS. A nominal transformation may not be
a proper representation in some cases. Assume that the output values
range over a large numerical interval. For example, from 0.0 to 10,000.0.  
If you use a simple scaling like dividing by 10,000.0 to make it
between 0.0 and 1.0, this will result in poor accuracy of prediction.
If the attribute is on the input side, then on theory the
scaling is unnecessary because the input layer weights will scale
accordingly. However, in practice I had lot of problem with this
approach. Maybe a log tranformation before scaling may not be a bad
choice.
If you use a closed scaling you may have problem whenever a future value
exceeds the maximum value of the numerical intervel. For example,
assume that the attribute is time, say in miliseconds. Any future time 
from the point of reference can exceed the limit. Hence any closed
scaling will not work properly.

[...for ordinal attributes:]
I have compared Binary Encoding (1.2), Gray-Coded representation and
straighforward scaling. Colsed scaling seems to do a good job. I have 
also compared open scaling and closed scaling and did find significant
improvement in prediction accuracy. 

###REF###
N. Karunanithi, D. Whitley and Y. K. Malaiya,
   "Prediction of Software Reliability Using Connectionist Models",
    IEEE Trans. Software Eng., July 1992, pp 563-574.