Encoding missing values

Lutz Prechelt prechelt at ira.uka.de
Wed Feb 2 03:58:56 EST 1994


I am currently thinking about the problem of how to encode data with
attributes for which some of the values are missing in the data set for
neural network training and use.

An example of such data is the 'heart-disease' dataset from the UCI machine
learning database (anonymous FTP on "ics.uci.edu" [128.195.1.1], directory
"/pub/machine-learning-databases"). There are 920 records altogether with
14 attributes each. Only 299 of the records are complete, the others have one
or several missing attribute values.  11% of all values are missing.

I consider only networks that handle arbitrary numbers of real-valued inputs
here (e.g. all backpropagation-suited network types etc). I do NOT consider
missing output values. In this setting, I can think of several ways how to
encode such missing values that might be reasonable and depend on
the kind of attribute and how it was encoded in the first place:

1. Nominal attributes (that have n different possible values)
  1.1 encoded "1-of-n", i.e., one network input per possible value, the relevant one
    being 1 all others 0.
      This encoding is very general, but has the disadvantage of producing
      networks with very many connections.
      Missing values can either be represented as 'all zero' or by simply
      treating 'is missing' as just another possible input value, resulting
      in a "1-of-(n+1)" encoding.
  1.2 encoded binary, i.e.,  log2(n) inputs being used like the bits in a
    binary representation of the numbers 0...n-1 (or 1...n).
      Missing values can either be represented as just another possible input
      value (probably all-bits-zero is best) or by adding an additional network
      input which is 1 for 'is missing' and 0 for 'is present'. The original
      inputs should probably be all zero in the 'is missing' case.

2. continuous attributes (or attributes treated as continuous)
  2.1 encoded as a single network input, perhaps using some monotone transformation
    to force the values into a certain distribution.
      Missing values are either encoded as a kind of 'best guess' (e.g. the
      average of the non-missing values for this attribute) or by using
      an additional network input being 0 for 'missing' and 1 for 'present' 
      (or vice versa) and setting the original attribute input either to 0
      or to the 'best guess'. (The 'best guess' variant also applies to
      nominal attributes above)

3. binary attributes (truth values)
  3.1 encoded by one input:  0=false  1=true   or vice versa
      Treat like (2.1)
  3.2 encoded by one input:  -1=false 1=true   or vice versa
      In this case we may act as for (3.1) or may just use 0 to indicate 'missing'.
  3.3 treat like nominal attribute with 2 possible values

4. ordinal attributes (having n different possible values, which are ordered)
  4.1 treat either like continuous or like nominal attribute.
    If (1.2) is chosen, a Gray-Code should be used.
    Continuous representation is risky unless a 'sensible' quantification
    of the possible values is available.    

So far to my considerations. Now to my questions.

a) Can you think of other encoding methods that seem reasonable ?  Which ?

b) Do you have experience with some of these methods that is worth sharing ?

c) Have you compared any of the alternatives directly ?

  Lutz

Lutz Prechelt   (email: prechelt at ira.uka.de)            | Whenever you 
Institut fuer Programmstrukturen und Datenorganisation  | complicate things,
Universitaet Karlsruhe;  76128 Karlsruhe;  Germany      | they get
(Voice: ++49/721/608-4068, FAX: ++49/721/694092)        | less simple.


More information about the Connectionists mailing list