Encoding missing values

Thu Feb 3 10:15:55 EST 1994

> I am currently thinking about the problem of how to encode data with
> a ttributes for which some of the values are missing in the data set for
> neural network training and use.

I am also having the same problem. I would like to get a copy
responses.

>1. Nominal attributes (that have n different possible values)
>  1.1 encoded "1-of-n", i.e., one network input per possible value, the relevant one
>    being 1 all others 0.
>      This encoding is very general, but has the disadvantage of producing
>      networks with very many connections.
>      Missing values can either be represented as 'all zero' or by simply
>      treating 'is missing' as just another possible input value, resulting
>      in a "1-of-(n+1)" encoding.
>  1.2 encoded binary, i.e.,  log2(n) inputs being used like the bits in a
>    binary representation of the numbers 0...n-1 (or 1...n).
>      Missing values can either be represented as just another possible input
>      value (probably all-bits-zero is best) or by adding an additional network
>      input which is 1 for 'is missing' and 0 for 'is present'. The original
>      inputs should probably be all zero in the 'is missing' case.
>

   Both methods have the problem of poor scalability. If the number of
missing values increases then the number of additional inputs will
increase linearly in 1.1 and logarithmically in 1.2.
    In fact, 1-of-n encoding may be a poor choice if (1) the number
of input features is large and (2) such an expanded dimensional 
representation does not become a (semi) linearly separable problem.
Even if it becomes a linearly separable problem, the overall complexity
of the network can sometimes be very high.

>2. continuous attributes (or attributes treated as continuous)
>  2.1 encoded as a single network input, perhaps using some monotone transformation
>    to force the values into a certain distribution.
>      Missing values are either encoded as a kind of 'best guess' (e.g. the
>      average of the non-missing values for this attribute) or by using
>      an additional network input being 0 for 'missing' and 1 for 'present' 
>      (or vice versa) and setting the original attribute input either to 0
>      or to the 'best guess'. (The 'best guess' variant also applies to
>      nominal attributes above)

This representation requires GUESS. A nominal tranformation may not be
a proper representation in some cases. Assume that the output values
range over a large numerical intervel. For example, from 0.0 to 10,000.0.  
If you use a simple scaling like dividing by 10,000.0 to make it
between 0.0 and 1.0, this will result in poor accuracy of prediction.
If the attribute is on the input side, then on theory the
scaling is unnecessary because the input layer weights will scale
accordingly. However, in practice I had lot of problem with this
approach. May be a log tranformation before scaling may not be a bad
choice.
If you use a closed scaling you may have problem whenever a future value
exceeds the maximum value of the numerical intervel. For example,
assume that the attribute is time, say in miliseconds. Any future time 
from the point of reference can exceed the limit. Hence any closed
scaling will not work properly.

> 3. binary attributes (truth values)
>   3.1 encoded by one input:  0=false  1=true   or vice versa
>       Treat like (2.1)
>   3.2 encoded by one input:  -1=false 1=true   or vice versa
>       In this case we may act as for (3.1) or may just use 0 to indicate 'missing'.
>   3.3 treat like nominal attribute with 2 possible values

No comments.

> 4. ordinal attributes (having n different possible values, which are ordered)
>   4.1 treat either like continuous or like nominal attribute.
>     If (1.2) is chosen, a Gray-Code should be used.
>     Continuous representation is risky unless a 'sensible' quantification
>     of the possible values is available.    

I have compared Binary Encoding (1.2), Gray-Coded representation and
straighforward scaling. Colsed scaling seems to do a good job. I have 
also compared open scaling and closed scaling and did find significant
improvement in prediction accuracy. 

(Refer to: N. Karunanithi, D. Whitley and Y. K. Malaiya,
   "Prediction of Software Reliability Using Connectionist Models",
    IEEE Trans. Software Eng., July 1992, pp 563-574.

  N. Karunanithi and Y. K. Malaiya, "The Scaling Problem in Neural
     Networks for Software Reliability Prediction", Proc. IEEE Int.
    Symposium on Rel. Eng., Oct. 1992, pp. 776-82.
 )

> So far to my considerations. Now to my questions.
> 
> a) Can you think of other encoding methods that seem reasonable ?  Which ?
> 
> b) Do you have experience with some of these methods that is worth sharing ?
> 
> c) Have you compared any of the alternatives directly ?
> 
>   Lutz

 I have not found a simple solution that is general. I think
representation in general and the missing information in specific
are open problems within connectionist research. I am not sure we will
have a magic bullet for all problems. The best approach is to come up
with a specific solution for a given problem.

-Karun