Outliers (Was: "Some questions on training..")

Mark Plutowski pluto at cs.ucsd.edu
Wed Feb 9 17:52:55 EST 1994


------- previous message -------
Dr. Liu writes:

Outliers are the data points that come in an "unexpected" way, both
in the training  data and in the future. For example, the data is collected
so that a proportional  of them are typos. So as the size of the data gets
large, the number of outliers in them also gets large. Plutowski's
assumption, as I understand it, is to assume the ratio of the number outliers
over the size of data size is very small. 

One way to look at data set containing outliers is to assume noises
are inhomoscedastic. Outlier data points have their noises with large variance,
and good data points have their noises with small variance (Liu 1994).
This is different from Plutowski's   "homoscedasticity" assumption.
Since we have no intention of  predicting the value of outliers, 
robust estimation in both the parameters and the generalization error
requires the "removal" of the outliers.

These discussion, I hope, could convey the idea that when using
cross-validation for the estimation of generalization error, 
some cautions should be taken as regards to the 
influence of Bad data in the training data set. 

------------
Yong Liu
Box 1843
Department of Physics
Institute for Brain and Neural Systems
Brown University
Providence, RI 02912

------- end previous message -------

Dear Dr Liu,

Yes, this points out the importance of examining the
assumptions carefully to ensure that they apply to your
particular learning task.  As another example of where these
results do not apply, note that the assumption of mean zero noise 
can be easily violated in discrimination tasks (often referred
to as "classification" tasks) where the noise involves
random misclassification of the target.  

It also points out an appealling definition  of "outlier",
My interpretation of this is the following:
When the noise variance on the target can depends upon the input 
(in statistical jargon, referred to as "heteroscedasticity of
the conditional variance of Y_i given X_i")
there is the possibility that a plot of the conditional 
target variance over the input space could display
discontinuous jumps, corresponding to where it is more likely
to encounter targets that are much more "noisy" - as compared
to targets for neighboring inputs.   Is this accurate?

I look forward to reading (Liu 94).  Can you (or anyone else)
point me to other references utilizing a similar definition
of "outlier?"  (IMHO) "outlier" is quite a value-laden term
that I tend to avoid since I feel it has multiple and
often ambiguous interpretations/definitions.  

I am currently doing work on detection of what I call
"offliers" since I have a precise definition of what this
means to me, and since I hesitate to use the term "outliers"
for the reason stated above.

= Mark


PS: I would appreciate further opinions/references/examples 
of what "outlier" means (either in practice or in theory) 
which I will summarize and post to the mailing list.   




More information about the Connectionists mailing list