[Research] Reminder - Thesis Proposal - Kaustav Das 4/3/07
Jeff Schneider
schneide at cs.cmu.edu
Mon Apr 2 17:19:39 EDT 2007
Hi Everyone,
Please come to Kaustav's thesis proposal Tuesday morning at 10.
Jeff.
-------- Original Message --------
Date: 4/3/07
Time: 10:00AM
Place: 1507 Newell-Simon Hall
Title: Detecting Anomalous Records in Large Categorical Datasets
Speaker: Kaustav Das, PhD candidate
Advisor: Jeff Schneider
Abstract:
We consider the problem of detecting anomalies in high dimensional
categorical datasets. In most applications, anomalies are defined as
data points that are 'abnormal'. Quite often we have access to data
which consists mostly of normal records, along with a small percentage
of unlabelled anomalous records. We are interested in the problem of
unsupervised anomaly detection, where we use the unlabelled data for
training, and detect records that do not follow the definition of normality.
A standard approach is to create a model of normal data, and compare
test records against it. A probabilistic approach builds a likelihood
model from the training data. Records are tested for anomalies based on
the complete record likelihood given the probability model. For
categorical attributes, bayes nets give a standard representation of the
likelihood. While this approach is good at finding outliers in the
dataset, it often tends to detect records with attribute values that are
rare. Sometimes, just detecting rare values of an attribute is not
desired and such outliers are not considered as anomalies in that
context. In this thesis we present an alternative definition of
anomalies, and propose an approach of comparing against marginal
distribution of attribute subsets. We show that this is a more
meaningful way of detecting anomalies, and has a better performance over
semi-synthetic as well as real world datasets. We propose to extend this
method to detecting anomalous groups of records. We also propose to
incorporate user feedback in a semi-supervised learning framework.
Committee:
Jeff Schneider (Chair)
Gregory Cooper (University of Pittsburgh)
Geoffrey Gordon
Christos Faloutsos
--
*******************************************************************
Diane Stidle
Business & Graduate Programs Manager
Machine Learning Department
School of Computer Science
4612 Wean Hall
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh, PA 15213-3891
Phone: 412-268-1299
Fax: 412-268-3431
Email: diane at cs.cmu.edu
URL:http://www.ml.cmu.edu
More information about the Autonlab-research
mailing list