[Research] Next Auton Lab meeting: Tuesday Feb 26, the usual time and place + an exciting topic!
Artur Dubrawski
awd at cs.cmu.edu
Tue Feb 19 17:14:55 EST 2008
Linear-Time Subset Scanning
Daniel B. Neill (neill at cs.cmu.edu)
The scan statistic is a commonly used and powerful framework for
detecting anomalous patterns. The typical scan statistic approach is to
define a fixed set of "search regions" (each search region represents a
subset of the data, which may be of potential interest), find regions
that maximize some likelihood ratio statistic, and compute the
statistical significance (or alternatively, posterior probability) of
each region. Since there are exponentially many subsets of the data, an
exhaustive search over all such subsets is computationally infeasible.
Thus a typical approach is to constrain the search regions (e.g.
searching over circles or rectangles for spatial data), but any chosen
set of search regions will have low power to detect patterns that do not
correspond to these regions.
However, many commonly used scan statistics (including the Kulldorff,
expectation-based, and nonparametric statistics) have an intriguing
property which we call "linear-time subset scanning" (LTSS): the group
which maximizes the likelihood ratio statistic can be found by ordering
the data records from most to least relevant, and only searching groups
consisting of the top-k most interesting records, requiring linear
rather than exponential time. If we have a uniform prior over all
subsets, the group found by LTSS also maximizes the posterior
probability. However, we often want to maximize the posterior under
hard constraints (e.g. some regions have zero priors) and/or soft
constraints (e.g. some regions have higher priors than others). In
these cases, the unconstrained optimum can be used as an upper bound on
the constrained optimum for branch-and-bound search, or as an informed
starting point for greedy heuristic search.
This work is still in the very early stages, so I'd like to lead an
informal brainstorming session of how LTSS can be used to find either
exact or approximate solutions to various constrained subset scan problems:
1. Network worm detection, searching over subnets of the network
2. Fast spatial scan over all distinct rectangular regions
3. Fast graph scan over all distinct connected subgraphs
4. What if we want to maximize the posterior over all subsets, with a
non-uniform prior based on self-similarity?
5. What if we are given an arbitrary, but structured, list of "valid"
subsets, and want to maximize over these?
And many other questions are worth considering, including:
* How can LTSS be applied to multivariate pattern detection, with
multiple event types and multiple data streams?
* What are the necessary and sufficient conditions for a scan statistic
to have the LTSS property?
* How is LTSS related to submodularity?
More information about the Autonlab-research
mailing list