[Research] Next Auton Lab meeting: Tuesday Feb 26, the usual time and place + an exciting topic!

Tue Feb 19 17:14:55 EST 2008

Linear-Time Subset Scanning
Daniel B. Neill (neill at cs.cmu.edu)

The scan statistic is a commonly used and powerful framework for 
detecting anomalous patterns.  The typical scan statistic approach is to 
define a fixed set of "search regions" (each search region represents a 
subset of the data, which may be of potential interest), find regions 
that maximize some likelihood ratio statistic, and compute the 
statistical significance (or alternatively, posterior probability) of 
each region.  Since there are exponentially many subsets of the data, an 
exhaustive search over all such subsets is computationally infeasible.  
Thus a typical approach is to constrain the search regions (e.g. 
searching over circles or rectangles for spatial data), but any chosen 
set of search regions will have low power to detect patterns that do not 
correspond to these regions.

However, many commonly used scan statistics (including the Kulldorff, 
expectation-based, and nonparametric statistics) have an intriguing 
property which we call "linear-time subset scanning" (LTSS): the group 
which maximizes the likelihood ratio statistic can be found by ordering 
the data records from most to least relevant, and only searching groups 
consisting of the top-k most interesting records, requiring linear 
rather than exponential time.  If we have a uniform prior over all 
subsets, the group found by LTSS also maximizes the posterior 
probability.  However, we often want to maximize the posterior under 
hard constraints (e.g. some regions have zero priors) and/or soft 
constraints (e.g. some regions have higher priors than others).  In 
these cases, the unconstrained optimum can be used as an upper bound on 
the constrained optimum for branch-and-bound search, or as an informed 
starting point for greedy heuristic search.

This work is still in the very early stages, so I'd like to lead an 
informal brainstorming session of how LTSS can be used to find either 
exact or approximate solutions to various constrained subset scan problems:

1. Network worm detection, searching over subnets of the network
2. Fast spatial scan over all distinct rectangular regions
3. Fast graph scan over all distinct connected subgraphs
4. What if we want to maximize the posterior over all subsets, with a 
non-uniform prior based on self-similarity?
5. What if we are given an arbitrary, but structured, list of "valid" 
subsets, and want to maximize over these?

And many other questions are worth considering, including:

* How can LTSS be applied to multivariate pattern detection, with 
multiple event types and multiple data streams?
* What are the necessary and sufficient conditions for a scan statistic 
to have the LTSS property?
* How is LTSS related to submodularity?