Principled Sampling for Anomaly Detection

Author(s): Brendan Juba, Christopher Musco, Fan Long, Stelios Sidiroglou-Douskos, Martin Rinard

Download: Paper (PDF)

Date: 7 Feb 2015

Document Type: Briefing Papers

Additional Documents: Slides

Associated Event: NDSS Symposium 2015

Abstract:

We present a technique and implemented system, Cassandra, for obtaining probabilistic bounds on false positive rates for anomaly detectors that process Internet data. Using a probability distribution based on PageRank and an efficient algorithm to draw samples from that probability distribution, Cassandra computes an estimated false positive rate and a probabilistic bound on the accuracy of the estimated rate. By drawing test samples from a well defined distribution that correlates well with data seen in practice, Cassandra improves on ad hoc methods for estimating the false positive rate of anomaly detectors. Specifically, our methods give bounds that are reproducible, comparable across different anomaly detectors, and theoretically sound. Experimental results from applying Cassandra to three anomaly detectors (SIFT, SOAP, and JSAND) show that Cassandra is efficientenough to use in practice — Cassandra can sample enough inputs to obtain tight false positive rate bounds in an acceptable amount of time. These results indicate that Cassandra can, in practice, help place anomaly detection on a stronger theoretical foundation and help practitioners better understand the behavior and consequences of the anomaly detectors that they deploy.