Results 1 - 10
of
80
Upper level set scan statistic for detecting arbitrarily shaped hotspots
- Environmental and Ecological Statistics
, 2004
"... A declared need is around for geoinformatic surveillance statistical science and software infrastructure for spatial and spatiotemporal hotspot detection. Hotspot means something unusual, anomaly, aberration, outbreak, elevated cluster, critical resource area, etc. The declared need may be for monit ..."
Abstract
-
Cited by 27 (16 self)
- Add to MetaCart
A declared need is around for geoinformatic surveillance statistical science and software infrastructure for spatial and spatiotemporal hotspot detection. Hotspot means something unusual, anomaly, aberration, outbreak, elevated cluster, critical resource area, etc. The declared need may be for monitoring, etiology, management, or early warning. The responsible factors may be natural, accidental, or intentional. This proof-of-concept paper suggests methods and tools for hotspot detection across geographic regions and across networks. The investigation proposes development of statistical methods and tools that have immediate potential for use in critical societal areas, such as public health and disease surveillance, ecosystem health, water resources and water services, transportation networks, persistent poverty typologies and trajectories, environmental justice, biosurveillance and biosecurity, among others. We introduce, for multidisciplinary use, an innovation of the health-area-popular circle-based spatial and spatiotemporal scan statistic. Our innovation employs the notion of an upper level set, and is accordingly called the upper level set scan statistic, pointing to a sophisticated analytical and computational system as the next generation of the present day popular SaTScan. Success of surveillance rests on potential elevated cluster detection capability. But the clusters can be of any shape, and cannot be captured only by circles. This is likely to give more of false alarms and more of false sense of security. What we need is capability to detect arbitrarily shaped clusters. The proposed upper level set scan statistic innovation is expected to ®ll this need
Rapid Detection of Significant Spatial Clusters
- In KDD
, 2004
"... Given an NN grid of squares, where each square has a count c i j and an underlying population p i j , our goal is to find the rectangular region with the highest density, and to calculate its significance by randomization. An arbitrary density function D, dependent on a region 's total count C and t ..."
Abstract
-
Cited by 23 (8 self)
- Add to MetaCart
Given an NN grid of squares, where each square has a count c i j and an underlying population p i j , our goal is to find the rectangular region with the highest density, and to calculate its significance by randomization. An arbitrary density function D, dependent on a region 's total count C and total population P, can be used. For example, if each count represents the number of disease cases occurring in that square, we can use Kulldorff's spatial scan statistic D K to find the most significant spatial disease cluster. A naive approach to finding the maximum density region requires O(N ) time, and is generally computationally infeasible. We present a multiresolution algorithm which partitions the grid into overlapping regions using a novel overlap-kd tree data structure, bounds the maximum score of subregions contained in each region, and prunes regions which cannot contain the maximum density region. For sufficiently dense regions, this method finds the maximum density region in ) time, in practice resulting in significant (20-2000x) speedups on both real and simulated datasets.
A Fast Multi-Resolution Method for Detection of Significant Spatial Overdensities
- Advances in Neural Information Processing Systems 16
, 2003
"... Given an N N grid of squares, where each square s ij has a count c ij and an underlying population p ij , our goal is to nd the square region S with the highest density, and to calculate the signi cance of this region by Monte Carlo testing. Any density measure D, which depends on the total count ..."
Abstract
-
Cited by 19 (6 self)
- Add to MetaCart
Given an N N grid of squares, where each square s ij has a count c ij and an underlying population p ij , our goal is to nd the square region S with the highest density, and to calculate the signi cance of this region by Monte Carlo testing. Any density measure D, which depends on the total count and total population of the region, can be used. For example, if each count c ij represents the number of disease cases occurring in that square, we can use Kulldor's spatial scan statistic DK to nd the most signi cant spatial disease cluster. A naive approach to nding the region of maximum density would be to calculate the density measure for every square region: this requires O(RN ) calculations, where R is the number of Monte Carlo replications, and hence is generally computationally infeasible. We present a novel multi-resolution algorithm which partitions the grid into overlapping regions, bounds the maximum score of subregions contained in each region, and prunes regions which cannot contain the maximum density region. For suciently dense regions, this method nds the maximum density region in optimal O(RN ) time, and in practice it results in signi cant (10-200x) speedups as compared to the naive approach.
Detecting significant multidimensional spatial clusters
- Advances in Neural Information Processing Systems 17
, 2005
"... Assume a uniform, multidimensional grid of bivariate data, where each cell of the grid has a count ci and a baseline bi. Our goal is to find spatial regions (d-dimensional rectangles) where the ci are significantly higher than expected given bi. We focus on two applications: detection of clusters of ..."
Abstract
-
Cited by 14 (7 self)
- Add to MetaCart
Assume a uniform, multidimensional grid of bivariate data, where each cell of the grid has a count ci and a baseline bi. Our goal is to find spatial regions (d-dimensional rectangles) where the ci are significantly higher than expected given bi. We focus on two applications: detection of clusters of disease cases from epidemiological data (emergency department visits, over-the-counter drug sales), and discovery of regions of increased brain activity corresponding to given cognitive tasks (from fMRI data). Each of these problems can be solved using a spatial scan statistic (Kulldorff, 1997), where we compute the maximum of a likelihood ratio statistic over all spatial regions, and find the significance of this region by randomization. However, computing the scan statistic for all spatial regions is generally computationally infeasible, so we introduce a novel fast spatial scan algorithm, generalizing the 2D scan algorithm of (Neill and Moore, 2004) to arbitrary dimensions. Our new multidimensional multiresolution algorithm allows us to find spatial clusters up to 1400x faster than the naive spatial scan, without any loss of accuracy. 1
An information-theoretic approach to detecting changes in multi-dimensional data streams
- In Proc. Symp. on the Interface of Statistics, Computing Science, and Applications
, 2006
"... Abstract An important problem in processing large data streams is detecting changes in the underly-ing distribution that generates the data. The challenge in designing change detection schemes is making them general, scalable, and statistically sound. In this paper, we take a general,information-the ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
Abstract An important problem in processing large data streams is detecting changes in the underly-ing distribution that generates the data. The challenge in designing change detection schemes is making them general, scalable, and statistically sound. In this paper, we take a general,information-theoretic approach to the change detection problem, which works for multidimensional as well as categorical data. We use relative entropy, also called the Kullback-Leiblerdistance, to measure the difference between two given distributions. The KL-distance is known to be related to the optimal error in determining whether the two distributions are the sameand draws on fundamental results in hypothesis testing. The KL-distance also generalizes traditional distance measures in statistics, and has invariance properties that make it ideally suitedfor comparing distributions. Our scheme is general; it is nonparametric and requires no assumptions on the underlyingdistributions. It employs a statistical inference procedure based on the theory of bootstrapping, which allows us to determine whether our measurements are statistically significant. The schemeis also quite flexible from a practical perspective; it can be implemented using any spatial partitioning scheme that scales well with dimensionality. In addition to providing change detections,our method generalizes Kulldorff's spatial scan statistic, allowing us to quantitatively identify specific regions in space where large changes have occurred.We provide a detailed experimental study that demonstrates the generality and efficiency of our approach with different kinds of multidimensional datasets, both synthetic and real. 1 Introduction We are collecting and storing data in unprecedented quantities and varieties--streams, images, audio, text, metadata descriptions, and even simple numbers. Over time, these data streams change as the underlying processes that generate them change. Some changes are spurious and pertain to glitches in the data. Some are genuine, caused by changes in the underlying distributions. Some changes are gradual and some are more precipitous. We would like to detect changes in a variety of settings:
Detecting anomalous records in categorical datasets
- Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
, 2007
"... We consider the problem of detecting anomalies in high arity categorical datasets. In most applications, anomalies are defined as data points that are ’abnormal’. Quite often we have access to data which consists mostly of normal records, along with a small percentage of unlabelled anomalous records ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
We consider the problem of detecting anomalies in high arity categorical datasets. In most applications, anomalies are defined as data points that are ’abnormal’. Quite often we have access to data which consists mostly of normal records, along with a small percentage of unlabelled anomalous records. We are interested in the problem of unsupervised anomaly detection, where we use the unlabelled data for training, and detect records that do not follow the definition of normality. A standard approach is to create a model of normal data, and compare test records against it. A probabilistic approach builds a likelihood model from the training data. Records are tested for anomalousness based on the complete record likelihood given the probability model. For categorical attributes, bayes nets give a standard representation of the likelihood. While this approach is good at finding outliers in the dataset, it often tends to detect records with attribute values that are rare. Sometimes, just detecting rare values of an attribute is not desired and such outliers are not considered as anomalies in that context. We present an alternative definition of anomalies, and propose an approach of comparing against marginal distributions of attribute subsets. We show that this is a more meaningful way of detecting anomalies, and has a better performance over semi-synthetic as well as real world datasets.
Spatial Variation in Search Engine Queries
, 2008
"... Local aspects of Web search — associating Web content and queries with geography — is a topic of growing interest. However, the underlying question of how spatial variation is manifested in search queries is still not well understood. Here we develop a probabilistic framework for quantifying such sp ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Local aspects of Web search — associating Web content and queries with geography — is a topic of growing interest. However, the underlying question of how spatial variation is manifested in search queries is still not well understood. Here we develop a probabilistic framework for quantifying such spatial variation; on complete Yahoo! query logs, we find that our model is able to localize large classes of queries to within a few miles of their natural centers based only on the distribution of activity for the query. Our model provides not only an estimate of a query’s geographic center, but also a measure of its spatial dispersion, indicating whether it has highly local interest or broader regional or national appeal. We also show how variations on our model can track geographically shifting topics over time, annotate a map with each location’s “distinctive queries,” and delineate the “spheres of influence” for competing queries in the same general domain.
Multiscale Advanced Raster Map Analysis System
- Definition, Design, and Development. Invited Plenary Address at the Brazilian Ecological Congress
"... of the Agency and no official endorsement should be inferred. ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
of the Agency and no official endorsement should be inferred.
The Hunting of the Bump: On Maximizing Statistical Discrepancy
- In SODA
, 2006
"... Anomaly detection has important applications in biosurveilance and environmental monitoring. When comparing measured data to data drawn from a baseline distribution, merely, finding clusters in the measured data may not actually represent true anomalies. These clusters may likely be the clusters of ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
Anomaly detection has important applications in biosurveilance and environmental monitoring. When comparing measured data to data drawn from a baseline distribution, merely, finding clusters in the measured data may not actually represent true anomalies. These clusters may likely be the clusters of the baseline distribution. Hence, a discrepancy function is often used to examine how different measured data is to baseline data within a region. An anomalous region is thus defined to be one with high discrepancy. In this paper, we present algorithms for maximizing statistical discrepancy functions over the space of axis-parallel rectangles. We give provable approximation guarantees, both additive and relative, and our methods apply to any convex discrepancy function. Our algorithms work by connecting statistical discrepancy to combinatorial discrepancy; roughly speaking, we show that in order to maximize a convex discrepancy function over a class of shapes, one needs only maximize a linear discrepancy function over the same set of shapes. We derive general discrepancy functions for data generated from a one- parameter exponential family. This generalizes the widely-used Kulldorff scan statistic for data from a Poisson distribution. We present an algorithm run-ning in O ( 1 ɛ n2 log 2 n) that computes the maximum discrepancy rectangle to within additive error ɛ, for the Kulldorff scan statistic. Similar results hold for relative error and for discrepancy functions for data coming from Gaussian, Bernoulli, and gamma distributions. Prior to our work, the best known algorithms were exact and ran in time O(n 4). 1

