Results 1  10
of
130
Upper level set scan statistic for detecting arbitrarily shaped hotspots
 Environmental and Ecological Statistics
, 2004
"... A declared need is around for geoinformatic surveillance statistical science and software infrastructure for spatial and spatiotemporal hotspot detection. Hotspot means something unusual, anomaly, aberration, outbreak, elevated cluster, critical resource area, etc. The declared need may be for monit ..."
Abstract

Cited by 30 (16 self)
 Add to MetaCart
A declared need is around for geoinformatic surveillance statistical science and software infrastructure for spatial and spatiotemporal hotspot detection. Hotspot means something unusual, anomaly, aberration, outbreak, elevated cluster, critical resource area, etc. The declared need may be for monitoring, etiology, management, or early warning. The responsible factors may be natural, accidental, or intentional. This proofofconcept paper suggests methods and tools for hotspot detection across geographic regions and across networks. The investigation proposes development of statistical methods and tools that have immediate potential for use in critical societal areas, such as public health and disease surveillance, ecosystem health, water resources and water services, transportation networks, persistent poverty typologies and trajectories, environmental justice, biosurveillance and biosecurity, among others. We introduce, for multidisciplinary use, an innovation of the healthareapopular circlebased spatial and spatiotemporal scan statistic. Our innovation employs the notion of an upper level set, and is accordingly called the upper level set scan statistic, pointing to a sophisticated analytical and computational system as the next generation of the present day popular SaTScan. Success of surveillance rests on potential elevated cluster detection capability. But the clusters can be of any shape, and cannot be captured only by circles. This is likely to give more of false alarms and more of false sense of security. What we need is capability to detect arbitrarily shaped clusters. The proposed upper level set scan statistic innovation is expected to ®ll this need
Rapid Detection of Significant Spatial Clusters
 In KDD
, 2004
"... Given an NN grid of squares, where each square has a count c i j and an underlying population p i j , our goal is to find the rectangular region with the highest density, and to calculate its significance by randomization. An arbitrary density function D, dependent on a region 's total count C and t ..."
Abstract

Cited by 28 (9 self)
 Add to MetaCart
Given an NN grid of squares, where each square has a count c i j and an underlying population p i j , our goal is to find the rectangular region with the highest density, and to calculate its significance by randomization. An arbitrary density function D, dependent on a region 's total count C and total population P, can be used. For example, if each count represents the number of disease cases occurring in that square, we can use Kulldorff's spatial scan statistic D K to find the most significant spatial disease cluster. A naive approach to finding the maximum density region requires O(N ) time, and is generally computationally infeasible. We present a multiresolution algorithm which partitions the grid into overlapping regions using a novel overlapkd tree data structure, bounds the maximum score of subregions contained in each region, and prunes regions which cannot contain the maximum density region. For sufficiently dense regions, this method finds the maximum density region in ) time, in practice resulting in significant (202000x) speedups on both real and simulated datasets.
Spatial Variation in Search Engine Queries
, 2008
"... Local aspects of Web search — associating Web content and queries with geography — is a topic of growing interest. However, the underlying question of how spatial variation is manifested in search queries is still not well understood. Here we develop a probabilistic framework for quantifying such sp ..."
Abstract

Cited by 28 (1 self)
 Add to MetaCart
Local aspects of Web search — associating Web content and queries with geography — is a topic of growing interest. However, the underlying question of how spatial variation is manifested in search queries is still not well understood. Here we develop a probabilistic framework for quantifying such spatial variation; on complete Yahoo! query logs, we find that our model is able to localize large classes of queries to within a few miles of their natural centers based only on the distribution of activity for the query. Our model provides not only an estimate of a query’s geographic center, but also a measure of its spatial dispersion, indicating whether it has highly local interest or broader regional or national appeal. We also show how variations on our model can track geographically shifting topics over time, annotate a map with each location’s “distinctive queries,” and delineate the “spheres of influence” for competing queries in the same general domain.
A Fast MultiResolution Method for Detection of Significant Spatial Overdensities
 Advances in Neural Information Processing Systems 16
, 2003
"... Given an N N grid of squares, where each square s ij has a count c ij and an underlying population p ij , our goal is to nd the square region S with the highest density, and to calculate the signi cance of this region by Monte Carlo testing. Any density measure D, which depends on the total count ..."
Abstract

Cited by 24 (7 self)
 Add to MetaCart
Given an N N grid of squares, where each square s ij has a count c ij and an underlying population p ij , our goal is to nd the square region S with the highest density, and to calculate the signi cance of this region by Monte Carlo testing. Any density measure D, which depends on the total count and total population of the region, can be used. For example, if each count c ij represents the number of disease cases occurring in that square, we can use Kulldor's spatial scan statistic DK to nd the most signi cant spatial disease cluster. A naive approach to nding the region of maximum density would be to calculate the density measure for every square region: this requires O(RN ) calculations, where R is the number of Monte Carlo replications, and hence is generally computationally infeasible. We present a novel multiresolution algorithm which partitions the grid into overlapping regions, bounds the maximum score of subregions contained in each region, and prunes regions which cannot contain the maximum density region. For suciently dense regions, this method nds the maximum density region in optimal O(RN ) time, and in practice it results in signi cant (10200x) speedups as compared to the naive approach.
An informationtheoretic approach to detecting changes in multidimensional data streams
 In Proc. Symp. on the Interface of Statistics, Computing Science, and Applications
, 2006
"... Abstract An important problem in processing large data streams is detecting changes in the underlying distribution that generates the data. The challenge in designing change detection schemes is making them general, scalable, and statistically sound. In this paper, we take a general,informationthe ..."
Abstract

Cited by 22 (1 self)
 Add to MetaCart
Abstract An important problem in processing large data streams is detecting changes in the underlying distribution that generates the data. The challenge in designing change detection schemes is making them general, scalable, and statistically sound. In this paper, we take a general,informationtheoretic approach to the change detection problem, which works for multidimensional as well as categorical data. We use relative entropy, also called the KullbackLeiblerdistance, to measure the difference between two given distributions. The KLdistance is known to be related to the optimal error in determining whether the two distributions are the sameand draws on fundamental results in hypothesis testing. The KLdistance also generalizes traditional distance measures in statistics, and has invariance properties that make it ideally suitedfor comparing distributions. Our scheme is general; it is nonparametric and requires no assumptions on the underlyingdistributions. It employs a statistical inference procedure based on the theory of bootstrapping, which allows us to determine whether our measurements are statistically significant. The schemeis also quite flexible from a practical perspective; it can be implemented using any spatial partitioning scheme that scales well with dimensionality. In addition to providing change detections,our method generalizes Kulldorff's spatial scan statistic, allowing us to quantitatively identify specific regions in space where large changes have occurred.We provide a detailed experimental study that demonstrates the generality and efficiency of our approach with different kinds of multidimensional datasets, both synthetic and real. 1 Introduction We are collecting and storing data in unprecedented quantities and varietiesstreams, images, audio, text, metadata descriptions, and even simple numbers. Over time, these data streams change as the underlying processes that generate them change. Some changes are spurious and pertain to glitches in the data. Some are genuine, caused by changes in the underlying distributions. Some changes are gradual and some are more precipitous. We would like to detect changes in a variety of settings:
Detecting anomalous records in categorical datasets
 Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
, 2007
"... We consider the problem of detecting anomalies in high arity categorical datasets. In most applications, anomalies are defined as data points that are ’abnormal’. Quite often we have access to data which consists mostly of normal records, along with a small percentage of unlabelled anomalous records ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
We consider the problem of detecting anomalies in high arity categorical datasets. In most applications, anomalies are defined as data points that are ’abnormal’. Quite often we have access to data which consists mostly of normal records, along with a small percentage of unlabelled anomalous records. We are interested in the problem of unsupervised anomaly detection, where we use the unlabelled data for training, and detect records that do not follow the definition of normality. A standard approach is to create a model of normal data, and compare test records against it. A probabilistic approach builds a likelihood model from the training data. Records are tested for anomalousness based on the complete record likelihood given the probability model. For categorical attributes, bayes nets give a standard representation of the likelihood. While this approach is good at finding outliers in the dataset, it often tends to detect records with attribute values that are rare. Sometimes, just detecting rare values of an attribute is not desired and such outliers are not considered as anomalies in that context. We present an alternative definition of anomalies, and propose an approach of comparing against marginal distributions of attribute subsets. We show that this is a more meaningful way of detecting anomalies, and has a better performance over semisynthetic as well as real world datasets.
Detecting significant multidimensional spatial clusters
 Advances in Neural Information Processing Systems 17
, 2005
"... Assume a uniform, multidimensional grid of bivariate data, where each cell of the grid has a count ci and a baseline bi. Our goal is to find spatial regions (ddimensional rectangles) where the ci are significantly higher than expected given bi. We focus on two applications: detection of clusters of ..."
Abstract

Cited by 17 (8 self)
 Add to MetaCart
Assume a uniform, multidimensional grid of bivariate data, where each cell of the grid has a count ci and a baseline bi. Our goal is to find spatial regions (ddimensional rectangles) where the ci are significantly higher than expected given bi. We focus on two applications: detection of clusters of disease cases from epidemiological data (emergency department visits, overthecounter drug sales), and discovery of regions of increased brain activity corresponding to given cognitive tasks (from fMRI data). Each of these problems can be solved using a spatial scan statistic (Kulldorff, 1997), where we compute the maximum of a likelihood ratio statistic over all spatial regions, and find the significance of this region by randomization. However, computing the scan statistic for all spatial regions is generally computationally infeasible, so we introduce a novel fast spatial scan algorithm, generalizing the 2D scan algorithm of (Neill and Moore, 2004) to arbitrary dimensions. Our new multidimensional multiresolution algorithm allows us to find spatial clusters up to 1400x faster than the naive spatial scan, without any loss of accuracy. 1
Fast Subset Scan for Spatial Pattern Detection
 J. Royal Statistical Society B
"... Summary. We propose a new ‘fast subset scan ’ approach for accurate and computationally efficient event detection in massive data sets. We treat event detection as a search over subsets of data records, finding the subset which maximizes some score function. We prove that many commonly used function ..."
Abstract

Cited by 17 (6 self)
 Add to MetaCart
Summary. We propose a new ‘fast subset scan ’ approach for accurate and computationally efficient event detection in massive data sets. We treat event detection as a search over subsets of data records, finding the subset which maximizes some score function. We prove that many commonly used functions (e.g. Kulldorff’s spatial scan statistic and extensions) satisfy the ‘linear time subset scanning ’ property, enabling exact and efficient optimization over subsets. In the spatial setting, we demonstrate that proximityconstrained subset scans substantially improve the timeliness and accuracy of event detection, detecting emerging outbreaks of disease 2 days faster than existing methods. Keywords: Algorithms; Disease surveillance; Event detection; Scan statistics; Spatial scan
Fast Graph Scan for Scalable Detection of Arbitrary Connected Clusters
"... This work presents GraphScan, a spatial scan method for detection of arbitrarilyshaped connected clusters. GraphScan enables efficient, exact computation of the highestscoring connected clusters, with or without proximity constraints, up to ~100 locations. BACKGROUND FlexScan [1] extends Kulldorff ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
This work presents GraphScan, a spatial scan method for detection of arbitrarilyshaped connected clusters. GraphScan enables efficient, exact computation of the highestscoring connected clusters, with or without proximity constraints, up to ~100 locations. BACKGROUND FlexScan [1] extends Kulldorff’s original spatial scan [2] to detect flexiblyshaped clusters consisting of a center location sc and a connected subset of its k – 1 nearest neighbors. Unlike other graphbased scan methods [3], FlexScan finds the highestscoring connected subgraph, subject to the constraint on neighborhood size k. However, its run time scales exponentially with k, and thus it is computationally infeasible (requiring over 1 week to find the highestscoring cluster for a single day of data) for k> 30. Lineartime subset scanning [4] can find the most interesting subset of N locations without exhaustively searching over the exponentially many subsets. Many commonly used scan statistics satisfy the LTSS property, which guarantees that the highest scoring subset will consist of the j highest priority locations, for some priority function G(si) and j {1…N}. In this case, only N of the 2 N subsets must be evaluated. For example, in Kulldorff’s statistic [2], we can use G(si) = ci / bi, the ratio of observed to expected count. METHODS While the unconstrained LTSS method may return a disconnected subset of locations, GraphScan expands on LTSS by only considering the connected subsets that have potential for highest score. For a score function that satisfies the LTSS property, we show that if location si is contained in the optimal subset S * , and if removing si does not disconnect the subgraph, any neighbor of si with higher priority will also be contained in S *. GraphScan uses this property and the graph structure to reduce the search space, pruning any subgraphs which violate the rule. This efficient search allows GraphScan to abandon proximity constraints and feasibly search over all connected subsets of nodes. Alternatively, GraphScan can be used to detect the same proximityconstrained connected clusters as FlexScan, but can scale up to much larger neighborhood sizes (higher values of k).