Results 1  10
of
12
Sampling algorithms in a stream operator
 In SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data
, 2005
"... Complex queries over high speed data streams often need to rely on approximations to keep up with their input. The research community has developed a rich literature on approximate streaming algorithms for this application. Many of these algorithms produce samples of the input stream, providing bett ..."
Abstract

Cited by 31 (4 self)
 Add to MetaCart
Complex queries over high speed data streams often need to rely on approximations to keep up with their input. The research community has developed a rich literature on approximate streaming algorithms for this application. Many of these algorithms produce samples of the input stream, providing better properties than conventional random sampling. In this paper, we abstract the stream sampling process and design a new stream sample operator. We show how it can be used to implement a wide variety of algorithms that perform sampling and samplingbased aggregations. Also, we show how to implement the operator in Gigascope a high speed stream database specialized for IP network monitoring applications. As an example study, we apply the operator within such an enhanced Gigascope to perform subsetsum sampling which is of great interest for IP network management. We evaluate this implemention on a live, high speed internet traffic data stream and find that (a) the operator is a flexible, versatile addition to Gigascope suitable for tuning and algorithm engineering, and (b) the operator imposes only a small evaluation overhead. This is the first operational implementation we know of, for a wide variety of stream sampling algorithms at line speed within a data stream management system. 1.
Adaptive spatial partitioning for multidimensional data streams
 In ISAAC
, 2004
"... We propose a spaceefficient scheme for summarizing multidimensional data streams. Our sketch can be used to solve spatial versions of several classical data stream queries efficiently. For instance, we can track εhotspots, which are congruent boxes containing at least an ε fraction of the stream, ..."
Abstract

Cited by 19 (5 self)
 Add to MetaCart
We propose a spaceefficient scheme for summarizing multidimensional data streams. Our sketch can be used to solve spatial versions of several classical data stream queries efficiently. For instance, we can track εhotspots, which are congruent boxes containing at least an ε fraction of the stream, and maintain hierarchical heavy hitters in d dimensions. Our sketch can also be viewed as a multidimensional generalization of the εapproximate quantile summary. The space complexity of our scheme is O ( 1 ε log R) if the points lie in the domain [0, R]d, where d is assumed to be a constant. The scheme extends to the sliding window model with a log(εn) factor increase in space, where n is the size of the sliding window. Our sketch can also be used to answer εapproximate rectangular range queries over a stream of ddimensional points. 1
A SpaceOptimal DataStream Algorithm for Coresets in the Plane
"... Given a point set P ⊆ R², a subset Q ⊆ P is an εkernel of P if for every slab W containing Q, the (1+ε)expansion of W also contains P. We present a datastream algorithm for maintaining an εkernel of a stream of points in R² that uses O(1/√ε) space and takes O(log(1/ε)) amortized time to process ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
Given a point set P ⊆ R², a subset Q ⊆ P is an εkernel of P if for every slab W containing Q, the (1+ε)expansion of W also contains P. We present a datastream algorithm for maintaining an εkernel of a stream of points in R² that uses O(1/√ε) space and takes O(log(1/ε)) amortized time to process each point. This is the first spaceoptimal datastream algorithm for this problem. As a consequence, we obtain improved datastream approximation algorithms for other extent measures, such as width, robust kernels, as well as εkernels in higher dimensions.
Continuous Query Processing in Spatiotemporal Databases
 In Proceedings of the ICDE/EDBT PhD Workshop
, 2004
"... The tremendous increase of cellular phones, GPSlike devices, and RFIDs results in highly dynamic environments where objects as well as queries are continuously moving. In this paper, we present a continuous query processor designed specifically for highly dynamic environments (e.g., locationaware ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
The tremendous increase of cellular phones, GPSlike devices, and RFIDs results in highly dynamic environments where objects as well as queries are continuously moving. In this paper, we present a continuous query processor designed specifically for highly dynamic environments (e.g., locationaware environments). We implemented the proposed continuous query processor inside the PLACE server (Pervasive LocationAware Computing Environments); a scalable locationaware database server currently developed at Purdue University. The PLACE server extends data streaming management systems to support locationaware environments. Such environments are characterized by the wide variety of continuous spatiotemporal queries and the unbounded spatiotemporal streams. The proposed continuous query processor mainly includes: (1) Developing new incremental spatiotemporal operators to support a wide variety of continuous spatiotemporal queries, (2) Extending the semantic of sliding window queries to deal with spatial sliding windows as well as temporal sliding windows, and (3) Providing a shared execution framework for scalable execution of a set of concurrent continuous spatiotemporal queries. Preliminary experimental evaluation shows the promising performance of the continuous query processor of the PLACE server.
Identifying high cardinality internet hosts
 In Proceedings of IEEE INFOCOM
, 2009
"... Abstract—The Internet host cardinality, defined as the number of distinct peers that an Internet host communicates with, is an important metric for profiling Internet hosts. Some example applications include behavior based network intrusion detection, p2p hosts identification, and server identificat ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
Abstract—The Internet host cardinality, defined as the number of distinct peers that an Internet host communicates with, is an important metric for profiling Internet hosts. Some example applications include behavior based network intrusion detection, p2p hosts identification, and server identification. However, due to the tremendous number of hosts in the Internet and high speed links, tracking the exact cardinality of each host is not feasible due to the limited memory and computation resource. Existing approaches on host cardinality counting have primarily focused on hosts of extremely high cardinalities. These methods do not work well with hosts of moderately large cardinalities that are needed for certain host behavior profiling such as detection of p2p hosts or port scanners. In this paper, we propose an online sampling approach for identifying hosts whose cardinality exceeds some moderate prescribed threshold, e.g. 50, or within specific ranges. The main advantage of our approach is that it can filter out the majority of low cardinality hosts while preserving the hosts of interest, and hence minimize the memory resources wasted by tracking irrelevant hosts. Our approach consists of three components: 1) twophase filtering for eliminating low cardinality hosts, 2) thresholded bitmap for counting cardinalities, and 3) bias correction. Through both theoretical analysis and experiments using real Internet traces, we demonstrate that our approach requires much less memory than existing approaches do whereas yields more accurate estimates. I.
Small and Stable Descriptors of Distributions for Geometric Statistical Problems
, 2009
"... This thesis explores how to sparsely represent distributions of points for geometric statistical problems. A coreset C is a small summary of a point set P such that if a certain statistic is computed on P and C, then the difference in the results is guaranteed to be bounded by a parameter ε. Two exa ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
This thesis explores how to sparsely represent distributions of points for geometric statistical problems. A coreset C is a small summary of a point set P such that if a certain statistic is computed on P and C, then the difference in the results is guaranteed to be bounded by a parameter ε. Two examples of coresets are εsamples and εkernels. An εsample can estimate the density of a point set in any range from a geometric family of ranges (e.g., disks, axisaligned rectangles). An εkernel approximates the width of a point set in all directions. Both coresets have size that depends only on ε, the error parameter, not the size of the original data set. We demonstrate several improvements to these coresets and how they are useful for geometric statistical problems. We reduce the size of εsamples for density queries in axisaligned rectangles to nearly a square root of the size when the queries are with respect to more general families of shapes, such as disks. We also show how to construct εsamples of probability distributions. We show how to maintain “stable” εkernels, that is, if the point set P changes by
Stability of εKernels
, 2009
"... Given a set P of n points in R d, an εkernel K ⊆ P approximates the directional width of P in every direction within a relative (1 − ε) factor. In this paper we study the stability of εkernels under dynamic insertion and deletion of points to P and by changing the approximation factor ε. In the fi ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Given a set P of n points in R d, an εkernel K ⊆ P approximates the directional width of P in every direction within a relative (1 − ε) factor. In this paper we study the stability of εkernels under dynamic insertion and deletion of points to P and by changing the approximation factor ε. In the first case, we say an algorithm for dynamically maintaining a εkernel is stable if at most O(1) points change in K as one point is inserted or deleted from P. We describe an algorithm to maintain an εkernel of size O(1/ε (d−1)/2) in O(1/ε (d−1)/2 + log n) time per update. Not only does our algorithm maintain a stable εkernel, its update time is faster than any known algorithm that maintains an εkernel of size O(1/ε (d−1)/2). Next, we show that if there is an εkernel of P of size κ, which may be dramatically less than O(1/ε (d−1)/2), then there is an (ε/2)kernel of P of size O(min{1/ε (d−1)/2, κ ⌊d/2 ⌋ log d−2 (1/ε)}). Moreover, there exists a point set P in R d and a parameter ε> 0 such that if every εkernel of P has size at least κ, then any (ε/2)kernel of P has size Ω(κ ⌊d/2 ⌋).
Cluster Hull: A Technique for Summarizing Spatial Data Streams
"... Recently there has been a growing interest in detecting patterns and analyzing trends in data that are generated continuously, often delivered in some fixed order and at a rapid rate, in the form of a data stream [5, 6]. When the stream ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Recently there has been a growing interest in detecting patterns and analyzing trends in data that are generated continuously, often delivered in some fixed order and at a rapid rate, in the form of a data stream [5, 6]. When the stream
References for Data Stream Algorithms
, 2007
"... Many scenarios, such as network analysis, utility monitoring, and financial applications, generate massive streams of data. These streams consist of millions or billions of simple updates every hour, and must be processed to extract the information described in tiny pieces. These notes provide an in ..."
Abstract
 Add to MetaCart
Many scenarios, such as network analysis, utility monitoring, and financial applications, generate massive streams of data. These streams consist of millions or billions of simple updates every hour, and must be processed to extract the information described in tiny pieces. These notes provide an introduction to (and set of references for) data stream algorithms, and some of the techniques that have been developed over recent years to help mine the data while avoiding drowning in these massive flows of information. 1
Answering linear optimization queries with an approximate stream index
"... We propose a SAO index to approximately answer arbitrary linear optimization queries in a sliding window of a data stream. It uses limited memory to maintain the most “important ” tuples. At any time, for any linear optimization query, we can retrieve the approximate topK tuples in the sliding wind ..."
Abstract
 Add to MetaCart
We propose a SAO index to approximately answer arbitrary linear optimization queries in a sliding window of a data stream. It uses limited memory to maintain the most “important ” tuples. At any time, for any linear optimization query, we can retrieve the approximate topK tuples in the sliding window almost instantly. The larger the amount of available memory, the better the quality of the answers is. More importantly, for a given amount of memory, the quality of the answers can be further improved by dynamically allocating a larger portion of the memory to the outer layers of the SAO index.