Results 1 -
9 of
9
Sampling algorithms in a stream operator
- In SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data
, 2005
"... Complex queries over high speed data streams often need to rely on approximations to keep up with their input. The research community has developed a rich literature on approximate streaming algorithms for this application. Many of these algorithms produce samples of the input stream, providing bett ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
Complex queries over high speed data streams often need to rely on approximations to keep up with their input. The research community has developed a rich literature on approximate streaming algorithms for this application. Many of these algorithms produce samples of the input stream, providing better properties than conventional random sampling. In this paper, we abstract the stream sampling process and design a new stream sample operator. We show how it can be used to implement a wide variety of algorithms that perform sampling and samplingbased aggregations. Also, we show how to implement the operator in Gigascope- a high speed stream database specialized for IP network monitoring applications. As an example study, we apply the operator within such an enhanced Gigascope to perform subset-sum sampling which is of great interest for IP network management. We evaluate this implemention on a live, high speed internet traffic data stream and find that (a) the operator is a flexible, versatile addition to Gigascope suitable for tuning and algorithm engineering, and (b) the operator imposes only a small evaluation overhead. This is the first operational implementation we know of, for a wide variety of stream sampling algorithms at line speed within a data stream management system. 1.
Adaptive spatial partitioning for multidimensional data streams
- In ISAAC
, 2004
"... We propose a space-efficient scheme for summarizing multidimensional data streams. Our sketch can be used to solve spatial versions of several classical data stream queries efficiently. For instance, we can track ε-hotspots, which are congruent boxes containing at least an ε fraction of the stream, ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
We propose a space-efficient scheme for summarizing multidimensional data streams. Our sketch can be used to solve spatial versions of several classical data stream queries efficiently. For instance, we can track ε-hotspots, which are congruent boxes containing at least an ε fraction of the stream, and maintain hierarchical heavy hitters in d dimensions. Our sketch can also be viewed as a multidimensional generalization of the ε-approximate quantile summary. The space complexity of our scheme is O ( 1 ε log R) if the points lie in the domain [0, R]d, where d is assumed to be a constant. The scheme extends to the sliding window model with a log(εn) factor increase in space, where n is the size of the sliding window. Our sketch can also be used to answer ε-approximate rectangular range queries over a stream of d-dimensional points. 1
A Space-Optimal Data-Stream Algorithm for Coresets in the Plane
"... Given a point set P ⊆ R², a subset Q ⊆ P is an ε-kernel of P if for every slab W containing Q, the (1+ε)-expansion of W also contains P. We present a data-stream algorithm for maintaining an ε-kernel of a stream of points in R² that uses O(1/√ε) space and takes O(log(1/ε)) amortized time to process ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
Given a point set P ⊆ R², a subset Q ⊆ P is an ε-kernel of P if for every slab W containing Q, the (1+ε)-expansion of W also contains P. We present a data-stream algorithm for maintaining an ε-kernel of a stream of points in R² that uses O(1/√ε) space and takes O(log(1/ε)) amortized time to process each point. This is the first space-optimal data-stream algorithm for this problem. As a consequence, we obtain improved data-stream approximation algorithms for other extent measures, such as width, robust kernels, as well as ε-kernels in higher dimensions.
Continuous Query Processing in Spatiotemporal Databases
- In Proceedings of the ICDE/EDBT PhD Workshop
, 2004
"... The tremendous increase of cellular phones, GPS-like devices, and RFIDs results in highly dynamic environments where objects as well as queries are continuously moving. In this paper, we present a continuous query processor designed specifically for highly dynamic environments (e.g., location-aware ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
The tremendous increase of cellular phones, GPS-like devices, and RFIDs results in highly dynamic environments where objects as well as queries are continuously moving. In this paper, we present a continuous query processor designed specifically for highly dynamic environments (e.g., location-aware environments). We implemented the proposed continuous query processor inside the PLACE server (Pervasive Location-Aware Computing Environments); a scalable location-aware database server currently developed at Purdue University. The PLACE server extends data streaming management systems to support location-aware environments. Such environments are characterized by the wide variety of continuous spatio-temporal queries and the unbounded spatio-temporal streams. The proposed continuous query processor mainly includes: (1) Developing new incremental spatio-temporal operators to support a wide variety of continuous spatio-temporal queries, (2) Extending the semantic of sliding window queries to deal with spatial sliding windows as well as temporal sliding windows, and (3) Providing a shared execution framework for scalable execution of a set of concurrent continuous spatio-temporal queries. Preliminary experimental evaluation shows the promising performance of the continuous query processor of the PLACE server.
Identifying high cardinality internet hosts
- In Proceedings of IEEE INFOCOM
, 2009
"... Abstract—The Internet host cardinality, defined as the number of distinct peers that an Internet host communicates with, is an important metric for profiling Internet hosts. Some example applications include behavior based network intrusion detection, p2p hosts identification, and server identificat ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract—The Internet host cardinality, defined as the number of distinct peers that an Internet host communicates with, is an important metric for profiling Internet hosts. Some example applications include behavior based network intrusion detection, p2p hosts identification, and server identification. However, due to the tremendous number of hosts in the Internet and high speed links, tracking the exact cardinality of each host is not feasible due to the limited memory and computation resource. Existing approaches on host cardinality counting have primarily focused on hosts of extremely high cardinalities. These methods do not work well with hosts of moderately large cardinalities that are needed for certain host behavior profiling such as detection of p2p hosts or port scanners. In this paper, we propose an online sampling approach for identifying hosts whose cardinality exceeds some moderate prescribed threshold, e.g. 50, or within specific ranges. The main advantage of our approach is that it can filter out the majority of low cardinality hosts while preserving the hosts of interest, and hence minimize the memory resources wasted by tracking irrelevant hosts. Our approach consists of three components: 1) two-phase filtering for eliminating low cardinality hosts, 2) thresholded bitmap for counting cardinalities, and 3) bias correction. Through both theoretical analysis and experiments using real Internet traces, we demonstrate that our approach requires much less memory than existing approaches do whereas yields more accurate estimates. I.
Stability of ε-Kernels
, 2009
"... Given a set P of n points in R d, an ε-kernel K ⊆ P approximates the directional width of P in every direction within a relative (1 − ε) factor. In this paper we study the stability of ε-kernels under dynamic insertion and deletion of points to P and by changing the approximation factor ε. In the fi ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Given a set P of n points in R d, an ε-kernel K ⊆ P approximates the directional width of P in every direction within a relative (1 − ε) factor. In this paper we study the stability of ε-kernels under dynamic insertion and deletion of points to P and by changing the approximation factor ε. In the first case, we say an algorithm for dynamically maintaining a ε-kernel is stable if at most O(1) points change in K as one point is inserted or deleted from P. We describe an algorithm to maintain an ε-kernel of size O(1/ε (d−1)/2) in O(1/ε (d−1)/2 + log n) time per update. Not only does our algorithm maintain a stable ε-kernel, its update time is faster than any known algorithm that maintains an ε-kernel of size O(1/ε (d−1)/2). Next, we show that if there is an ε-kernel of P of size κ, which may be dramatically less than O(1/ε (d−1)/2), then there is an (ε/2)-kernel of P of size O(min{1/ε (d−1)/2, κ ⌊d/2 ⌋ log d−2 (1/ε)}). Moreover, there exists a point set P in R d and a parameter ε> 0 such that if every ε-kernel of P has size at least κ, then any (ε/2)-kernel of P has size Ω(κ ⌊d/2 ⌋).
Small and Stable Descriptors of Distributions for Geometric Statistical Problems
, 2009
"... This thesis explores how to sparsely represent distributions of points for geometric statistical problems. A coreset C is a small summary of a point set P such that if a certain statistic is computed on P and C, then the difference in the results is guaranteed to be bounded by a parameter ε. Two exa ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This thesis explores how to sparsely represent distributions of points for geometric statistical problems. A coreset C is a small summary of a point set P such that if a certain statistic is computed on P and C, then the difference in the results is guaranteed to be bounded by a parameter ε. Two examples of coresets are ε-samples and ε-kernels. An ε-sample can estimate the density of a point set in any range from a geometric family of ranges (e.g., disks, axis-aligned rectangles). An ε-kernel approximates the width of a point set in all directions. Both coresets have size that depends only on ε, the error parameter, not the size of the original data set. We demonstrate several improvements to these coresets and how they are useful for geometric statistical problems. We reduce the size of ε-samples for density queries in axis-aligned rectangles to nearly a square root of the size when the queries are with respect to more general families of shapes, such as disks. We also show how to construct ε-samples of probability distributions. We show how to maintain “stable” ε-kernels, that is, if the point set P changes by
References for Data Stream Algorithms
, 2007
"... Many scenarios, such as network analysis, utility monitoring, and financial applications, generate massive streams of data. These streams consist of millions or billions of simple updates every hour, and must be processed to extract the information described in tiny pieces. These notes provide an in ..."
Abstract
- Add to MetaCart
Many scenarios, such as network analysis, utility monitoring, and financial applications, generate massive streams of data. These streams consist of millions or billions of simple updates every hour, and must be processed to extract the information described in tiny pieces. These notes provide an introduction to (and set of references for) data stream algorithms, and some of the techniques that have been developed over recent years to help mine the data while avoiding drowning in these massive flows of information. 1
Answering linear optimization queries with an approximate stream index
"... We propose a SAO index to approximately answer arbitrary linear optimization queries in a sliding window of a data stream. It uses limited memory to maintain the most “important ” tuples. At any time, for any linear optimization query, we can retrieve the approximate top-K tuples in the sliding wind ..."
Abstract
- Add to MetaCart
We propose a SAO index to approximately answer arbitrary linear optimization queries in a sliding window of a data stream. It uses limited memory to maintain the most “important ” tuples. At any time, for any linear optimization query, we can retrieve the approximate top-K tuples in the sliding window almost instantly. The larger the amount of available memory, the better the quality of the answers is. More importantly, for a given amount of memory, the quality of the answers can be further improved by dynamically allocating a larger portion of the memory to the outer layers of the SAO index.

