Results 1 - 10
of
23
Sketching streams through the net: Distributed approximate query tracking
- In VLDB
, 2005
"... While traditional database systems optimize for performance on one-shot query processing, emerging large-scale monitoring applications require continuous tracking of complex dataanalysis queries over collections of physically-distributed streams. Thus, effective solutions have to be simultaneously s ..."
Abstract
-
Cited by 78 (20 self)
- Add to MetaCart
While traditional database systems optimize for performance on one-shot query processing, emerging large-scale monitoring applications require continuous tracking of complex dataanalysis queries over collections of physically-distributed streams. Thus, effective solutions have to be simultaneously space/time efficient (at each remote monitor site), communication efficient (across the underlying communication network), and provide continuous, guaranteed-quality approximate query answers. In this paper, we propose novel algorithmic solutions for the problem of continuously tracking a broad class of complex aggregate queries in such a distributed-streams setting. Our tracking schemes maintain approximate query answers with provable error guarantees, while simultaneously optimizing the storage space and processing time at each remote site, as well as the communication cost across the network. In a nutshell, our algorithms rely on tracking general-purpose randomized sketch summaries of local streams at remote sites along with concise prediction models of local site behavior in order to produce highly communication- and space/time-efficient solutions. The end result is a powerful approximate query tracking framework that readily incorporates several complex analysis queries (including distributed join and multi-join aggregates, and approximate wavelet representations), thus giving the first known low-overhead tracking solution for such queries in the distributed-streams model. Experiments with real data validate our approach, revealing significant savings over naive solutions as well as our analytical worst-case guarantees. 1
Exploiting k-Constraints to Reduce Memory Overhead in Continuous Queries over Data Streams
- ACM Transactions on Database Systems, TODS
, 2004
"... We consider the problem of efficiently processing continuous queries over multiple continuous data streams inthe presence of constraints on the datastreams. We specify several types of constraints, and for each constrainttype we identify an “ adherence parameter ” that captures how closely a given s ..."
Abstract
-
Cited by 59 (9 self)
- Add to MetaCart
We consider the problem of efficiently processing continuous queries over multiple continuous data streams inthe presence of constraints on the datastreams. We specify several types of constraints, and for each constrainttype we identify an “ adherence parameter ” that captures how closely a given stream or joining pair of streams adheres to a constraint of that type. We then present a query execution algorithm that takes-constraints over streams into account in order to reduce memory overhead. In general, the tighter the adherence parameters are in the-constraints, the less memory required. Furthermore, if input streams do not adhere to constraints within the specified adherence parameters, our algorithm automatically degrades gracefully to provide continuous approximate answers. We have implemented our approach in a testbed continuous query processor and preliminary experimental results are reported. 1
A SURVEY OF SYNOPSIS CONSTRUCTION IN DATA STREAMS
"... The large volume of data streams poses unique space and time constraints on the computation process. Many query processing, database operations, and mining algorithms require efficient execution which can be difficult to achieve with a fast data stream. In many cases, it may be acceptable to generat ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
The large volume of data streams poses unique space and time constraints on the computation process. Many query processing, database operations, and mining algorithms require efficient execution which can be difficult to achieve with a fast data stream. In many cases, it may be acceptable to generate approximate solutions for such problems. In recent years a number of synopsis structures have been developed, which can be used in conjunction with a variety of mining and query processing techniques in data stream processing. Some key synopsis methods include those of sampling, wavelets, sketches and histograms. In this chapter, we will provide a survey of the key synopsis techniques, and the mining techniques supported by such methods. We will discuss the challenges and tradeoffs associated with using different kinds of techniques, and the important research directions for synopsis construction.
XML stream processing using tree-edit distance embeddings
- ACM Trans. on Database Systems
, 2005
"... We propose the first known solution to the problem of correlating, in small space, continuous streams of XML data through approximate (structure and content) matching, as defined by a general tree-edit distance metric. The key element of our solution is a novel algorithm for obliviously embedding tr ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
We propose the first known solution to the problem of correlating, in small space, continuous streams of XML data through approximate (structure and content) matching, as defined by a general tree-edit distance metric. The key element of our solution is a novel algorithm for obliviously embedding tree-edit distance metrics into an L1 vector space while guaranteeing a (worst-case) upper bound of O(log 2 n log ∗ n)onthe distance distortion between any data trees with at most n nodes. We demonstrate how our embedding algorithm can be applied in conjunction with known random sketching techniques to (1) build a compact synopsis of a massive, streaming XML data tree that can be used as a concise surrogate for the full tree in approximate tree-edit distance computations; and (2) approximate the result of tree-edit-distance similarity joins over continuous XML document streams. Experimental results from an empirical study with both synthetic and real-life XML data trees validate our approach, demonstrating that the average-case behavior of our embedding techniques is much better than what would be predicted from our theoretical worstcase distortion bounds. To the best of our knowledge, these are the first algorithmic results on lowdistortion embeddings for tree-edit distance metrics, and on correlating (e.g., through similarity joins) XML data in the streaming model. Categories and Subject Descriptors: H.2.4 [Database Management]: Systems—Query processing; G.2.1 [Discrete Mathematics]: Combinatorics—Combinatorial algorithms
Approximation techniques for spatial data
, 2004
"... Spatial Database Management Systems (SDBMS), e.g., Geographical Information Systems, that manage spatial objects such as points, lines, and hyper-rectangles, often have very high query processing costs. Accurate selectivity estimation during query optimization therefore is crucially important for fi ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
Spatial Database Management Systems (SDBMS), e.g., Geographical Information Systems, that manage spatial objects such as points, lines, and hyper-rectangles, often have very high query processing costs. Accurate selectivity estimation during query optimization therefore is crucially important for finding good query plans, especially when spatial joins are involved. Selectivity estimation has been studied for relational database systems, but to date has only received little attention in SDBMS. In this paper, we introduce novel methods that permit high-quality selectivity estimation for spatial joins and range queries. Our techniques can be constructed in a single scan over the input, handle inserts and deletes to the database incrementally, and hence they can also be used for processing of streaming spatial data. In contrast to previous approaches, our techniques return approximate results that come with provable probabilistic quality guarantees. We present a detailed analysis and experimentally demonstrate the efficacy of the proposed techniques. 1.
Sliding Window Query Processing over Data Streams
, 2006
"... I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Database management systems (DBMSs) have been used suc ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Database management systems (DBMSs) have been used successfully in traditional business applications that require persistent data storage and an efficient querying mechanism. Typically, it is assumed that the data are static, unless explicitly modified or deleted by a user or application. Database queries are executed when issued and their answers reflect the current state of the data. However, emerging applications, such as sensor networks, real-time Internet traffic analysis, and on-line financial trading, require support for processing of unbounded data streams. The fundamental assumption of a data stream management system (DSMS) is that new data are generated continually, making it infeasible to store a stream in its entirety. At best, a sliding window of recently arrived data may be maintained, meaning that old data must be removed as time goes on. Furthermore, as the contents of the sliding windows evolve over time, it makes
CR-precis: A Deterministic Summary Structure for Update Streams
- In Proc. Int’l Symp. on Algorithms, Probabilistic and Experimental Methodologies (ESCAPE), LNCS 4614
, 2007
"... Abstract. We present the CR-precis structure, that is a general-purpose, deterministic and sub-linear data structure for summarizing update data streams. The CR-precis structure yields the first deterministic sub-linear space/time algorithms for update streams for answering a variety of fundamental ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
(Show Context)
Abstract. We present the CR-precis structure, that is a general-purpose, deterministic and sub-linear data structure for summarizing update data streams. The CR-precis structure yields the first deterministic sub-linear space/time algorithms for update streams for answering a variety of fundamental stream queries, such as, (a) point queries, (b) range queries, (c) finding approximate frequent items, (d) finding approximate quantiles, (e) finding approximate hierarchical heavy hitters, (f) estimating inner-products, (g) near-optimal B-bucket histograms, etc.. 1
Approximate Continuous Querying over Distributed Streams
, 2008
"... While traditional database systems optimize for performance on one-shot query processing, emerging largescale monitoring applications require continuous tracking of complex data-analysis queries over collections of physically-distributed streams. Thus, effective solutions have to be simultaneously s ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
While traditional database systems optimize for performance on one-shot query processing, emerging largescale monitoring applications require continuous tracking of complex data-analysis queries over collections of physically-distributed streams. Thus, effective solutions have to be simultaneously space/time efficient (at each remote monitor site), communication efficient (across the underlying communication network), and provide continuous, guaranteed-quality approximate query answers. In this paper, we propose novel algorithmic solutions for the problem of continuously tracking a broad class of complex aggregate queries in such a distributed-streams setting. Our tracking schemes maintain approximate query answers with provable error guarantees, while simultaneously optimizing the storage space and processing time at each remote site, and the communication cost across the network. In a nutshell, our algorithms rely on tracking general-purpose randomized sketch summaries of local streams at remote sites along with concise prediction models of local site behavior in order to produce highly communication- and space/time-efficient solutions. The end result is a powerful approximate query tracking framework that readily incorporates several complex analysis queries (including distributed join and multi-join aggregates, and approximate wavelet representations), thus giving the first known low-overhead tracking solution for such queries in the distributed-streams model. Experiments with real data validate our approach, revealing significant savings over naive solutions as well as our analytical worst-case guarantees.
End-biased Samples for Join Cardinality Estimation
"... We present a new technique for using samples to estimate join cardinalities. This technique, which we term “end-biased samples,” is inspired by recent work in network traffic measurement. It improves on random samples by using coordinated pseudo-random samples and retaining the sampled values in pro ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
We present a new technique for using samples to estimate join cardinalities. This technique, which we term “end-biased samples,” is inspired by recent work in network traffic measurement. It improves on random samples by using coordinated pseudo-random samples and retaining the sampled values in proportion to their frequency. We show that end-biased samples always provide more accurate estimates than random samples with the same sample size. The comparison with histograms is more interesting — while end-biased histograms are somewhat better than end-biased samples for uncorrelated data sets, end-biased samples dominate by a large margin when the data is correlated. Finally, we compare end-biased samples to the recently proposed “skimmed sketches” and show that neither dominates the other, that each has different and compelling strengths and weaknesses. These results suggest that endbiased samples may be a useful addition to the repertoire of techniques used for data summarization.
Pseudo-random number generation for sketch-based estimations
- TODS
"... The exact computation of aggregate queries, like the size of join of two relations, usually requires large amounts of memory – constrained in data-streaming – or communication – constrained in distributed computation – and large processing times. In this situation, approximation techniques with prov ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
The exact computation of aggregate queries, like the size of join of two relations, usually requires large amounts of memory – constrained in data-streaming – or communication – constrained in distributed computation – and large processing times. In this situation, approximation techniques with provable guarantees, like sketches, are one possible solution. The performance of sketches depends crucially on the ability to generate particular pseudo-random numbers. In this paper we investigate both theoretically and empirically the problem of generating k-wise independent pseudo-random numbers and, in particular, that of generating 3 and 4-wise independent pseudorandom numbers that are fast range-summable (i.e., they can be summed-up in sub-linear time). Our specific contributions are: (a) we provide a thorough comparison of the various pseudorandom number generating schemes, (b) we study both theoretically and empirically the fast range-summation property of the 3 and 4-wise independent generating schemes, (c) we provide algorithms for the fast range-summation of two 3-wise independent schemes, BCH and Extended Hamming, (d) we show convincing theoretical and empirical evidence that the Extended Hamming scheme performs as well as any 4-wise independent scheme for estimating the size of join of two relations using AMS-sketches, even though it is only 3-wise independent. We use this scheme to generate estimators that significantly outperform the state-of-the-art solutions for two problems – size of spatial joins and selectivity estimation.