Results 1 - 10
of
106
Adaptive Filters for Continuous Queries over Distributed Data Streams
- In SIGMOD
, 2003
"... We consider an environment where distributed data sources continuously stream updates to a centralized processor that monitors continuous queries over the distributed data. Significant communication overhead is incurred in the presence of rapid update streams, and we propose a new technique fo ..."
Abstract
-
Cited by 161 (2 self)
- Add to MetaCart
We consider an environment where distributed data sources continuously stream updates to a centralized processor that monitors continuous queries over the distributed data. Significant communication overhead is incurred in the presence of rapid update streams, and we propose a new technique for reducing the overhead. Users register continuous queries with precision requirements at the central stream processor, which installs filters at remote data sources. The filters adapt to changing conditions to minimize stream rates while guaranteeing that all continuous queries still receive the updates necessary to provide answers of adequate precision at all times. Our approach enables applications to trade precision for communication overhead at a fine granularity by individually adjusting the precision constraints of continuous queries over streams in a multi-query workload.
What’s hot and what’s not: Tracking most frequent items dynamically
- In Proceedings of ACM Principles of Database Systems
, 2003
"... Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items ” in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of ..."
Abstract
-
Cited by 139 (12 self)
- Add to MetaCart
Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items ” in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in many applications. We present new methods for dynamically determining the hot items at any time in a relation which is undergoing deletion operations as well as inserts. Our methods maintain small space data structures that monitor the transactions on the relation, and when required, quickly output all hot items, without rescanning the relation in the database. With user-specified probability, all hot items are correctly reported. Our methods rely on ideas from “group testing”. They are simple to implement, and have provable quality, space and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees can not handle deletions, and those that handle deletions can not make similar guarantees without rescanning the database. Our experiments with real and synthetic data show that our algorithms are accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.
Holistic Aggregates in a Networked World: Distributed Tracking of Approximate Quantiles
- In SIGMOD
, 2005
"... While traditional database systems optimize for performance on one-shot queries, emerging large-scale monitoring applications require continuous tracking of complex aggregates and data-distribution summaries over collections of physically-distributed streams. Thus, effective solutions have to be sim ..."
Abstract
-
Cited by 68 (15 self)
- Add to MetaCart
While traditional database systems optimize for performance on one-shot queries, emerging large-scale monitoring applications require continuous tracking of complex aggregates and data-distribution summaries over collections of physically-distributed streams. Thus, effective solutions have to be simultaneously space efficient (at each remote site), communication efficient (across the underlying communication network), and provide continuous, guaranteedquality estimates. In this paper, we propose novel algorithmic solutions for the problem of continuously tracking complex holistic aggregates in such a distributed-streams setting — our primary focus is on approximate quantile summaries, but our approach is more broadly applicable and can handle other holistic-aggregate functions (e.g., “heavy-hitters ” queries). We present the first known distributed-tracking schemes for maintaining accurate quantile estimates with provable approximation guarantees, while simultaneously optimizing the storage space at each remote site as well as the communication cost across the network. In a nutshell, our algorithms employ a combination of local tracking at remote sites and simple prediction models for local site behavior in order to produce highly communication- and space-efficient solutions. We perform extensive experiments with real and synthetic data to explore the various tradeoffs and understand the role of prediction models in our schemes. The results clearly validate our approach, revealing significant savings over naive solutions as well as our analytical worst-case guarantees. 1.
Finding (recently) frequent items in distributed data streams
"... We consider the problem of maintaining frequency counts for items occurring frequently in the union of multiple distributed data streams. Naïve methods of combining approximate frequency counts from multiple nodes tend to result in excessively large data structures that are costly to transfer among ..."
Abstract
-
Cited by 56 (2 self)
- Add to MetaCart
We consider the problem of maintaining frequency counts for items occurring frequently in the union of multiple distributed data streams. Naïve methods of combining approximate frequency counts from multiple nodes tend to result in excessively large data structures that are costly to transfer among nodes. To minimize communication requirements, the degree of precision maintained by each node while counting item frequencies must be managed carefully. We introduce the concept of a precision gradient for managing precision when nodes are arranged in a hierarchical communication structure. We then study the optimization problem of how to set the precision gradient so as to minimize communication, and provide optimal solutions that minimize worst-case communication load over all possible inputs. We then introduce a variant designed to perform well in practice, with input data that does not conform to worst-case characteristics. We verify the effectiveness of our approach empirically using real-world data, and show that our methods incur substantially less communication than naïve approaches while providing the same error guarantees on answers. 1.
Efficient Top-K Query Calculation in Distributed Networks
- In PODC
, 2004
"... This paper presents a new algorithm to answer top-k queries (e.g. “find the k objects with the highest aggregate values”) in a distributed network. Existing algorithms such as the Threshold Algorithm [FLN01] consume an excessive amount of bandwidth when the number of nodes, m, is high. We propose a ..."
Abstract
-
Cited by 46 (0 self)
- Add to MetaCart
This paper presents a new algorithm to answer top-k queries (e.g. “find the k objects with the highest aggregate values”) in a distributed network. Existing algorithms such as the Threshold Algorithm [FLN01] consume an excessive amount of bandwidth when the number of nodes, m, is high. We propose a new algorithm called “Three-Phase Uniform Threshold” (TPUT). TPUT reduces network bandwidth consumption by pruning away ineligible objects, and terminates in three round-trips regardless of data input. The paper presents two sets of results about TPUT. First, trace-driven simulations show that, depending on the size of the network, TPUT reduces network traffic by one to two orders of magnitude compared to existing algorithms. Second, TPUT is proven to be instance-optimal on data series that satisfy a lower bound on the slope of decreases in values. In particular, analysis shows that by using a pruning parameter α < 1, TPUT achieves a qualitative reduction in network traffic, for example, lowering the optimality ratio from O(m ∗ m) to O(m ∗ √ m) for data series following Zipf distribution. 1
A geometric approach to monitoring threshold functions over distributed data streams
- In ACM SIGMOD
, 2006
"... Monitoring data streams in a distributed system is the focus of much research in recent years. Most of the proposed schemes, however, deal with monitoring simple aggregated values, such as the frequency of appearance of items in the streams. More involved challenges, such as the important task of fe ..."
Abstract
-
Cited by 40 (5 self)
- Add to MetaCart
Monitoring data streams in a distributed system is the focus of much research in recent years. Most of the proposed schemes, however, deal with monitoring simple aggregated values, such as the frequency of appearance of items in the streams. More involved challenges, such as the important task of feature selection (e.g., by monitoring the information gain of various features), still require very high communication overhead using naive, centralized algorithms. We present a novel geometric approach by which an arbitrary global monitoring task can be split into a set of constraints applied locally on each of the streams. The constraints are used to locally filter out data increments that do not affect the monitoring outcome, thus avoiding unnecessary communication. As a result, our approach enables monitoring of arbitrary threshold functions over distributed data streams in an efficient manner. We present experimental results on real-world data which demonstrate that our algorithms are highly scalable, and considerably reduce communication load in comparison to centralized algorithms. 1.
Network-Aware Query Processing for Stream-based Applications
, 2004
"... This paper investigates the benefits of network awareness when processing queries in widelydistributed environments such as the Internet. ..."
Abstract
-
Cited by 36 (0 self)
- Add to MetaCart
This paper investigates the benefits of network awareness when processing queries in widelydistributed environments such as the Internet.
Continuous monitoring of top-k queries over sliding windows
- In SIGMOD
, 2006
"... Given a dataset P and a preference function f, atop-k query retrieves the k tuples in P with the highest scores according to f. Even though the problem is well-studied in conventional databases, the existing methods are inapplicable to highly dynamic environments involving numerous longrunning queri ..."
Abstract
-
Cited by 35 (3 self)
- Add to MetaCart
Given a dataset P and a preference function f, atop-k query retrieves the k tuples in P with the highest scores according to f. Even though the problem is well-studied in conventional databases, the existing methods are inapplicable to highly dynamic environments involving numerous longrunning queries. This paper studies continuous monitoring of top-k queries over a fixed-size window W of the most recent data. The window size can be expressed either in terms of the number of active tuples or time units. We propose a general methodology for top-k monitoring that restricts processing to the sub-domains of the workspace that influence the result of some query. To cope with high stream rates and provide fast answers in an on-line fashion, the data in W reside in main memory. The valid records are indexed by a grid structure, which also maintains book-keeping information. We present two processing techniques: the first one computes the new answer of a query whenever some of the current top-k points expire; the second one partially precomputes the future changes in the result, achieving better running time at the expense of slightly higher space requirements. We analyze the performance of both algorithms and evaluate their efficiency through extensive experiments. Finally, we extend the proposed framework to other query types and a different data stream model. 1.
Efficient Computation of Frequent and Top-k Elements in Data Streams
- IN ICDT
, 2005
"... We propose an approximate integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream coming from a large domain. Our solution is space efficient and reports both frequent and top-k elements with tight guarantees on errors. For ..."
Abstract
-
Cited by 35 (6 self)
- Add to MetaCart
We propose an approximate integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream coming from a large domain. Our solution is space efficient and reports both frequent and top-k elements with tight guarantees on errors. For general data distributions, our top-k algorithm returns k elements that have roughly the highest frequencies; and it uses limited space for calculating frequent elements. For realistic Zipfian data, the space requirement of the proposed algorithm for solving the exact frequent elements problem decreases dramatically with the parameter of the distribution; and for top-k queries, the analysis ensures that only the top-k elements, in the correct order, are reported. The experiments, using real and synthetic data sets, show space reductions with no loss in accuracy. Having proved the effectiveness of the proposed approach through both analysis and experiments, we extend it to be able to answer continuous queries about frequent and top-k elements. Although the problems of incremental reporting of frequent and top-k elements are useful in many applications, to the best of our knowledge, no solution has been proposed.
A sampling-based approach to optimizing top-k queries in sensor networks
- In ICDE
, 2006
"... Wireless sensor networks generate a vast amount of data. This data, however, must be sparingly extracted to conserve energy, usually the most precious resource in battery-powered sensors. When approximation is acceptable, a model-driven approach to query processing is effective in saving energy by a ..."
Abstract
-
Cited by 31 (3 self)
- Add to MetaCart
Wireless sensor networks generate a vast amount of data. This data, however, must be sparingly extracted to conserve energy, usually the most precious resource in battery-powered sensors. When approximation is acceptable, a model-driven approach to query processing is effective in saving energy by avoiding contacting nodes whose values can be predicted or are unlikely to be in the result set. However, to optimize queries such as topk, reasoning directly with models of joint probability distributions can be prohibitively expensive. Instead of using models explicitly, we propose to use samples of past sensor readings. Not only are such samples simple to maintain, but they are also computationally efficient to use in query optimization. With these samples, we can formulate the problem of optimizing approximate topk queries under an energy constraint as a linear program. We demonstrate the power and flexibility of our sampling-based approach by developing a series of top-k query planning algorithms with linear programming, which are capable of efficiently producing plans with better performance and novel features. We show that our approach is both theoretically sound and practically effective on simulated and real-world datasets. 1

