Results 1  10
of
157
Taming the Underlying Challenges of Reliable Multihop Routing in Sensor Networks
 In SenSys
, 2003
"... The dynamic and lossy nature of wireless communication poses major challenges to reliable, selforganizing multihop networks. These nonideal characteristics are more problematic with the primitive, lowpower radio transceivers found in sensor networks, and raise new issues that routing protocols mu ..."
Abstract

Cited by 663 (20 self)
 Add to MetaCart
(Show Context)
The dynamic and lossy nature of wireless communication poses major challenges to reliable, selforganizing multihop networks. These nonideal characteristics are more problematic with the primitive, lowpower radio transceivers found in sensor networks, and raise new issues that routing protocols must address. Link connectivity statistics should be captured dynamically through an efficient yet adaptive link estimator and routing decisions should exploit such connectivity statistics to achieve reliability. Link status and routing information must be maintained in a neighborhood table with constant space regardless of cell density. We study and evaluate link estimator, neighborhood table management, and reliable routing protocol techniques. We focus on a manytoone, periodic data collection workload. We narrow the design space through evaluations on largescale, highlevel simulations to 50node, indepth empirical experiments. The most effective solution uses a simple time averaged EWMA estimator, frequency based table management, and costbased routing.
What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically
, 2003
"... Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items” in the relation: those that appear many times (most frequently, or more than some threshold). For example, endbiased histograms keep the hot items as part of t ..."
Abstract

Cited by 174 (13 self)
 Add to MetaCart
Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items” in the relation: those that appear many times (most frequently, or more than some threshold). For example, endbiased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in networking applications. We present a new algorithm for dynamically determining the hot items at any time in the relation that is undergoing deletion operations as well as inserts. Our algorithm maintains a small space data structure that monitors the transactions on the relation, and when required, quickly outputs all hot items, without rescanning the relation in the database. With userspecified probability, it is able to report all hot items. Our algorithm relies on the idea of “group testing”, is simple to implement, and has provable quality, space and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees can not handle deletions, and those that handle deletions can not make similar guarantees without rescanning the database. Our experiments with real and synthetic data shows that our algorithm is remarkably accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.
A Simple Algorithm For Finding Frequent Elements In Streams And Bags
, 2003
"... We present a simple, exact algorithm for identifying in a multiset the items with frequency more than a threshold θ. The algorithm requires two passes, linear time, and space 1/θ. The first pass is an online algorithm, generalizing a wellknown algorithm for finding a majority element, for identify ..."
Abstract

Cited by 148 (0 self)
 Add to MetaCart
We present a simple, exact algorithm for identifying in a multiset the items with frequency more than a threshold θ. The algorithm requires two passes, linear time, and space 1/θ. The first pass is an online algorithm, generalizing a wellknown algorithm for finding a majority element, for identifying a set of at most 1/θ items that includes, possibly among others, all items with frequency greater than θ.
Issues in Data Stream Management
, 2003
"... Traditional databases store sets of relatively static records with no predefined notion of time, unless timestamp attributes are explicitly added. While this model adequately represents commercial catalogues or repositories of personal information, many current and emerging applications require sup ..."
Abstract

Cited by 137 (6 self)
 Add to MetaCart
Traditional databases store sets of relatively static records with no predefined notion of time, unless timestamp attributes are explicitly added. While this model adequately represents commercial catalogues or repositories of personal information, many current and emerging applications require support for online analysis of rapidly changing data streams. Limitations of traditional DBMSs in supporting streaming applications have been recognized, prompting research to augment existing technologies and build new systems to manage streaming data. The purpose of this paper is to review recent work in data stream management systems, with an emphasis on application requirements, data models, continuous query languages, and query evaluation.
Mining Frequent Patterns in Data Streams at Multiple Time Granularities
, 2002
"... Although frequentpattern mining has been widely studied and used, it is challenging to extend it to data streams. Compared to mining from a static transaction data set, the streaming case has far more information to track and far greater complexity to manage. Infrequent items can become frequent la ..."
Abstract

Cited by 103 (5 self)
 Add to MetaCart
Although frequentpattern mining has been widely studied and used, it is challenging to extend it to data streams. Compared to mining from a static transaction data set, the streaming case has far more information to track and far greater complexity to manage. Infrequent items can become frequent later on and hence cannot be ignored. The storage structure needs to be dynamically adjusted to reflect the evolution of itemset frequencies over time.
Approximate Counts and Quantiles over Sliding Windows
 Proc. of ACM PODS Symp
, 2004
"... We consider the problem of maintaining approximate counts and quantiles over fixed and variablesize sliding windows in limited space. For quantiles, we present deterministic algorithms whose space requirements are O ( 1! log 1! logN) and O( ..."
Abstract

Cited by 87 (1 self)
 Add to MetaCart
(Show Context)
We consider the problem of maintaining approximate counts and quantiles over fixed and variablesize sliding windows in limited space. For quantiles, we present deterministic algorithms whose space requirements are O ( 1! log 1! logN) and O(
Finding (recently) frequent items in distributed data streams
"... We consider the problem of maintaining frequency counts for items occurring frequently in the union of multiple distributed data streams. Naïve methods of combining approximate frequency counts from multiple nodes tend to result in excessively large data structures that are costly to transfer among ..."
Abstract

Cited by 66 (2 self)
 Add to MetaCart
(Show Context)
We consider the problem of maintaining frequency counts for items occurring frequently in the union of multiple distributed data streams. Naïve methods of combining approximate frequency counts from multiple nodes tend to result in excessively large data structures that are costly to transfer among nodes. To minimize communication requirements, the degree of precision maintained by each node while counting item frequencies must be managed carefully. We introduce the concept of a precision gradient for managing precision when nodes are arranged in a hierarchical communication structure. We then study the optimization problem of how to set the precision gradient so as to minimize communication, and provide optimal solutions that minimize worstcase communication load over all possible inputs. We then introduce a variant designed to perform well in practice, with input data that does not conform to worstcase characteristics. We verify the effectiveness of our approach empirically using realworld data, and show that our methods incur substantially less communication than naïve approaches while providing the same error guarantees on answers. 1.
SpaceCode Bloom Filter for Efficient PerFlow Traffic Measurement
 In Proc. IEEE INFOCOM
, 2004
"... Perflow traffic measurement is critical for usage accounting, traffic engineering, and anomaly detection. Previous methodologies are either based on random sampling (e.g., Cisco's NetFlow), which is inaccurate, or only account for the "elephants". We introduce a novel technique for m ..."
Abstract

Cited by 63 (2 self)
 Add to MetaCart
Perflow traffic measurement is critical for usage accounting, traffic engineering, and anomaly detection. Previous methodologies are either based on random sampling (e.g., Cisco's NetFlow), which is inaccurate, or only account for the "elephants". We introduce a novel technique for measuring perflow traffic approximately, for all flows regardless of their sizes, at very highspeed (say, OC768). The core of this technique is a novel data structure called Space Code Bloom Filter (SCBF). A SCBF is an approximate representation of a multiset; each element in this multiset...
Streaming and sublinear approximation of entropy and information distances
 In ACMSIAM Symposium on Discrete Algorithms
, 2006
"... In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the pr ..."
Abstract

Cited by 55 (13 self)
 Add to MetaCart
(Show Context)
In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the problem of property testing with respect to the JensenShannon distance. We present optimal algorithms for estimating bounded, symmetric fdivergences (including the JensenShannon divergence and the Hellinger distance) between distributions in various property testing frameworks. Along the way, we close a (log n)/H gap between the upper and lower bounds for estimating entropy H, yielding an optimal algorithm over all values of the entropy. In a data stream setting (sublinear space), we give the first algorithm for estimating the entropy of a distribution. Our algorithm runs in polylogarithmic space and yields an asymptotic constant factor approximation scheme. An integral part of the algorithm is an interesting use of an F0 (the number of distinct elements in a set) estimation algorithm; we also provide other results along the space/time/approximation tradeoff curve. Our results have interesting structural implications that connect sublinear time and space constrained algorithms. The mediating model is the random order streaming model, which assumes the input is a random permutation of a multiset and was first considered by Munro and Paterson in 1980. We show that any property testing algorithm in the combined oracle model for calculating a permutation invariant functions can be simulated in the random order model in a single pass. This addresses a question raised by Feigenbaum et al regarding the relationship between property testing and stream algorithms. Further, we give a polylogspace PTAS for estimating the entropy of a one pass random order stream. This bound cannot be achieved in the combined oracle (generalized property testing) model. 1
Fast and approximate stream mining of quantiles and frequencies using graphics processors
 In SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM
, 2005
"... We present algorithms for fast quantile and frequency estimation in large data streams using graphics processors (GPUs). We exploit the high computation power and memory bandwidth of graphics processors and present a new sorting algorithm that performs rasterization operations on the GPUs. We use s ..."
Abstract

Cited by 53 (13 self)
 Add to MetaCart
(Show Context)
We present algorithms for fast quantile and frequency estimation in large data streams using graphics processors (GPUs). We exploit the high computation power and memory bandwidth of graphics processors and present a new sorting algorithm that performs rasterization operations on the GPUs. We use sorting as the main computational component for histogram approximation and construction of approximate quantile and frequency summaries. Our algorithms for numerical statistics computation on data streams are deterministic, applicable to xed or variablesized sliding windows and use a limited memory footprint. We use GPU as a coprocessor and minimize the data transmission between the CPU and GPU by taking into account the low bus bandwidth. We implemented our algorithms on a PC with a NVIDIA GeForce FX 6800 Ultra GPU and a 3:4 GHz Pentium IV CPU and applied them to large data streams consisting of more than 100 million values. We also compared the performance of our GPUbased algorithms with optimized implementations of prior CPUbased algorithms. Overall, our results demonstrate that the graphics processors available on a commodity computer system are efcient streamprocessor and useful coprocessors for mining data streams.