Results 1  10
of
115
Approximate aggregation techniques for sensor databases
 In ICDE
, 2004
"... In the emerging area of sensorbased systems, a significant challenge is to develop scalable, faulttolerant methods to extract useful information from the data the sensors collect. An approach to this data management problem is the use of sensor database systems, exemplified by TinyDB and Cougar, w ..."
Abstract

Cited by 234 (5 self)
 Add to MetaCart
In the emerging area of sensorbased systems, a significant challenge is to develop scalable, faulttolerant methods to extract useful information from the data the sensors collect. An approach to this data management problem is the use of sensor database systems, exemplified by TinyDB and Cougar, which allow users to perform aggregation queries such as MIN, COUNT and AVG on a sensor network. Due to power and range constraints, centralized approaches are generally impractical, so most systems use innetwork aggregation to reduce network traffic. Also, aggregation strategies must provide faulttolerance to address the issues of packet loss and node failures inherent in such a system. An unfortunate consequence of standard methods is that they typically introduce duplicate values, which must be accounted for to compute aggregates correctly. Another consequence of loss in the network is that exact aggregation is not possible in general. With this in mind, we investigate the use of approximate innetwork aggregation using small sketches. Our contributions are as follows: 1) we generalize well known duplicateinsensitive sketches for approximating COUNT to handle SUM (and by extension, AVG and other aggregates), 2) we present and analyze methods for using sketches to produce accurate results with low communication and computation overhead (even on lowpowered CPUs with little storage and no floating point operations), and 3) we present an extensive experimental validation of our methods. 1
SIA: Secure Information Aggregation in Sensor Networks
, 2003
"... Sensor networks promise viable solutions to many monitoring problems. However, the practical deployment of sensor networks faces many challenges imposed by realworld demands. Sensor nodes often have limited computation and communication resources and battery power. Moreover, in many applications se ..."
Abstract

Cited by 175 (11 self)
 Add to MetaCart
Sensor networks promise viable solutions to many monitoring problems. However, the practical deployment of sensor networks faces many challenges imposed by realworld demands. Sensor nodes often have limited computation and communication resources and battery power. Moreover, in many applications sensors are deployed in open environments, and hence are vulnerable to physical attacks, potentially compromising the sensor's cryptographic keys. One of the basic and indispensable functionalities of sensor networks is the ability to answer queries over the data acquired by the sensors. The resource constraints and security issues make designing mechanisms for information aggregation in large sensor networks particularly challenging.
Tributaries and deltas: Efficient and robust aggregation in sensor network streams
 In SIGMOD
, 2005
"... Existing energyefficient approaches to innetwork aggregation in sensor networks can be classified into two categories, treebased and multipathbased, with each having unique strengths and weaknesses. In this paper, we introduce TributaryDelta, a novel approach that combines the advantages of th ..."
Abstract

Cited by 89 (2 self)
 Add to MetaCart
Existing energyefficient approaches to innetwork aggregation in sensor networks can be classified into two categories, treebased and multipathbased, with each having unique strengths and weaknesses. In this paper, we introduce TributaryDelta, a novel approach that combines the advantages of the tree and multipath approaches by running them simultaneously in different regions of the network. We present schemes for adjusting the regions in response to changes in network conditions, and show how many useful aggregates can be readily computed within this new framework. We then show how a difficult aggregate for this context— finding frequent items—can be efficiently computed within the framework. To this end, we devise the first algorithm for frequent items (and for quantiles) that provably minimizes the worst case total communication for nonregular trees. In addition, we give a multipath algorithm for frequent items that is considerably more accurate than previous approaches. These algorithms form the basis for our efficient TributaryDelta frequent items algorithm. Through extensive simulation with realworld and synthetic data, we show the significant advantages of our techniques. For example, in computing Count under realistic loss rates, our techniques reduce answer error by up to a factor of 3 compared to any previous technique. 1.
Optimal space lower bounds for all frequency moments
 In SODA
, 2004
"... Abstract We prove that any onepass streaming algorithm which (ffl, ffi)approximates the kth frequency moment Fk, for any real k 6 = 1 and any ffl = \Omega i 1pm j, must use \Omega \Gamma 1ffl2 \Delta bits of space, where m is the size of the universe. This is optimal in terms of ffl, resolves the ..."
Abstract

Cited by 60 (12 self)
 Add to MetaCart
Abstract We prove that any onepass streaming algorithm which (ffl, ffi)approximates the kth frequency moment Fk, for any real k 6 = 1 and any ffl = \Omega i 1pm j, must use \Omega \Gamma 1ffl2 \Delta bits of space, where m is the size of the universe. This is optimal in terms of ffl, resolves the open questions of BarYossef et al in [3, 4], and extends the \Omega \Gamma 1ffl2 \Delta lower bound for F0 in [11] to much smaller ffl by applying novel techniques. Along the way we lower bound the oneway communication complexity of approximating the Hamming distance and the number of bipartite graphs with minimum/maximum degree constraints. 1 Introduction Computing statistics on massive data sets is increasinglyimportant these days. Advances in communication and storage technology enable large bodies of raw datato be generated daily, and consequently, there is a rising demand to process this data efficiently. Sinceit is impractical for an algorithm to store even a small fraction of the data stream, its performance istypically measured by the amount of space it uses. In many scenarios, such as internet routing, once a streamelement is examined it is lost forever unless explicitly saved by the processing algorithm. This, along with thesheer size of the data, makes multiple passes over the data infeasible. In this paper we restrict our attention toonepass streaming algorithms and we investigate their space complexity.Let a =
Distributed streams algorithms for sliding windows
 In Proc. ACM Symp. on Parallel Algorithms and Architectures (SPAA
, 2002
"... Massive data sets often arise as physically distributed, parallel data streams, and it is important to estimate various aggregates and statistics on the union of these streams. This paper presents algorithms for estimating aggregate functions over a “sliding window ” of the N most recent data items ..."
Abstract

Cited by 57 (11 self)
 Add to MetaCart
Massive data sets often arise as physically distributed, parallel data streams, and it is important to estimate various aggregates and statistics on the union of these streams. This paper presents algorithms for estimating aggregate functions over a “sliding window ” of the N most recent data items in one or more streams. Our results include: 1. For a single stream, we present the first ɛapproximation scheme for the number of 1’s in a sliding window that is optimal in both worst case time and space. We also present the first ɛapproximation scheme for the sum of integers in [0..R] in a sliding window that is optimal in both worst case time and space (assuming R is at most polynomial in N). Both algorithms are deterministic and use only logarithmic memory words. 2. In contrast, we show that any deterministic algorithm that estimates, to within a small constant relative error, the number of 1’s (or the sum of integers) in a sliding window on the union of distributed streams requires Ω(N) space.
Dremel: Interactive Analysis of WebScale Datasets
"... Dremel is a scalable, interactive adhoc query system for analysis of readonly nested data. By combining multilevel execution trees and columnar data layout, it is capable of running aggregation queries over trillionrow tables in seconds. The system scales to thousands of CPUs and petabytes of da ..."
Abstract

Cited by 56 (1 self)
 Add to MetaCart
Dremel is a scalable, interactive adhoc query system for analysis of readonly nested data. By combining multilevel execution trees and columnar data layout, it is capable of running aggregation queries over trillionrow tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReducebased computing. We present a novel columnar storage representation for nested records and discuss experiments on fewthousand node instances of the system. 1.
Streaming and sublinear approximation of entropy and information distances
 In ACMSIAM Symposium on Discrete Algorithms
, 2006
"... In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the pr ..."
Abstract

Cited by 55 (13 self)
 Add to MetaCart
In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the problem of property testing with respect to the JensenShannon distance. We present optimal algorithms for estimating bounded, symmetric fdivergences (including the JensenShannon divergence and the Hellinger distance) between distributions in various property testing frameworks. Along the way, we close a (log n)/H gap between the upper and lower bounds for estimating entropy H, yielding an optimal algorithm over all values of the entropy. In a data stream setting (sublinear space), we give the first algorithm for estimating the entropy of a distribution. Our algorithm runs in polylogarithmic space and yields an asymptotic constant factor approximation scheme. An integral part of the algorithm is an interesting use of an F0 (the number of distinct elements in a set) estimation algorithm; we also provide other results along the space/time/approximation tradeoff curve. Our results have interesting structural implications that connect sublinear time and space constrained algorithms. The mediating model is the random order streaming model, which assumes the input is a random permutation of a multiset and was first considered by Munro and Paterson in 1980. We show that any property testing algorithm in the combined oracle model for calculating a permutation invariant functions can be simulated in the random order model in a single pass. This addresses a question raised by Feigenbaum et al regarding the relationship between property testing and stream algorithms. Further, we give a polylogspace PTAS for estimating the entropy of a one pass random order stream. This bound cannot be achieved in the combined oracle (generalized property testing) model. 1
A nearoptimal algorithm for computing the entropy of a stream
 In ACMSIAM Symposium on Discrete Algorithms
, 2007
"... We describe a simple algorithm for approximating the empirical entropy of a stream of m values in a single pass, using O(ε −2 log(δ −1) log m) words of space. Our algorithm is based upon a novel extension of a method introduced by Alon, Matias, and Szegedy [1]. We show a space lower bound of Ω(ε −2 ..."
Abstract

Cited by 54 (20 self)
 Add to MetaCart
We describe a simple algorithm for approximating the empirical entropy of a stream of m values in a single pass, using O(ε −2 log(δ −1) log m) words of space. Our algorithm is based upon a novel extension of a method introduced by Alon, Matias, and Szegedy [1]. We show a space lower bound of Ω(ε −2 / log(ε −1)), meaning that our algorithm is near optimal in terms of its dependency on ε. This improves over previous work on this problem [8, 13, 17, 5]. We show that generalizing to kth order entropy requires close to linear space for all k ≥ 1, and give additive approximations using our algorithm. Lastly, we show how to compute a multiplicative approximation to the entropy of a random walk on an undirected graph. 1
New Streaming Algorithms for Fast Detection of Superspreaders
 in Proceedings of Network and Distributed System Security Symposium (NDSS
, 2005
"... Highspeed monitoring of Internet traffic is an important and challenging problem, with applications to realtime attack detection and mitigation, traffic engineering, etc. However, packetlevel monitoring requires fast streaming algorithms that use very little memory and little communication among c ..."
Abstract

Cited by 50 (2 self)
 Add to MetaCart
Highspeed monitoring of Internet traffic is an important and challenging problem, with applications to realtime attack detection and mitigation, traffic engineering, etc. However, packetlevel monitoring requires fast streaming algorithms that use very little memory and little communication among collaborating network monitoring points. In this paper, we consider the problem of detecting superspreaders, which are sources that connect to a large number of distinct destinations. We propose new streaming algorithms for detecting superspreaders and prove guarantees on their accuracy and memory requirements. We also show experimental results on real network traces. Our algorithms are substantially more efficient (both theoretically and experimentally) than previous approaches. We also extend our algorithms to identify superspreaders in a distributed setting, with sliding windows, and when deletions are allowed in the stream (which lets us identify sources that make a large number of failed connections to distinct destinations). More generally, our algorithms are applicable to any problem that can be formulated as follows: given a stream of (x, y) pairs, find all the x’s that are paired with a large number of distinct y’s. We call this the heavy distincthitters problem. There are many network security applications of this general problem. This paper discusses these applications and, for concreteness, focuses on the superspreader problem. 1
Computing separable functions via gossip
 In Proceedings of the TwentyFifth Annual ACM Symposium on Principles of Distributed Computing (PODC
, 2006
"... Motivated by applications to sensor, peertopeer, and adhoc networks, we study the problem of computing functions of values at the nodes in a network in a totally distributed manner. In particular, we consider separable functions, which can be written as linear combinations or products of function ..."
Abstract

Cited by 48 (6 self)
 Add to MetaCart
Motivated by applications to sensor, peertopeer, and adhoc networks, we study the problem of computing functions of values at the nodes in a network in a totally distributed manner. In particular, we consider separable functions, which can be written as linear combinations or products of functions of individual variables. The main contribution of this paper is the design of a distributed algorithm for computing separable functions based on properties of exponential random variables. We bound the running time of our algorithm in terms of the running time of an information spreading algorithm used as a subroutine by the algorithm. Since we are interested in totally distributed algorithms, we consider a randomized gossip mechanism for information spreading as the subroutine. Combining these algorithms yields a complete and simple distributed algorithm for computing separable functions. The second contribution of this paper is a characterization of the information spreading time of the gossip algorithm, and therefore the computation time for separable functions, in terms of the conductance of an appropriate stochastic matrix. Specifically, we find that for a class of graphs with small spectral gap, this time is of a smaller order than the time required to compute averages for a known iterative gossip scheme [4]. 1