Results 1 - 10
of
81
Approximate aggregation techniques for sensor databases
- In ICDE
, 2004
"... In the emerging area of sensor-based systems, a significant challenge is to develop scalable, fault-tolerant methods to extract useful information from the data the sensors collect. An approach to this data management problem is the use of sensor database systems, exemplified by TinyDB and Cougar, w ..."
Abstract
-
Cited by 192 (5 self)
- Add to MetaCart
In the emerging area of sensor-based systems, a significant challenge is to develop scalable, fault-tolerant methods to extract useful information from the data the sensors collect. An approach to this data management problem is the use of sensor database systems, exemplified by TinyDB and Cougar, which allow users to perform aggregation queries such as MIN, COUNT and AVG on a sensor network. Due to power and range constraints, centralized approaches are generally impractical, so most systems use in-network aggregation to reduce network traffic. Also, aggregation strategies must provide fault-tolerance to address the issues of packet loss and node failures inherent in such a system. An unfortunate consequence of standard methods is that they typically introduce duplicate values, which must be accounted for to compute aggregates correctly. Another consequence of loss in the network is that exact aggregation is not possible in general. With this in mind, we investigate the use of approximate in-network aggregation using small sketches. Our contributions are as follows: 1) we generalize well known duplicateinsensitive sketches for approximating COUNT to handle SUM (and by extension, AVG and other aggregates), 2) we present and analyze methods for using sketches to produce accurate results with low communication and computation overhead (even on low-powered CPUs with little storage and no floating point operations), and 3) we present an extensive experimental validation of our methods. 1
SIA: Secure Information Aggregation in Sensor Networks
, 2003
"... Sensor networks promise viable solutions to many monitoring problems. However, the practical deployment of sensor networks faces many challenges imposed by real-world demands. Sensor nodes often have limited computation and communication resources and battery power. Moreover, in many applications se ..."
Abstract
-
Cited by 140 (11 self)
- Add to MetaCart
Sensor networks promise viable solutions to many monitoring problems. However, the practical deployment of sensor networks faces many challenges imposed by real-world demands. Sensor nodes often have limited computation and communication resources and battery power. Moreover, in many applications sensors are deployed in open environments, and hence are vulnerable to physical attacks, potentially compromising the sensor's cryptographic keys. One of the basic and indispensable functionalities of sensor networks is the ability to answer queries over the data acquired by the sensors. The resource constraints and security issues make designing mechanisms for information aggregation in large sensor networks particularly challenging.
Tributaries and deltas: Efficient and robust aggregation in sensor network streams
- In SIGMOD
, 2005
"... Existing energy-efficient approaches to in-network aggregation in sensor networks can be classified into two categories, tree-based and multi-path-based, with each having unique strengths and weaknesses. In this paper, we introduce Tributary-Delta, a novel approach that combines the advantages of th ..."
Abstract
-
Cited by 71 (2 self)
- Add to MetaCart
Existing energy-efficient approaches to in-network aggregation in sensor networks can be classified into two categories, tree-based and multi-path-based, with each having unique strengths and weaknesses. In this paper, we introduce Tributary-Delta, a novel approach that combines the advantages of the tree and multi-path approaches by running them simultaneously in different regions of the network. We present schemes for adjusting the regions in response to changes in network conditions, and show how many useful aggregates can be readily computed within this new framework. We then show how a difficult aggregate for this context— finding frequent items—can be efficiently computed within the framework. To this end, we devise the first algorithm for frequent items (and for quantiles) that provably minimizes the worst case total communication for non-regular trees. In addition, we give a multi-path algorithm for frequent items that is considerably more accurate than previous approaches. These algorithms form the basis for our efficient Tributary-Delta frequent items algorithm. Through extensive simulation with real-world and synthetic data, we show the significant advantages of our techniques. For example, in computing Count under realistic loss rates, our techniques reduce answer error by up to a factor of 3 compared to any previous technique. 1.
Distributed streams algorithms for sliding windows
- In Proc. ACM Symp. on Parallel Algorithms and Architectures (SPAA
, 2002
"... Massive data sets often arise as physically distributed, parallel data streams, and it is important to estimate various aggregates and statistics on the union of these streams. This paper presents algorithms for estimating aggregate functions over a “sliding window ” of the N most recent data items ..."
Abstract
-
Cited by 48 (10 self)
- Add to MetaCart
Massive data sets often arise as physically distributed, parallel data streams, and it is important to estimate various aggregates and statistics on the union of these streams. This paper presents algorithms for estimating aggregate functions over a “sliding window ” of the N most recent data items in one or more streams. Our results include: 1. For a single stream, we present the first ɛ-approximation scheme for the number of 1’s in a sliding window that is optimal in both worst case time and space. We also present the first ɛ-approximation scheme for the sum of integers in [0..R] in a sliding window that is optimal in both worst case time and space (assuming R is at most polynomial in N). Both algorithms are deterministic and use only logarithmic memory words. 2. In contrast, we show that any deterministic algorithm that estimates, to within a small constant relative error, the number of 1’s (or the sum of integers) in a sliding window on the union of distributed streams requires Ω(N) space.
New Streaming Algorithms for Fast Detection of Superspreaders
- in Proceedings of Network and Distributed System Security Symposium (NDSS
, 2005
"... High-speed monitoring of Internet traffic is an important and challenging problem, with applications to realtime attack detection and mitigation, traffic engineering, etc. However, packet-level monitoring requires fast streaming algorithms that use very little memory and little communication among c ..."
Abstract
-
Cited by 45 (2 self)
- Add to MetaCart
High-speed monitoring of Internet traffic is an important and challenging problem, with applications to realtime attack detection and mitigation, traffic engineering, etc. However, packet-level monitoring requires fast streaming algorithms that use very little memory and little communication among collaborating network monitoring points. In this paper, we consider the problem of detecting superspreaders, which are sources that connect to a large number of distinct destinations. We propose new streaming algorithms for detecting superspreaders and prove guarantees on their accuracy and memory requirements. We also show experimental results on real network traces. Our algorithms are substantially more efficient (both theoretically and experimentally) than previous approaches. We also extend our algorithms to identify superspreaders in a distributed setting, with sliding windows, and when deletions are allowed in the stream (which lets us identify sources that make a large number of failed connections to distinct destinations). More generally, our algorithms are applicable to any problem that can be formulated as follows: given a stream of (x, y) pairs, find all the x’s that are paired with a large number of distinct y’s. We call this the heavy distinct-hitters problem. There are many network security applications of this general problem. This paper discusses these applications and, for concreteness, focuses on the superspreader problem. 1
Optimal space lower bounds for all frequency moments
- In SODA
, 2004
"... Abstract We prove that any one-pass streaming algorithm which (ffl, ffi)-approximates the kth frequency moment Fk, for any real k 6 = 1 and any ffl = \Omega i 1pm j, must use \Omega \Gamma 1ffl2 \Delta bits of space, where m is the size of the universe. This is optimal in terms of ffl, resolves the ..."
Abstract
-
Cited by 42 (10 self)
- Add to MetaCart
Abstract We prove that any one-pass streaming algorithm which (ffl, ffi)-approximates the kth frequency moment Fk, for any real k 6 = 1 and any ffl = \Omega i 1pm j, must use \Omega \Gamma 1ffl2 \Delta bits of space, where m is the size of the universe. This is optimal in terms of ffl, resolves the open questions of BarYossef et al in [3, 4], and extends the \Omega \Gamma 1ffl2 \Delta lower bound for F0 in [11] to much smaller ffl by applying novel techniques. Along the way we lower bound the one-way communication complexity of approximating the Hamming distance and the number of bipartite graphs with minimum/maximum degree constraints. 1 Introduction Computing statistics on massive data sets is increasinglyimportant these days. Advances in communication and storage technology enable large bodies of raw datato be generated daily, and consequently, there is a rising demand to process this data efficiently. Sinceit is impractical for an algorithm to store even a small fraction of the data stream, its performance istypically measured by the amount of space it uses. In many scenarios, such as internet routing, once a streamelement is examined it is lost forever unless explicitly saved by the processing algorithm. This, along with thesheer size of the data, makes multiple passes over the data infeasible. In this paper we restrict our attention toone-pass streaming algorithms and we investigate their space complexity.Let a =
A near-optimal algorithm for computing the entropy of a stream
- In ACM-SIAM Symposium on Discrete Algorithms
, 2007
"... We describe a simple algorithm for approximating the empirical entropy of a stream of m values in a single pass, using O(ε −2 log(δ −1) log m) words of space. Our algorithm is based upon a novel extension of a method introduced by Alon, Matias, and Szegedy [1]. We show a space lower bound of Ω(ε −2 ..."
Abstract
-
Cited by 36 (17 self)
- Add to MetaCart
We describe a simple algorithm for approximating the empirical entropy of a stream of m values in a single pass, using O(ε −2 log(δ −1) log m) words of space. Our algorithm is based upon a novel extension of a method introduced by Alon, Matias, and Szegedy [1]. We show a space lower bound of Ω(ε −2 / log(ε −1)), meaning that our algorithm is near optimal in terms of its dependency on ε. This improves over previous work on this problem [8, 13, 17, 5]. We show that generalizing to kth order entropy requires close to linear space for all k ≥ 1, and give additive approximations using our algorithm. Lastly, we show how to compute a multiplicative approximation to the entropy of a random walk on an undirected graph. 1
Streaming and sublinear approximation of entropy and information distances
- In ACM-SIAM Symposium on Discrete Algorithms
, 2006
"... In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the pr ..."
Abstract
-
Cited by 33 (9 self)
- Add to MetaCart
In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the problem of property testing with respect to the Jensen-Shannon distance. We present optimal algorithms for estimating bounded, symmetric f-divergences (including the Jensen-Shannon divergence and the Hellinger distance) between distributions in various property testing frameworks. Along the way, we close a (log n)/H gap between the upper and lower bounds for estimating entropy H, yielding an optimal algorithm over all values of the entropy. In a data stream setting (sublinear space), we give the first algorithm for estimating the entropy of a distribution. Our algorithm runs in polylogarithmic space and yields an asymptotic constant factor approximation scheme. An integral part of the algorithm is an interesting use of an F0 (the number of distinct elements in a set) estimation algorithm; we also provide other results along the space/time/approximation tradeoff curve. Our results have interesting structural implications that connect sublinear time and space constrained algorithms. The mediating model is the random order streaming model, which assumes the input is a random permutation of a multiset and was first considered by Munro and Paterson in 1980. We show that any property testing algorithm in the combined oracle model for calculating a permutation invariant functions can be simulated in the random order model in a single pass. This addresses a question raised by Feigenbaum et al regarding the relationship between property testing and stream algorithms. Further, we give a polylog-space PTAS for estimating the entropy of a one pass random order stream. This bound cannot be achieved in the combined oracle (generalized property testing) model. 1
Tight lower bounds for the distinct elements problem
- In FOCS
, 2003
"... We prove strong lower bounds for the space complexity of ¢¤£¦¥¨§� ©-approximating the number of distinct elements �� � in a data stream. Let � be the size of the universe from which the stream elements are drawn. We show that any one-pass streaming algorithm for ¢¤£¦¥¨§� ©-approximating � � must us ..."
Abstract
-
Cited by 31 (9 self)
- Add to MetaCart
We prove strong lower bounds for the space complexity of ¢¤£¦¥¨§� ©-approximating the number of distinct elements �� � in a data stream. Let � be the size of the universe from which the stream elements are drawn. We show that any one-pass streaming algorithm for ¢¤£¦¥¨§� ©-approximating � � must use ����� space £������¦���� � ������ � when, for ���� � any, im-proving upon the known lower bound of � ��� � � for this range of £. This lower bound is tight up to a factor of ������������ �. Our lower bound is derived from a reduction from the one-way communication complexity of approximating a boolean function in Euclidean space. The reduction makes use of a lowdistortion embedding from an �� � to an � � norm. 1
Algorithms for Distributed Functional Monitoring
, 2008
"... We study what we call functional monitoring problems. We have k players each tracking their inputs, say player i tracking a multiset Ai(t) up until time t, and communicating with a central coordinator. The coordinator’s task is to monitor a given function f computed over the union of the inputs ∪iAi ..."
Abstract
-
Cited by 25 (11 self)
- Add to MetaCart
We study what we call functional monitoring problems. We have k players each tracking their inputs, say player i tracking a multiset Ai(t) up until time t, and communicating with a central coordinator. The coordinator’s task is to monitor a given function f computed over the union of the inputs ∪iAi(t), continuously at all times t. The goal is to minimize the number of bits communicated between the players and the coordinator. A simple example is when f is the sum, and the coordinator is required to alert when the sum of a distributed set of values exceeds a given threshold τ. Of interest is the approximate version where the coordinator outputs 1 if f ≥ τ and 0 if f ≤ (1 − ɛ)τ. This defines the (k, f, τ, ɛ) distributed, functional monitoring problem. Functional monitoring problems are fundamental in distributed systems, in particular sensor networks, where we must minimize communication; they also connect to problems in communication complexity, communication theory, and signal processing. Yet few formal bounds are known for functional monitoring. We give upper and lower bounds for the (k, f, τ, ɛ) problem for some of the basic f’s. In particular, we study frequency moments (F0, F1, F2). For F0 and F1, we obtain continuously monitoring algorithms with costs almost the same as their one-shot computation algorithms. However, for F2 the monitoring problem seems much harder. We give a carefully constructed multi-round algorithm that uses “sketch summaries ” at multiple levels of detail and solves the (k, F2, τ, ɛ) problem with communication Õ(k2 /ɛ+ ( √ k/ɛ) 3). Since frequency moment estimation is central to other problems, our results have immediate applications to histograms, wavelet computations, and others. Our algorithmic techniques are likely to be useful for other functional monitoring problems as well.

