Results 1 - 10
of
19
Optimal tracking of distributed heavy hitters and quantiles
- In PODS
, 2009
"... We consider the the problem of tracking heavy hitters and quantiles in the distributed streaming model. The heavy hitters and quantiles are two important statistics for characterizing a data distribution. Let A be a multiset of elements, drawn from the universe U = {1,..., u}. For a given 0 ≤ φ ≤ 1, ..."
Abstract
-
Cited by 24 (9 self)
- Add to MetaCart
(Show Context)
We consider the the problem of tracking heavy hitters and quantiles in the distributed streaming model. The heavy hitters and quantiles are two important statistics for characterizing a data distribution. Let A be a multiset of elements, drawn from the universe U = {1,..., u}. For a given 0 ≤ φ ≤ 1, the φ-heavy hitters are those elements of A whose frequency in A is at least φ|A|; the φ-quantile of A is an element x of U such that at most φ|A | elements of A are smaller than A and at most (1 − φ)|A | elements of A are greater than x. Suppose the elements of A are received at k remote sites over time, and each of the sites has a two-way communication channel to a designated coordinator, whose goal is to track the set of φ-heavy hitters and the φ-quantile of A approximately at all times with minimum communication. We give tracking algorithms with worst-case communication cost O(k/ǫ · log n) for both problems, where n is the total number of items in A, and ǫ is the approximation error. This substantially improves upon the previous known algorithms. We also give matching lower bounds on the communication costs for both problems, showing that our algorithms are optimal. We also consider a more general version of the problem where we simultaneously track the φ-quantiles for all 0 ≤ φ ≤ 1. 1
Mergeable Summaries
"... We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means t ..."
Abstract
-
Cited by 22 (7 self)
- Add to MetaCart
(Show Context)
We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means that the summaries can be merged in a way like other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the data sets. But some other fundamental ones like those for heavy hitters and quantiles, are not (known to be) mergeable. In this paper, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for ε-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/ε); for ε-approximate quantiles, there is a deterministic summary of size O ( 1 log(εn)) that has a restricted form of mergeability, ε and a randomized one of size O ( 1 1 log3/2) with full merge-ε ε ability. We also extend our results to geometric summaries such as ε-approximations and ε-kernels. We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for ε-approximate quantiles that depends only on ε, of size O ( 1 1 log3/2), and (2) we demonstrate that the MG and the ε ε SpaceSaving summaries for heavy hitters are isomorphic. Supported by NSF under grants CNS-05-40347, IIS-07-
Robust shape fitting via peeling and grating coresets
- In Proc. 17th ACM-SIAM Sympos. Discrete Algorithms
, 2006
"... Let P be a set of n points in R d. A subset S of P is called a (k, ε)-kernel if for every direction, the direction width of S ε-approximates that of P, when k “outliers ” can be ignored in that direction. We show that a (k, ε)-kernel of P of size O(k/ε (d−1)/2) can be computed in time O(n+k 2 /ε d−1 ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
(Show Context)
Let P be a set of n points in R d. A subset S of P is called a (k, ε)-kernel if for every direction, the direction width of S ε-approximates that of P, when k “outliers ” can be ignored in that direction. We show that a (k, ε)-kernel of P of size O(k/ε (d−1)/2) can be computed in time O(n+k 2 /ε d−1). The new algorithm works by repeatedly “peeling” away (0, ε)-kernels from the point set. We also present a simple ε-approximation algorithm for fitting various shapes through a set of points with at most k outliers. The algorithm is incremental and works by repeatedly “grating ” critical points into a working set, till the working set provides the required approximation. We prove that the size of the working set is independent of n, and thus results in a simple and practical, near-linear ε-approximation algorithm for shape fitting with outliers in low dimensions. We demonstrate the practicality of our algorithms by showing their empirical performance on various inputs and problems. 1
An almost space-optimal streaming algorithm for coresets in fixed dimensions
- Algorithmica
"... We present a new streaming algorithm for maintaining an ε-kernel of a point set in Rd using O((1/ε(d−1)/2) log(1/ε)) space. The space used by our algorithm is optimal up to a small logarithmic factor. This significantly improves (for any fixed dimension d> 3) the best previous algorithm for this ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
(Show Context)
We present a new streaming algorithm for maintaining an ε-kernel of a point set in Rd using O((1/ε(d−1)/2) log(1/ε)) space. The space used by our algorithm is optimal up to a small logarithmic factor. This significantly improves (for any fixed dimension d> 3) the best previous algorithm for this problem that uses O(1/εd−(3/2)) space, presented by Agarwal and Yu. Our algorithm immediately improves the space complexity of the previous streaming algorithms for a number of fundamental geometric optimization problems in fixed dimensions, including width, minimum-volume bounding box, minimum-radius enclosing cylinder, minimum-width enclosing annulus, etc. 1
Streaming algorithms for extent problems in high dimensions
- in SODA ’10: Proc. Twenty-First ACM-SIAM Symposium on Discrete Algorithms
, 2010
"... We develop (single-pass) streaming algorithms for maintaining extent measures of a stream S of n points in R d. We focus on designing streaming algorithms whose working space is polynomial in d (poly(d)) and sublinear in n. For the problems of computing diameter, width and minimum enclosing ball of ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
(Show Context)
We develop (single-pass) streaming algorithms for maintaining extent measures of a stream S of n points in R d. We focus on designing streaming algorithms whose working space is polynomial in d (poly(d)) and sublinear in n. For the problems of computing diameter, width and minimum enclosing ball of S, we obtain lower bounds on the worst-case approximation ratio of any streaming algorithm that uses poly(d) space. On the positive side, we introduce the notion of blurred ball cover and use it for answering approximate farthestpoint queries and maintaining approximate minimum enclosing ball and diameter of S. We describe a streaming algorithm for maintaining a blurred ball cover whose working space is linear in d and independent of n. 1
Streaming Algorithms for Line Simplification
"... We study the following variant of the well-known line-simplification problem: we are getting a possibly infinite sequence of points p0, p1, p2,... defining a polygonal path, and as we receive the points we wish to maintain a simplification of the path seen so far. We study this problem in a streamin ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
(Show Context)
We study the following variant of the well-known line-simplification problem: we are getting a possibly infinite sequence of points p0, p1, p2,... defining a polygonal path, and as we receive the points we wish to maintain a simplification of the path seen so far. We study this problem in a streaming setting, where we only have a limited amount of storage so that we cannot store all the points. We analyze the competitive ratio of our algorithms, allowing resource augmentation: we let our algorithm maintain a simplification with 2k (internal) points, and compare the error of our simplification to the error of the optimal simplification with k points. We obtain the algorithms with O(1) competitive ratio for three cases: convex paths where the error is measured using the Hausdorff distance, xy-monotone paths where the error is measured using the Hausdorff distance, and general paths where the error is measured using the Fréchet distance. In the first case the algorithm needs O(k) additional storage, and in the latter two cases the algorithm needs O(k 2) additional storage. 1
Streaming and Dynamic Algorithms for Minimum Enclosing Balls in High Dimensions
"... Abstract. At SODA’10, Agarwal and Sharathkumar presented a streaming algorithm for approximating the minimum enclosing ball of a set of points in d-dimensional Euclidean space. Their algorithm requires one pass, uses O(d) space, and was shown to have approximation factor at most (1 + √ 3)/2 + ε ≈ 1. ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Abstract. At SODA’10, Agarwal and Sharathkumar presented a streaming algorithm for approximating the minimum enclosing ball of a set of points in d-dimensional Euclidean space. Their algorithm requires one pass, uses O(d) space, and was shown to have approximation factor at most (1 + √ 3)/2 + ε ≈ 1.3661. We prove that the same algorithm has approximation factor less than 1.22, which brings us much closer to a (1 + √ 2)/2 ≈ 1.207 lower bound given by Agarwal and Sharathkumar. We also apply this technique to the dynamic version of the minimum enclosing ball problem (in the non-streaming setting). We give an O(dn)space data structure that can maintain a 1.22-approximate minimum enclosing ball in O(d log n) expected amortized time per insertion/deletion. 1
Small and Stable Descriptors of Distributions for Geometric Statistical Problems
, 2009
"... This thesis explores how to sparsely represent distributions of points for geometric statistical problems. A coreset C is a small summary of a point set P such that if a certain statistic is computed on P and C, then the difference in the results is guaranteed to be bounded by a parameter ε. Two exa ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
This thesis explores how to sparsely represent distributions of points for geometric statistical problems. A coreset C is a small summary of a point set P such that if a certain statistic is computed on P and C, then the difference in the results is guaranteed to be bounded by a parameter ε. Two examples of coresets are ε-samples and ε-kernels. An ε-sample can estimate the density of a point set in any range from a geometric family of ranges (e.g., disks, axis-aligned rectangles). An ε-kernel approximates the width of a point set in all directions. Both coresets have size that depends only on ε, the error parameter, not the size of the original data set. We demonstrate several improvements to these coresets and how they are useful for geometric statistical problems. We reduce the size of ε-samples for density queries in axis-aligned rectangles to nearly a square root of the size when the queries are with respect to more general families of shapes, such as disks. We also show how to construct ε-samples of probability distributions. We show how to maintain “stable” ε-kernels, that is, if the point set P changes by
Stability of ε-Kernels
, 2009
"... Given a set P of n points in R d, an ε-kernel K ⊆ P approximates the directional width of P in every direction within a relative (1 − ε) factor. In this paper we study the stability of ε-kernels under dynamic insertion and deletion of points to P and by changing the approximation factor ε. In the fi ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
Given a set P of n points in R d, an ε-kernel K ⊆ P approximates the directional width of P in every direction within a relative (1 − ε) factor. In this paper we study the stability of ε-kernels under dynamic insertion and deletion of points to P and by changing the approximation factor ε. In the first case, we say an algorithm for dynamically maintaining a ε-kernel is stable if at most O(1) points change in K as one point is inserted or deleted from P. We describe an algorithm to maintain an ε-kernel of size O(1/ε (d−1)/2) in O(1/ε (d−1)/2 + log n) time per update. Not only does our algorithm maintain a stable ε-kernel, its update time is faster than any known algorithm that maintains an ε-kernel of size O(1/ε (d−1)/2). Next, we show that if there is an ε-kernel of P of size κ, which may be dramatically less than O(1/ε (d−1)/2), then there is an (ε/2)-kernel of P of size O(min{1/ε (d−1)/2, κ ⌊d/2 ⌋ log d−2 (1/ε)}). Moreover, there exists a point set P in R d and a parameter ε> 0 such that if every ε-kernel of P has size at least κ, then any (ε/2)-kernel of P has size Ω(κ ⌊d/2 ⌋).