## Mergeable Summaries

### Cached

### Download Links

Citations: | 5 - 1 self |

### BibTeX

@MISC{Agarwal_mergeablesummaries,

author = {Pankaj K. Agarwal and Jeff M. Phillips and Graham Cormode and Zhewei Wei and Zengfeng Huang and Ke Yi},

title = {Mergeable Summaries},

year = {}

}

### OpenURL

### Abstract

We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means that the summaries can be merged in a way like other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the data sets. But some other fundamental ones like those for heavy hitters and quantiles, are not (known to be) mergeable. In this paper, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for ε-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/ε); for ε-approximate quantiles, there is a deterministic summary of size O ( 1 log(εn)) that has a restricted form of mergeability, ε and a randomized one of size O ( 1 1 log3/2) with full merge-ε ε ability. We also extend our results to geometric summaries such as ε-approximations and ε-kernels. We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for ε-approximate quantiles that depends only on ε, of size O ( 1 1 log3/2), and (2) we demonstrate that the MG and the ε ε SpaceSaving summaries for heavy hitters are isomorphic. Supported by NSF under grants CNS-05-40347, IIS-07-

### Citations

1139 | TAG: a Tiny AGgregation Service for Ad-Hoc Sensor Networks
- Madden, Franklin, et al.
- 2002
(Show Context)
Citation Context ...ee rooted at the base station. Each sensor holds some data and the goal of data aggregation is to compute a summary of all the data. Nearly all data aggregation algorithms follow a bottom-up approach =-=[29]-=-: Starting from the leaves, the aggregation propagates upwards to the root. When a node receives the summaries from its children, it merges these with its own summary, and forwards the result to its p... |

963 |
On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications
- Vapnik, Chervonenkis
- 1971
(Show Context)
Citation Context ...ng range queries, on multidimensional data. For a range space (D, R) of VC-dimension 2 ν, a random sample of O(1/ε 2 (ν + log(1/δ))) points from D is an εapproximation with probability at least 1 − δ =-=[28, 42]-=-. Random samples are easily mergeable, but they are far from optimal. It is known that, if R is the set of ranges induced by d-dimensional axis-aligned rectangles, there is an ε-approximation of size ... |

710 | The space complexity of approximating the frequency moments
- Alon, Matias, et al.
- 1996
(Show Context)
Citation Context ...e summary size should be at least as large. Some summaries are known to be mergeable. For example, all sketches that are linear functions of D are trivially mergeable. These include the F2 AMS sketch =-=[4]-=-, the CountMin sketch [15], the ℓ1 sketch [17, 37], among many others. Summaries that maintain the maximum or top-k values can also be easily merged, most notably summaries for estimating the number o... |

311 | An improved data stream summary: the count-min sketch and its applications
- Cormode, Muthukrishnan
(Show Context)
Citation Context ...at least as large. Some summaries are known to be mergeable. For example, all sketches that are linear functions of D are trivially mergeable. These include the F2 AMS sketch [4], the CountMin sketch =-=[15]-=-, the ℓ1 sketch [17, 37], among many others. Summaries that maintain the maximum or top-k values can also be easily merged, most notably summaries for estimating the number of distinct elements [6, 26... |

265 | Clustering data streams
- Guha, Mishra, et al.
- 2000
(Show Context)
Citation Context ...mergeable summaries. In some cases, it may be possible to adapt existing solutions from the streaming literature to this setting. For example, consider the problem of k-median clustering. Guha et al. =-=[24]-=- show that clustering the union of cluster centers from disjoint parts of the input gives a guaranteed approximation to the overall clustering. In our terminology, this means that clusterings can be m... |

264 | Stable distributions, pseudorandom generators, embeddings, and data stream computation - Indyk - 2000 |

183 | Space-efcient online computation of quantile summaries
- Greenwald, Khanna
- 2001
(Show Context)
Citation Context ...igest [39] has size O( 1 log u); although not a linear sketch, ε it is still mergeable. Neither approach scales well when log u is large. The most popular quantile summary technique is the GK summary =-=[21]-=-, which guarantees a size of O( 1 log(εn)). A merging algorithm has been previously ε designed, but the error could increase to 2ε when two εsummaries are merged [22]. ε-approximations. Let (D, R) be ... |

147 | Medians and beyond: new aggregation techniques for sensor networks
- Shrivastava, Buragohain, et al.
- 2004
(Show Context)
Citation Context ...quency estimation summary) can be organized into a hierarchy to solve the quantile problem, yielding a linear sketch of size O( 1 ε log2 log n u log( )) after adjusting parameters [15]. The qε digest =-=[39]-=- has size O( 1 log u); although not a linear sketch, ε it is still mergeable. Neither approach scales well when log u is large. The most popular quantile summary technique is the GK summary [21], whic... |

146 | Counting distinct elements in a data stream
- Bar-Yossef, Jayram, et al.
- 2002
(Show Context)
Citation Context ...h [15], the ℓ1 sketch [17, 37], among many others. Summaries that maintain the maximum or top-k values can also be easily merged, most notably summaries for estimating the number of distinct elements =-=[6, 26]-=-. However, several fundamental problems have summaries that are based on other techniques, and are not known to be mergeable (or have unsatisfactory bounds). This paper focuses on summaries for severa... |

143 | bounds for Gaussian and empirical processes. Annals of Probability 22 - Talagrand, Sharper - 1994 |

115 | Approximate medians and other quantiles in one pass and with limited memory
- Manku, Rajagopalan, et al.
- 1998
(Show Context)
Citation Context ... meet the requirements on this error. An immediate observation is that the GK algorithm [21] (along with other deterministic techniques for streaming computation of quantiles which require more space =-=[32]-=-) meets these requirements, and is therefore one-way mergeable. The merging is fast, since it takes time linear in the summary size to extract an approximate distribution, and near-linear to insert in... |

105 | How to summarize the universe: Dynamic maintenance of quantiles
- Gilbert, Kotidis, et al.
(Show Context)
Citation Context ...ms which are quantiles for multiples of ε ′ . Therefore a quantile summary is automatically a frequency estimation summary, but not vice versa. Quite a number of quantile summaries have been designed =-=[16, 18, 17, 30, 23, 12]-=-, but all the mergeable ones either have dependency on log u (thus work only in the bounded-universe case). The Count-Min sketch (more generally, any frequency estimation summary) can be organized int... |

103 | The Discrepancy Method: Randomness and Complexity - Chazelle - 2000 |

98 | Approximating extent measures of points
- Agarwal, Har-Peled, et al.
- 2004
(Show Context)
Citation Context ...has gone into bounding the discrepancy, which governs the increase in error at each step. We are unaware of any mergeable ε-approximations of o(1/ε 2 ) size. ε-kernels. Finally, we consider ε-kernels =-=[1]-=- which are summaries for approximating the convex shape of a point set P . Specifically, they are a specific type of coreset that approximates the width of P within a relative (1 + ε)-factor in any di... |

95 | Tributaries and deltas: Efficient and robust aggregation in sensor network streams
- Manjhi, Nath, et al.
(Show Context)
Citation Context ...r to the root. These motivating scenarios are by no means new. However, results to this date have yielded rather weak results. Specifically, in many cases, the error increases as more merges are done =-=[13, 22, 30, 31]-=-. To obtain any overall guarantee, it is necessary to bound the number of rounds of merging operations so that the error parameter ε can be scaled accordingly. Consequently, this weaker form of mergea... |

92 | On linear-time deterministic algorithms for optimization problems in fixed dimension
- Chazelle, Matousek
- 1996
(Show Context)
Citation Context ...r to the root. These motivating scenarios are by no means new. However, results to this date have yielded rather weak results. Specifically, in many cases, the error increases as more merges are done =-=[13, 22, 30, 31]-=-. To obtain any overall guarantee, it is necessary to bound the number of rounds of merging operations so that the error parameter ε can be scaled accordingly. Consequently, this weaker form of mergea... |

88 |
An approximate L1-difference algorithm for massive data streams
- FEIGENBAUM, KANNAN, et al.
- 1999
(Show Context)
Citation Context ...ome summaries are known to be mergeable. For example, all sketches that are linear functions of D are trivially mergeable. These include the F2 AMS sketch [4], the CountMin sketch [15], the ℓ1 sketch =-=[17, 37]-=-, among many others. Summaries that maintain the maximum or top-k values can also be easily merged, most notably summaries for estimating the number of distinct elements [6, 26]. However, several fund... |

83 |
Finding repeated elements
- Misra, Gries
- 1982
(Show Context)
Citation Context ...og 2 u log n). In some cases log u is large, for example when the elements are strings or userdefined types, so we seek to avoid such factors. The counter-based summaries, most notably the MG summary =-=[36]-=- and the SpaceSaving summary [35], have been reported [14] to give the best results for both the frequency estimation and the heavy hitters problem (in the streaming model). They are deterministic, si... |

82 |
The Discrepancy Method
- Chazelle
- 2000
(Show Context)
Citation Context ...accumulates during each reduction step of the process. In particular, the reduction step is handled by a low-discrepancy coloring, and an intense line of work (see books of Matousek [34] and Chazelle =-=[12]-=-) has gone into bounding the discrepancy, which governs the increase in error at each step. We are unaware of any mergeable ε-approximations of o(1/ε 2 ) size. ε-kernels. Finally, we consider ε-kernel... |

78 | Efficiently approximating the minimum-volume bounding box of a point set in three dimensions
- Barequet, Har-Peled
- 2001
(Show Context)
Citation Context ...kernel of A(P ) [1]. • Let I = [−1, 1] d and βd = 2 d d 5/2 d!. There exists an O(d 2 |P |) size algorithm to construct an affine transform A such that A(P ) ⊂ I and A(P ) is βd-fat with respect to I =-=[7, 25]-=-. • Place a grid Gε on I so that each grid cell has width ε/βd. For each grid cell g ∈ Gε, place one point (if it exists) from g ∩ A(P ) in K. Then K is an ε-kernel of A(P ). Clearly the same holds if... |

73 | Power-conserving computation of order-statistics over sensor networks
- Greenwald, Khanna
(Show Context)
Citation Context ...r to the root. These motivating scenarios are by no means new. However, results to this date have yielded rather weak results. Specifically, in many cases, the error increases as more merges are done =-=[13, 22, 30, 31]-=-. To obtain any overall guarantee, it is necessary to bound the number of rounds of merging operations so that the error parameter ε can be scaled accordingly. Consequently, this weaker form of mergea... |

66 | Finding (recently) frequent items in distributed data streams
- Manjhi, Shkapenyuk, et al.
- 2005
(Show Context)
Citation Context |

65 | Faster core-set constructions and data-stream algorithms in fixed dimensions. Computational Geometry: Theory and Applications
- Chan
- 2006
(Show Context)
Citation Context ...ically, they are a specific type of coreset that approximates the width of P within a relative (1 + ε)-factor in any direction. These summaries have been extensively studied in computational geometry =-=[2, 9, 10, 43]-=- as they can be used to approximate many other geometric properties of a point set having to do with its convex shape, including diameter, minimum enclosing annulus, and minimum enclosing cylinder. In... |

62 | Geometric approximations via coresets
- Agarwal, Har-Peled, et al.
- 2005
(Show Context)
Citation Context ...summaries of these subsets. However, for all existing analysis, the error increases on each merge step; hence these techniques are not known to be mergeable. ε-kernels. Finally, we consider ε-kernels =-=[2, 1]-=- which are summaries for approximating the convex shape of a point set P . Specifically, they are a specific type of coreset that approximates the width of P within a relative (1 + ε)-factor in any di... |

54 | Graph distances in the streaming model: the value of space - Feigenbaum, Kannan, et al. - 2005 |

48 |
Approximations and optimal geometric divideand-conquer
- Matoušek
- 1991
(Show Context)
Citation Context ...must have size linear in n in the comparison model. On the hand, in this section we give a randomized mergeable quantile summary of size O(1/ε log 1.5 (1/ε)). The idea is to the mergereduce algorithm =-=[13, 33]-=- for constructing deterministic εapproximations of range spaces, but randomize it in a way so that error is preserved. Same-weight merges. We first consider a restricted merging model where each merge... |

45 |
Geometric Discrepancy. An illustrated guide
- Matoušek
- 1999
(Show Context)
Citation Context ...ε)) [27], and an ε-approximation of size O((1/ε) log 2d (1/ε)) [38] can be computed efficiently. More generally, an ε-approximation of size O(1/ε 2ν/(ν+1) ) exists for a range space of VC-dimension ν =-=[34]-=-. Furthermore, such an ε-approximation can be constructed using Bansal’s algorithm [5]; see also [11, 34]. These algorithms for constructing ε-approximations are not known to be mergeable. Although th... |

42 | Improved bounds on the sample complexity of learning
- Li, Long, et al.
- 2000
(Show Context)
Citation Context ...ng range queries, on multidimensional data. For a range space (D, R) of VC-dimension 2 ν, a random sample of O(1/ε 2 (ν + log(1/δ))) points from D is an εapproximation with probability at least 1 − δ =-=[28, 42]-=-. Random samples are easily mergeable, but they are far from optimal. It is known that, if R is the set of ranges induced by d-dimensional axis-aligned rectangles, there is an ε-approximation of size ... |

40 | Finding frequent items in data streams
- Cormode, Hadjieleftheriou
- 2008
(Show Context)
Citation Context ...hen the elements are strings or userdefined types, so we seek to avoid such factors. The counter-based summaries, most notably the MG summary [36] and the SpaceSaving summary [35], have been reported =-=[14]-=- to give the best results for both the frequency estimation and the heavy hitters problem (in the streaming model). They are deterministic, simple, and have the optimal size O( 1 ). They also work in ... |

34 | Abbadi, “An integrated efficient solution for computing frequent and top-k elements in data streams
- Metwally, Agrawal, et al.
- 2006
(Show Context)
Citation Context ...u is large, for example when the elements are strings or userdefined types, so we seek to avoid such factors. The counter-based summaries, most notably the MG summary [36] and the SpaceSaving summary =-=[35]-=-, have been reported [14] to give the best results for both the frequency estimation and the heavy hitters problem (in the streaming model). They are deterministic, simple, and have the optimal size O... |

30 | An optimal algorithm for the distinct elements problem
- Kane, Nelson, et al.
(Show Context)
Citation Context ...h [15], the ℓ1 sketch [17, 37], among many others. Summaries that maintain the maximum or top-k values can also be easily merged, most notably summaries for estimating the number of distinct elements =-=[6, 26]-=-. However, several fundamental problems have summaries that are based on other techniques, and are not known to be mergeable (or have unsatisfactory bounds). This paper focuses on summaries for severa... |

27 | Range counting over multidimensional data streams
- Suri, Toth, et al.
- 2004
(Show Context)
Citation Context ...1/ε) log(εn) [21] (1/ε) log u [39] (1/ε) log(εn) (§3.1, restricted merging) quantiles (randomized) 1/ε 1/ε · log 3/2 (1/ε) (§3.3) ε-approximations (rectangles) (1/ε) log 2d (1/ε) (1/ε) log 2d+1 (1/ε) =-=[40]-=- (1/ε) log 2d+3/2 (1/ε) (§4) ε-approximations (range spaces) (VC-dim ν) ε-kernels 1/ε 2ν ν+1 1/ε 2ν ν+1 log ν+1 (1/ε) [40] 1/ε 2ν ν+1 log 3/2 (1/ε) (§4) 1/ε d−1 2 1/ε d−1 2 log(1/ε) [44] 1/ε d−1 2 (§5... |

27 | Practical methods for shape fitting and kinetic data structures using core sets
- Yu, Agarwal, et al.
- 2004
(Show Context)
Citation Context ...ically, they are a specific type of coreset that approximates the width of P within a relative (1 + ε)-factor in any direction. These summaries have been extensively studied in computational geometry =-=[2, 9, 10, 43]-=- as they can be used to approximate many other geometric properties of a point set having to do with its convex shape, including diameter, minimum enclosing annulus, and minimum enclosing cylinder. In... |

21 | On distributing symmetric streaming computations
- Feldman, Muthukrishnan, et al.
(Show Context)
Citation Context ... operation by giving two specific applications. Motivating Scenario 1: Distributed Computation. The need for a merging operation arises in the MUD (Massive Unordered Distributed) model of computation =-=[18]-=-, which describes large-scale distributed programming paradigms like MapReduce and Sawzall. In this model, the input data is broken into an arbitrary number of pieces, each of which is potentially han... |

21 | Fast moment estimation in data streams in optimal space - Kane, Nelson, et al. - 2011 |

19 |
Geometric approximation algorithms
- Har-Peled
- 2011
(Show Context)
Citation Context ...kernel of A(P ) [1]. • Let I = [−1, 1] d and βd = 2 d d 5/2 d!. There exists an O(d 2 |P |) size algorithm to construct an affine transform A such that A(P ) ⊂ I and A(P ) is βd-fat with respect to I =-=[7, 25]-=-. • Place a grid Gε on I so that each grid cell has width ε/βd. For each grid cell g ∈ Gε, place one point (if it exists) from g ∩ A(P ) in K. Then K is an ε-kernel of A(P ). Clearly the same holds if... |

17 | Article A, Publication date: January YYYY - unknown authors - 2001 |

16 | A space-optimal data-stream algorithm for coresets in the plane, Proc. 23rd Annual Sympos
- Agarwal, Yu
(Show Context)
Citation Context ...losing cylinder. In the static setting in R d , ε-kernels of size O(1/ε (d−1)/2 ) [9, 43] can always be constructed, which is optimal. In the streaming setting, several algorithms have been developed =-=[1, 3, 9]-=- ultimately yielding an algorithm using O((1/ε (d−1)/2 ) log(1/ε)) space [44]. However, ε-kernels, including those maintained by streaming algorithms, are not mergeable. Combining two ε-kernels will i... |

13 |
Constructive algorithms for discrepancy minimization
- BANSAL
(Show Context)
Citation Context ...iciently. More generally, an ε-approximation of size O(1/ε 2ν/(ν+1) ) exists for a range space of VC-dimension ν [34]. Furthermore, such an ε-approximation can be constructed using Bansal’s algorithm =-=[5]-=-; see also [11, 34]. These algorithms for constructing ε-approximations are not known to be mergeable. Although they proceed by partitioning D into small subsets, constructing ε-approximations of each... |

13 | Analyzing graph structure via linear measurements - Ahn, Guha, et al. - 2012 |

11 | CR-precis: A deterministic summary structure for update data streams
- Ganguly, Majumder
- 2007
(Show Context)
Citation Context ...space increases to O( 1 log u log u log( )) from the extra sketches with ε ε adjusted parameters. The Count-Min sketch is randomized; while there is also a deterministic linear sketch for the problem =-=[19]-=-, its size is O( 1 ε2 log 2 u log n). In some cases log u is large, for example when the elements are strings or userdefined types, so we seek to avoid such factors. The counter-based summaries, most ... |

11 | Tight upper bounds for the discrepancy of half-spaces - MATOUSEK - 1995 |

10 | Space-optimal heavy hitters with strong error bounds
- Berinde, Cormode, et al.
- 2009
(Show Context)
Citation Context ... also work in the comparison model. ε However, only recently were they shown to support a weaker model of mergeability, where the error is bounded provided the merge is always “into” a single summary =-=[8]-=-. Some merging algorithms for these summaries have been previously proposed, but the error increases after each merging step [30, 31]. Quantile summaries. For the quantile problem we assume that the e... |

9 | Algorithms for ε-approximations of terrains
- PHILLIPS
(Show Context)
Citation Context ...at, if R is the set of ranges induced by d-dimensional axis-aligned rectangles, there is an ε-approximation of size O((1/ε) log d+1/2 (1/ε)) [27], and an ε-approximation of size O((1/ε) log 2d (1/ε)) =-=[38]-=- can be computed efficiently. More generally, an ε-approximation of size O(1/ε 2ν/(ν+1) ) exists for a range space of VC-dimension ν [34]. Furthermore, such an ε-approximation can be constructed using... |

8 |
An almost space-optimal streaming algorithm for coresets in fixed dimensions
- Zarrabi-Zadeh
- 2008
(Show Context)
Citation Context ...[9, 43] can always be constructed, which is optimal. In the streaming setting, several algorithms have been developed [1, 3, 9] ultimately yielding an algorithm using O((1/ε (d−1)/2 ) log(1/ε)) space =-=[44]-=-. However, ε-kernels, including those maintained by streaming algorithms, are not mergeable. Combining two ε-kernels will in general double the error or double the size. 1.3 Our results In this paper ... |

7 |
Dynamic coresets
- CHAN
(Show Context)
Citation Context ...ically, they are a specific type of coreset that approximates the width of P within a relative (1 + ε)-factor in any direction. These summaries have been extensively studied in computational geometry =-=[2, 9, 10, 43]-=- as they can be used to approximate many other geometric properties of a point set having to do with its convex shape, including diameter, minimum enclosing annulus, and minimum enclosing cylinder. In... |

7 | Fast manhattan sketches in data streams
- Nelson, Woodruff
- 2010
(Show Context)
Citation Context ...ome summaries are known to be mergeable. For example, all sketches that are linear functions of D are trivially mergeable. These include the F2 AMS sketch [4], the CountMin sketch [15], the ℓ1 sketch =-=[17, 37]-=-, among many others. Summaries that maintain the maximum or top-k values can also be easily merged, most notably summaries for estimating the number of distinct elements [6, 26]. However, several fund... |

6 | On range searching in the group model and combinatorial discrepancy
- Larsen
(Show Context)
Citation Context ...ly mergeable, but they are far from optimal. It is known that, if R is the set of ranges induced by d-dimensional axis-aligned rectangles, there is an ε-approximation of size O((1/ε) log d+1/2 (1/ε)) =-=[27]-=-, and an ε-approximation of size O((1/ε) log 2d (1/ε)) [38] can be computed efficiently. More generally, an ε-approximation of size O(1/ε 2ν/(ν+1) ) exists for a range space of VC-dimension ν [34]. Fu... |

5 | Aleksandar Nikolov. Tight hardness results for minimizing discrepancy - Charikar, Newman - 2011 |

4 | Tight results for clustering and summarizing data streams
- GUHA
(Show Context)
Citation Context ...time, we may think of this as a one-way merge algorithm. Similarly, results on k-center clustering on the stream can generate a mergeable summary of size O( k log 1/ε) that provides a 2 + ε guarantee =-=[23]-=-. ε In the graph setting, a simple technique for finding a t-spanner is to eject any edges which complete a cycle of length t. Given two such spanners, we can merge by simply applying the same rule as... |