## Sharing aggregate computation for distributed queries (2007)

### Cached

### Download Links

- [www.cs.ucf.edu]
- [www.cs.berkeley.edu]
- [www.softnet.tuc.gr]
- [www.cs.berkeley.edu]
- [db.cs.berkeley.edu]
- [db.cs.berkeley.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In SIGMOD |

Citations: | 19 - 0 self |

### BibTeX

@INPROCEEDINGS{Huebsch07sharingaggregate,

author = {Ryan Huebsch and Minos Garofalakis and Joseph M. Hellerstein and Ion Stoica},

title = {Sharing aggregate computation for distributed queries},

booktitle = {In SIGMOD},

year = {2007}

}

### OpenURL

### Abstract

An emerging challenge in modern distributed querying is to efficiently process multiple continuous aggregation queries simultaneously. Processing each query independently may be infeasible, so multi-query optimizations are critical for sharing work across queries. The challenge is to identify overlapping computations that may not be obvious in the queries themselves. In this paper, we reveal new opportunities for sharing work in the context of distributed aggregation queries that vary in their selection predicates. We identify settings in which a large set of q such queries can be answered by executing k ≪ q different queries. The k queries are revealed by analyzing a boolean matrix capturing the connection between data and the queries that they satisfy, in a manner akin to familiar techniques like Gaussian elimination. Indeed, we identify a class of linear aggregate functions (including SUM, COUNT and AVERAGE), and show that the sharing potential for such queries can be optimally recovered using standard matrix decompositions from computational linear algebra. For some other typical aggregation functions (including MIN and MAX) we find that optimal sharing maps to the NP-hard set basis problem. However, for those scenarios, we present a family of heuristic algorithms and demonstrate that they perform well for moderate-sized matrices. We also present a dynamic distributed system architecture to exploit sharing opportunities, and experimentally evaluate the benefits of our techniques via a novel, flexible random workload generator we develop for this setting. Categories and Subject Descriptors: H.2.4 [Systems]: Distributed databases

### Citations

11201 |
Computers and Intractability: A Guide to the Theory of NP-Completeness
- Garey, Johnson
- 1979
(Show Context)
Citation Context ...QR, or SVD decompositions). Unfortunately, duplicate-insensitive aggregates (e.g.,ÅÁÆ,Å��) result in a dramatic increase in problem complexity, since the problem maps to the NP-hard Set-Basis Problem =-=[7]-=- (known to be inapproximable to within any constant factor [13]); thus, we propose a novel efficient heuristic technique that, as our empirical results demonstrate, performs well in practice. We also ... |

1131 | TAG: a tiny aggregation service for ad-hoc sensor networks
- Madden, Franklin, et al.
(Show Context)
Citation Context ...ion queries also allow for effective, in-network processing that can drastically reduce communication overheads by “pushing” the aggregate function computation down to individual nodes in the network =-=[14]-=-. Another crucial requirement for large-scale distributed monitoring platforms is the ability to scale in both the volume of the underlying data streams and the number of simultaneous long-running que... |

506 |
der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods
- Barrett, Berry, et al.
- 1994
(Show Context)
Citation Context ... Ai into A ′ i and then to reconstruct A ′ into the query answers at the coordinator node. These factorization methods and their implementations are well studied in the numerical computing literature =-=[1]-=-. We now present formulations for utilizing these factoring methods. An LU algorithm factors F into a lower triangular matrix L and an upper triangular matrix U such that L × U = F . In the decomposit... |

451 | The design of an acquisitional query processor for sensor networks
- Madden, Franklin, et al.
- 2003
(Show Context)
Citation Context ... is crucial to minimize the communication overhead that monitoring imposes on the underlying infrastructure, e.g., to limit the burden on the production network [5] or to maximize sensor battery life =-=[16]-=-. In most monitoring scenarios, the naive “warehousing solution” of simply collecting a large, distributed data set at a centralized site for query processing and result dissemination is prohibitively... |

386 |
On the Hardness of Approximating Minimization Problems
- Lund, Yannakakis
- 1994
(Show Context)
Citation Context ...ve aggregates (e.g.,ÅÁÆ,Å��) result in a dramatic increase in problem complexity, since the problem maps to the NP-hard Set-Basis Problem [7] (known to be inapproximable to within any constant factor =-=[13]-=-); thus, we propose a novel efficient heuristic technique that, as our empirical results demonstrate, performs well in practice. We also give an analysis of the sharing benefits. • Implementation Deta... |

270 | V.: Gigascope: A stream database for network applications
- Cranor, Johnson, et al.
- 2003
(Show Context)
Citation Context ... distributed nature of such systems, it is crucial to minimize the communication overhead that monitoring imposes on the underlying infrastructure, e.g., to limit the burden on the production network =-=[5]-=- or to maximize sensor battery life [16]. In most monitoring scenarios, the naive “warehousing solution” of simply collecting a large, distributed data set at a centralized site for query processing a... |

161 | A scalable distributed information management system
- Yalagandula, Dahlin
- 2004
(Show Context)
Citation Context ...case of a single distributed aggregation query, efficient in-network execution strategies have been proposed by several recent papers and research prototypes (including, for instance, TAG [14], SDIMS =-=[20]-=-, and PIER [9]). The key idea in these techniques is to perform the aggregate computation over a dynamic tree in an overlay network. Aggregation occurs over a dynamic tree, with each node combining th... |

107 |
Multiple query optimization
- Sellis
- 1986
(Show Context)
Citation Context ...simulation numbers, we also demonstrate the communication savings of our methodology in a full implementation in the PIER distributed query engine [9], running on an experimental cluster. Prior Work. =-=[17]-=- and similar work focus on select/project/join queries. Contrastingly, our work only addresses aggregation. For the case of a single distributed aggregation query, efficient in-network execution strat... |

90 |
Spectral Bloom filters
- Cohen, Matias
- 2003
(Show Context)
Citation Context ...ample aggregate functions for each category: Duplicate Sensitive Duplicate Insensitive Non-linear Linear k-MAX, k-MIN SUM, COUNT, AVERAGE MIN, MAX, BLOOM FILTER, logical AND/OR Spectral Bloom filters =-=[2]-=-, Set expressions with updates [6] The intuition for why k-MAX and k-MIN (the multi-set of the top k highest/lowest datums) are non-linear is analogous to that of MAX and MIN. k-MAX/MIN are also dupli... |

84 | Holistic aggregates in a networked world: distributed tracking of approximate quantiles
- CORMODE, GAROFALAKIS, et al.
- 2005
(Show Context)
Citation Context ... streaming has demonstrated that, with appropriate PSR definitions and combination techniques, in-network aggregation ideas can be extended to fairly complex aggregates, such as approximate quantiles =-=[4, 8]-=-, and approximate histograms and join aggregates [3]. None of these earlier papers considers the case of multiple distributed aggregation queries, essentially assuming that such queries are processed ... |

76 | The architecture of PIER: An Internet-scale query processor
- HUEBSCH, CHUN, et al.
- 2005
(Show Context)
Citation Context ...de range of workloads. In addition to the analytical simulation numbers, we also demonstrate the communication savings of our methodology in a full implementation in the PIER distributed query engine =-=[9]-=-, running on an experimental cluster. Prior Work. [17] and similar work focus on select/project/join queries. Contrastingly, our work only addresses aggregation. For the case of a single distributed a... |

72 | Power-conserving computation of orderstatistics over sensor networks
- Greenwald, Khanna
(Show Context)
Citation Context ... streaming has demonstrated that, with appropriate PSR definitions and combination techniques, in-network aggregation ideas can be extended to fairly complex aggregates, such as approximate quantiles =-=[4, 8]-=-, and approximate histograms and join aggregates [3]. None of these earlier papers considers the case of multiple distributed aggregation queries, essentially assuming that such queries are processed ... |

68 | Sketching streams through the net: Distributed approximate query tracking
- Cormode, Garofalakis
- 2005
(Show Context)
Citation Context ...definitions and combination techniques, in-network aggregation ideas can be extended to fairly complex aggregates, such as approximate quantiles [4, 8], and approximate histograms and join aggregates =-=[3]-=-. None of these earlier papers considers the case of multiple distributed aggregation queries, essentially assuming that such queries are processed individually, modulo perhaps some simple routing opt... |

60 |
Processing Set Expressions over Continuous Update Streams
- Ganguly, Garofalakis, et al.
- 2003
(Show Context)
Citation Context ... category: Duplicate Sensitive Duplicate Insensitive Non-linear Linear k-MAX, k-MIN SUM, COUNT, AVERAGE MIN, MAX, BLOOM FILTER, logical AND/OR Spectral Bloom filters [2], Set expressions with updates =-=[6]-=- The intuition for why k-MAX and k-MIN (the multi-set of the top k highest/lowest datums) are non-linear is analogous to that of MAX and MIN. k-MAX/MIN are also duplicate sensitive since evaluating ea... |

41 | Multi-query optimization for sensor networks
- TRIGONI, YAO, et al.
- 2005
(Show Context)
Citation Context ...single site. In the distributed setting, network communication is the typical bottleneck, and hence minimizing the network traffic becomes an important optimization concern. In an independent effort, =-=[19]-=- has proposed a distributed solution for the sub-problem we term “linear aggregates” in this paper. Their scheme is based on heuristics tailored to power-constrained sensornets where the query workloa... |

39 | On-the-fly sharing for streamed aggregation
- Krishnamurthy, Wu, et al.
- 2006
(Show Context)
Citation Context ...ion involved when tracking (1) several�ÊÇÍÈ��aggregates (differing in their grouping attributes) [21], or (2) several windowed aggregates (differing in their window sizes and/or selection predicates) =-=[10, 11]-=-, over a continuous data stream observed at a single site. In the distributed setting, network communication is the typical bottleneck, and hence minimizing the network traffic becomes an important op... |

27 | No pane, no gain: efficient evaluation of sliding-window aggregates over data streams
- Li, Maier, et al.
(Show Context)
Citation Context ...ion involved when tracking (1) several�ÊÇÍÈ��aggregates (differing in their grouping attributes) [21], or (2) several windowed aggregates (differing in their window sizes and/or selection predicates) =-=[10, 11]-=-, over a continuous data stream observed at a single site. In the distributed setting, network communication is the typical bottleneck, and hence minimizing the network traffic becomes an important op... |

27 | Multiple aggregations over data streams
- ZHANG, KOUDAS, et al.
- 2005
(Show Context)
Citation Context ...utions for the centralized version of the problem, where the goal is to minimize the amount of computation involved when tracking (1) several�ÊÇÍÈ��aggregates (differing in their grouping attributes) =-=[21]-=-, or (2) several windowed aggregates (differing in their window sizes and/or selection predicates) [10, 11], over a continuous data stream observed at a single site. In the distributed setting, networ... |

17 |
The set basis problem is NP-complete
- Stockmeyer
- 1975
(Show Context)
Citation Context ... set. Our problem is the same, where S = rows of F and B = rows of F ′ . The set of possible basis sets is 2 2n where n is the number of elements in S S. This problem was proved NP-Hard by Stockmeyer =-=[18]-=-, and was later shown to be inapproximable to within any constant factor [13]. To our knowledge, ours is the first heuristic approximation algorithm for the general problem. In [12] Lubiw shows that t... |

4 |
The Boolean Basis Problem and How to Cover Some Polygons by Rectangles
- Lubiw
- 1990
(Show Context)
Citation Context ...Hard by Stockmeyer [18], and was later shown to be inapproximable to within any constant factor [13]. To our knowledge, ours is the first heuristic approximation algorithm for the general problem. In =-=[12]-=- Lubiw shows that the problem can be solved for some limited classes of F matrices, but these do not apply in our domain. As with the general decomposition problem in Section 3, the search space of ou... |

4 |
Continuously Adaptive Continuous Queries
- Madden
(Show Context)
Citation Context ...e technique presented in [10]. Each tuple is locally evaluated against each query’s predicates to determine on-the-fly which fragment the tuple belongs to. We can use techniques such as group filters =-=[15]-=- to efficiently evaluate the predicates. Once the fragment is determined, the tuple is added to the fragment’s corresponding local PSR in Ai. In the second phase, decomposition, each node will locally... |