## Grouping and Duplicate Elimination: Benefits of Early Aggregation (1997)

Citations: | 12 - 1 self |

### BibTeX

@TECHREPORT{Larson97groupingand,

author = {Per-Åke Larson},

title = {Grouping and Duplicate Elimination: Benefits of Early Aggregation},

institution = {},

year = {1997}

}

### Years of Citing Articles

### OpenURL

### Abstract

Early aggregation is a technique for speeding up the processing of GROUP BY queries by reducing the amount of intermediate data transferred between main memory and disk. It can also be applied to duplicate elimination because duplicate elimination is equivalent to grouping with no aggregation functions. This paper describes six different algorithms for grouping and aggregation, shows how to incorporate early aggregation in each of them, and analyzes the resulting reduction in intermediate data. In addition to the grouping algorithm used, the reduction depends on several factors: the number of groups, the skew in group size distribution, the input size, and the amount of main memory available. All six algorithms considered benefit from early aggregation with grouping by hash partitioning producing the least amount of intermediate data. If the group size distribution is skewed, the overall reduction can be very significant, even with a modest amount of additional main memory.

### Citations

2324 |
The Art of Computer Programming
- Knuth
- 1973
(Show Context)
Citation Context ...sume that the labels are assigned so that p 1sp 2s\Delta \Delta \DeltaspD . For the numerical results reported in this paper, we model the group size distribution with a generalized Zipf distribution =-=[4]-=-. The distribution function is defined by: p i = 1 c (1=i) ff ; i = 1; 2; ::::; D where ff is a positive constant and c = P D i=1 (1=i) ff . Setting ff = 1 gives the traditional Zipf distribution, and... |

689 | Query Evaluation Techniques for Large Databases
- Graefe
- 1993
(Show Context)
Citation Context ... are assumed to be of the same size, the only merge pattern considered is balanced two-way merge, and duplicate elimination during run formation is not considered. This analysis is also summarized in =-=[3]-=-. Parallel database systems running on shared-nothing systems normally perform aggregation in two steps: each node first performs grouping and aggregation on its local data and then ships the result t... |

76 | Duplicate record elimination in large data files
- Bitton, DeWitt
- 1983
(Show Context)
Citation Context ...rested in large-scale grouping and aggregation requiring external storage. The processing cost is then dominated by the cost of I/O, and the CPU time can be largely ignored. D. Bitton and D.J. Dewitt =-=[2]-=- analyzed the benefits of early duplicate elimination during run merging in external merge sort. Their analysis is based on several simplifying assumptions: all groups are assumed to be of the same si... |

47 | Adaptive parallel aggregation algorithms
- Shatdal, Naughton
- 1995
(Show Context)
Citation Context ...egation in two steps: each node first performs grouping and aggregation on its local data and then ships the result to one or more nodes where the partial results are integrated. Shatdal and Naughton =-=[6]-=- pointed out that if the input is large and the duplication factor is low (few records per group), then the first step may do a lot of work for a relatively small reduction in output size. If so, it i... |

26 |
Sorting and searching in multisets
- MUNRO, SPIRA
- 1976
(Show Context)
Citation Context ...nt, even when using a modest amount of main memory. 2 Previous Work Early aggregation is not a new idea. Most published work deals with duplicate elimination, typically in main memory. Munro and Spira=-=[5]-=- gave a computational bound for the number of comparisons required to sort a multiset with early duplicate removal. Several algorithms, based on various sorting algorithms, e.g., quick sort, hash sort... |

14 |
Quicksort for Equal Keys
- Wegner
- 1985
(Show Context)
Citation Context ...hree-way comparisons. Teuhola and Wegner[7] gave a duplicate elimination algorithm based on hashing with early duplicate removal, which requires linear time on the average and O(1) extra space. Wegner=-=[8]-=- gave a quick sort algorithm for the run formation phase and analyzed its computational complexity. However, we are mainly interested in large-scale grouping and aggregation requiring external storage... |

3 |
Minimal space, average linear time duplicate deletion
- Teuhola, Wegner
- 1991
(Show Context)
Citation Context ... sort, have been proposed for duplicate elimination. Abdelguerfi and Sood[1] gave the computational complexity of the merge sort method based on the number of three-way comparisons. Teuhola and Wegner=-=[7]-=- gave a duplicate elimination algorithm based on hashing with early duplicate removal, which requires linear time on the average and O(1) extra space. Wegner[8] gave a quick sort algorithm for the run... |

3 | Data Reduction Through Early Grouping
- Yan, Larson
- 1994
(Show Context)
Citation Context ...tively small reduction in output size. If so, it is better to skip local aggregation and simply send the input tuples directly to the nodes performing the final aggregation. A paper by Yan and Larson =-=[9]-=- contains some early results (based on a simulation study) of the benefits of applying early aggregation to grouping by sorting. 3 Preliminaries In this section we derive three functions that will be ... |

2 |
Computational complexity of sorting and joining relations with duplicates
- Abdelguerfi, Sood
- 1991
(Show Context)
Citation Context ...et with early duplicate removal. Several algorithms, based on various sorting algorithms, e.g., quick sort, hash sort and merge sort, have been proposed for duplicate elimination. Abdelguerfi and Sood=-=[1]-=- gave the computational complexity of the merge sort method based on the number of three-way comparisons. Teuhola and Wegner[7] gave a duplicate elimination algorithm based on hashing with early dupli... |