## Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets

### Cached

### Download Links

Citations: | 169 - 2 self |

### BibTeX

@MISC{Vitter_approximatecomputation,

author = {Jeffrey Scott Vitter and Min Wang},

title = { Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets},

year = {}

}

### Years of Citing Articles

### OpenURL

### Abstract

Computing multidimensional aggregates in high dimensions is a performance bottleneck for many OLAP applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in a data warehouse environment. It is advantageous to have fast, approximate answers to OLAP aggregation queries. In this paper, we present anovel method that provides approximate answers to high-dimensional OLAP aggregation queries in massive sparse data sets in a time-efficient and space-efficient manner. We construct a compact data cube, which is an approximate and space-efficient representation of the underlying multidimensional array, based upon a multiresolution wavelet decomposition. In the on-line phase, each aggregation query can generally be answered using the compact data cube in one I/O or a small number of I/Os, depending upon the desired accuracy. We present two I/O-efficient algorithms to construct the compact data cube for the important case of sparse high-dimensional arrays, which often arise in practice. The traditional histogram methods are infeasible for the massive high-dimensional data sets in OLAP applications. Previously developed wavelet techniques are efficient only for dense data. Our on-line query processing algorithm is very fast and capable of refining answers as the user demands more accuracy. Experiments on real data show that our method provides significantly more accurate results for typical OLAP aggregation queries than other efficient approximation techniques such as random sampling.

### Citations

697 | Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-total
- Gray, Bosworth, et al.
- 1996
(Show Context)
Citation Context ...ion, these queries have the form Sum(l1 : h1 , . . . , l d 0 : h d 0 , all d 0 +1 , . . . , all d ). For brevity, we simply write Sum(l1 : h1 ; : : : ; l d 0 : h d 0 ). The popular data cube operator =-=[GBLP96]-=- can be viewed as computing the special case of all range-sums with singleton ranges, Sum(l1 : h1 ; : : : ; l d 0 : h d 0 ), in which 0sl i = h i ! jD i j, for 1sisd 0 . In traditional approaches of a... |

537 | The input/output complexity of sorting and related problems - Aggarwal, Vitter - 1988 |

463 | Implementing Data Cubes Efficiently - Harinarayan, Rajaraman, et al. - 1996 |

320 | External Memory Algorithms and Data Structures: Dealing with Massive Data
- Vitter
- 1981
(Show Context)
Citation Context ...eded. However, it may no longer be desirable to do the transposition via the distribution approach of (5); instead we can do the transposition by sorting, which uses O( Nz B log M=B Nz B ) I/Os. (See =-=[Vit99]-=- for a proof in the I/O model that transposition is equivalent to sorting.) If all the processed hyperplanes individually fit into internal memory, the resulting I/O bound for Algorithm I will be O i ... |

312 | Online aggregation
- Hellerstein, Haas, et al.
- 1997
(Show Context)
Citation Context ...arios in which a user may prefer an approximate answer in a few seconds over an exact answer that requires tens of minutes or more to compute. An example is a drill-down query sequence in data mining =-=[HHW97]-=-. Another consideration is that sometimes the base data are remote and unavailable, so that an exact answer is not an option until the data again become available [FJS97]. In developing our wavelet-ba... |

237 | Improved histograms for selectivity estimation of range predicates - Poosala, Ioannidis, et al. - 1996 |

234 | Algorithms for parallel memory I: Two level memories - Vitter, Shriver - 1994 |

212 | Wavelet-based histograms for selectivity estimations
- Matias, Vitter, et al.
- 1998
(Show Context)
Citation Context ...ons of a (possibly multidimensional) array of values are needed, such as query optimization [SAC + 79], parallel join load balancing [PI96], and approximate query processors [BDF + 97]. Matias et al. =-=[MVW98]-=- first explored the use of wavelet-based techniques to construct analogs of histograms in databases. Their experiments show that wavelet-based approximation methods can offer substantial improvements ... |

206 | On the computation of multidimensional aggregates - Agarwal, Agrawal, et al. - 1996 |

205 |
New sampling-based summary statistics for improving approximate query answers
- Gibbons, Matias
- 1998
(Show Context)
Citation Context ...proximate answer is given in the obvious way: If the answer of the query using a sample of size t is s, the approximate answer is s \Theta Nz =t. The new sampling-based summary statistics proposed in =-=[GM98]-=- cannot be applied here to any advantage since our raw data do not contain duplicate tuples. We chose not to do any comparisons with traditional histogram methods [PIHS96, PI97], because as we mention... |

198 | Selectivity estimation without the attribute value independence assumption - Poosala, Ioannidis - 1997 |

168 | An array-based algorithm for simultaneous multidimensional aggregates - Zhao, PM, et al. - 1997 |

116 | An Overview of Wavelet Based Multiresolution Analysis - Jawerth, Sweldens - 1994 |

103 |
Wavelets for Computer Graphics
- Stollnitz, DeRose, et al.
- 1996
(Show Context)
Citation Context ...jsd \Psi : That is, if we want to minimize the average absolute error in approximating all the individual cells in S, the best choice is to keep the C largest (in absolute value) wavelet coefficients =-=[SDS96]-=-. But our goal here is to approximate d 0 -dimensional rangesum queries, where usually d 0sd. If a coefficient c i is more likely to contribute to a query than another coefficient c j , we would like ... |

92 | Data Cube Approximation and Histograms via Wavelets
- Vitter, Wang, et al.
- 1998
(Show Context)
Citation Context ...lowing four important issues in this paper: 1. I/O-efficiency of the compact data cube construction, especially when the underlying multidimensional array is very sparse. Our earlier wavelet approach =-=[VWI98]-=- requires a dense storage representation during the construction of the compact data cube, which is infeasible for very large sparse data sets. Histogram techniques [PI97, PIHS96] usually require exce... |

59 | Range queries in olap data cubes
- Ho, Agrawal, et al.
- 1997
(Show Context)
Citation Context ...t class of aggregation queries are the so-called (general) range-sum queries, which are defined by applying the Sum operation over a selected contiguous range in the domains of some of the attributes =-=[HAMS97]-=-. A range-sum query can generally be formulated as follows: Sum(l1 : h1 ; : : : ; l d : hd ) = X l 1 i 1 h 1 \Delta \Delta \Delta X l d i d h d S(i1 ; : : : ; i d ): An interesting subset of the gener... |

36 | Recovering information from summary data
- Faloutsos, Jagadish, et al.
- 1997
(Show Context)
Citation Context ...query sequence in data mining [HHW97]. Another consideration is that sometimes the base data are remote and unavailable, so that an exact answer is not an option until the data again become available =-=[FJS97]-=-. In developing our wavelet-based techniques to approximately answer OLAP range-sum queries, we resolve the following four important issues in this paper: 1. I/O-efficiency of the compact data cube co... |

34 | A transparent parallel I/O environment - Vengroff - 1994 |

34 | I/O-efficient scientific computation using TPIE - Vengroff, Vitter - 1996 |

28 | Estimation of Query-Result Distribution and its Application in Parallel-Join Load Balancing
- Poosala, Ioannidis
- 1996
(Show Context)
Citation Context ...used in a variety of important applications where quick approximations of a (possibly multidimensional) array of values are needed, such as query optimization [SAC + 79], parallel join load balancing =-=[PI96]-=-, and approximate query processors [BDF + 97]. Matias et al. [MVW98] first explored the use of wavelet-based techniques to construct analogs of histograms in databases. Their experiments show that wav... |

17 | TPIE User Manual and Reference - Vengroff - 1995 |

6 |
The OLAP Report
- Pendse
(Show Context)
Citation Context ...recomputed data cube. As we all know, the size of the precomputed data cube is much larger than that of the underlying raw data, especially when S is high-dimensional (e.g., more than six dimensions) =-=[PC98]-=-. In some applications there may be 100 dimensions! Even in moderately sized scenarios, there are usually many tables (in a ROLAP system) or multidimensional arrays (in a MOLAP system), and most of th... |

2 |
Census bureau databases. The online data are available on the web at http://www.census.gov
- Bureau
(Show Context)
Citation Context ...construction algorithm, we report the running time of our algorithm on tunable synthetic datasets. We obtained our real-world data from the U.S. Census Bureau using their Data Extraction System (DES) =-=[Bur]-=-. Our data source is the Current Population Survey (CPS) and our extracted file is the March Questionnaire Supplement-- Person Data File. The file contains 372 attributes, from which we chose 11. Our ... |