Results 1 - 10
of
24
Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets
"... Computing multidimensional aggregates in high dimensions is a performance bottleneck for many OLAP applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in a data warehouse environment. It is advantageous to have fast, a ..."
Abstract
-
Cited by 154 (2 self)
- Add to MetaCart
Computing multidimensional aggregates in high dimensions is a performance bottleneck for many OLAP applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in a data warehouse environment. It is advantageous to have fast, approximate answers to OLAP aggregation queries. In this paper, we present anovel method that provides approximate answers to high-dimensional OLAP aggregation queries in massive sparse data sets in a time-efficient and space-efficient manner. We construct a compact data cube, which is an approximate and space-efficient representation of the underlying multidimensional array, based upon a multiresolution wavelet decomposition. In the on-line phase, each aggregation query can generally be answered using the compact data cube in one I/O or a small number of I/Os, depending upon the desired accuracy. We present two I/O-efficient algorithms to construct the compact data cube for the important case of sparse high-dimensional arrays, which often arise in practice. The traditional histogram methods are infeasible for the massive high-dimensional data sets in OLAP applications. Previously developed wavelet techniques are efficient only for dense data. Our on-line query processing algorithm is very fast and capable of refining answers as the user demands more accuracy. Experiments on real data show that our method provides significantly more accurate results for typical OLAP aggregation queries than other efficient approximation techniques such as random sampling.
Data-Streams and Histograms
, 2001
"... Histograms have been used widely to capture data distribution, to represent the data by a small number of step functions. Dynamic programming algorithms which provide optimal construction of these histograms exist, albeit running in quadratic time and linear space. In this paper we provide linear ti ..."
Abstract
-
Cited by 121 (8 self)
- Add to MetaCart
Histograms have been used widely to capture data distribution, to represent the data by a small number of step functions. Dynamic programming algorithms which provide optimal construction of these histograms exist, albeit running in quadratic time and linear space. In this paper we provide linear time construction of 1 + epsilon approximation of optimal histograms, running in polylogarithmic space. Our results extend to the context of data-streams, and in fact generalize to give 1 + epsilon approximation of several problems in data-streams which require partitioning the index set into intervals. The only assumptions required are that the cost of an interval is monotonic under inclusion (larger interval has larger cost) and that the cost can be computed or approximated in small space. This exhibits a nice class of problems for which we can have near optimal data-stream algorithms.
How to Summarize the Universe: Dynamic Maintenance of Quantiles
- In VLDB
, 2002
"... Order statistics, i.e., quantiles, are frequently used in databases both at the database server as well as the application level. For example, they are useful in selectivity estimation during query optimization, in partitioning large relations, in estimating query result sizes when building us ..."
Abstract
-
Cited by 89 (12 self)
- Add to MetaCart
Order statistics, i.e., quantiles, are frequently used in databases both at the database server as well as the application level. For example, they are useful in selectivity estimation during query optimization, in partitioning large relations, in estimating query result sizes when building user interfaces, and in characterizing the data distribution of evolving datasets in the process of data mining.
The History of Histograms (abridged)
- PROC. OF VLDB CONFERENCE
, 2003
"... The history of histograms is long and rich, full of detailed information in every step. It includes the course of histograms in diFFerent scientific fields, the successes and failures of histograms in approximating and compressing information, their adoption by industry, and solutions that hav ..."
Abstract
-
Cited by 67 (0 self)
- Add to MetaCart
The history of histograms is long and rich, full of detailed information in every step. It includes the course of histograms in diFFerent scientific fields, the successes and failures of histograms in approximating and compressing information, their adoption by industry, and solutions that have been given on a great variety of histogram-related problems. In this paper and in the same spirit of the histogram techniques themselves, we compress their entire history (including their "future history" as currently anticipated) in the given/fixed space budget, mostly recording details for the periods, events, and results with the highest (personally-biased) interest. In a limited set of experiments, the semantic distance between the compressed and the full form of the history was found relatively small!
Optimization techniques for queries with expensive methods
- ACM Transactions on Database Systems (TODS
, 1998
"... Object-Relational database management systems allow knowledgeable users to de ne new data types, as well as new methods (operators) for the types. This exibility produces an attendant complexity, which must be handled in new ways for an Object-Relational database management system to be e cient. In ..."
Abstract
-
Cited by 53 (3 self)
- Add to MetaCart
Object-Relational database management systems allow knowledgeable users to de ne new data types, as well as new methods (operators) for the types. This exibility produces an attendant complexity, which must be handled in new ways for an Object-Relational database management system to be e cient. In this paper we study techniques for optimizing queries that contain time-consuming methods. The focus of traditional query optimizers has been on the choice of join methods and orders; selections have been handled by \pushdown " rules. These rules apply selections in an arbitrary order before as many joins as possible, using the assumption that selection takes no time. However, users of Object-Relational systems can embed complex methods in selections. Thus selections may take signi cant amounts of time, and the query optimization model must be enhanced. In this paper, we carefully de ne a query cost framework that incorporates both selectivity and cost estimates for selections. We develop an algorithm called Predicate Migration, and prove that it produces optimal plans for queries with expensive methods. We then describe our implementation of Predicate Migration in the commercial Object-Relational database management system Illustra, and discuss practical issues that a ect our earlier assumptions. We compare Predicate Migration to a variety of simpler optimization techniques, and demonstrate that Predicate Migration is the best general solution to date. The alternative techniques we presentmaybe useful for constrained workloads.
Parallel query scheduling and optimization with time- and space-shared resources
- In Proceedings of the 23rd VLDB Conference
, 1997
"... ..."
Frequency-adaptive join for Shared Nothing machines
- Parallel and Distributed Computing Practices
, 1999
"... Although many skew-handling algorithms have been proposed for simple join operations, they remain generally inecient in the case of theta-join and in the case of multi-join. A new method for self-balancing equi-join operations on shared-nothing (SN) machines is proposed here. It offers deterministic ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Although many skew-handling algorithms have been proposed for simple join operations, they remain generally inecient in the case of theta-join and in the case of multi-join. A new method for self-balancing equi-join operations on shared-nothing (SN) machines is proposed here. It offers deterministic and near-perfect load balancing through exible control of communications in intra-transaction parallelism. The new algorithm mixes a balanced data-distribution strategy with pure hash-join. Its predictably low join-product- and attribute-value skews make it suitable for repeated use in multi-join operations. Its tradeoff between balancing overhead and speedup is analyzed in the BSP (Bulk-synchronous parallel) computing model. The scalable model predicts a negligible join product skew and a near-linear speed-up in any combination of selectivity, skew and number of processors. This prediction is confirmed by a series of tests.
A skew-insensitive algorithm for join and multi-join operations on Shared Nothing machines
, 2000
"... Join is an expensive and frequently used operation whose parallelization is highly desirable. However effectiveness of parallel joins depends on the ability to evenly divide load among processors. Data skew can have a disastrous effect on performance. Although many skew-handling algorithms have been ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
Join is an expensive and frequently used operation whose parallelization is highly desirable. However effectiveness of parallel joins depends on the ability to evenly divide load among processors. Data skew can have a disastrous effect on performance. Although many skew-handling algorithms have been proposed they remain generally inefficient in the case of multi-joins due to join product skew, costly and unnecessary redistribution and communication costs. A parallel join algorithm called fa join has been introduced in an earlier paper with deterministic and near-perfect balancing properties. Despite its advantages, fa join is sensitive to the correlation of the attribute value distributions in both relations. We present here an improved version of the algorithm called Sfa join with a symmetric treatment of both relations. Its predictably low join-product and attribute-value skew makes it suitable for repeated use in multi-join operations. Its performance is analyzed theoretically and expe...
An Efficient Scalable Parallel View Maintenance Algorithm for Shared Nothing Multi-Processor Machines
, 1999
"... . The problem of maintenance of materialized views has been the object of increased research activity recently mainly because of applications related to data warehousing. Many sequential view maintenance algorithms are developed in the literature. If the view is defined by a relational expression ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
. The problem of maintenance of materialized views has been the object of increased research activity recently mainly because of applications related to data warehousing. Many sequential view maintenance algorithms are developed in the literature. If the view is defined by a relational expression involving join operators, the cost of re-evaluating the view even incrementally may be unacceptable. Moreover, when views are materialized, parallelism can greatly increase processing power as necessary for view maintenance. In this paper, we present a new parallel join algorithm by partial duplication of data and a new parallel view maintenance algorithm where views can involve multi-joins. The performances of these algorithms are analyzed using the scalable and portable BSP 1 cost model which predicts a near-linear speedup. Key words: PDBMS 2 , materialized view maintenance, parallel incremental algorithm, data-warehouse, multi-joins, data skew, dynamic load balancing. 1 Int...

