Results 1 - 10
of
69
Selection of Views to Materialize in a Data Warehouse
, 1997
"... . A data warehouse stores materialized views of data from one or more sources, with the purpose of efficiently implementing decisionsupport or OLAP queries. One of the most important decisions in designing a data warehouse is the selection of materialized views to be maintained at the warehouse. The ..."
Abstract
-
Cited by 161 (5 self)
- Add to MetaCart
. A data warehouse stores materialized views of data from one or more sources, with the purpose of efficiently implementing decisionsupport or OLAP queries. One of the most important decisions in designing a data warehouse is the selection of materialized views to be maintained at the warehouse. The goal is to select an appropriate set of views that minimizes total query response time and the cost of maintaining the selected views, given a limited amount of resource, e.g., materialization time, storage space etc. In this article, we develop a theoretical framework for the general problem of selection of views in a data warehouse. We present competitive polynomial-time heuristics for selection of views to optimize total query response time, for some important special cases of the general data warehouse scenario, viz.: (i) an AND view graph, where each query/view has a unique evaluation, and (ii) an OR view graph, in which any view can be computed from any one of its related views, e.g.,...
Materialized View Selection in a Multidimensional Database
, 1997
"... A multidimensional database is a data repository that supports the efficient execution of complex business decision queries. Query response can be significantly improved by storing an appropriate set of materialized views. These views are selected from the multidimensional lattice whose elements rep ..."
Abstract
-
Cited by 114 (1 self)
- Add to MetaCart
A multidimensional database is a data repository that supports the efficient execution of complex business decision queries. Query response can be significantly improved by storing an appropriate set of materialized views. These views are selected from the multidimensional lattice whose elements represent the solution space of the problem. Several techniques have been proposed in the past to perform the selection of materialized views for databases with a reduced number of dimensions. When the number and complexity of dimensions increase, the proposed techniques do not scale well. The technique we are proposing reduces the solution space by considering only the relevant elements of the multidimensional lattice. An additional statistical analysis allows a further reduction of the solution space. 1 Introduction A multidimensional database (MDDB) is a data repository that provides an integrated environment for decision support queries that require complex aggregations on huge amounts of...
An efficient cost-driven index selection tool for Microsoft SQL Server
, 1997
"... In this paper we describe novel techniques that make it possible to build an industrial-strength tool for automating the choice of indexes in the physical design of a SQL database. The tool takes as input a workload of SQL queries, and suggests a set of suitable indexes. We ensure that the indexes c ..."
Abstract
-
Cited by 103 (16 self)
- Add to MetaCart
In this paper we describe novel techniques that make it possible to build an industrial-strength tool for automating the choice of indexes in the physical design of a SQL database. The tool takes as input a workload of SQL queries, and suggests a set of suitable indexes. We ensure that the indexes chosen are effective in reducing the cost of the workload by keeping the index selection tool and the query optimizer "in step". The number of index sets that must be evaluated to find the optimal configuration is very large. We reduce the complexity of this problem using three techniques. First, we remove a large number of spurious indexes from consideration by taking into account both query syntax and cost information. Second, we introduce optimizations that make it possible to cheaply evaluate the “goodness ” of an index set. Third, we describe an iterative approach to handle the complexity arising from multicolumn indexes. The tool has been implemented on Microsoft SQL Server 7.0. We performed extensive experiments over a range of workloads, including TPC-D. The results indicate that the tool is efficient and its choices are close to optimal. 1.
Efficient and Extensible Algorithms for Multi Query Optimization
, 2000
"... Complex queries are becoming commonplace, with the growing use of decision support systems. These complex queries often have a lot of common sub-expressions, either within a single query, or across multiple such queries run as a batch. Multi-query optimization aims at exploiting common subexpressi ..."
Abstract
-
Cited by 99 (4 self)
- Add to MetaCart
Complex queries are becoming commonplace, with the growing use of decision support systems. These complex queries often have a lot of common sub-expressions, either within a single query, or across multiple such queries run as a batch. Multi-query optimization aims at exploiting common subexpressions to reduce evaluation cost. Multi-query optimization has hither-to been viewed as impractical, since earlier algorithms were exhaustive, and explore a doubly exponential search space. In this paper we demonstrate that multi-query optimization using heuristics is practical, and provides significant benefits. We propose three cost-based heuristic algorithms: Volcano-SH and Volcano-RU, which are based on simple modifications to the Volcano search strategy, and a greedy heuristic. Our greedy heuristic incorporates novel optimizations that improve efficiency greatly. Our algorithms are designed to be easily added to existing optimizers. We present a performance study comparing the algo...
Practical Lineage Tracing in Data Warehouses
- In ICDE
, 1999
"... We consider the view data lineage problem in a warehousing environment: For a given data item in a materialized warehouse view, we want to identify the set of source data items that produced the view item. We formalize the problem and present a lineage tracing algorithm for relational views with agg ..."
Abstract
-
Cited by 73 (6 self)
- Add to MetaCart
We consider the view data lineage problem in a warehousing environment: For a given data item in a materialized warehouse view, we want to identify the set of source data items that produced the view item. We formalize the problem and present a lineage tracing algorithm for relational views with aggregation. Based on our tracing algorithm, we propose a number of schemes for storing auxiliary views that enable consistent and efficient lineage tracing in a multisource data warehouse. We report on a performance study of the various schemes, identifying which schemes perform best in which settings. Based on our results, we have implemented a lineage tracing package in the WHIPS data warehousing system prototype at Stanford. With this package, users can select view tuples of interest, then efficiently "drill down" to examine the source data that produced them. 1 Introduction Data warehousing systems collect data from multiple distributed sources, integrate the information as materialized v...
Rewriting Aggregate Queries Using Views
, 1999
"... We investigate the problem of rewriting queries with aggregate operators using views that mayormay not contain aggregate operators. A rewriting of a query is a second query that uses view predicates such that evaluating first the views and then the rewriting yields the same result as evaluating the ..."
Abstract
-
Cited by 70 (7 self)
- Add to MetaCart
We investigate the problem of rewriting queries with aggregate operators using views that mayormay not contain aggregate operators. A rewriting of a query is a second query that uses view predicates such that evaluating first the views and then the rewriting yields the same result as evaluating the original query. In this sense, the original query and the rewriting are equivalent modulo the view definitions. The queries and views we consider correspond to unnested SQL queries, possibly with union, that employ the operators min, max, count, and sum. Our approach is based on syntactic characterizations of the equivalence of aggregate queries. One contribution of this paper are characterizations of the equivalence of disjunctive aggregate queries, which generalize our previous results for the conjunctive case. For each operator α, we introduce several types of queries using views as candidates for rewritings. We unfold such a candidate by replacing each occurrence of a view predicate with ...
Selection of Views to Materialize Under a Maintenance Cost Constraint
, 1999
"... . A data warehouse stores materialized views derived from one or more sources for the purpose of efficiently implementing decisionsupport or OLAP queries. One of the most important decisions in designing a data warehouse is the selection of materialized views to be maintained at the warehouse. The g ..."
Abstract
-
Cited by 61 (0 self)
- Add to MetaCart
. A data warehouse stores materialized views derived from one or more sources for the purpose of efficiently implementing decisionsupport or OLAP queries. One of the most important decisions in designing a data warehouse is the selection of materialized views to be maintained at the warehouse. The goal is to select an appropriate set of views that minimizes total query response time and/or the cost of maintaining the selected views, given a limited amount of resource such as materialization time, storage space, or total view maintenance time. In this article, we develop algorithms to select a set of views to materialize in a data warehouse in order to minimize the total query response time under the constraint of a given total view maintenance time. As the above maintenance-cost view-selection problem is extremely intractable, we tackle some special cases and design approximation algorithms. First, we design an approximation greedy algorithm for the maintenance-cost view-selection prob...
Data Warehouse Configuration
, 1997
"... In the data warehousing approach to the integration of data from multiple information sources, selected information is extracted in advance and stored in a repository. A data warehouse (DW) can therefore be seen as a set of materialized views defined over the sources. When a query is posed, it is ev ..."
Abstract
-
Cited by 60 (4 self)
- Add to MetaCart
In the data warehousing approach to the integration of data from multiple information sources, selected information is extracted in advance and stored in a repository. A data warehouse (DW) can therefore be seen as a set of materialized views defined over the sources. When a query is posed, it is evaluated locally, using the materialized views, without accessing the original information sources. The applications using DWs require high query performance. This requirement is in conflict with the need to maintain in the DW updated information. The DW configuration problem is the problem of selecting a set of views to materialize in the DW that answers all the queries of interest while minimizing the total query evaluation and view maintenance cost. In this paper we provide a theoretical framework for this problem in terms of the relational model. We develop a method for dealing with it by formulating it as a state space optimization problem and then solving it using an exh...
Design and Evaluation of Alternative Selection Placement Strategies in Optimizing Continuous Queries
- In ICDE
, 2002
"... selection placement strategies for optimizing a very large number of continuous queries in an Internet environment. Two grouping strategies, PushDown and PullUp, in which selections are either pushed below, or pulled above, joins are proposed and investigated. While our earlier research has demonstr ..."
Abstract
-
Cited by 52 (1 self)
- Add to MetaCart
selection placement strategies for optimizing a very large number of continuous queries in an Internet environment. Two grouping strategies, PushDown and PullUp, in which selections are either pushed below, or pulled above, joins are proposed and investigated. While our earlier research has demonstrated that the incremental group optimization can significantly outperform an ungrouped approach, the results from this paper show that different incremental group optimization strategies can have significantly different performance characteristics. Surprisingly, in our studies, PullUp, in which selections are pulled above joins, is often better and achieves an average 10-fold performance improvement over PushDown (occasionally 100 times faster). Furthermore, a revised algorithm of PullUp, termed filtered PullUp is proposed that is able to further reduce the cost of PullUp by 75% when the union of the selection predicates is selective. Detailed cost models, which consider several special parameters, including (1) characteristics of queries to be grouped, and (2) characteristics of data changes, are presented in this paper. Preliminary experiments using an implementation of both strategies show that our models are fairly accurate in predicting the results obtained from the implementation of these techniques in the Niagara system. This work can serve as the basis for building a cost-based incremental group query optimizer to choose a better grouping strategy.
K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources
, 2000
"... The integration of heterogeneous data sources and software systems is a major issue in the biomedical community and several approaches have been explored: linking databases, "on-the-fly" integration through views, and integration through warehousing. In this paper we report on our experiences with t ..."
Abstract
-
Cited by 52 (4 self)
- Add to MetaCart
The integration of heterogeneous data sources and software systems is a major issue in the biomedical community and several approaches have been explored: linking databases, "on-the-fly" integration through views, and integration through warehousing. In this paper we report on our experiences with two systems that were developed at the University of Pennsylvania: an integration system called K2, which has primarily been used to provide views over multiple external data sources and software systems; and a data warehouse called GUS which downloads, cleans, integrates and annotates data from multiple external data sources. Although the view and warehouse approaches each have their advantages, there is no clear "winner". Therefore, users must consider how the data is to be used, what the performance guarantees must be, and how much programmer time and expertise is available to choose the best strategy for a particular application.

