## Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data (2001)

### Cached

### Download Links

- [www.cs.berkeley.edu]
- [www.cs.umd.edu]
- [www.cs.berkeley.edu]
- [www.softnet.tuc.gr]
- [webdocs.cs.ualberta.ca]
- DBLP

### Other Repositories/Bibliography

Venue: | In SIGMOD |

Citations: | 66 - 12 self |

### BibTeX

@INPROCEEDINGS{Deshpande01independenceis,

author = {Amol Deshpande},

title = {Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data},

booktitle = {In SIGMOD},

year = {2001}

}

### Years of Citing Articles

### OpenURL

### Abstract

Approximating the joint data distribution of a multi-dimensional data set through a compact and accurate histogram synopsis is a fundamental problem arising in numerous practical scenarios, including query optimization and approximate query answering. Existing solutions either rely on simplistic independence assumptions or try to directly approximate the full joint data distribution over the complete set of attributes. Unfortunately, both approaches are doomed to fail for high-dimensional data sets with complex correlation patterns between attributes. In this paper, we propose a novel approach to histogram-based synopses that employs the solid foundation of statistical interaction models to explicitly identify and exploit the statistical characteristics of the data. Abstractly, our key idea is to break the synopsis into (1) a statistical interaction model that accurately captures significant correlation and independence patterns in data, and (2) a collection of histograms on low-dimensional marginals that, based on the model, can provide accurate approximations of the overall joint data distribution. Extensive experimental results with several real-life data sets verify the effectiveness of our approach. An important aspect of our general, model-based methodology is that it can be used to enhance the performance of other synopsis techniques that are based on data-space partitioning (e.g., wavelets) by providing an effective tool to deal with the “dimensionality curse”. 1.

### Citations

8984 |
Introduction to Algorithms
- Cormen, Leiserson, et al.
- 2001
(Show Context)
Citation Context ... the issue of finding the optimal order for multiplying the relevant histograms for obtaining a given marginal. This problem is similar in spirit to the well-known matrix-chain multiplication problem =-=[5]-=- and, in fact, reduces to that problem in the case of a simple path model graph � and a twovariable range predicates. Optimizing the multiplication operations for general model graphs and range predic... |

7441 |
Probabilistic Reasoning in Intelligent Systems
- Pearl
- 1988
(Show Context)
Citation Context ...ractice of Markov networks, the interested reader is referred to [20].) It is easy to see that the number and complexity of possible decomposable models grows explosively with the number of variables =-=[6, 15, 17]-=-. Figure 1(b) shows a somewhat more complex example of � a -dimensional decomposable model � , � namely � � � ��� ��� � ����� � . From the Markov graph, it is easy to read, ����� for example, that var... |

465 |
Graphical models in applied multivariate statistics
- Whittaker
- 1990
(Show Context)
Citation Context ... in this paper, we propose DEPENDENCY-BASED (DB) histograms, a novel approach to building histogram synopses for high-dimensional data that uses the solid foundation of statistical interaction models =-=[1, 4, 20]-=- to explore the spectrum of possibilities between the existing “fully-independent” and “fully-correlated” approaches. Abstractly, our key technical idea is to break the synopsis of a high-dimensional ... |

248 | Improved histograms for selectivity estimation of range predicates
- Poosala
- 1996
(Show Context)
Citation Context ...compact synopses structures that approximate a joint frequency distribution with reasonable accuracy in limited space is critical for numerous applications, including query optimization and profiling =-=[19]-=-. Histogram-based synopses for approximating one-dimensional data distributions have been extensively studied in the research literature [13, 19], and have been adopted by several commercial database ... |

213 |
Discrete multivariate analysis
- Bishop, Fienberg, et al.
- 1975
(Show Context)
Citation Context ... in this paper, we propose DEPENDENCY-BASED (DB) histograms, a novel approach to building histogram synopses for high-dimensional data that uses the solid foundation of statistical interaction models =-=[1, 4, 20]-=- to explore the spectrum of possibilities between the existing “fully-independent” and “fully-correlated” approaches. Abstractly, our key technical idea is to break the synopsis of a high-dimensional ... |

212 | Selectivity Estimation Without the Attribute Value Independence Assumption
- Poosala, Ioannidis
- 1997
(Show Context)
Citation Context ...ery answering. Costbased query optimizers employ such synopses to obtain accurate estimates of intermediate result sizes that are, in turn, needed to evaluate the quality of different execution plans =-=[11, 18]-=-. Similarly, query profilers and approximate query processors require compact data synopses in order to provide users with fast, useful feedback on their original query [3, 19]. Such query feedback (t... |

189 | Approximate query processing using wavelets
- Chakrabarti, Garofalakis, et al.
- 2000
(Show Context)
Citation Context ...ferent execution plans [11, 18]. Similarly, query profilers and approximate query processors require compact data synopses in order to provide users with fast, useful feedback on their original query =-=[3, 19]-=-. Such query feedback (typically, in the form of an approximate answer) allows OLAP and datamining users to identify the truly interesting regions of a data set and, thus, focus their explorations qui... |

136 |
Resource Allocation Problems: Algorithmic Approaches
- IBARAKI, KATOH
- 1988
(Show Context)
Citation Context ...can now be stated as follows: � ���� � � ¡ ERR� ��� � � Minimize , subject to ���� � � ¡ � � � � � ��¤ � This is essentially a discrete resource allocation problem with a separable objective function =-=[12]-=-, which can be solved optimally in pseudo-polynomial time using dynamic programming. ��� More � � ��� � specifically, let denote the minimum achievable total error for � the first clique histograms wh... |

119 |
and B.Peyton, An Introduction to Chordal Graphs and Clique Trees Graph Theory and Sparse
- Blair
(Show Context)
Citation Context ...eoretic properties of the Markov network representation of a model � ; therefore, we use � to refer to both the interaction model and its Markov graph in the remainder of this paper. � � X 1 X2 X3 [1]=-=[2]-=-[3] X 1 X2 X3 [1][23] (a) X 1 X2 X3 [12][13] X 4 X 2 X 3 X 1 � � (b) X 5 [123][124][15] { X1 X2 X4 } { X1 X2 X3 } { X1X5 } (c) Figure 1: Markov networks for (a) three simple � -dimensional decomposabl... |

84 | Optimal Junction Trees
- Jensen, Jensen
- 1994
(Show Context)
Citation Context ...posable, since it corresponds � to a non-chordal Markov graph (a -cycle). A compact and particularly useful representation of chordal graphs is provided by junction trees (also known as clique trees) =-=[2, 14]-=-. Briefly, given a chordal graph � � � ��� , a junction tree is a tree structure defined over the cliques (i.e., generators) of � � � ��� � ¥ ¥ ¥ � � � ¥ characterized by the following clique-intersec... |

79 | Approximating Multi-Dimensional Aggregate Range Queries Over Real Attributes
- Gunopulos, Kollios, et al.
- 2000
(Show Context)
Citation Context ...ery answering. Costbased query optimizers employ such synopses to obtain accurate estimates of intermediate result sizes that are, in turn, needed to evaluate the quality of different execution plans =-=[11, 18]-=-. Similarly, query profilers and approximate query processors require compact data synopses in order to provide users with fast, useful feedback on their original query [3, 19]. Such query feedback (t... |

63 |
Discrete Optimization Via Marginal Analysis
- Fox
- 1966
(Show Context)
Citation Context ...g algorithm for this problem as well as a cheaper greedy heuristic based on the concept of marginal gains that is, in fact, optimal when the histogram error functions follow a diminishing-returns law =-=[10]-=-. DB-Histogram Usage. Efficiently estimating the selectivity of ¦ a range query predicate over (a subset of) the data attributes using DB histograms requires new techniques that effectively utilize th... |

58 |
Optimal histograms for limiting worst-case error propagation in the size of the join results
- Ioannidis, Christodoulakis
- 1993
(Show Context)
Citation Context ...ine how well-posed/semantically-correct their query is, allowing them to make an informed decision on whether they would like to invest more time and resources to execute it to completion. Histograms =-=[13, 19]-=- constitute a very general class of synopsis structures that offer several advantages, including (1) they are typically built off-line and stored in the DBMS catalog, thus incurring almost no run-time... |

47 | On rectangular partitions in two dimensions: Algorithms, complexity and applications
- Muthukrishnan, Poosala, et al.
- 1999
(Show Context)
Citation Context ...st need of partitioning until the space budget (i.e., number of available buckets) is exhausted. Grid Histograms are based on a simple generalization of � ��� ¦ rectangular array partitionings (e.g., =-=[16]-=-) to higher dimensionalities. We use a simple greedy algorithm for building grid histograms that, at each step, partitions the entire data distribution along the dimension that is in most need of part... |

28 |
Approximating Discrete Probability Distributions with Decomposable Models
- Malvestuto
- 1991
(Show Context)
Citation Context ...ny ��� , is also specified to vanish. A simple notation used for describing hierarchical log-linear models is to list their maximal variable interaction components (also known as the model generators =-=[15]-=-) in square brackets. For instance, the satu� rated -dimensional log-linear model can be denoted ��� � � � as and the full-independence model is ��� � � � � � � � simply . More interesting correlation... |

22 |
Relaxing the uniformity and independence assumptions using the concept of fractal dimension
- Faloutsos, Kamel
- 1997
(Show Context)
Citation Context ...ucts of one-dimensional marginals. Unfortunately, experience with real-life data proves that the fullindependence model is almost always invalid and can lead to gross approximation errors in practice =-=[9, 18]-=-. As a consequence, more recent work has proposed algorithms for building multi-dimensional histogram synopses that try to directly approximate the joint data distribution of a multi-attribute data se... |

3 |
Log-Linear Models and Logistic Regression”. Springer-Verlag
- Christensen
- 1997
(Show Context)
Citation Context ... in this paper, we propose DEPENDENCY-BASED (DB) histograms, a novel approach to building histogram synopses for high-dimensional data that uses the solid foundation of statistical interaction models =-=[1, 4, 20]-=- to explore the spectrum of possibilities between the existing “fully-independent” and “fully-correlated” approaches. Abstractly, our key technical idea is to break the synopsis of a high-dimensional ... |