## Evaluating Query Result Significance in Databases via Randomizations

Citations: | 2 - 0 self |

### BibTeX

@MISC{Ojala_evaluatingquery,

author = {Markus Ojala and Gemma C. Garriga and Aristides Gionis and Heikki Mannila},

title = {Evaluating Query Result Significance in Databases via Randomizations},

year = {}

}

### OpenURL

### Abstract

Many sorts of structured data are commonly stored in a multi-relational format of interrelated tables. Under this relational model, exploratory data analysis can be done by using relational queries. As an example, in the Internet Movie Database (IMDb) a query can be used to check whether the average rank of action movies is higher than the average rank of drama movies. We consider the problem of assessing whether the results returned by such a query are statistically significant or just a random artifact of the structure in the data. Our approach is based on randomizing the tables occurring in the queries and repeating the original query on the randomized tables. It turns out that there is no unique way of randomizing in multi-relational data. We propose several randomization techniques, study their properties, and show how to find out which queries or hypotheses about our data result in statistically significant information and which tables in the database convey most of the structure in the query. We give results on real and generated data and show how the significance of some queries vary between different randomizations. 1

### Citations

2873 |
Controlling the false discovery rate. A practical and powerful approach to multiple testing
- Benjamini, Hochberg
- 1995
(Show Context)
Citation Context ...for selecting a list of rejected null-hypothesis, especially in exploratory data analysis. For example, the method by Benjamini-Hochberg is a simple way to limit the FDR below the chosen threshold α. =-=[1]-=- In this paper, we will not use any correction for multiple comparisons to keep the experimental results simple and easily interpretable. The main contribution of this paper resides in the new approac... |

811 |
Bootstrap Methods: Another Look at the Jackknife
- EFRON
(Show Context)
Citation Context ...ch to measuring p-values for patterns, see [20]. A related work that studies permutations on networks and how this affects significance of patterns is [14]. Sub-sampling methods such as bootstrapping =-=[9]-=- use randomization to study the properties of the underlying distribution instead of testing the data against some null-model. Finally, database theory studies mainly query processing and optimization... |

566 |
A simple sequential rejective multiple test procedure
- Holm
- 1979
(Show Context)
Citation Context ...proach—the probability of making Type II error, i.e., the error of failing to reject a nullhypothesis when it is not true, is high. The extended HolmBonferroni method alleviates this problem slightly =-=[12]-=-. 908 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.30 60 Romance 1 0 Drama 1 1 History 0 1 (a) GM · MD · DA 30 60 Romance 2 0 Drama 3 2 History 0 2 (b) GM ∗ MD ∗ DA Fi... |

495 | Beyond market baskets: generalizing association rules to correlations
- Brin, Motwani, et al.
- 1997
(Show Context)
Citation Context .... George, 60)} Figure 1: A toy example of a multi-relational database with three binary relations: movies classified by genres, GM; movies directed by directors, MD; and ages of directors, DA. et al. =-=[18]-=- considered measuring the significance of rules via the chi-squared test, and from there many other papers followed—see e.g. [19] for a comprehensive survey. More recently, the approach of defining ra... |

187 | Selecting the right interestingness measure for association patterns
- Tan, Kumar, et al.
(Show Context)
Citation Context ...M; movies directed by directors, MD; and ages of directors, DA. et al. [18] considered measuring the significance of rules via the chi-squared test, and from there many other papers followed—see e.g. =-=[19]-=- for a comprehensive survey. More recently, the approach of defining randomization tests to assess data mining results was introduced for binary data [10], and for real-valued data [15]. Abstracting a... |

150 | Permutation Tests- A Practical Guide to Resampling Methods for Testing Hypotheses - Good - 2000 |

73 |
Combinatorial properties of matrices of zeros and ones
- RYSER
- 1957
(Show Context)
Citation Context ...s one 1 in each column, then sw(A) = cp(A); if A has one 1 in each row, then sw(A) = rp(A). Proof. Note that sw(I), for identity matrix I, can produce any permutation matrix with uniform distribution =-=[17]-=-. Thus, the boolean product sw(I) · A produces all permutations for the rows of A and similarly, the boolean product A · sw(I) produces all permutations of the columns of A. If A has exactly one 1 in ... |

70 | Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs
- Kashtan, Itzkovitz, et al.
- 2007
(Show Context)
Citation Context ...s on real-valued data. For another type of approach to measuring p-values for patterns, see [20]. A related work that studies permutations on networks and how this affects significance of patterns is =-=[14]-=-. Sub-sampling methods such as bootstrapping [9] use randomization to study the properties of the underlying distribution instead of testing the data against some null-model. Finally, database theory ... |

42 | Discovering Significant Patterns
- Webb
- 2007
(Show Context)
Citation Context ...erns: the papers [7, 10] deal with randomizations on binary data, and the work in [15] studies randomizations on real-valued data. For another type of approach to measuring p-values for patterns, see =-=[20]-=-. A related work that studies permutations on networks and how this affects significance of patterns is [14]. Sub-sampling methods such as bootstrapping [9] use randomization to study the properties o... |

38 |
Generalized Monte Carlo significance tests
- Besag, Clifford
- 1989
(Show Context)
Citation Context ...relation A, a local swap represents a flip between two independent edges. i k . . . j l i ⇐⇒ . . k A sequence of swaps is performed until the data mixes sufficiently enough in a Markov chain approach =-=[2, 3]-=-, and therefore, a random sample of A is obtained. We use ten times the number of ones in the matrix as the number of swaps, which suffices for the convergence of the chain [10]. We denote the set of ... |

33 | Assessing data mining results via swap randomization
- Gionis, Mannila, et al.
(Show Context)
Citation Context ...rom there many other papers followed—see e.g. [19] for a comprehensive survey. More recently, the approach of defining randomization tests to assess data mining results was introduced for binary data =-=[10]-=-, and for real-valued data [15]. Abstracting a bit from the question of how significant patterns are in the data, we introduce here the statistical testing framework to databases and the exploratory t... |

28 |
Sequential Monte Carlo p-values
- BESAG, P
- 1991
(Show Context)
Citation Context ... as evaluating the query. The time and space consumption of the methods scale linearly in the size of the relation. The sequential approach for calculating the empirical p-value by Besag and Clifford =-=[4]-=- can be used to determine the sufficient amount of randomized samples in large-scale applications. For example, in many cases 30 samples is usually sufficient to determine the significance of the resu... |

22 |
Database Management Systems. McGraw-Hill Higher Education
- Ramakrishnan, Gehrke
- 2003
(Show Context)
Citation Context ...S ⊆ {A1, . . . , An}. The result of a query is denoted by q(⋊⋉S). We say that S is the set of relations occurring in the query. A query can be described with the operators of projection and selection =-=[16]-=-, applied to a join ⋊⋉S. Projection is a unary operator πX(⋊⋉S) that restricts tuples of ⋊⋉S to attributes in X. Selection is a unary operator σϕ(⋊⋉S) where ϕ is a propositional formula. The operator ... |

12 |
An application of Markov chain Monte Carlo to community ecology
- COBB, CHEN
- 2003
(Show Context)
Citation Context ...he binary table presentation of A, e.g., as seen in Figure 2. The running times and space consumptions of the methods are linear in the size of the relation A. (1) Swap randomization of A, as used in =-=[7, 10]-=-, produces random samples of A that preserve the row and column sums. The algorithm starts from the original dataset A and performs local swaps interchanging a pair of 1’s with a pair of 0’s preservin... |

8 |
Type inference for datalog and its application to query optimisation
- Moor, Sereni, et al.
- 2008
(Show Context)
Citation Context ... the properties of the underlying distribution instead of testing the data against some null-model. Finally, database theory studies mainly query processing and optimization in different complex data =-=[8, 13]-=-. 8 Conclusions and Future Work We have addressed the problem of assessing the significance of queries made for the exploratory analysis of relational databases. Each query, together with the associat... |

7 | Markov chain Monte Carlo methods for statistical inference. http://www.ims. nus.edu.sg/Programs/mcmc/files/besag tl.pdf
- BESAG
(Show Context)
Citation Context ...relation A, a local swap represents a flip between two independent edges. i k . . . j l i ⇐⇒ . . k A sequence of swaps is performed until the data mixes sufficiently enough in a Markov chain approach =-=[2, 3]-=-, and therefore, a random sample of A is obtained. We use ten times the number of ones in the matrix as the number of swaps, which suffices for the convergence of the chain [10]. We denote the set of ... |

5 |
Randomization methods for assessing data analysis results on real-valued matrices
- Ojala, Vuokko, et al.
- 2009
(Show Context)
Citation Context ...lowed—see e.g. [19] for a comprehensive survey. More recently, the approach of defining randomization tests to assess data mining results was introduced for binary data [10], and for real-valued data =-=[15]-=-. Abstracting a bit from the question of how significant patterns are in the data, we introduce here the statistical testing framework to databases and the exploratory task of querying the relations o... |

3 | Query evaluation with softkey constraints
- Jha, Rastogi, et al.
- 2008
(Show Context)
Citation Context ... the properties of the underlying distribution instead of testing the data against some null-model. Finally, database theory studies mainly query processing and optimization in different complex data =-=[8, 13]-=-. 8 Conclusions and Future Work We have addressed the problem of assessing the significance of queries made for the exploratory analysis of relational databases. Each query, together with the associat... |

2 |
Sequential MC methods for statistical analysis of tables
- Chen, Diaconis, et al.
(Show Context)
Citation Context ...Let ˆ RA = {⋊⋉T ∪ Â | Â ∈ Â and T = S\A} 4: Compute the p-value using the random samples ˆ RA 5: end for Alternatively, we could apply a traditional permutation test on the contingency table of paths =-=[6]-=-, shown in Figure 4(b). This table gives the number of paths between the Genre and Age, as required by q1. The hypothesis related to our queries under those permutation tests would never be significan... |