Results 1  10
of
83
Beyond Market Baskets: Generalizing Association Rules To Dependence Rules
, 1998
"... One of the more wellstudied problems in data mining is the search for association rules in market basket data. Association rules are intended to identify patterns of the type: “A customer purchasing item A often also purchases item B. Motivated partly by the goal of generalizing beyond market bask ..."
Abstract

Cited by 490 (7 self)
 Add to MetaCart
One of the more wellstudied problems in data mining is the search for association rules in market basket data. Association rules are intended to identify patterns of the type: “A customer purchasing item A often also purchases item B. Motivated partly by the goal of generalizing beyond market basket data and partly by the goal of ironing out some problems in the definition of association rules, we develop the notion of dependence rules that identify statistical dependence in both the presence and absence of items in itemsets. We propose measuring significance of dependence via the chisquared test for independence from classical statistics. This leads to a measure that is upwardclosed in the itemset lattice, enabling us to reduce the mining problem to the search for a border between dependent and independent itemsets in the lattice. We develop pruning strategies based on the closure property and thereby devise an efficient algorithm for discovering dependence rules. We demonstrate our algorithm’s effectiveness by testing it on census data, text data (wherein we seek term dependence), and synthetic data.
Algebraic Algorithms for Sampling from Conditional Distributions
 Annals of Statistics
, 1995
"... We construct Markov chain algorithms for sampling from discrete exponential families conditional on a sufficient statistic. Examples include generating tables with fixed row and column sums and higher dimensional analogs. The algorithms involve finding bases for associated polynomial ideals and so a ..."
Abstract

Cited by 182 (15 self)
 Add to MetaCart
We construct Markov chain algorithms for sampling from discrete exponential families conditional on a sufficient statistic. Examples include generating tables with fixed row and column sums and higher dimensional analogs. The algorithms involve finding bases for associated polynomial ideals and so an excursion into computational algebraic geometry.
Discovering significant patterns
, 2007
"... Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some userspecified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type1 error, that is, of finding patter ..."
Abstract

Cited by 42 (3 self)
 Add to MetaCart
Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some userspecified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying wellestablished statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to realworld data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.
Similarity of position frequency matrices for transcription factor binding sites
 Bioinformatics
, 2005
"... Motivation: Transcriptionfactor binding sites in promoter sequences of higher eukaryotes are commonly modeled using position frequency matrices. The ability to compare position frequency matrices representing binding sites is especially important for de novo sequence motif discovery, where it is de ..."
Abstract

Cited by 29 (5 self)
 Add to MetaCart
Motivation: Transcriptionfactor binding sites in promoter sequences of higher eukaryotes are commonly modeled using position frequency matrices. The ability to compare position frequency matrices representing binding sites is especially important for de novo sequence motif discovery, where it is desirable to compare putative matrices to one another and to known matrices. Results: We describe a position frequency matrix similarity quantification method based on productmultinomial distributions, demonstrate its ability to identify position frequency matrix similarity and show that it has a better false positive to false negative ratio compared to existing methods. We group transcription factor binding site frequency matrices from two libraries into matrix families, and identify the matrices that are common and unique to these libraries. We identify similarities and differences between the skeletalmusclespecific and nonmusclespecific frequency matrices for the binding sites of Mef2, Myf, Sp1, SRF and TEF of Wasserman and Fickett (1998). We further identify known frequency matrices and matrix families that are strongly similar to the matrices given by Wasserman and Fickett. We provide methodology and tools to compare and query libraries of frequency matrices for transcription factor binding sites. Availability: Software is available to use over the web at
Review Enrichment or depletion of a GO category within a class of genes: which test?
"... Motivation: A number of available program packages determine the significant enrichments and/or depletions of GO categories among a class of genes of interest. Whereas a correct formulation of the problem leads to a single exact null distribution, these GO tools use a large variety of statistical te ..."
Abstract

Cited by 28 (1 self)
 Add to MetaCart
Motivation: A number of available program packages determine the significant enrichments and/or depletions of GO categories among a class of genes of interest. Whereas a correct formulation of the problem leads to a single exact null distribution, these GO tools use a large variety of statistical tests whose denominations often do not clarify the underlying pvalue computations. Summary: We review the different formulations of the problem and the tests they lead to: the binomial, chisquare, equality of two probabilities, Fisher’s exact, and hypergeometric tests. We clarify the relationships existing between these tests, in particular the equivalence between the hypergeometric test and Fisher’s exact test. We recall that the other tests are valid only for large samples, the test of equality of two probabilities and the chisquare test being equivalent. We discuss the appropriateness of one and twosided pvalues, as well as some discreteness and conservatism issues. 1
Sequential importance sampling for multiway tables
 Annals of Statistics
, 2005
"... We describe an algorithm for the sequential sampling of entries in multiway contingency tables with given constraints. The algorithm can be used for computations in exact conditional inference. To justify the algorithm, a theory relates sampling values at each step to properties of the associated to ..."
Abstract

Cited by 21 (3 self)
 Add to MetaCart
We describe an algorithm for the sequential sampling of entries in multiway contingency tables with given constraints. The algorithm can be used for computations in exact conditional inference. To justify the algorithm, a theory relates sampling values at each step to properties of the associated toric ideal using computational commutative algebra. In particular, the property of interval cell counts at each step is related to exponents on lead indeterminates of a lexicographic Gröbner basis. Also, the approximation of integer programming by linear programming for sampling is related to initial terms of a toric ideal. We apply the algorithm to examples of contingency tables which appear in the social and medical sciences. The numerical results demonstrate that the theory is applicable and that the algorithm performs well. 1. Introduction. Sampling
Statistical Techniques for Language Recognition: An Introduction and Guide for Cryptanalysts
 Cryptologia
, 1993
"... We explain how to apply statistical techniques to solve several languagerecognition problems that arise in cryptanalysis and other domains. Language recognition is important in cryptanalysis because, among other applications, an exhaustive key search of any cryptosystem from ciphertext alone requir ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
We explain how to apply statistical techniques to solve several languagerecognition problems that arise in cryptanalysis and other domains. Language recognition is important in cryptanalysis because, among other applications, an exhaustive key search of any cryptosystem from ciphertext alone requires a test that recognizes valid plaintext. Written for cryptanalysts, this guide should also be helpful to others as an introduction to statistical inference on Markov chains. Modeling language as a finite stationary Markov process, we adapt a statistical model of pattern recognition to language recognition. Within this framework we consider four welldefined languagerecognition problems: 1) recognizing a known language, 2) distinguishing a known language from uniform noise, 3) distinguishing unknown 0thorder noise from unknown 1storder language, and 4) detecting nonuniform unknown language. For the second problem we give a most powerful test based on the NeymanPearson Lemma. For the oth...
DWE: discriminating word enumerator
 Bioinformatics
, 2005
"... Motivation: Tissuespecific transcriptionfactor binding sites give insight into tissuespecific transcription regulation. Results: We describe a wordcountingbased tool for de novo tissuespecific transcriptionfactor binding site discovery using expression information in addition to sequence info ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Motivation: Tissuespecific transcriptionfactor binding sites give insight into tissuespecific transcription regulation. Results: We describe a wordcountingbased tool for de novo tissuespecific transcriptionfactor binding site discovery using expression information in addition to sequence information. We incorporate tissuespecific gene expression through gene classification to positive expression and repressed expression. We present a direct statistical approach to find overrepresented transcriptionfactor binding sites in a foreground promoter sequence set against a background promoter sequence set. Our approach naturally extends to synergistic transcription factor binding site search. We find putative transcription factor binding sites that are over represented in the proximal promoters of liverspecific genes relative to proximal promoters of liverindependent genes. Our results indicate that binding sites for hepatocyte nuclear factors (especially HNF1 and HNF4) and CCAAT/enhancerbinding protein (C/EBPβ) are the most over represented in proximal promoters of liverspecific genes. Our results suggest that HNF4 has strong synergistic relationships with hepatocyte nuclear factors HNF1, HNF4 and HNF3β and with C/EBPβ. Availability: Programs are available for use over the web
Efficient exact pvalue computation for small sample, sparse, and surprising categorical data
 J. of Comp. Bio
, 2004
"... A major obstacle in applying various hypothesis testing procedures to datasets in bioinformatics is the computation of ensuing pvalues. In this paper, we define a generic branchandbound approach to efficient exact pvalue computation and enumerate the required conditions for successful application ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
A major obstacle in applying various hypothesis testing procedures to datasets in bioinformatics is the computation of ensuing pvalues. In this paper, we define a generic branchandbound approach to efficient exact pvalue computation and enumerate the required conditions for successful application. Explicit procedures are developed for the entire Cressie–Read family of statistics, which includes the widely used Pearson and likelihood ratio statistics in a oneway frequency table goodnessoffit test. This new formulation constitutes a first practical exact improvement over the exhaustive enumeration performed by existing statistical software. The general techniques we develop to exploit the convexity of many statistics are also shown to carry over to contingency table tests, suggesting that they are readily extendible to other tests and test statistics of interest. Our empirical results demonstrate a speedup of orders of magnitude over the exhaustive computation, significantly extending the practical range for performing exact tests. We also show that the relative speedup gain increases as the null hypothesis becomes sparser, that computation precision increases with increase in speedup, and that computation time is very moderately affected by the magnitude of the computed pvalue. These qualities make our algorithm especially appealing in the regimes of small samples, sparse null distributions, and rare events, compared to the alternative asymptotic approximations and Monte Carlo samplers. We discuss several established bioinformatics applications, where small sample size, small expected counts in one or more categories (sparseness), and very small pvalues do occur. Our computational framework could be applied in these, and similar cases, to improve performance. Key words: pvalue, exact tests, branch and bound, real extension, categorical data.
Data augmentation in multiway contingency tables with fixed marginal totals
 JOURNAL OF STATISTICAL PLANNING AND INFERENCE
, 2006
"... ..."