Results 1  10
of
13
Discovering significant patterns
, 2007
"... Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some userspecified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type1 error, that is, of finding patter ..."
Abstract

Cited by 41 (3 self)
 Add to MetaCart
Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some userspecified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying wellestablished statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to realworld data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.
Mining quantitative correlated patterns using an informationtheoretic approach
 In KDD
, 2006
"... Existing research on mining quantitative databases mainly focuses on mining associations. However, mining associations is too expensive to be practical in many cases. In this paper, we study mining correlations from quantitative databases and show that it is a more effective approach than mining ass ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
Existing research on mining quantitative databases mainly focuses on mining associations. However, mining associations is too expensive to be practical in many cases. In this paper, we study mining correlations from quantitative databases and show that it is a more effective approach than mining associations. We propose a new notion of Quantitative Correlated Patterns (QCPs), which is founded on two formal concepts, mutual information and allconfidence. We first devise a normalization on mutual information and apply it to QCP mining to capture the dependency between the attributes. We further adopt allconfidence as a quality measure to control, at a finer granularity, the dependency between the attributes with specific quantitative intervals. We also propose a supervised method to combine the consecutive intervals of the quantitative attributes based on mutual information, such that the interval combining is guided by the dependency between the attributes. We develop an algorithm, QCoMine, to efficiently mine QCPs by utilizing normalized mutual information and allconfidence to perform a twolevel pruning. Our experiments verify the efficiency of QCoMine and the quality of the QCPs.
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets
"... As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent i ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support threshold s ∗ for a dataset, such that the number of itemsets with support at least s ∗ represents a substantial deviation from what would be expected in a random dataset with the same number of transactions and the same individual item frequencies. These itemsets can then be flagged as statistically significant with a small false discovery rate. Our methodology hinges on a Poisson approximation to the Harvard School of Engineering and Applied Sciences, Cambridge,
SelfSufficient Itemsets: An Approach to Screening Potentially Interesting Associations Between Items
"... Selfsufficient itemsets are those whose frequency cannot explained solely by the frequency of either their subsets or of their supersets. We argue that itemsets that are not selfsufficient will often be of little interest to the data analyst, as their frequency should be expected once that of the ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Selfsufficient itemsets are those whose frequency cannot explained solely by the frequency of either their subsets or of their supersets. We argue that itemsets that are not selfsufficient will often be of little interest to the data analyst, as their frequency should be expected once that of the itemsets on which their frequency depends is known. We present statistical tests for statistically sound discovery of selfsufficient itemsets, and computational techniques that allow those tests to be applied as a postprocessing step for any itemset discovery algorithm. We also present a measure for assessing the degree of potential interest in an itemset that complements these statistical measures.
Geng L.: A Unified framework for Utility based Measures for Mining Itemsets
 Second International Workshop on UtilityBased Data Mining
, 2006
"... A pattern is of utility to a person if its use by that person contributes to reaching a goal. Utility based measures use the utilities of the patterns to reflect the user’s goals. In this paper, we first review utility based measures for itemset mining. Then, we present a unified framework for incor ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
A pattern is of utility to a person if its use by that person contributes to reaching a goal. Utility based measures use the utilities of the patterns to reflect the user’s goals. In this paper, we first review utility based measures for itemset mining. Then, we present a unified framework for incorporating several utility based measures into the data mining process by defining a unified utility function. Next, within this framework, we summary the mathematical properties of utility based measures that will allow the time and space costs of the itemset mining algorithm to be reduced.
Garriga Multiple Hypothesis Testing in Pattern Discovery
, 2009
"... The problem of multiple hypothesis testing arises when there are more than one hypothesis to be tested simultaneously for statistical significance. This is a very common situation in many data mining applications. For instance, assessing simultaneously the significance of all frequent itemsets of a ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
The problem of multiple hypothesis testing arises when there are more than one hypothesis to be tested simultaneously for statistical significance. This is a very common situation in many data mining applications. For instance, assessing simultaneously the significance of all frequent itemsets of a single dataset entails a host of hypothesis, one for each itemset. A multiple hypothesis testing method is needed to control the number of false positives (Type I error). Our contribution in this paper is to extend the multiple hypothesis framework to be used with a generic data mining algorithm. We provide a method that provably controls the familywise error rate (FWER, the probability of at least one false positive) in the strong sense. We evaluate the performance of our solution on both real and generated data. The results show that our method controls the FWER while maintaining the power of the test.
An InformationTheoretic Approach to Quantitative Association Rule Mining
, 2007
"... Quantitative Association Rule (QAR) mining has been recognized an influential research problem over the last decade due to the popularity of quantitative databases and the usefulness of association rules in real life. Unlike Boolean Association Rules (BARs), which only consider boolean attributes, Q ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Quantitative Association Rule (QAR) mining has been recognized an influential research problem over the last decade due to the popularity of quantitative databases and the usefulness of association rules in real life. Unlike Boolean Association Rules (BARs), which only consider boolean attributes, QARs consist of quantitative attributes which contain much richer information than the boolean attributes. However, the combination of these quantitative attributes and their value intervals always gives rise to the generation of an explosively large number of itemsets, thereby severely degrading the mining efficiency. In this paper, we propose an informationtheoretic approach to avoid unrewarding combinations of both the attributes and their value intervals being generated in the mining process. We study the mutual information between the attributes in a quantitative database and devise a normalization on the mutual information to make it applicable in the context of QAR mining. To indicate the strong informative relationships among the
Distribution rules with numeric attributes of interest ⋆
"... Abstract. In this paper we introduce distribution rules, a kind of association rules with a distribution on the consequent. Distribution rules are related to quantitative association rules but can be seen as a more fundamental concept, useful for learning distributions. We formalize the main concept ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Abstract. In this paper we introduce distribution rules, a kind of association rules with a distribution on the consequent. Distribution rules are related to quantitative association rules but can be seen as a more fundamental concept, useful for learning distributions. We formalize the main concepts and indicate applications to tasks such as frequent pattern discovery, sub group discovery and forecasting. An efficient algorithm for the generation of distribution rules is described. We also provide interest measures, visualization techniques and evaluation. 1
Technical Note: Layered Critical Values: A Powerful DirectAdjustment Approach to Discovering Significant Patterns
, 2008
"... Standard pattern discovery techniques, such as association rules, suffer an extreme risk of finding very large numbers of spurious patterns for many knowledge discovery tasks. The directadjustment approach to controlling this risk applies a statistical test during the discovery process, using a cri ..."
Abstract
 Add to MetaCart
Standard pattern discovery techniques, such as association rules, suffer an extreme risk of finding very large numbers of spurious patterns for many knowledge discovery tasks. The directadjustment approach to controlling this risk applies a statistical test during the discovery process, using a critical value adjusted to take account of the size of the search space. However, a problem with the directadjustment strategy is that it may discard numerous true patterns. This paper investigates the assignment of different critical values to different areas of the search space as an approach to alleviating this problem, using a variant of a technique originally developed for other purposes. This approach is shown to be effective at increasing the number of discoveries while still maintaining strict control over the risk of false discoveries. 1
A Rigorous Statistical Approach for Identifying Significant Itemsets
"... As advances in technology allow for the collection, storage, and mining of vast amounts of data, the task of screening and assessing the significance of the discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent ..."
Abstract
 Add to MetaCart
As advances in technology allow for the collection, storage, and mining of vast amounts of data, the task of screening and assessing the significance of the discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support threshold s for a dataset, such that the family of frequent itemsets with respect to s embodies a substantial deviation from what would be expected in a random dataset, hence these itemsets can be flagged as significant. Our methodology hinges on a Poisson approximation of the distribution of the number of frequent itemsets of a given size, which is the main theoretical result of the paper. A crucial feature of our approach is that, unlike previous work, it takes into account the entire dataset rather than individual discoveries, hence it is able to distinguishing between significant observations and random fluctuations in data, thus resulting in fewer false discoveries. Extensive experiments are reported that substantiate the effectiveness of our methodology. 1.