Results 1 - 10
of
11
Discovering significant patterns
, 2007
"... Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patter ..."
Abstract
-
Cited by 25 (3 self)
- Add to MetaCart
Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.
Mining quantitative correlated patterns using an information-theoretic approach
- In KDD
, 2006
"... Existing research on mining quantitative databases mainly focuses on mining associations. However, mining associations is too expensive to be practical in many cases. In this paper, we study mining correlations from quantitative databases and show that it is a more effective approach than mining ass ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Existing research on mining quantitative databases mainly focuses on mining associations. However, mining associations is too expensive to be practical in many cases. In this paper, we study mining correlations from quantitative databases and show that it is a more effective approach than mining associations. We propose a new notion of Quantitative Correlated Patterns (QCPs), which is founded on two formal concepts, mutual information and all-confidence. We first devise a normalization on mutual information and apply it to QCP mining to capture the dependency between the attributes. We further adopt all-confidence as a quality measure to control, at a finer granularity, the dependency between the attributes with specific quantitative intervals. We also propose a supervised method to combine the consecutive intervals of the quantitative attributes based on mutual information, such that the interval combining is guided by the dependency between the attributes. We develop an algorithm, QCoMine, to efficiently mine QCPs by utilizing normalized mutual information and all-confidence to perform a two-level pruning. Our experiments verify the efficiency of QCoMine and the quality of the QCPs.
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets
"... As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent i ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support threshold s ∗ for a dataset, such that the number of itemsets with support at least s ∗ represents a substantial deviation from what would be expected in a random dataset with the same number of transactions and the same individual item frequencies. These itemsets can then be flagged as statistically significant with a small false discovery rate. Our methodology hinges on a Poisson approximation to the Harvard School of Engineering and Applied Sciences, Cambridge,
Geng L.: A Unified framework for Utility based Measures for Mining Itemsets
- Second International Workshop on Utility-Based Data Mining
, 2006
"... A pattern is of utility to a person if its use by that person contributes to reaching a goal. Utility based measures use the utilities of the patterns to reflect the user’s goals. In this paper, we first review utility based measures for itemset mining. Then, we present a unified framework for incor ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
A pattern is of utility to a person if its use by that person contributes to reaching a goal. Utility based measures use the utilities of the patterns to reflect the user’s goals. In this paper, we first review utility based measures for itemset mining. Then, we present a unified framework for incorporating several utility based measures into the data mining process by defining a unified utility function. Next, within this framework, we summary the mathematical properties of utility based measures that will allow the time and space costs of the itemset mining algorithm to be reduced.
Garriga Multiple Hypothesis Testing in Pattern Discovery
, 2009
"... The problem of multiple hypothesis testing arises when there are more than one hypothesis to be tested simultaneously for statistical significance. This is a very common situation in many data mining applications. For instance, assessing simultaneously the significance of all frequent itemsets of a ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The problem of multiple hypothesis testing arises when there are more than one hypothesis to be tested simultaneously for statistical significance. This is a very common situation in many data mining applications. For instance, assessing simultaneously the significance of all frequent itemsets of a single dataset entails a host of hypothesis, one for each itemset. A multiple hypothesis testing method is needed to control the number of false positives (Type I error). Our contribution in this paper is to extend the multiple hypothesis framework to be used with a generic data mining algorithm. We provide a method that provably controls the family-wise error rate (FWER, the probability of at least one false positive) in the strong sense. We evaluate the performance of our solution on both real and generated data. The results show that our method controls the FWER while maintaining the power of the test.
Self-Sufficient Itemsets: An Approach to Screening Potentially Interesting Associations Between Items
"... Self-sufficient itemsets are those whose frequency cannot explained solely by the frequency of either their subsets or of their supersets. We argue that itemsets that are not self-sufficient will often be of little interest to the data analyst, as their frequency should be expected once that of the ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Self-sufficient itemsets are those whose frequency cannot explained solely by the frequency of either their subsets or of their supersets. We argue that itemsets that are not self-sufficient will often be of little interest to the data analyst, as their frequency should be expected once that of the itemsets on which their frequency depends is known. We present statistical tests for statistically sound discovery of self-sufficient itemsets, and computational techniques that allow those tests to be applied as a post-processing step for any itemset discovery algorithm. We also present a measure for assessing the degree of potential interest in an itemset that complements these statistical measures.
Distribution rules with numeric attributes of interest ⋆
"... Abstract. In this paper we introduce distribution rules, a kind of association rules with a distribution on the consequent. Distribution rules are related to quantitative association rules but can be seen as a more fundamental concept, useful for learning distributions. We formalize the main concept ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. In this paper we introduce distribution rules, a kind of association rules with a distribution on the consequent. Distribution rules are related to quantitative association rules but can be seen as a more fundamental concept, useful for learning distributions. We formalize the main concepts and indicate applications to tasks such as frequent pattern discovery, sub group discovery and forecasting. An efficient algorithm for the generation of distribution rules is described. We also provide interest measures, visualization techniques and evaluation. 1
An Information-Theoretic Approach to Quantitative Association Rule Mining ⋆
"... Abstract. Quantitative Association Rule (QAR) mining has been rec-ognized an influential research problem over the last decade due to the popularity of quantitative databases and the usefulness of associ-ation rules in real life. Unlike Boolean Association Rules (BARs), which only consider boolean a ..."
Abstract
- Add to MetaCart
Abstract. Quantitative Association Rule (QAR) mining has been rec-ognized an influential research problem over the last decade due to the popularity of quantitative databases and the usefulness of associ-ation rules in real life. Unlike Boolean Association Rules (BARs), which only consider boolean attributes, QARs consist of quantitative attributes which contain much richer information than the boolean attributes. How-ever, the combination of these quantitative attributes and their value in-tervals always gives rise to the generation of an explosively large number of itemsets, thereby severely degrading the mining efficiency. In this paper, we propose an information-theoretic approach to avoid un-rewarding combinations of both the attributes and their value intervals being generated in the mining process. We study the mutual information between the attributes in a quantitative database and devise a normal-ization on the mutual information to make it applicable in the context of QAR mining. To indicate the strong informative relationships among the
Technical Note: Layered Critical Values: A Powerful Direct-Adjustment Approach to Discovering Significant Patterns
, 2008
"... Standard pattern discovery techniques, such as association rules, suffer an extreme risk of finding very large numbers of spurious patterns for many knowledge discovery tasks. The direct-adjustment approach to controlling this risk applies a statistical test during the discovery process, using a cri ..."
Abstract
- Add to MetaCart
Standard pattern discovery techniques, such as association rules, suffer an extreme risk of finding very large numbers of spurious patterns for many knowledge discovery tasks. The direct-adjustment approach to controlling this risk applies a statistical test during the discovery process, using a critical value adjusted to take account of the size of the search space. However, a problem with the direct-adjustment strategy is that it may discard numerous true patterns. This paper investigates the assignment of different critical values to different areas of the search space as an approach to alleviating this problem, using a variant of a technique originally developed for other purposes. This approach is shown to be effective at increasing the number of discoveries while still maintaining strict control over the risk of false discoveries. 1
A Rigorous Statistical Approach for Identifying Significant Itemsets
"... As advances in technology allow for the collection, storage, and mining of vast amounts of data, the task of screening and assessing the significance of the discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent ..."
Abstract
- Add to MetaCart
As advances in technology allow for the collection, storage, and mining of vast amounts of data, the task of screening and assessing the significance of the discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support threshold s for a dataset, such that the family of frequent itemsets with respect to s embodies a substantial deviation from what would be expected in a random dataset, hence these itemsets can be flagged as significant. Our methodology hinges on a Poisson approximation of the distribution of the number of frequent itemsets of a given size, which is the main theoretical result of the paper. A crucial feature of our approach is that, unlike previous work, it takes into account the entire dataset rather than individual discoveries, hence it is able to distinguishing between significant observations and random fluctuations in data, thus resulting in fewer false discoveries. Extensive experiments are reported that substantiate the effectiveness of our methodology. 1.

