Results 1 - 10
of
35
Discovering significant patterns
, 2007
"... Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patter ..."
Abstract
-
Cited by 25 (3 self)
- Add to MetaCart
Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.
Interestingness of Frequent Itemsets Using Bayesian Networks as Background Knowledge
- In Proceedings of the SIGKDD Conference on Knowledge Discovery and Data Mining
, 2004
"... ..."
Local and Global Methods in Data Mining: Basic Techniques and Open Problems
- In ICALP 2002, 29th International Colloquium on Automata, Languages, and Programming, Malaga
, 2002
"... Data mining has in recent years emerged as an interesting area in the boundary between algorithms, probabilistic modeling, statistics, and databases. Data mining research can be divided into global approaches, which try to model the whole data, and local methods, which try to find useful patterns oc ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
Data mining has in recent years emerged as an interesting area in the boundary between algorithms, probabilistic modeling, statistics, and databases. Data mining research can be divided into global approaches, which try to model the whole data, and local methods, which try to find useful patterns occurring in the data. We discuss briefly some simple local and global techniques, review two attempts at combining the approaches, and list open problems with an algorithmic flavor.
Bandits for taxonomies: A modelbased approach
- In In Proc. of the SIAM International Conference on Data Mining
, 2007
"... We consider a novel problem of learning an optimal matching, in an online fashion, between two feature spaces that are organized as taxonomies. We formulate this as a multi-armed bandit problem where the arms of the bandit are dependent due to the structure induced by the taxonomies. We then propose ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
We consider a novel problem of learning an optimal matching, in an online fashion, between two feature spaces that are organized as taxonomies. We formulate this as a multi-armed bandit problem where the arms of the bandit are dependent due to the structure induced by the taxonomies. We then propose a multi-stage hierarchical allocation scheme that improves the explore/exploit properties of the classical multiarmed bandit policies in this scenario. In particular, our scheme uses the taxonomy structure and performs shrinkage estimation in a Bayesian framework to exploit dependencies among the arms, thereby enhancing exploration without losing efficiency on short term exploitation. We prove that our scheme asymptotically converges to the optimal matching. We conduct extensive experiments on real data to illustrate the efficacy of our scheme in practice. 1
Exploiting A Support-based Upper Bound of Pearson's Correlation Coefficient for Efficiently Identifying Strongly Correlated Pairs
, 2004
"... Given a user-specified minimum correlation threshold # and a market basket database with N items and T transactions, an all-strong-pairs correlation query finds all item pairs with correlations above the threshold #. However, when the number of items and transactions are large, the computation cost ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Given a user-specified minimum correlation threshold # and a market basket database with N items and T transactions, an all-strong-pairs correlation query finds all item pairs with correlations above the threshold #. However, when the number of items and transactions are large, the computation cost of this query can be very high. In this paper, we identify an upper bound of Pearson's correlation coe#cient for binary variables. This upper bound is not only much cheaper to compute than Pearson's correlation coe#cient but also exhibits a special monotone property which allows pruning of many item pairs even without computing their upper bounds. A Two-step All-strong-Pairs corrElation queRy (TAPER) algorithm is proposed to exploit these properties in a filter-and-refine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent or improves when the number of items is increased in data sets with common Zipf or linear rank-support distributions. Experimental results from synthetic and real data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than brute-force alternatives.
Mining Statistically Important Equivalence Classes and Delta-Discriminative Emerging Patterns
, 2007
"... The support-confidence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that effectively simplifies the search lattice. This computational convenience brings both quality and statistical flaws to the results as observed by many previous studies. In thi ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
The support-confidence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that effectively simplifies the search lattice. This computational convenience brings both quality and statistical flaws to the results as observed by many previous studies. In this paper, we introduce a novel algorithm that produces itemsets with ranked statistical merits under sophisticated test statistics such as chi-square, risk ratio, odds ratio, etc. Our algorithm is based on the concept of equivalence classes. An equivalence class is a set of frequent itemsets that always occur together in the same set of transactions. Therefore, itemsets within an equivalence class all share the same level of statistical significance regardless of the variety of test statistics. As an equivalence class can be uniquely determined and concisely represented by a closed pattern and a set of generators, we just mine closed patterns and generators, taking a simultaneous depth-first search scheme. This parallel approach has not been exploited by any prior work. We evaluate our algorithm on two aspects. In general, we compare to LCM and FPclose which are the best algorithms tailored for mining only closed patterns. In particular, we compare to epMiner which is the most recent algorithm for mining a type of relative risk patterns, known as minimal emerging patterns. Experimental results show that our algorithm is faster than all of them, sometimes even multiple orders of magnitude faster. These statistically ranked patterns and the efficiency have a high potential for reallife applications, especially in biomedical and financial fields where classical test statistics are of dominant interest.
Is Pushing Constraints Deeply into the Mining Algorithms really what we want? - An Alternative Approach for Association Rule Mining
, 2002
"... The common approach to exploit mining constraints is to push them deeply into the mining algorithms. In our paper we argue that this approach is based on an understanding of KDD that is no longer up-to-date. In fact, today KDD is seen as a human centered, highly interactive and iterative process. Bl ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
The common approach to exploit mining constraints is to push them deeply into the mining algorithms. In our paper we argue that this approach is based on an understanding of KDD that is no longer up-to-date. In fact, today KDD is seen as a human centered, highly interactive and iterative process. Blindly enforcing constraints already during the mining runs neglects the process character of KDD and therefore is no longer state of the art. Constraints can make a single algorithm run faster but in fact we are still far from response times that would allow true interactivity in KDD. In addition we pay the price of repeated mining runs and moreover risk reducing data mining to some kind of hypothesis testing. Taking all the above into consideration we propose to do exactly the contrary of constrained mining: We accept an initial (nearly) unconstrained and costly mining run. But instead of a sequence of subsequent and still expensive constrained mining runs we answer all further mining queries from this initial result set. Whereas this is straight forward for constraints that can be implemented as filters on the result set, things get more complicated when we restrict the underlying mining data. Actually in practice such constraints are very important, e.g. the generation of rules for certain days of the week, for families, singles, male or female customers etc. We show how to postpone such rowrestriction constraints on the transactions from rule generation to rule retrieval from the initial result set.
A Comparison of Statistical and Machine Learning Algorithms on the Task of Link Completion
- In KDD Workshop on Link Analysis for Detecting Complex Behavior
, 2003
"... Link data, consisting of a collection of subsets of entities, can be an important source of information for a variety of fields including the social sciences, biology, criminology, and business intelligence. However, these links may be incomplete, containing one or more unknown members. We consider ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Link data, consisting of a collection of subsets of entities, can be an important source of information for a variety of fields including the social sciences, biology, criminology, and business intelligence. However, these links may be incomplete, containing one or more unknown members. We consider the problem of link completion, identifying which entities are the most likely missing members of a link given the previously observed links. We concentrate on the case of one missing entity. We compare a variety of recently developed along with standard machine learning and strawman algorithms adjusted to suit the task. The algorithms were tested extensively on a simulated and a range of real-world data sets.
Screening and Interpreting Multi-item Associations Based On Log-Linear Modeling
- In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2003
"... Association rules have received a lot of attention in the data mining community since their introduction. The classical approach to find rules whose items enjoy high support (appear in a lot of the transactions in the data set) is, however, filled with shortcomings. It has been shown that support ca ..."
Abstract
-
Cited by 11 (8 self)
- Add to MetaCart
Association rules have received a lot of attention in the data mining community since their introduction. The classical approach to find rules whose items enjoy high support (appear in a lot of the transactions in the data set) is, however, filled with shortcomings. It has been shown that support can be misleading as an indicator of how interesting the rule is. Alternative measures, such as lift, have been proposed. More recently, a paper by DuMouchel et al. proposed the use of all-two-factor loglinear models to discover sets of items that cannot be explained by pairwise associations between the items involved. This approach, however, has its limitations, since it stops short of considering higher order interactions (other than pairwise) among the items. In this paper, we propose a method that examines the parameters of the fitted loglinear models to find all the significant association patterns among the items. Since fitting loglinear models for large data sets can be computationally prohibitive, we apply graph-theoretical results to divide the original set of items into components (sets of items) that are statistically independent from each other. We then apply loglinear modeling to each of the components and find the interesting associations among items in them. The technique is experimentally evaluated with a real data set (insurance data) and a series of synthetic data sets. The results show that the technique is e#ective in finding interesting associations among the items involved.
A computational environment for mining association rules and frequent item sets
- JOURNAL OF STATISTICAL SOFTWARE
, 2005
"... Mining frequent itemsets and association rules is a popular and well researched approach to discovering interesting relationships between variables in large databases. The R package arules presented in this paper provides a basic infrastructure for creating and manipulating input data sets and for a ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
Mining frequent itemsets and association rules is a popular and well researched approach to discovering interesting relationships between variables in large databases. The R package arules presented in this paper provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules. The package also includes interfaces to two fast mining algorithms, the popular C implementations of Apriori and Eclat by Christian Borgelt. These algorithms can be used to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules.

