Results 1  10
of
45
Selecting the right objective measure for association analysis
 Information Systems
"... Abstract. Objective measures such as support, confidence, interest factor, correlation, and entropy are often used to evaluate the interestingness of association patterns. However, in many situations, these measures may provide conflicting information about the interestingness of a pattern. Data min ..."
Abstract

Cited by 61 (6 self)
 Add to MetaCart
Abstract. Objective measures such as support, confidence, interest factor, correlation, and entropy are often used to evaluate the interestingness of association patterns. However, in many situations, these measures may provide conflicting information about the interestingness of a pattern. Data mining practitioners also tend to apply an objective measure without realizing that there may be better alternatives available for their application. In this paper, we describe several key properties one should examine in order to select the right measure for a given application. A comparative study of these properties is made using twentyone measures that were originally developed in diverse fields such as statistics, social science, machine learning, and data mining. We show that depending on its properties, each measure is useful for some application, but not for others. We also demonstrate two scenarios in which many existing measures become consistent with each other, namely, when supportbased pruning and a technique known as table standardization are applied. Finally, we present an algorithm for selecting a small set of patterns such that domain experts can find a measure that best fits their requirements by ranking this small set of patterns. 1
Discovering significant patterns
, 2007
"... Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some userspecified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type1 error, that is, of finding patter ..."
Abstract

Cited by 41 (3 self)
 Add to MetaCart
Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some userspecified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying wellestablished statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to realworld data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.
Interestingness of Frequent Itemsets Using Bayesian Networks as Background Knowledge
 In Proceedings of the SIGKDD Conference on Knowledge Discovery and Data Mining
, 2004
"... ..."
Local and Global Methods in Data Mining: Basic Techniques and Open Problems
 In ICALP 2002, 29th International Colloquium on Automata, Languages, and Programming, Malaga
, 2002
"... Data mining has in recent years emerged as an interesting area in the boundary between algorithms, probabilistic modeling, statistics, and databases. Data mining research can be divided into global approaches, which try to model the whole data, and local methods, which try to find useful patterns oc ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
Data mining has in recent years emerged as an interesting area in the boundary between algorithms, probabilistic modeling, statistics, and databases. Data mining research can be divided into global approaches, which try to model the whole data, and local methods, which try to find useful patterns occurring in the data. We discuss briefly some simple local and global techniques, review two attempts at combining the approaches, and list open problems with an algorithmic flavor.
Mining Statistically Important Equivalence Classes and DeltaDiscriminative Emerging Patterns
, 2007
"... The supportconfidence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that effectively simplifies the search lattice. This computational convenience brings both quality and statistical flaws to the results as observed by many previous studies. In thi ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
The supportconfidence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that effectively simplifies the search lattice. This computational convenience brings both quality and statistical flaws to the results as observed by many previous studies. In this paper, we introduce a novel algorithm that produces itemsets with ranked statistical merits under sophisticated test statistics such as chisquare, risk ratio, odds ratio, etc. Our algorithm is based on the concept of equivalence classes. An equivalence class is a set of frequent itemsets that always occur together in the same set of transactions. Therefore, itemsets within an equivalence class all share the same level of statistical significance regardless of the variety of test statistics. As an equivalence class can be uniquely determined and concisely represented by a closed pattern and a set of generators, we just mine closed patterns and generators, taking a simultaneous depthfirst search scheme. This parallel approach has not been exploited by any prior work. We evaluate our algorithm on two aspects. In general, we compare to LCM and FPclose which are the best algorithms tailored for mining only closed patterns. In particular, we compare to epMiner which is the most recent algorithm for mining a type of relative risk patterns, known as minimal emerging patterns. Experimental results show that our algorithm is faster than all of them, sometimes even multiple orders of magnitude faster. These statistically ranked patterns and the efficiency have a high potential for reallife applications, especially in biomedical and financial fields where classical test statistics are of dominant interest.
Exploiting A Supportbased Upper Bound of Pearson's Correlation Coefficient for Efficiently Identifying Strongly Correlated Pairs
, 2004
"... Given a userspecified minimum correlation threshold # and a market basket database with N items and T transactions, an allstrongpairs correlation query finds all item pairs with correlations above the threshold #. However, when the number of items and transactions are large, the computation cost ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
Given a userspecified minimum correlation threshold # and a market basket database with N items and T transactions, an allstrongpairs correlation query finds all item pairs with correlations above the threshold #. However, when the number of items and transactions are large, the computation cost of this query can be very high. In this paper, we identify an upper bound of Pearson's correlation coe#cient for binary variables. This upper bound is not only much cheaper to compute than Pearson's correlation coe#cient but also exhibits a special monotone property which allows pruning of many item pairs even without computing their upper bounds. A Twostep AllstrongPairs corrElation queRy (TAPER) algorithm is proposed to exploit these properties in a filterandrefine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent or improves when the number of items is increased in data sets with common Zipf or linear ranksupport distributions. Experimental results from synthetic and real data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than bruteforce alternatives.
Bandits for taxonomies: A modelbased approach
 In In Proc. of the SIAM International Conference on Data Mining
, 2007
"... We consider a novel problem of learning an optimal matching, in an online fashion, between two feature spaces that are organized as taxonomies. We formulate this as a multiarmed bandit problem where the arms of the bandit are dependent due to the structure induced by the taxonomies. We then propose ..."
Abstract

Cited by 17 (5 self)
 Add to MetaCart
We consider a novel problem of learning an optimal matching, in an online fashion, between two feature spaces that are organized as taxonomies. We formulate this as a multiarmed bandit problem where the arms of the bandit are dependent due to the structure induced by the taxonomies. We then propose a multistage hierarchical allocation scheme that improves the explore/exploit properties of the classical multiarmed bandit policies in this scenario. In particular, our scheme uses the taxonomy structure and performs shrinkage estimation in a Bayesian framework to exploit dependencies among the arms, thereby enhancing exploration without losing efficiency on short term exploitation. We prove that our scheme asymptotically converges to the optimal matching. We conduct extensive experiments on real data to illustrate the efficacy of our scheme in practice. 1
A Comparison of Statistical and Machine Learning Algorithms on the Task of Link Completion
 In KDD Workshop on Link Analysis for Detecting Complex Behavior
, 2003
"... Link data, consisting of a collection of subsets of entities, can be an important source of information for a variety of fields including the social sciences, biology, criminology, and business intelligence. However, these links may be incomplete, containing one or more unknown members. We consider ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
Link data, consisting of a collection of subsets of entities, can be an important source of information for a variety of fields including the social sciences, biology, criminology, and business intelligence. However, these links may be incomplete, containing one or more unknown members. We consider the problem of link completion, identifying which entities are the most likely missing members of a link given the previously observed links. We concentrate on the case of one missing entity. We compare a variety of recently developed along with standard machine learning and strawman algorithms adjusted to suit the task. The algorithms were tested extensively on a simulated and a range of realworld data sets.
Is Pushing Constraints Deeply into the Mining Algorithms really what we want?  An Alternative Approach for Association Rule Mining
, 2002
"... The common approach to exploit mining constraints is to push them deeply into the mining algorithms. In our paper we argue that this approach is based on an understanding of KDD that is no longer uptodate. In fact, today KDD is seen as a human centered, highly interactive and iterative process. Bl ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
The common approach to exploit mining constraints is to push them deeply into the mining algorithms. In our paper we argue that this approach is based on an understanding of KDD that is no longer uptodate. In fact, today KDD is seen as a human centered, highly interactive and iterative process. Blindly enforcing constraints already during the mining runs neglects the process character of KDD and therefore is no longer state of the art. Constraints can make a single algorithm run faster but in fact we are still far from response times that would allow true interactivity in KDD. In addition we pay the price of repeated mining runs and moreover risk reducing data mining to some kind of hypothesis testing. Taking all the above into consideration we propose to do exactly the contrary of constrained mining: We accept an initial (nearly) unconstrained and costly mining run. But instead of a sequence of subsequent and still expensive constrained mining runs we answer all further mining queries from this initial result set. Whereas this is straight forward for constraints that can be implemented as filters on the result set, things get more complicated when we restrict the underlying mining data. Actually in practice such constraints are very important, e.g. the generation of rules for certain days of the week, for families, singles, male or female customers etc. We show how to postpone such rowrestriction constraints on the transactions from rule generation to rule retrieval from the initial result set.
TAPER: A twostep approach for allstrongpairs correlation query in large databases
 IEEE TKDE
"... Abstract—Given a userspecified minimum correlation threshold and a marketbasket database with N items and T transactions, an allstrongpairs correlation query finds all item pairs with correlations above the threshold. However, when the number of items and transactions are large, the computation ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
Abstract—Given a userspecified minimum correlation threshold and a marketbasket database with N items and T transactions, an allstrongpairs correlation query finds all item pairs with correlations above the threshold. However, when the number of items and transactions are large, the computation cost of this query can be very high. The goal of this paper is to provide computationally efficient algorithms to answer the allstrongpairs correlation query. Indeed, we identify an upper bound of Pearson’s correlation coefficient for binary variables. This upper bound is not only much cheaper to compute than Pearson’s correlation coefficient, but also exhibits special monotone properties which allow pruning of many item pairs even without computing their upper bounds. A Twostep AllstrongPairs corElation queRy (TAPER) algorithm is proposed to exploit these properties in a filterandrefine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent of or improves when the number of items is increased in data sets with Zipflike or linear ranksupport distributions. Experimental results from synthetic and realworld data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than bruteforce alternatives. Finally, we demonstrate that the algorithmic ideas developed in the TAPER algorithm can be extended to efficiently compute negative correlation and uncentered Pearson’s correlation coefficient. Index Terms—Association analysis, data mining, Pearson’s correlation coefficient, statistical computing. 1