Results 1  10
of
61
Selecting the right objective measure for association analysis
 Information Systems
"... Abstract. Objective measures such as support, confidence, interest factor, correlation, and entropy are often used to evaluate the interestingness of association patterns. However, in many situations, these measures may provide conflicting information about the interestingness of a pattern. Data min ..."
Abstract

Cited by 72 (6 self)
 Add to MetaCart
(Show Context)
Abstract. Objective measures such as support, confidence, interest factor, correlation, and entropy are often used to evaluate the interestingness of association patterns. However, in many situations, these measures may provide conflicting information about the interestingness of a pattern. Data mining practitioners also tend to apply an objective measure without realizing that there may be better alternatives available for their application. In this paper, we describe several key properties one should examine in order to select the right measure for a given application. A comparative study of these properties is made using twentyone measures that were originally developed in diverse fields such as statistics, social science, machine learning, and data mining. We show that depending on its properties, each measure is useful for some application, but not for others. We also demonstrate two scenarios in which many existing measures become consistent with each other, namely, when supportbased pruning and a technique known as table standardization are applied. Finally, we present an algorithm for selecting a small set of patterns such that domain experts can find a measure that best fits their requirements by ranking this small set of patterns. 1
Discovering significant patterns
, 2007
"... Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some userspecified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type1 error, that is, of finding patter ..."
Abstract

Cited by 46 (3 self)
 Add to MetaCart
Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some userspecified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying wellestablished statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to realworld data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.
Assessing data mining results via swap randomization
 ACM Transactions on Knowledge Discovery from Data
"... The problem of assessing the significance of data mining results on highdimensional 0–1 data sets has been studied extensively in the literature. For problems such as mining frequent sets and finding correlations, significance testing can be done by, e.g., chisquare tests, or many other methods. H ..."
Abstract

Cited by 37 (6 self)
 Add to MetaCart
The problem of assessing the significance of data mining results on highdimensional 0–1 data sets has been studied extensively in the literature. For problems such as mining frequent sets and finding correlations, significance testing can be done by, e.g., chisquare tests, or many other methods. However, the results of such tests depend only on the specific attributes and not on the dataset as a whole. Moreover, the tests are more difficult to apply to sets of patterns or other complex results of data mining. In this paper, we consider a simple randomization technique that deals with this shortcoming. The approach consists of producing random datasets that have the same row and column margins with the given dataset, computing the results of interest on the randomized instances, and comparing them against the results on the actual data. This randomization technique can be used to assess the results of many different types of data mining algorithms, such as frequent sets, clustering, and rankings. To generate random datasets with given margins, we use variations of a Markov chain approach, which is based on a simple swap operation. We give theoretical results on the efficiency of different randomization methods, and apply the swap randomization method to several wellknown datasets. Our results indicate that for some datasets the structure discovered by the data mining algorithms is a random artifact, while for other datasets the discovered structure conveys meaningful information.
Interestingness of Frequent Itemsets Using Bayesian Networks as Background Knowledge
 In Proceedings of the SIGKDD Conference on Knowledge Discovery and Data Mining
, 2004
"... ..."
(Show Context)
Local and global methods in data mining: basic techniques and open problems
 In Automata, Languages, and Programming
, 2002
"... ..."
(Show Context)
Exploiting A Supportbased Upper Bound of Pearson's Correlation Coefficient for Efficiently Identifying Strongly Correlated Pairs
, 2004
"... Given a userspecified minimum correlation threshold # and a market basket database with N items and T transactions, an allstrongpairs correlation query finds all item pairs with correlations above the threshold #. However, when the number of items and transactions are large, the computation cost ..."
Abstract

Cited by 25 (2 self)
 Add to MetaCart
Given a userspecified minimum correlation threshold # and a market basket database with N items and T transactions, an allstrongpairs correlation query finds all item pairs with correlations above the threshold #. However, when the number of items and transactions are large, the computation cost of this query can be very high. In this paper, we identify an upper bound of Pearson's correlation coe#cient for binary variables. This upper bound is not only much cheaper to compute than Pearson's correlation coe#cient but also exhibits a special monotone property which allows pruning of many item pairs even without computing their upper bounds. A Twostep AllstrongPairs corrElation queRy (TAPER) algorithm is proposed to exploit these properties in a filterandrefine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent or improves when the number of items is increased in data sets with common Zipf or linear ranksupport distributions. Experimental results from synthetic and real data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than bruteforce alternatives.
Mining Statistically Important Equivalence Classes and DeltaDiscriminative Emerging Patterns
, 2007
"... The supportconfidence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that effectively simplifies the search lattice. This computational convenience brings both quality and statistical flaws to the results as observed by many previous studies. In thi ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
The supportconfidence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that effectively simplifies the search lattice. This computational convenience brings both quality and statistical flaws to the results as observed by many previous studies. In this paper, we introduce a novel algorithm that produces itemsets with ranked statistical merits under sophisticated test statistics such as chisquare, risk ratio, odds ratio, etc. Our algorithm is based on the concept of equivalence classes. An equivalence class is a set of frequent itemsets that always occur together in the same set of transactions. Therefore, itemsets within an equivalence class all share the same level of statistical significance regardless of the variety of test statistics. As an equivalence class can be uniquely determined and concisely represented by a closed pattern and a set of generators, we just mine closed patterns and generators, taking a simultaneous depthfirst search scheme. This parallel approach has not been exploited by any prior work. We evaluate our algorithm on two aspects. In general, we compare to LCM and FPclose which are the best algorithms tailored for mining only closed patterns. In particular, we compare to epMiner which is the most recent algorithm for mining a type of relative risk patterns, known as minimal emerging patterns. Experimental results show that our algorithm is faster than all of them, sometimes even multiple orders of magnitude faster. These statistically ranked patterns and the efficiency have a high potential for reallife applications, especially in biomedical and financial fields where classical test statistics are of dominant interest.
Bandits for taxonomies: A modelbased approach
 In In Proc. of the SIAM International Conference on Data Mining
, 2007
"... We consider a novel problem of learning an optimal matching, in an online fashion, between two feature spaces that are organized as taxonomies. We formulate this as a multiarmed bandit problem where the arms of the bandit are dependent due to the structure induced by the taxonomies. We then propose ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
(Show Context)
We consider a novel problem of learning an optimal matching, in an online fashion, between two feature spaces that are organized as taxonomies. We formulate this as a multiarmed bandit problem where the arms of the bandit are dependent due to the structure induced by the taxonomies. We then propose a multistage hierarchical allocation scheme that improves the explore/exploit properties of the classical multiarmed bandit policies in this scenario. In particular, our scheme uses the taxonomy structure and performs shrinkage estimation in a Bayesian framework to exploit dependencies among the arms, thereby enhancing exploration without losing efficiency on short term exploitation. We prove that our scheme asymptotically converges to the optimal matching. We conduct extensive experiments on real data to illustrate the efficacy of our scheme in practice. 1
TAPER: A twostep approach for allstrongpairs correlation query in large databases
 IEEE TKDE
"... Abstract—Given a userspecified minimum correlation threshold and a marketbasket database with N items and T transactions, an allstrongpairs correlation query finds all item pairs with correlations above the threshold. However, when the number of items and transactions are large, the computation ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
(Show Context)
Abstract—Given a userspecified minimum correlation threshold and a marketbasket database with N items and T transactions, an allstrongpairs correlation query finds all item pairs with correlations above the threshold. However, when the number of items and transactions are large, the computation cost of this query can be very high. The goal of this paper is to provide computationally efficient algorithms to answer the allstrongpairs correlation query. Indeed, we identify an upper bound of Pearson’s correlation coefficient for binary variables. This upper bound is not only much cheaper to compute than Pearson’s correlation coefficient, but also exhibits special monotone properties which allow pruning of many item pairs even without computing their upper bounds. A Twostep AllstrongPairs corElation queRy (TAPER) algorithm is proposed to exploit these properties in a filterandrefine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent of or improves when the number of items is increased in data sets with Zipflike or linear ranksupport distributions. Experimental results from synthetic and realworld data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than bruteforce alternatives. Finally, we demonstrate that the algorithmic ideas developed in the TAPER algorithm can be extended to efficiently compute negative correlation and uncentered Pearson’s correlation coefficient. Index Terms—Association analysis, data mining, Pearson’s correlation coefficient, statistical computing. 1
A Comparison of Statistical and Machine Learning Algorithms on the Task of Link Completion
 In KDD Workshop on Link Analysis for Detecting Complex Behavior
, 2003
"... Link data, consisting of a collection of subsets of entities, can be an important source of information for a variety of fields including the social sciences, biology, criminology, and business intelligence. However, these links may be incomplete, containing one or more unknown members. We consider ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
(Show Context)
Link data, consisting of a collection of subsets of entities, can be an important source of information for a variety of fields including the social sciences, biology, criminology, and business intelligence. However, these links may be incomplete, containing one or more unknown members. We consider the problem of link completion, identifying which entities are the most likely missing members of a link given the previously observed links. We concentrate on the case of one missing entity. We compare a variety of recently developed along with standard machine learning and strawman algorithms adjusted to suit the task. The algorithms were tested extensively on a simulated and a range of realworld data sets.