Results 1  10
of
23
Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining
 Journal of Machine Learning Research
"... This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use ..."
Abstract

Cited by 25 (0 self)
 Add to MetaCart
This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use different terminology and task definitions, claim to have different goals, claim to use different rule learning heuristics, and use different means for selecting subsets of induced patterns. This paper contributes a novel understanding of these subareas of data mining by presenting a unified terminology, by explaining the apparent differences between the learning tasks as variants of a unique supervised descriptive rule discovery task and by exploring the apparent differences between the approaches. It also shows that various rule learning heuristics used in CSM, EPM and SD algorithms all aim at optimizing a trade off between rule coverage and precision. The commonalities (and differences) between the approaches are showcased on a selection of best known variants of CSM, EPM and SD algorithms. The paper also provides a critical survey of existing supervised descriptive rule discovery visualization methods.
Mining Statistically Important Equivalence Classes and DeltaDiscriminative Emerging Patterns
, 2007
"... The supportconfidence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that effectively simplifies the search lattice. This computational convenience brings both quality and statistical flaws to the results as observed by many previous studies. In thi ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
The supportconfidence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that effectively simplifies the search lattice. This computational convenience brings both quality and statistical flaws to the results as observed by many previous studies. In this paper, we introduce a novel algorithm that produces itemsets with ranked statistical merits under sophisticated test statistics such as chisquare, risk ratio, odds ratio, etc. Our algorithm is based on the concept of equivalence classes. An equivalence class is a set of frequent itemsets that always occur together in the same set of transactions. Therefore, itemsets within an equivalence class all share the same level of statistical significance regardless of the variety of test statistics. As an equivalence class can be uniquely determined and concisely represented by a closed pattern and a set of generators, we just mine closed patterns and generators, taking a simultaneous depthfirst search scheme. This parallel approach has not been exploited by any prior work. We evaluate our algorithm on two aspects. In general, we compare to LCM and FPclose which are the best algorithms tailored for mining only closed patterns. In particular, we compare to epMiner which is the most recent algorithm for mining a type of relative risk patterns, known as minimal emerging patterns. Experimental results show that our algorithm is faster than all of them, sometimes even multiple orders of magnitude faster. These statistically ranked patterns and the efficiency have a high potential for reallife applications, especially in biomedical and financial fields where classical test statistics are of dominant interest.
Mining minimal distinguishing subsequence patterns with gap constraints
 In ICDM
, 2005
"... Discovering contrasts between collections of data is an important task in data mining. In this paper, we introduce a new type of contrast pattern, called a Minimal Distinguishing Subsequence (MDS). An MDS is a minimal subsequence that occurs frequently in one class of sequences and infrequently in s ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
Discovering contrasts between collections of data is an important task in data mining. In this paper, we introduce a new type of contrast pattern, called a Minimal Distinguishing Subsequence (MDS). An MDS is a minimal subsequence that occurs frequently in one class of sequences and infrequently in sequences of another class. It is a natural way of representing strong and succinct contrast information between two sequential datasets and can be useful in applications such as protein comparison, document comparison and building sequential classification models. Mining MDS patterns is a challenging task and is significantly different from mining contrasts between relational/transactional data. One particularly important type of constraint that can be integrated into the mining process is the maximum gap constraint. We present an efficient algorithm called ConSGapMiner, to mine all MDSs according to a maximum gap constraint. It employs highly efficient bitset and boolean operations, for powerful gap based pruning within a prefix growth framework. A performance evaluation with both sparse and dense datasets, demonstrates the scalability of ConSGapMiner and shows its ability to mine patterns from high dimensional datasets at low supports. 1.
Fast mining of high dimensional expressive contrast patterns using zerosuppressed binary decision diagrams
 In KDD
, 2006
"... Patterns of contrast are a very important way of comparing multidimensional datasets. Such patterns are able to capture regions of high difference between two classes of data, and are useful for human experts and the construction of classifiers. However, mining such patterns is particularly challeng ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
Patterns of contrast are a very important way of comparing multidimensional datasets. Such patterns are able to capture regions of high difference between two classes of data, and are useful for human experts and the construction of classifiers. However, mining such patterns is particularly challenging when the number of dimensions is large. This paper describes a new technique for mining several varieties of contrast pattern, based on the use of ZeroSuppressed Binary Decision Diagrams (ZBDDs), a powerful data structure for manipulating sparse data. We study the mining of both simple contrast patterns, such as emerging patterns, and more novel and complex contrasts, which we call disjunctive emerging patterns. A performance study demonstrates our ZBDD technique is highly scalable, substantially improves on state of the art mining for emerging patterns and can be effective for discovering complex contrasts from datasets with thousands of attributes.
Group sax: Extending the notion of contrast sets to time series and multimedia data
 In PKDD
, 2006
"... Abstract. In this work, we take the traditional notation of contrast sets and extend them to other data types, in particular time series and by extension, images. In the traditional sense, contrastset mining identifies attributes, values and instances that differ significantly across groups, and he ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
Abstract. In this work, we take the traditional notation of contrast sets and extend them to other data types, in particular time series and by extension, images. In the traditional sense, contrastset mining identifies attributes, values and instances that differ significantly across groups, and helps user understand the differences between groups of data. We reformulate the notion of contrastsets for time series data, and define it to be the key pattern(s) that are maximally different from the other set of data. We propose a fast and exact algorithm to find the contrast sets, and demonstrate its utility in several diverse domains, ranging from industrial to anthropology. We show that our algorithm achieves 3 orders of magnitude speedup from the bruteforce algorithm, while producing exact solutions. 1
A Systematic Approach for Optimizing Complex Mining Tasks on Multiple Databases
"... and iterative process. In order to support this process, one of the longterm goals of data mining research has been to build a Knowledge Discovery and Data Mining System (KDDMS). Along this line, much research has been done to provide database support for mining operations. ..."
Abstract

Cited by 10 (6 self)
 Add to MetaCart
and iterative process. In order to support this process, one of the longterm goals of data mining research has been to build a Knowledge Discovery and Data Mining System (KDDMS). Along this line, much research has been done to provide database support for mining operations.
A Statistically Sound Alternative Approach to Mining Contrast Sets
 In Proceedings of the 4th Australasian Data Mining Conference (AusDM
, 2005
"... Abstract. One of the fundamental tasks of data analysis in many disciplines is to identify the significant differences between classes or groups. Contrast sets have previously been proposed as a useful tool for describing these differences. A contrast set is a conjunction of (association rulelike) ..."
Abstract

Cited by 9 (4 self)
 Add to MetaCart
Abstract. One of the fundamental tasks of data analysis in many disciplines is to identify the significant differences between classes or groups. Contrast sets have previously been proposed as a useful tool for describing these differences. A contrast set is a conjunction of (association rulelike) attributevalue pairs for which the conjunction is true for some group. The intuition is that comparing the support for a contrast set across groups may provide some insight into the fundamental differences between the groups. In this paper, we compare two contrast set mining methods that rely on different statistical philosophies: the wellknown STUCCO approach, and CIGAR, our proposed alternative approach. We survey and discuss the statistical measures underlying the two methods using an informal tutorial approach. Experimental results show that both methodologies are statistically sound, representing valid alternative solutions to the problem of identifying potentially interesting contrast sets. 1
Statistically Sound Exploratory Rule Discovery
"... Association rule discovery and other exploratory rule discovery techniques explore large search spaces of potential rules to find those that appear interesting by some userselected criterion of interestingness. Due to the large number of rules considered, they su#er from an extreme risk of type1 e ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Association rule discovery and other exploratory rule discovery techniques explore large search spaces of potential rules to find those that appear interesting by some userselected criterion of interestingness. Due to the large number of rules considered, they su#er from an extreme risk of type1 error, finding rules that appear to satisfy the interestingness criteria on the sample data only due to chance. One method for avoiding this risk is to apply a correction for multiple comparisons during the rule discovery process. While this may result in statistically sound rule discovery with tight control over the risk of type1 error, it introduces extreme risk of type2 error, rejecting rules that do in fact satisfy the interestingness criteria. This paper proposes a technique to overcome this problem by using holdout data for statistical evaluation. Experiments demonstrate that traditional association rule discovery can result in large numbers of rules that are rejected when subjected to statistical evaluation on holdout data. They also reveal that modification of the rule discovery process to anticipate subsequent statistical evaluation can increase the number of rules that satisfy an interestingness criterion that are accepted by statistical evaluation on holdout data.
Contrasting the Contrast Sets: An Alternative Approach
"... The need to identify significant differences between contrasting groups or classes is ubiquitous and thus was the focus of many statisticians and data miners. Contrast sets, conjunctions of attributevalue pairs significantly more frequent in one group than another, were proposed to describe such di ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
The need to identify significant differences between contrasting groups or classes is ubiquitous and thus was the focus of many statisticians and data miners. Contrast sets, conjunctions of attributevalue pairs significantly more frequent in one group than another, were proposed to describe such differences, which lead to the introduction of a new data mining techniquecontrastset mining. A number of attempts have been made in this regard by various authors; however, no clear picture seems to have emerged. In this paper, we try to address the problem of finding meaningful contrast sets by using Association Rule based analysis. We present the results for our experiments for interesting contrast sets and compare these results with those obtained from the wellknown algorithm for contrast setsSTUCCO. 1.
Discovering substantial distinctions among incremental biclusters
 in SDM
, 2009
"... A fundamental task of data analysis is comprehending what distinguishes clusters found within the data. We present the problem of mining distinguishing sets which seeks to find sets of objects or attributes that induce that most change among the incremental biclusters of a binary dataset. Unlike em ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
A fundamental task of data analysis is comprehending what distinguishes clusters found within the data. We present the problem of mining distinguishing sets which seeks to find sets of objects or attributes that induce that most change among the incremental biclusters of a binary dataset. Unlike emerging patterns and contrast sets which only focus on statistical differences between support of itemsets, our approach considers distinctions in both the attribute space and the object space. Viewing the lattice of biclusters formed within a data set as a weighted directed graph, we mine the most significant distinguishing sets by growing a maximal cost spanning tree of the lattice. In this paper we present a weighting function for measuring distinction among biclusters in the lattice and the novel MIDS algorithm. MIDS simultaneously enumerates biclusters, constructs the bicluster lattice, and computes the distinguishing sets. The efficient computational performance of MIDS is exhibited in a performance test on real world and benchmark data sets. The utility of distinguishing sets is also demonstrated with experiments on synthetic and real data. 1