Results 1  10
of
97
Mining significant graph patterns by leap search
 in SIGMOD ’08
"... With everincreasing amounts of graph data from disparate sources, there has been a strong need for exploiting significant graph patterns with userspecified objective functions. Most objective functions are not antimonotonic, which could fail all of frequencycentric graph mining algorithms. In thi ..."
Abstract

Cited by 66 (17 self)
 Add to MetaCart
(Show Context)
With everincreasing amounts of graph data from disparate sources, there has been a strong need for exploiting significant graph patterns with userspecified objective functions. Most objective functions are not antimonotonic, which could fail all of frequencycentric graph mining algorithms. In this paper, we give the first comprehensive study on general mining method aiming to find most significant patterns directly. Our new mining framework, called LEAP(Descending Leap Mine), is developed to exploit the correlation between structural similarity and significance similarity in a way that the most significant pattern could be identified quickly by searching dissimilar graph patterns. Two novel concepts, structural leap search and frequency descending mining, are proposed to support leap search in graph pattern space. Our new mining method revealed that the widely adopted branchandbound search in data mining literature is indeed not the best, thus sketching a new picture on scalable graph pattern discovery. Empirical results show that LEAP achieves orders of magnitude speedup in comparison with the stateoftheart method. Furthermore, graph classifiers built on mined patterns outperform the uptodate graph kernel method in terms of efficiency and accuracy, demonstrating the high promise of such patterns.
Discovering significant patterns
, 2007
"... Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some userspecified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type1 error, that is, of finding patter ..."
Abstract

Cited by 56 (4 self)
 Add to MetaCart
(Show Context)
Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some userspecified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying wellestablished statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to realworld data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.
On Detecting Differences Between Groups
, 2003
"... Understanding the differences between contrasting groups is a fundamental task in data analysis. This realization has led to the development of a new special purpose data mining technique, contrastset mining. We undertook a study with a retail collaborator to compare contrastset mining with existi ..."
Abstract

Cited by 50 (2 self)
 Add to MetaCart
(Show Context)
Understanding the differences between contrasting groups is a fundamental task in data analysis. This realization has led to the development of a new special purpose data mining technique, contrastset mining. We undertook a study with a retail collaborator to compare contrastset mining with existing rulediscovery techniques. To our surprise we observed that straightforward application of an existing commercial rulediscovery system, Magnum Opus, could successfully perform the contrastsetmining task. This led to the realization that contrastset mining is a special case of the more general rulediscovery task. We present the results of our study together with a proof of this conclusion.
Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining
 Journal of Machine Learning Research
"... This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use ..."
Abstract

Cited by 48 (0 self)
 Add to MetaCart
This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use different terminology and task definitions, claim to have different goals, claim to use different rule learning heuristics, and use different means for selecting subsets of induced patterns. This paper contributes a novel understanding of these subareas of data mining by presenting a unified terminology, by explaining the apparent differences between the learning tasks as variants of a unique supervised descriptive rule discovery task and by exploring the apparent differences between the approaches. It also shows that various rule learning heuristics used in CSM, EPM and SD algorithms all aim at optimizing a trade off between rule coverage and precision. The commonalities (and differences) between the approaches are showcased on a selection of best known variants of CSM, EPM and SD algorithms. The paper also provides a critical survey of existing supervised descriptive rule discovery visualization methods.
Mining Statistically Important Equivalence Classes and DeltaDiscriminative Emerging Patterns
, 2007
"... The supportconfidence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that effectively simplifies the search lattice. This computational convenience brings both quality and statistical flaws to the results as observed by many previous studies. In thi ..."
Abstract

Cited by 31 (2 self)
 Add to MetaCart
The supportconfidence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that effectively simplifies the search lattice. This computational convenience brings both quality and statistical flaws to the results as observed by many previous studies. In this paper, we introduce a novel algorithm that produces itemsets with ranked statistical merits under sophisticated test statistics such as chisquare, risk ratio, odds ratio, etc. Our algorithm is based on the concept of equivalence classes. An equivalence class is a set of frequent itemsets that always occur together in the same set of transactions. Therefore, itemsets within an equivalence class all share the same level of statistical significance regardless of the variety of test statistics. As an equivalence class can be uniquely determined and concisely represented by a closed pattern and a set of generators, we just mine closed patterns and generators, taking a simultaneous depthfirst search scheme. This parallel approach has not been exploited by any prior work. We evaluate our algorithm on two aspects. In general, we compare to LCM and FPclose which are the best algorithms tailored for mining only closed patterns. In particular, we compare to epMiner which is the most recent algorithm for mining a type of relative risk patterns, known as minimal emerging patterns. Experimental results show that our algorithm is faster than all of them, sometimes even multiple orders of magnitude faster. These statistically ranked patterns and the efficiency have a high potential for reallife applications, especially in biomedical and financial fields where classical test statistics are of dominant interest.
Mining minimal distinguishing subsequence patterns with gap constraints
 In ICDM
, 2005
"... Discovering contrasts between collections of data is an important task in data mining. In this paper, we introduce a new type of contrast pattern, called a Minimal Distinguishing Subsequence (MDS). An MDS is a minimal subsequence that occurs frequently in one class of sequences and infrequently in s ..."
Abstract

Cited by 26 (3 self)
 Add to MetaCart
(Show Context)
Discovering contrasts between collections of data is an important task in data mining. In this paper, we introduce a new type of contrast pattern, called a Minimal Distinguishing Subsequence (MDS). An MDS is a minimal subsequence that occurs frequently in one class of sequences and infrequently in sequences of another class. It is a natural way of representing strong and succinct contrast information between two sequential datasets and can be useful in applications such as protein comparison, document comparison and building sequential classification models. Mining MDS patterns is a challenging task and is significantly different from mining contrasts between relational/transactional data. One particularly important type of constraint that can be integrated into the mining process is the maximum gap constraint. We present an efficient algorithm called ConSGapMiner, to mine all MDSs according to a maximum gap constraint. It employs highly efficient bitset and boolean operations, for powerful gap based pruning within a prefix growth framework. A performance evaluation with both sparse and dense datasets, demonstrates the scalability of ConSGapMiner and shows its ability to mine patterns from high dimensional datasets at low supports. 1.
Clustering and Sequential Pattern Mining of Online Collaborative Learning Data
 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
"... Group work is widespread in education. The growing use of online tools supporting group work generates huge amounts of data. We aim to exploit this data to support mirroring: presenting useful highlevel views of information about the group, together with desired patterns characterizing the behavio ..."
Abstract

Cited by 25 (2 self)
 Add to MetaCart
Group work is widespread in education. The growing use of online tools supporting group work generates huge amounts of data. We aim to exploit this data to support mirroring: presenting useful highlevel views of information about the group, together with desired patterns characterizing the behaviour of strong groups. The goal is to enable the groups and their facilitators to see relevant aspects of the group’s operation and provide feedback if these are more likely to be associated with positive or negative outcomes and where the problems are. We explore how useful mirror information can be extracted via a theorydriven approach and a range of clustering and sequential pattern mining. The context is a senior software development project where students use the collaboration tool TRAC. We extract patterns distinguishing the better from the weaker groups and get insights in the success factors. The results point to the importance of leadership and group interaction, and give promising indications if they are occurring. Patterns indicating good individual practices were also identified. We found that some key measures can be mined from early data. The results are promising for advising groups at the start and early identification of effective and poor practices, in time for remediation.
Fast mining of high dimensional expressive contrast patterns using zerosuppressed binary decision diagrams
 In KDD
, 2006
"... Patterns of contrast are a very important way of comparing multidimensional datasets. Such patterns are able to capture regions of high difference between two classes of data, and are useful for human experts and the construction of classifiers. However, mining such patterns is particularly challeng ..."
Abstract

Cited by 24 (5 self)
 Add to MetaCart
(Show Context)
Patterns of contrast are a very important way of comparing multidimensional datasets. Such patterns are able to capture regions of high difference between two classes of data, and are useful for human experts and the construction of classifiers. However, mining such patterns is particularly challenging when the number of dimensions is large. This paper describes a new technique for mining several varieties of contrast pattern, based on the use of ZeroSuppressed Binary Decision Diagrams (ZBDDs), a powerful data structure for manipulating sparse data. We study the mining of both simple contrast patterns, such as emerging patterns, and more novel and complex contrasts, which we call disjunctive emerging patterns. A performance study demonstrates our ZBDD technique is highly scalable, substantially improves on state of the art mining for emerging patterns and can be effective for discovering complex contrasts from datasets with thousands of attributes.
Closed Sets for Labeled Data ⋆
"... Abstract. Closed sets are being successfully applied in the context of compacted data representation for association rule learning. However, their use is mainly descriptive. This paper shows that, when considering labeled data, closed sets can be adapted for prediction and discrimination purposes by ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Closed sets are being successfully applied in the context of compacted data representation for association rule learning. However, their use is mainly descriptive. This paper shows that, when considering labeled data, closed sets can be adapted for prediction and discrimination purposes by conveniently contrasting covering properties on positive and negative examples. We formally justify that these sets characterize the space of relevant combinations of features for discriminating the target class. In practice, identifying relevant/irrelevant combinations of features through closed sets is useful in many applications. Here we apply it to compacting emerging patterns and essential rules and to learn descriptions for subgroup discovery. 1
Mining interesting contrast rules for a webbased educational system
 In International Conference on Machine Learning and Applications (ICMLA
, 2004
"... Webbased educational technologies allow educators to study how students learn (descriptive studies) and which learning strategies are most effective (causal/predictive studies). Since webbased educational systems collect vast amounts of student profile data, data mining and knowledge discovery tec ..."
Abstract

Cited by 22 (2 self)
 Add to MetaCart
(Show Context)
Webbased educational technologies allow educators to study how students learn (descriptive studies) and which learning strategies are most effective (causal/predictive studies). Since webbased educational systems collect vast amounts of student profile data, data mining and knowledge discovery techniques can be applied to find interesting relationships between attributes of students, assessments, and the solution strategies adopted by students. This paper focuses on the discovery of interesting contrast rules, which are sets of conjunctive rules describing interesting characteristics of different segments of a population. In the context of webbased educational systems, contrast rules help to identify attributes characterizing patterns of performance disparity between various groups of students. We propose a general formulation of contrast rules as well as a framework for finding such patterns. We apply this technique to an online educational system developed at Michigan State University called LONCAPA.