Results 1  10
of
28
From Local Patterns to Global Models: The LeGo Approach to Data Mining
"... Abstract. In this paper we present LeGo, a generic framework that utilizes existing local pattern mining techniques for global modeling in a variety of diverse data mining tasks. In the spirit of well known KDD process models, our work identifies different phases within the data mining step, each of ..."
Abstract

Cited by 19 (5 self)
 Add to MetaCart
(Show Context)
Abstract. In this paper we present LeGo, a generic framework that utilizes existing local pattern mining techniques for global modeling in a variety of diverse data mining tasks. In the spirit of well known KDD process models, our work identifies different phases within the data mining step, each of which is formulated in terms of different formal constraints. It starts with a phase of mining patterns that are individually promising. Later phases establish the context given by the global data mining task by selecting groups of diverse and highly informative patterns, which are finally combined to one or more global models that address the overall data mining task(s). The paper discusses the connection to various learning techniques, and illustrates that our framework is broad enough to cover and leverage frequent pattern mining, subgroup discovery, pattern teams, multiview learning, and several other popular algorithms. The Safarii learning toolbox serves as a proofofconcept of its high potential for practical data mining applications. Finally, we point out several challenging open research questions that naturally emerge in a constraintbased localtoglobal pattern mining, selection, and combination framework. 1
Directly Mining Descriptive Patterns
 SIAM SDM
, 2012
"... Mining small, useful, and highquality sets of patterns has recently become an important topic in data mining. The standard approach is to first mine many candidates, and then to select a good subset. However, the pattern explosion generates such enormous amounts of candidates that by postprocessin ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
(Show Context)
Mining small, useful, and highquality sets of patterns has recently become an important topic in data mining. The standard approach is to first mine many candidates, and then to select a good subset. However, the pattern explosion generates such enormous amounts of candidates that by postprocessing it is virtually impossible to analyse dense or large databases in any detail. We introduce Slim, an anytime algorithm for mining highquality sets of itemsets directly from data. We use MDL to identify the best set of itemsets as that set that describes the data best. To approximate this optimum, we iteratively use the current solution to determine what itemset would provide most gain— estimating quality using an accurate heuristic. Without requiring a premined candidate collection, Slim is parameterfree in both theory and practice. Experiments show we mine highquality pattern sets; while evaluating ordersofmagnitude fewer candidates than our closest competitor, Krimp, we obtain much better compression ratios—closely approximating the locallyoptimal strategy. Classification experiments independently verify we characterise data very well. 1
Finding good itemsets by packing data
 In ICDM
, 2008
"... The problem of selecting small groups of itemsets that represent the data well has recently gained a lot of attention. We approach the problem by searching for the itemsets that compress the data efficiently. As a compression technique we use decision trees combined with a refined version of MDL. Mo ..."
Abstract

Cited by 12 (7 self)
 Add to MetaCart
(Show Context)
The problem of selecting small groups of itemsets that represent the data well has recently gained a lot of attention. We approach the problem by searching for the itemsets that compress the data efficiently. As a compression technique we use decision trees combined with a refined version of MDL. More formally, assuming that the items are ordered, we create a decision tree for each item that may only depend on the previous items. Our approach allows us to find complex interactions between the attributes, not just cooccurrences of 1s. Further, we present a link between the itemsets and the decision trees and use this link to export the itemsets from the decision trees. In this paper we present two algorithms. The first one is a simple greedy approach that builds a family of itemsets directly from data. The second one, given a collection of candidate itemsets, selects a small subset of these itemsets. Our experiments show that these approaches result in compact and high quality descriptions of the data. 1
LowEntropy Set Selection
"... Most pattern discovery algorithms easily generate very large numbers of patterns, making the results impossible to understand and hard to use. Recently, the problem of instead selecting a small subset of informative patterns from a large collection of patterns has attracted a lot of interest. In thi ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
(Show Context)
Most pattern discovery algorithms easily generate very large numbers of patterns, making the results impossible to understand and hard to use. Recently, the problem of instead selecting a small subset of informative patterns from a large collection of patterns has attracted a lot of interest. In this paper we present a succinct way of representing data on the basis of itemsets that identify strong interactions. This new approach, LESS, provides a more powerful and more general technique to data description than existing approaches. Lowentropy sets consider the data symmetrically and as such identify strong interactions between attributes, not just between items that are present. Selection of these patterns is executed through the MDLcriterion. This results in only a handful of sets that together form a compact lossless description of the data. By using entropybased elements for the data description, we can successfully apply the maximum likelihood principle to locally cover the data optimally. Further, it allows for a fast, natural and well performing heuristic. Based on these approaches we present two algorithms that provide highquality descriptions of the data in terms of strongly interacting variables. Experiments on these methods show that highquality results are mined: very small pattern sets are returned that are easily interpretable and understandable descriptions of the data, and can be straightforwardly visualized. Swap randomization experiments and high compression ratios show that they capture the structure of the data well.
One in a million: picking the right patterns
 Knowledge and Information Systems
, 2008
"... Constrained pattern mining extracts patterns based on their individual merit. Usually this results in far more patterns than a human expert or a machine leaning technique could make use of. Often different patterns or combinations of patterns cover a similar subset of the examples, thus being redu ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
Constrained pattern mining extracts patterns based on their individual merit. Usually this results in far more patterns than a human expert or a machine leaning technique could make use of. Often different patterns or combinations of patterns cover a similar subset of the examples, thus being redundant and not carrying any new information. To remove the redundant information contained in such pattern sets, we propose two general heuristic algorithms Bouncer and Picker for selecting a small subset of patterns. We identify several selection techniques for use in this general algorithm and evaluate those on several data sets. The results show that both techniques succeed in severely reducing the number of patterns, while at the same time apparently retaining much of the original information. Additionally, the experiments show that reducing the pattern set indeed improves the quality of classification results. Both results show that the developed solutions are very well suited for the goals we aim at. 1
PatternBased Classification: A Unifying Perspective
"... Abstract. The use of patterns in predictive models is a topic that has received a lot of attention in recent years. Pattern mining can help to obtain models for structured domains, such as graphs and sequences, and has been proposed as a means to obtain more accurate and more interpretable models. D ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Abstract. The use of patterns in predictive models is a topic that has received a lot of attention in recent years. Pattern mining can help to obtain models for structured domains, such as graphs and sequences, and has been proposed as a means to obtain more accurate and more interpretable models. Despite the large amount of publications devoted to this topic, we believe however that an overview of what has been accomplished in this area is missing. This paper presents our perspective on this evolving area. We identify the principles of pattern mining that are important when mining patterns for models and provide an overview of patternbased classification methods. We categorize these methods along the following dimensions: (1) whether they postprocess a precomputed set of patterns or iteratively execute pattern mining algorithms; (2) whether they select patterns modelindependently or whether the pattern selection is guided by a model. We summarize the results that have been obtained for each of these methods. 1
ConstraintBased Mining of Sets of Cliques Sharing Vertex Properties
 In ACNE’10
, 2010
"... Abstract. We consider data mining methods on large graphs where a set of labels is associated to each vertex. A typical example of such graphs is a social network of collaborating researchers where additional information represent the main publication targets (preferred conferences or journals) for ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We consider data mining methods on large graphs where a set of labels is associated to each vertex. A typical example of such graphs is a social network of collaborating researchers where additional information represent the main publication targets (preferred conferences or journals) for each author. We investigate the extraction of sets of dense subgraphs such that the vertices in all subgraphs of a set share a large enough set of labels. As a first step, we consider here the special case of dense subgraphs that are cliques. We proposed a method to compute all maximal homogeneous clique sets that satisfy userdefined constraints on the number of separated cliques, on the size of the cliques, and on the number of labels shared by all the vertices. The empirical validation illustrates the scalability of our approach and it provides experimental feedback on two real datasets, more precisely an annotated social network derived from the DBLP database and an enriched biological network of proteinprotein interactions. In both cases, we discuss the relevancy of extracted patterns thanks to available domain knowledge.
The blind men and the elephant: On meeting the problem of multiple truths in data from clustering and pattern mining perspectives
, 2013
"... ..."
Nonredundant Subgroup Discovery in Large and Complex Data
"... Abstract. Large and complex data is challenging for most existing discovery algorithms, for several reasons. First of all, such data leads to enormous hypothesis spaces, making exhaustive search infeasible. Second, many variants of essentially the same pattern exist, due to (numeric) attributes of h ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Large and complex data is challenging for most existing discovery algorithms, for several reasons. First of all, such data leads to enormous hypothesis spaces, making exhaustive search infeasible. Second, many variants of essentially the same pattern exist, due to (numeric) attributes of high cardinality, correlated attributes, and so on. This causes topk mining algorithms to return highly redundant result sets, while ignoring many potentially interesting results. These problems are particularly apparent with Subgroup Discovery and its generalisation, Exceptional Model Mining. To address this, we introduce subgroup set mining: one should not consider individual subgroups, but sets of subgroups. We consider three degrees of redundancy, and propose corresponding heuristic selection strategies in order to eliminate redundancy. By incorporating these strategies in a beam search, the balance between exploration and exploitation is improved. Experiments clearly show that the proposed methods result in much more diverse subgroup sets than traditional Subgroup Discovery methods. 1
Characterizing Uncertain Data using Compression
"... Motivated by sensor networks, mobility data, biology and life sciences, the area of mining uncertain data has recently received a great deal of attention. While various papers have focused on efficiently mining frequent patterns from uncertain data, the problem of discovering a small set of interest ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Motivated by sensor networks, mobility data, biology and life sciences, the area of mining uncertain data has recently received a great deal of attention. While various papers have focused on efficiently mining frequent patterns from uncertain data, the problem of discovering a small set of interesting patterns that provide an accurate and condensed description of a probabilistic database is still unexplored. In this paper we study the problem of discovering characteristic patterns in uncertain data through information theoretic lenses. Adopting the possible worlds interpretation of probabilistic data and a compression scheme based on the MDL principle, we formalize the problem of mining patterns that compress the database well in expectation. Despite its huge search space, we show that this problem can be accurately approximated. In particular, we devise a sequence of three methods where each new method improves the memory requirements orders of magnitudes compared to its predecessor, while giving up only a little in terms of approximation accuracy. We empirically compare our methods on both synthetic data and real data from life science. Results show that from a probabilistic matrix with more than one million rows and columns, we can extract a small set of meaningful patterns that accurately characterize the data distribution of any probable world. 1