Results 1  10
of
47
Discovering significant patterns
, 2007
"... Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some userspecified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type1 error, that is, of finding patter ..."
Abstract

Cited by 61 (4 self)
 Add to MetaCart
(Show Context)
Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some userspecified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying wellestablished statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to realworld data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.
Extracting redundancyaware topk patterns
 In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2006
"... Observed in many applications, there is a potential need of extracting a small set of frequent patterns having not only high significance but also low redundancy. The significance is usually defined by the context of applications. Previous studies have been concentrating on how to compute topk sign ..."
Abstract

Cited by 36 (3 self)
 Add to MetaCart
(Show Context)
Observed in many applications, there is a potential need of extracting a small set of frequent patterns having not only high significance but also low redundancy. The significance is usually defined by the context of applications. Previous studies have been concentrating on how to compute topk significant patterns or how to remove redundancy among patterns separately. There is limited work on finding those topk patterns which demonstrate highsignificance and lowredundancy simultaneously. In this paper, we study the problem of extracting redundancyaware topk patterns from a large collection of frequent patterns. We first examine the evaluation functions for measuring the combined significance of a pattern set and propose the MMS (Maximal Marginal Significance) as the problem formulation. The problem is known as NPhard. We further present a greedy algorithm which approximates the optimal solution with performance bound O(log k) (with conditions on redundancy), where k is the number of reported patterns. The direct usage of redundancyaware topk patterns is illustrated through two real applications: disk block prefetch and document theme extraction. Our method can also be applied to processing redundancyaware topk queries in traditional database.
Tell me what I need to know: Succinctly summarizing data with itemsets
 In Proc. KDD
, 2011
"... Data analysis is an inherently iterative process. That is, what we know about the data greatly determines our expectations, and hence, what result we would find the most interesting. With this in mind, we introduce a wellfounded approach for succinctly summarizing data with a collection of itemsets ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
(Show Context)
Data analysis is an inherently iterative process. That is, what we know about the data greatly determines our expectations, and hence, what result we would find the most interesting. With this in mind, we introduce a wellfounded approach for succinctly summarizing data with a collection of itemsets; using a probabilistic maximum entropy model, we iteratively find the most interesting itemset, and in turn update our model of the data accordingly. As we only include itemsets that are surprising with regard to the current model, the summary is guaranteed to be both descriptive and nonredundant. The algorithm that we present can either mine the topk most interesting itemsets, or use the Bayesian Information Criterion to automatically identify the model containing only the itemsets most important for describing the data. Or, in other words, it will ‘tell you what you need to know’. Experiments on synthetic and benchmark data show that the discovered summaries are succinct, and correctly identify the key patterns in the data. The models they form attain high likelihoods, and inspection shows that they summarize the data well with increasingly specific, yet nonredundant itemsets.
GraphRank: Statistical Modeling and Mining of Significant Subgraphs in the Feature Space
, 2006
"... We propose a technique for evaluating the statistical significance of frequent subgraphs in a database. A graph is represented by a feature vector that is a histogram over a set of basis elements. The set of basis elements is chosen based on domain knowledge and consists generally of vertices, edges ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
(Show Context)
We propose a technique for evaluating the statistical significance of frequent subgraphs in a database. A graph is represented by a feature vector that is a histogram over a set of basis elements. The set of basis elements is chosen based on domain knowledge and consists generally of vertices, edges, or small graphs. A given subgraph is transformed to a feature vector and the significance of the subgraph is computed by considering the significance of occurrence of the corresponding vector. The probability of occurrence of the vector in a random vector is computed based on the prior probability of the basis elements. This is then used to obtain a probability distribution on the support of the vector in a database of random vectors. The statistical significance of the vector/subgraph is then defined as the pvalue of its observed support. We develop efficient methods for computing pvalues and lower bounds. A simplified model is further proposed to improve the efficiency. We also address the problem of feature vector mining, a generalization of itemset mining where counts are associated with items and the goal is to find significant subvectors. We present an algorithm that explores closed frequent subvectors to find significant ones. Experimental results show that the proposed techniques are effective, efficient, and useful for ranking frequent subgraphs by their statistical significance.
SamplingBased Sequential Subgroup Mining
, 2005
"... Subgroup discovery is a learning task that aims at finding interesting rules from classified examples. The search is guided by a utility function, trading o# the coverage of rules against their statistical unusualness. One shortcoming of existing approaches is that they do not incorporate prior know ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
Subgroup discovery is a learning task that aims at finding interesting rules from classified examples. The search is guided by a utility function, trading o# the coverage of rules against their statistical unusualness. One shortcoming of existing approaches is that they do not incorporate prior knowledge. To this end a novel generic sampling strategy is proposed. It allows to turn pattern mining into an iterative process. In each iteration the focus of subgroup discovery lies on those patterns that are unexpected with respect to prior knowledge and previously discovered patterns. The result of this technique is a small diverse set of understandable rules that characterise a specified property of interest. As another contribution this article derives a simple connection between subgroup discovery and classifier induction. For a popular utility function this connection allows to apply any standard rule induction algorithm to the task of subgroup discovery after a step of stratified resampling. The proposed techniques are empirically compared to state of the art subgroup discovery algorithms.
Rule Interestingness Analysis Using OLAP Operations
"... The problem of interestingness of discovered rules has been investigated by many researchers. The issue is that data mining algorithms often generate too many rules, which make it very hard for the user to find the interesting ones. Over the years many techniques have been proposed. However, few hav ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
The problem of interestingness of discovered rules has been investigated by many researchers. The issue is that data mining algorithms often generate too many rules, which make it very hard for the user to find the interesting ones. Over the years many techniques have been proposed. However, few have made it to reallife applications. Since August 2004, we have been working on a major application for Motorola. The objective is to find causes of cellular phone call failures from a large amount of usage log data. Class association rules have been shown to be suitable for this type of diagnostic data mining application. We were also able to put several existing interestingness methods to the test, which revealed some major shortcomings. One of the main problems is that most existing methods treat rules individually. However, we discovered that users seldom regard a single rule to be interesting by itself. A rule is only interesting in the context of some other rules. Furthermore, in many cases, each individual rule may not be interesting, but a group of them together can represent an important piece of knowledge. This led us to discover a deficiency of the current rule mining paradigm. Using nonzero minimum support and nonzero minimum confidence eliminates a large amount of context information, which makes rule analysis difficult. This paper proposes a novel approach to deal with all of these issues, which casts rule analysis as OLAP operations and general impression mining. This approach enables the user to explore the knowledge space to find useful knowledge easily and systematically. It also provides a natural framework for visualization. As an evidence of its effectiveness, our system, called Opportunity Map, based on these ideas has been deployed, and it is in daily use in Motorola for finding actionable knowledge from its engineering and other types of data sets.
Finding good itemsets by packing data
 In ICDM
, 2008
"... The problem of selecting small groups of itemsets that represent the data well has recently gained a lot of attention. We approach the problem by searching for the itemsets that compress the data efficiently. As a compression technique we use decision trees combined with a refined version of MDL. Mo ..."
Abstract

Cited by 12 (7 self)
 Add to MetaCart
(Show Context)
The problem of selecting small groups of itemsets that represent the data well has recently gained a lot of attention. We approach the problem by searching for the itemsets that compress the data efficiently. As a compression technique we use decision trees combined with a refined version of MDL. More formally, assuming that the items are ordered, we create a decision tree for each item that may only depend on the previous items. Our approach allows us to find complex interactions between the attributes, not just cooccurrences of 1s. Further, we present a link between the itemsets and the decision trees and use this link to export the itemsets from the decision trees. In this paper we present two algorithms. The first one is a simple greedy approach that builds a family of itemsets directly from data. The second one, given a collection of candidate itemsets, selects a small subset of these itemsets. Our experiments show that these approaches result in compact and high quality descriptions of the data. 1
Classifying without Discriminating
"... Abstract—Classification models usually make predictions on the basis of training data. If the training data is biased towards certain groups or classes of objects, e.g., there is racial discrimination towards black people, the learned model will also show discriminatory behavior towards that particu ..."
Abstract

Cited by 11 (5 self)
 Add to MetaCart
(Show Context)
Abstract—Classification models usually make predictions on the basis of training data. If the training data is biased towards certain groups or classes of objects, e.g., there is racial discrimination towards black people, the learned model will also show discriminatory behavior towards that particular community. This partial attitude of the learned model may lead to biased outcomes when labeling future unlabeled data objects. Often, however, impartial classification results are desired or even required by law for future data objects in spite of having biased training data. In this paper, we tackle this problem by introducing a new classification scheme for learning unbiased models on biased training data. Our method is based on massaging the dataset by making the least intrusive modifications which lead to an unbiased dataset. On this modified dataset we then learn a nondiscriminating classifier. The proposed method has been implemented and experimental results on a credit approval dataset show promising results: in all experiments our method is able to reduce the prejudicial behavior for future classification significantly without loosing too much predictive accuracy. I.
SelfSufficient Itemsets: An Approach to Screening Potentially Interesting Associations Between Items
"... Selfsufficient itemsets are those whose frequency cannot explained solely by the frequency of either their subsets or of their supersets. We argue that itemsets that are not selfsufficient will often be of little interest to the data analyst, as their frequency should be expected once that of the ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
Selfsufficient itemsets are those whose frequency cannot explained solely by the frequency of either their subsets or of their supersets. We argue that itemsets that are not selfsufficient will often be of little interest to the data analyst, as their frequency should be expected once that of the itemsets on which their frequency depends is known. We present statistical tests for statistically sound discovery of selfsufficient itemsets, and computational techniques that allow those tests to be applied as a postprocessing step for any itemset discovery algorithm. We also present a measure for assessing the degree of potential interest in an itemset that complements these statistical measures.
Mampaey M. Using background knowledge to rank itemsets. DataMin Knowl Discov 2010;21:293–309
"... Abstract. Assessing the quality of discovered results is an important open problem in data mining. Such assessment is particularly vital when mining itemsets, since commonly many of the discovered patterns can be easily explained by background knowledge. The simplest approach to screen uninterestin ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Assessing the quality of discovered results is an important open problem in data mining. Such assessment is particularly vital when mining itemsets, since commonly many of the discovered patterns can be easily explained by background knowledge. The simplest approach to screen uninteresting patterns is to compare the observed frequency against the independence model. Since the parameters for the independence model are the column margins, we can view such screening as a way of using the column margins as background knowledge. In this paper we study techniques for more flexible approaches for infusing background knowledge. Namely, we show that we can efficiently use additional knowledge such as row margins, lazarus counts, and bounds of ones. We demonstrate that these statistics describe forms of data that occur in practice and have been studied in data mining. To infuse the information efficiently we use a maximum entropy approach. In its general setting, solving a maximum entropy model is infeasible, but we demonstrate that for our setting it can be solved in polynomial time. Experiments show that more sophisticated models fit the data better and that using more information improves the frequency prediction of itemsets.