Results 1 -
7 of
7
Detecting group differences: Mining contrast sets
- Data Mining and Knowledge Discovery
, 2001
"... A fundamental task in data analysis is understanding the differences between several con-trasting groups. These groups can represent different classes of objects, such as male or female students, or the same group over time, e.g. freshman students in 1993 through 1998. We present the problem of mini ..."
Abstract
-
Cited by 61 (3 self)
- Add to MetaCart
A fundamental task in data analysis is understanding the differences between several con-trasting groups. These groups can represent different classes of objects, such as male or female students, or the same group over time, e.g. freshman students in 1993 through 1998. We present the problem of mining contrast sets: conjunctions of attributes and values that differ meaningfully in their distribution across groups. We provide a search algorithm for mining contrast sets with pruning rules that drastically reduce the computational complexity. Once the contrast sets are found, we post-process the results to present a subset that are surprising to the user given what we have already shown. We explicitly control the probability of Type I error (false positives) and guarantee a maximum error rate for the entire analysis by using Bonferroni corrections.
Interestingness of Frequent Itemsets Using Bayesian Networks as Background Knowledge
- In Proceedings of the SIGKDD Conference on Knowledge Discovery and Data Mining
, 2004
"... ..."
Exploring Constraints to Efficiently Mine Emerging Patterns from Large High-dimensional Datasets
, 2000
"... Emerging patterns (EPs) were proposed recently to capture changes or differences between datasets: an EP is a multivariate feature whose support increases sharply from a background dataset to a target dataset, and the support ratio is called its growth rate. Interesting long EPs often have low suppo ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
Emerging patterns (EPs) were proposed recently to capture changes or differences between datasets: an EP is a multivariate feature whose support increases sharply from a background dataset to a target dataset, and the support ratio is called its growth rate. Interesting long EPs often have low support; mining such EPs from high-dimensional datasets is a great challenge due to the combinatorial explosion of the number of candidates. We propose a Constraint-based EP Miner, ConsEPMiner, that utilizes two types of constraints for effectively pruning the search space: External constraints are user-given minimums on support, growth rate, and growth-rate improvement to con ne the resulting EP set. Inherent constraints -- same subset support, top growth rate, and same origin -- are derived from the properties of EPs and datasets, and are solely for pruning the search space and saving computation. ConsEPMiner can efficiently mine all EPs at low support on large high-dimensional datasets, with low minim...
Multivariate Discretization for Set Mining
- KNOWLEDGE AND INFORMATION SYSTEMS
, 2000
"... Many algorithms in data mining can be formulated as a set mining problem where the goal is to find conjunctions (or disjunctions) of terms that meet user specified constraints. Set mining techniques have been largely designed for categorical or discrete data where variables can only take on a fixed ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Many algorithms in data mining can be formulated as a set mining problem where the goal is to find conjunctions (or disjunctions) of terms that meet user specified constraints. Set mining techniques have been largely designed for categorical or discrete data where variables can only take on a fixed number of values. However, many data sets also contain continuous variables and a common method of dealing with these is to discretize them by breaking them into ranges. Most discretization methods are univariate and consider only a single feature at a time (sometimes in conjunction with a class variable). We argue that this is a sub-optimal approach for knowledge discovery as univariate discretization can destroy hidden patterns in data. Discretization should consider the effects on all variables in the analysis and that two regions X and Y should only be in the same interval after discretization if the instances in those regions have similar multivariate distributions (Fx Fy) across all variables and combinations of variables. We present a bottom up merging algorithm to discretize continuous variables based on this rule. Our experiments indicate that the approach is feasible, that it will not destroy hidden patterns and that it will generate meaningful intervals.
Automatic Hierarchical E-Mail Classification Using Association Rules
, 2001
"... The explosive growth of on-line communication, in particular e-mail communication, makes it necessary to organize the information for faster and easier processing and searching. Storing e-mail messages into hierarchically organized folders, where each folder corresponds to a separate topic, has prov ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
The explosive growth of on-line communication, in particular e-mail communication, makes it necessary to organize the information for faster and easier processing and searching. Storing e-mail messages into hierarchically organized folders, where each folder corresponds to a separate topic, has proven to be very useful. Previous approaches to this problem use Nave Bayes- or TF-IDF-style classifiers that are based on the unrealistic term independence assumption. These methods are also context-insensitive in that the meaning of words is independent of presence/absence of other words in the same message. It was shown that text classification methods that deviate from the independence assumption and capture context achieve higher accuracy. In this thesis, we address the problem of term dependence by building an associative classifier called Classification using Cohesion and Multiple Association Rules, or COMAR in short. The problem of context capturing is addressed by looking for phrases in message corpora. Both rules and phrases are generated using an efficient FP-growth-like approach. Since the amount of rules and phrases produced can be very large, we propose two new measures, rule cohesion and phrase cohesion, that possess the anti-monotone property which allows the push of rule and phrase pruning deeply into the process of their generation. This approach to pattern pruning proves to be much more efficient than "generate-and-prune" methods. Both unstructured text attributes and semi-structured non-text attributes, such as senders and recipients, are used for the classification. COMAR classification algorithm uses multiple rules to predict several highest probability topics for each message. Different feature selection and rule ranking methods are compared. Our studies show ...
Mining changes of classification by correspondence tracing
- In Proceedings of the 2003 SIAM International Conference on Data Mining (SDM_2003
, 2003
"... We study the problem of mining changes of classification characteristics as the data changes. Available are an old classifier, representing previous knowledge about classification characteristics, and a new data. We want to find the changes of classification characteristics in the new data. An examp ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We study the problem of mining changes of classification characteristics as the data changes. Available are an old classifier, representing previous knowledge about classification characteristics, and a new data. We want to find the changes of classification characteristics in the new data. An example of such changes is “members with a large family no longer shop frequently, but they used to”. Finding this kind of changes holds the key for the organization to adopt to the changed environment and stay ahead of competitors. The challenge is that it is difficult to see what has really changed from comparing the old and new classifiers that could be very large and different. In this paper, we propose a technique to identify such changes. The idea is tracing the characteristics, in the old and new classifiers, that correspond to each other by classifying the same examples. We describe several ways to present changes so that the user can focus on a small number of important ones. We evaluate the proposed method on real life data sets. 1
Thesis Proposal
"... In recent years, database and data mining communities have focused on a new model of data processing, where data arrives in the form of continuous streams. Because it is not feasible to store all data, it is quite challenging to perform the traditional data mining operations, including frequent item ..."
Abstract
- Add to MetaCart
In recent years, database and data mining communities have focused on a new model of data processing, where data arrives in the form of continuous streams. Because it is not feasible to store all data, it is quite challenging to perform the traditional data mining operations, including frequent itemset mining, classification, and clustering, in a streaming environment. Our current

