Results 1  10
of
101
Discovering significant patterns
, 2007
"... Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some userspecified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type1 error, that is, of finding patter ..."
Abstract

Cited by 54 (4 self)
 Add to MetaCart
(Show Context)
Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some userspecified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying wellestablished statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to realworld data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.
Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining
 Journal of Machine Learning Research
"... This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use ..."
Abstract

Cited by 46 (0 self)
 Add to MetaCart
(Show Context)
This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use different terminology and task definitions, claim to have different goals, claim to use different rule learning heuristics, and use different means for selecting subsets of induced patterns. This paper contributes a novel understanding of these subareas of data mining by presenting a unified terminology, by explaining the apparent differences between the learning tasks as variants of a unique supervised descriptive rule discovery task and by exploring the apparent differences between the approaches. It also shows that various rule learning heuristics used in CSM, EPM and SD algorithms all aim at optimizing a trade off between rule coverage and precision. The commonalities (and differences) between the approaches are showcased on a selection of best known variants of CSM, EPM and SD algorithms. The paper also provides a critical survey of existing supervised descriptive rule discovery visualization methods.
Discovering Associations With Numeric Variables
 In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
, 2001
"... This paper further develops Aumann and Lindell's [3] proposal for a variant of association rules for which the consequent is a numeric variable. It is argued that these rules can discover useful interactions with numeric data that cannot be discovered directly using traditional association rule ..."
Abstract

Cited by 39 (7 self)
 Add to MetaCart
(Show Context)
This paper further develops Aumann and Lindell's [3] proposal for a variant of association rules for which the consequent is a numeric variable. It is argued that these rules can discover useful interactions with numeric data that cannot be discovered directly using traditional association rules with discretization. Alternative measures for identifying interesting rules are proposed. Efficient algorithms are presented that enable these rules to be discovered for dense data sets for which application of Auman and Lindell's algorithm is infeasible.
Pruning Redundant Association Rules Using Maximum Entropy Principle
 In Advances in Knowledge Discovery and Data Mining, 6th PacificAsia Conference, PAKDD’02
, 2002
"... Data mining algorithms produce huge sets of rules, practically impossible to analyze manually. It is thus important to develop methods for removing redundant rules from those sets. We present a solution to the problem using the Maximum Entropy approach. The problem of eciency of Maximum Entropy comp ..."
Abstract

Cited by 22 (4 self)
 Add to MetaCart
Data mining algorithms produce huge sets of rules, practically impossible to analyze manually. It is thus important to develop methods for removing redundant rules from those sets. We present a solution to the problem using the Maximum Entropy approach. The problem of eciency of Maximum Entropy computations is addressed by using closed form solutions for the most frequent cases. Analytical and experimental evaluation of the proposed technique indicates that it eciently produces small sets of interesting association rules.
Mining rankcorrelated sets of numerical attributes
 In KDD’06
, 2006
"... We study the mining of interesting patterns in the presence of numerical attributes. Instead of the usual discretization methods, we propose the use of rank based measures to score the similarity of sets of numerical attributes. New support measures for numerical data are introduced, based on extens ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
(Show Context)
We study the mining of interesting patterns in the presence of numerical attributes. Instead of the usual discretization methods, we propose the use of rank based measures to score the similarity of sets of numerical attributes. New support measures for numerical data are introduced, based on extensions of Kendall’s tau, and Spearman’s Footrule and rho. We show how these support measures are related. Furthermore, we introduce a novel type of pattern combining numerical and categorical attributes. We give efficient algorithms to find all frequent patterns for the proposed support measures, and evaluate their performance on reallife datasets.
An Evolutionary Algorithm to Discover Numeric Association Rules
 In Proceedings of the ACM symposium on Applied computing SAC’2002
, 2002
"... Association rules are one of the most used tools to discover relationships among attributes in a database. Nowadays, there are many e#cient techniques to obtain these rules, although most of them require that the values of the attributes be discrete. To solve this problem, these techniques discretiz ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
Association rules are one of the most used tools to discover relationships among attributes in a database. Nowadays, there are many e#cient techniques to obtain these rules, although most of them require that the values of the attributes be discrete. To solve this problem, these techniques discretize the numeric attributes, but this implies a loss of information. In a general way, these techniques work in two phases: in the first one they try to find the sets of attributes that are, with a determined frequency, within the database (frequent itemsets), and in the second one, they extract the association rules departing from these sets. In this paper we present a technique to find the frequent itemsets in numeric databases without needing to discretize the attributes. We use an evolutionary algorithm to find the intervals of each attribute that conforms a frequent itemset. The evaluation function itself will be the one that decide the amplitude of these intervals. Finally, we evaluate the tool with synthetic and real databases to check the e#ciency of our algorithm.
Relative Unsupervised Discretization for Association Rule Mining
 PRINCIPLES OF DATA MINING AND KNOWLEDGE DISCOVERY, D.A. ZIGHED, H.J. KOMOROWSKI AND J.M. ZYTKOW, EDS, LNCS 1910
, 2000
"... The paper describes a new, contextsensitive discretization algorithm that can be used to completely discretize a numeric or mixed numericcategorical dataset. The method combines aspects of unsupervised (classblind) and supervised methods. The central idea in the algorithm is what might be call ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
The paper describes a new, contextsensitive discretization algorithm that can be used to completely discretize a numeric or mixed numericcategorical dataset. The method combines aspects of unsupervised (classblind) and supervised methods. The central idea in the algorithm is what might be called "mutual structure projection" between the (numeric or categorical) attributes. The goal is to discretize a numeric attribute into intervals that correlate as much as possible with patterns in the value distributions of the other attributes. This is achieved by finding points of distribution changes, mapping them onto the target attribute, and subsequently clustering these points; the result is a set of significant split points that define the interval boundaries of the attribute discretization. This process can be performed for each numeric attribute in a dataset, thereby producing discretizations that reect potentially complex interrelationships among dierent attributes of the dataset. The algorithm
Mining multidimensional constrained gradients in data cubes
 In Proceedings of the 27 th International Conference on Very Large Data Bases (VLDB 2001
, 2001
"... 1 Introduction In recent years, there have been growing interests in multidimensional analysis of relational databases, transactional ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
(Show Context)
1 Introduction In recent years, there have been growing interests in multidimensional analysis of relational databases, transactional
Mining quantitative correlated patterns using an informationtheoretic approach
 In KDD
, 2006
"... Existing research on mining quantitative databases mainly focuses on mining associations. However, mining associations is too expensive to be practical in many cases. In this paper, we study mining correlations from quantitative databases and show that it is a more effective approach than mining ass ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
(Show Context)
Existing research on mining quantitative databases mainly focuses on mining associations. However, mining associations is too expensive to be practical in many cases. In this paper, we study mining correlations from quantitative databases and show that it is a more effective approach than mining associations. We propose a new notion of Quantitative Correlated Patterns (QCPs), which is founded on two formal concepts, mutual information and allconfidence. We first devise a normalization on mutual information and apply it to QCP mining to capture the dependency between the attributes. We further adopt allconfidence as a quality measure to control, at a finer granularity, the dependency between the attributes with specific quantitative intervals. We also propose a supervised method to combine the consecutive intervals of the quantitative attributes based on mutual information, such that the interval combining is guided by the dependency between the attributes. We develop an algorithm, QCoMine, to efficiently mine QCPs by utilizing normalized mutual information and allconfidence to perform a twolevel pruning. Our experiments verify the efficiency of QCoMine and the quality of the QCPs.
Finding Regional Colocation Patterns for Sets of Continuous Variables, under review
"... This paper proposes a novel framework for mining regional colocation patterns with respect to sets of continuous variables in spatial datasets. The goal is to identify regions in which multiple continuous variables with values from the wings of their statistical distribution are colocated. A coloc ..."
Abstract

Cited by 13 (9 self)
 Add to MetaCart
(Show Context)
This paper proposes a novel framework for mining regional colocation patterns with respect to sets of continuous variables in spatial datasets. The goal is to identify regions in which multiple continuous variables with values from the wings of their statistical distribution are colocated. A colocation mining framework is introduced that operates in the continuous domain without the need for discretization and which views regional colocation mining as a clustering problem in which an externally given fitness function has to be maximized. Interestingness of colocation patterns is assessed using products of zscores of the relevant continuous variables. The proposed framework is evaluated by a domain expert in a case study that analyzes chemical concentrations in Texas water wells centering on colocation patterns involving Arsenic. Our approach was able to identify known and unknown regional colocation patterns, and different sets of algorithm parameters lead to the characterization of arsenic distribution at different scales. Moreover, inconsistent colocation sets were found for regions in South Texas and West Texas that can be clearly attributed to geological differences in the two regions, emphasizing the need for regional colocation mining techniques. Moreover, a novel, prototypebased region discovery algorithm named CLEVER is introduced that uses randomized hill climbing, and searches a variable number of clusters and larger neighborhood sizes. Keywords spatial data mining, regional colocation mining, regional data mining, clustering, finding associations between continuous variables. 1.