Results 1  10
of
23
Discovering significant patterns
, 2007
"... Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some userspecified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type1 error, that is, of finding patter ..."
Abstract

Cited by 42 (3 self)
 Add to MetaCart
Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some userspecified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying wellestablished statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to realworld data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.
Logic regression
 Journal of Computational and Graphical Statistics
, 2003
"... Logic regression is an adaptive regression methodology that attempts to construct predictors as Boolean combinations of binary covariates. In many regression problems a model is developed that relates the main effects (the predictors or transformations thereof) to the response, while interactions ar ..."
Abstract

Cited by 37 (11 self)
 Add to MetaCart
Logic regression is an adaptive regression methodology that attempts to construct predictors as Boolean combinations of binary covariates. In many regression problems a model is developed that relates the main effects (the predictors or transformations thereof) to the response, while interactions are usually kept simple (two to threeway interactions at most). Often, especially when all predictors are binary, the interaction between many predictors may be what causes the differences in response. This issue arises, for example, in the analysis of SNP microarray data or in some data mining problems. In the proposed methodology, given a set of binary predictors we create new predictors such as “X1, X2, X3, and X4 are true, ” or “X5 or X6 but not X7 are true. ” In more speci � c terms: we try to � t regression models of the form g(E[Y]) = b0 + b1L1 + ¢ ¢ ¢ + bnLn, where Lj is any Boolean expression of the predictors. The Lj and bj are estimated simultaneously using a simulated annealing algorithm. This article discusses how to � t logic regression models, how to carry out model selection for these models, and gives some examples.
Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining
 Journal of Machine Learning Research
"... This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use ..."
Abstract

Cited by 26 (0 self)
 Add to MetaCart
This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use different terminology and task definitions, claim to have different goals, claim to use different rule learning heuristics, and use different means for selecting subsets of induced patterns. This paper contributes a novel understanding of these subareas of data mining by presenting a unified terminology, by explaining the apparent differences between the learning tasks as variants of a unique supervised descriptive rule discovery task and by exploring the apparent differences between the approaches. It also shows that various rule learning heuristics used in CSM, EPM and SD algorithms all aim at optimizing a trade off between rule coverage and precision. The commonalities (and differences) between the approaches are showcased on a selection of best known variants of CSM, EPM and SD algorithms. The paper also provides a critical survey of existing supervised descriptive rule discovery visualization methods.
Mining quantitative correlated patterns using an informationtheoretic approach
 In KDD
, 2006
"... Existing research on mining quantitative databases mainly focuses on mining associations. However, mining associations is too expensive to be practical in many cases. In this paper, we study mining correlations from quantitative databases and show that it is a more effective approach than mining ass ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
Existing research on mining quantitative databases mainly focuses on mining associations. However, mining associations is too expensive to be practical in many cases. In this paper, we study mining correlations from quantitative databases and show that it is a more effective approach than mining associations. We propose a new notion of Quantitative Correlated Patterns (QCPs), which is founded on two formal concepts, mutual information and allconfidence. We first devise a normalization on mutual information and apply it to QCP mining to capture the dependency between the attributes. We further adopt allconfidence as a quality measure to control, at a finer granularity, the dependency between the attributes with specific quantitative intervals. We also propose a supervised method to combine the consecutive intervals of the quantitative attributes based on mutual information, such that the interval combining is guided by the dependency between the attributes. We develop an algorithm, QCoMine, to efficiently mine QCPs by utilizing normalized mutual information and allconfidence to perform a twolevel pruning. Our experiments verify the efficiency of QCoMine and the quality of the QCPs.
Deriving Quantitative Models for Correlation Clusters
 IN PROC. 12TH ACM SIGKDD INT’L CONF. ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2006
"... Correlation clustering aims at grouping the data set into correlation clusters such that the objects in the same cluster exhibit a certain density and are all associated to a common arbitrarily oriented hyperplane of arbitrary dimensionality. Several algorithms for this task have been proposed recen ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
Correlation clustering aims at grouping the data set into correlation clusters such that the objects in the same cluster exhibit a certain density and are all associated to a common arbitrarily oriented hyperplane of arbitrary dimensionality. Several algorithms for this task have been proposed recently. However, all algorithms only compute the partitioning of the data into clusters. This is only a first step in the pipeline of advanced data analysis and system modelling. The second (postclustering) step of deriving a quantitative model for each correlation cluster has not been addressed so far. In this paper, we describe an original approach to handle this second step. We introduce a general method that can extract quantitative information on the linear dependencies within a correlation clustering. Our concepts are independent of the clustering model and can thus be applied as a postprocessing step to any correlation clustering algorithm. Furthermore, we show how these quantitative models can be used to predict the probability distribution that an object is created by these models. Our broad experimental evaluation demonstrates the beneficial impact of our method on several applications of significant practical importance.
Analysis of Firewall Policy Rules Using Data Mining Techniques
, 2006
"... Firewall is the de facto core technology of today's network security. However, the management of firewall rules has been proven to be complex, errorprone, costly and inefficient for many largenetworked organizations. To make firewall policy rules useful and effective, a timely and thorough analys ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
Firewall is the de facto core technology of today's network security. However, the management of firewall rules has been proven to be complex, errorprone, costly and inefficient for many largenetworked organizations. To make firewall policy rules useful and effective, a timely and thorough analysis of network traffic data is often required in a perpetual or periodic timeframe. This is often economically infeasible or operationally impractical due to the magnitude of data, the complexity of analysis, and its resource requirement. In this paper, we present a set of techniques and algorithms to analyze and manage firewall policy rules with (1) Data Mining techniques to deduce firewall policy rules by mining its network traffic log with Association Rule Mining (ARM) and Mining firewall Log using Frequency (MLF), (2) FilteringRule Generalization (FRG) by generalization and Clustering Algorithm by Gap Analysis (CAGA) to cluster IP addresses or Port numbers, (3) the detection of various anomalies based on the mining exposes many hidden but not detectable by analyzing only the firewall policy rules, and (4) a technique to identify decaying rule and dominant rule, thus in result to generate a new set of efficient firewall policy rules.
Discarding insignificant rules during impact rule discovery in large databases
 In Proc. SIAM DM
, 2005
"... Considerable progress has been made on how to reduce the number of spurious exploratory rules with quantitative attributes. However, little has been done for rules with undiscretized quantitative attributes. It is argued that propositional rules can not effectively describe the interactions between ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Considerable progress has been made on how to reduce the number of spurious exploratory rules with quantitative attributes. However, little has been done for rules with undiscretized quantitative attributes. It is argued that propositional rules can not effectively describe the interactions between quantitative and qualitative attributes. Aumann and Lindell proposed quantitative association rules to provide a better description of such relationship, together with a rule pruning techniques. Since their technique is based on the frequent itemset framework, it is not suitable for rule discovery in large, dense databases. In this paper, an efficient technique for automatically discarding insignificant rules during rule discovery is proposed, based on the OPUS search algorithm. Experiments demonstrate that the algorithm we propose can efficiently remove potentially uninteresting rules even in very large, dense databases.
Realvalued alldimensions search: Lowoverhead rapid searching over subsets of attributes
 Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence
, 2002
"... This paper is about searching the combinatorial space of contingency tables during the inner loop of a nonlinear statistical optimization. Examples of this operation in various data analytic communities include searching for nonlinear combinations of attributes that contribute signicantly to a regre ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
This paper is about searching the combinatorial space of contingency tables during the inner loop of a nonlinear statistical optimization. Examples of this operation in various data analytic communities include searching for nonlinear combinations of attributes that contribute signicantly to a regression (Statistics), searching for items to include in a decision list (machine learning) and association rule hunting (Data Mining). This paper investigates a new, efficient approach to this class of problems, called RADSEARCH (Realvalued AllDimensionstree Search). RADSEARCH finds the global optimum, and this gives us the opportunity to empirically evaluate the question: apart from algorithmic elegance what does this attention to optimality buy us? We compare RADSEARCH with other recent successful search algorithms such as CN2, PRIM, APriori, OPUS and DenseMiner. Finally, we introduce RADREG, a new regression algorithm for learning realvalued outputs based on RADSEARCHing for highorder interactions.
SelfSufficient Itemsets: An Approach to Screening Potentially Interesting Associations Between Items
"... Selfsufficient itemsets are those whose frequency cannot explained solely by the frequency of either their subsets or of their supersets. We argue that itemsets that are not selfsufficient will often be of little interest to the data analyst, as their frequency should be expected once that of the ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Selfsufficient itemsets are those whose frequency cannot explained solely by the frequency of either their subsets or of their supersets. We argue that itemsets that are not selfsufficient will often be of little interest to the data analyst, as their frequency should be expected once that of the itemsets on which their frequency depends is known. We present statistical tests for statistically sound discovery of selfsufficient itemsets, and computational techniques that allow those tests to be applied as a postprocessing step for any itemset discovery algorithm. We also present a measure for assessing the degree of potential interest in an itemset that complements these statistical measures.
Quantitative and ordinal association rules mining (qar mining
 In KnowledgeBased Intelligent Information and Engineering Systems, volume 4251 of LNAI
, 2006
"... Abstract. Association rules have exhibited an excellent ability to identify interesting association relationships among a set of binary variables describing huge amount of transactions. Although the rules can be relatively easily generalized to other variable types, the generalization can result in ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Abstract. Association rules have exhibited an excellent ability to identify interesting association relationships among a set of binary variables describing huge amount of transactions. Although the rules can be relatively easily generalized to other variable types, the generalization can result in a computationally expensive algorithm generating a prohibitive number of redundant rules of little significance. This danger especially applies to quantitative and ordinal variables. This paper presents and verifies an alternative approach to the quantitative and ordinal association rule mining. In this approach, quantitative or ordinal variables are not immediately transformed into a set of binary variables. Instead, it applies simple arithmetic operations in order to construct the cedents and searches for areas of increased association which are finally decomposed into conjunctions of literals. This scenario outputs rules that do not syntactically differentiate from classical association rules.