Results 1  10
of
130
Efficiently mining long patterns from databases
, 1998
"... We present a patternmining algorithm that scales roughly linearly in the number of maximal patterns embedded in a database irrespective of the length of the longest pattern. In comparison, previous algorithms based on Apriori scale exponentially with longest pattern length. Experiments on real data ..."
Abstract

Cited by 465 (3 self)
 Add to MetaCart
(Show Context)
We present a patternmining algorithm that scales roughly linearly in the number of maximal patterns embedded in a database irrespective of the length of the longest pattern. In comparison, previous algorithms based on Apriori scale exponentially with longest pattern length. Experiments on real data show that when the patterns are long, our algorithm is more efficient by an order of magnimaximal frequent itemset, MaxMiner’s output implicitly and concisely represents all frequent itemsets. MaxMiner is shown to result in two or more orders of magnitude in performance improvements over Apriori on some datasets. On other datasets where the patterns are not so long, the gains are more modest. In practice, MaxMiner is demonstrated to run in time that is roughly linear in the number of maximal frequent itemsets and the size of the database, irrespective of the size of the longest frequent itemset. tude or more. 1.
Efficient Mining of Emerging Patterns: Discovering Trends and Differences
, 1999
"... We introduce a new kind of patterns, called emerging patterns (EPs), for knowledge discovery from databases. EPs are defined as itemsets whose supports increase significantly from one dataset to another. EPs can capture emerging trends in timestamped databases, or useful contrasts between data clas ..."
Abstract

Cited by 341 (38 self)
 Add to MetaCart
(Show Context)
We introduce a new kind of patterns, called emerging patterns (EPs), for knowledge discovery from databases. EPs are defined as itemsets whose supports increase significantly from one dataset to another. EPs can capture emerging trends in timestamped databases, or useful contrasts between data classes. EPs have been proven useful: we have used them to build very powerful classifiers, which are more accurate than C4.5 and CBA, for many datasets. We believe that EPs with low to medium support, such as 1% 20%, can give useful new insights and guidance to experts, in even "well understood" applications. The efficient mining of EPs is a challenging problem, since (i) the Apriori property no longer holds for EPs, and (ii) there are usually too many candidates for high dimensional databases or for small support thresholds such as 0.5%. Naive algorithms are too costly. To solve this problem, (a) we promote the description of large collections of itemsets using their concise borders (the pa...
Data privacy through optimal kanonymization
 In ICDE
, 2005
"... Data deidentification reconciles the demand for release of data for research purposes and the demand for privacy from individuals. This paper proposes and evaluates an optimization algorithm for the powerful deidentification procedure known as kanonymization. A kanonymized dataset has the proper ..."
Abstract

Cited by 337 (3 self)
 Add to MetaCart
(Show Context)
Data deidentification reconciles the demand for release of data for research purposes and the demand for privacy from individuals. This paper proposes and evaluates an optimization algorithm for the powerful deidentification procedure known as kanonymization. A kanonymized dataset has the property that each record is indistinguishable from at least k – 1 others. Even simple restrictions of optimized kanonymity are NPhard, leading to significant computational challenges. We present a new approach to exploring the space of possible anonymizations that tames the combinatorics of the problem, and develop datamanagement strategies to reduce reliance on expensive operations such as sorting. Through experiments on real census data, we show the resulting algorithm can find optimal kanonymizations under two representative cost measures and a wide range of k. We also show that the algorithm can produce good anonymizations in circumstances where the input data or input parameters preclude finding an optimal solution in reasonable time. Finally, we use the algorithm to explore the effects of different coding approaches and problem variations on anonymization quality and performance. To our knowledge, this is the first result demonstrating optimal kanonymization of a nontrivial dataset under a general model of the problem. 1.
MAFIA: A maximal frequent itemset algorithm for transactional databases
 In ICDE
, 2001
"... We present a new algorithm for mining maximal frequent itemsets from a transactional database. Our algorithm is especially efficient when the itemsets in the database are very long. The search strategy of our algorithm integrates a depthfirst traversal of the itemset lattice with effective pruning ..."
Abstract

Cited by 312 (3 self)
 Add to MetaCart
(Show Context)
We present a new algorithm for mining maximal frequent itemsets from a transactional database. Our algorithm is especially efficient when the itemsets in the database are very long. The search strategy of our algorithm integrates a depthfirst traversal of the itemset lattice with effective pruning mechanisms. Our implementation of the search strategy combines a vertical bitmap representation of the database with an efficient relative bitmap compression schema. In a thorough experimental analysis of our algorithm on real data, we isolate the effect of the individual components of the algorithm. Our performance numbers show that our algorithm outperforms previous work by a factor of three to five. 1
Closet+: searching for the best strategies for mining frequent closed itemsets
, 2003
"... Mining frequent closed itemsets provides complete and nonredundant results for frequent pattern analysis. Extensive studies have proposed various strategies for efficient frequent closed itemset mining, such as depthfirst search vs. breadthfirst search, vertical formats vs. horizontal formats, tree ..."
Abstract

Cited by 183 (20 self)
 Add to MetaCart
Mining frequent closed itemsets provides complete and nonredundant results for frequent pattern analysis. Extensive studies have proposed various strategies for efficient frequent closed itemset mining, such as depthfirst search vs. breadthfirst search, vertical formats vs. horizontal formats, treestructure vs. other data structures, topdown vs. bottomup traversal, pseudo projection vs. physical projection of conditional database, etc. It is the right time to ask “what are the pros and cons of the strategies? ” and “what and how can we pick and integrate the best strategies to achieve higher performance in general cases?” In this study, we answer the above questions by a systematic study of the search strategies and develop a winning algorithm CLOSET+. CLOSET+ integrates the advantages of the previously proposed effective strategies as well as some ones newly developed here. A thorough performance study on synthetic and real data sets has shown the advantages of the strategies and the improvement of CLOSET+ over existing mining algorithms, including CLOSET, CHARM and OP, in terms of runtime, memory usage and scalability.
Constraintbased rule mining in large, dense databases
, 1999
"... Constraintbased rule miners find all rules in a given dataset meeting userspecified constraints such as minimum support and confidence. We describe a new algorithm that directly exploits all userspecified constraints including minimum support, minimum confidence, and a new constraint that ensures ..."
Abstract

Cited by 173 (3 self)
 Add to MetaCart
Constraintbased rule miners find all rules in a given dataset meeting userspecified constraints such as minimum support and confidence. We describe a new algorithm that directly exploits all userspecified constraints including minimum support, minimum confidence, and a new constraint that ensures every mined rule offers a predictive advantage over any of its simplifications. Our algorithm maintains efficiency even at low supports on data that is dense (e.g. relational data). Previous approaches such as Apriori and its variants exploit only the minimum support constraint, and as a result are ineffective on dense data due to a combinatorial explosion of “frequent itemsets”.
Embedding Defaults into Terminological Knowledge Representation Formalisms
 Journal of Automated Reasoning
, 1995
"... We consider the problem of integrating Reiter's default logic into terminological representation systems. It turns out that such an integration is less straightforward than we expected, considering the fact that the terminological language is a decidable sublanguage of firstorder logic. Semant ..."
Abstract

Cited by 156 (8 self)
 Add to MetaCart
(Show Context)
We consider the problem of integrating Reiter's default logic into terminological representation systems. It turns out that such an integration is less straightforward than we expected, considering the fact that the terminological language is a decidable sublanguage of firstorder logic. Semantically, one has the unpleasant effect that the consequences of a terminological default theory may be rather unintuitive, and may even vary with the syntactic structure of equivalent concept expressions. This is due to the unsatisfactory treatment of open defaults via Skolemization in Reiter's semantics. On the algorithmic side, we show that this treatment may lead to an undecidable default consequence relation, even though our base language is decidable, and we have only finitely many (open) defaults. Because of these problems, we then consider a restricted semantics for open defaults in our terminological default theories: default rules are only applied to individuals that are explicitly presen...
Detecting group differences: Mining contrast sets
 Data Mining and Knowledge Discovery
, 2001
"... A fundamental task in data analysis is understanding the differences between several contrasting groups. These groups can represent different classes of objects, such as male or female students, or the same group over time, e.g. freshman students in 1993 through 1998. We present the problem of mini ..."
Abstract

Cited by 108 (3 self)
 Add to MetaCart
A fundamental task in data analysis is understanding the differences between several contrasting groups. These groups can represent different classes of objects, such as male or female students, or the same group over time, e.g. freshman students in 1993 through 1998. We present the problem of mining contrast sets: conjunctions of attributes and values that differ meaningfully in their distribution across groups. We provide a search algorithm for mining contrast sets with pruning rules that drastically reduce the computational complexity. Once the contrast sets are found, we postprocess the results to present a subset that are surprising to the user given what we have already shown. We explicitly control the probability of Type I error (false positives) and guarantee a maximum error rate for the entire analysis by using Bonferroni corrections.
Depth First Generation of Long Patterns
, 2000
"... In this paper we present an algorithm for mining long patterns in databases. The algorithm finds large itemsets by using depth first search on a lexicographic tree of itemsets. The focus of this paper is to develop CPUefficient algorithms for finding frequent itemsets in the cases when the database ..."
Abstract

Cited by 99 (2 self)
 Add to MetaCart
(Show Context)
In this paper we present an algorithm for mining long patterns in databases. The algorithm finds large itemsets by using depth first search on a lexicographic tree of itemsets. The focus of this paper is to develop CPUefficient algorithms for finding frequent itemsets in the cases when the database contains patterns which are very wide. We refer to this algorithm as DepthProject, and it achieves more than one order of magnitude speedup over the recently proposed MaxMiner algorithm for finding long patterns. These techniques may be quite useful for applications in areas such as computational biology in which the number of records is relatively small, but the itemsets are very long. This necessitates the discovery of patterns using algorithms which are especially tailored to the nature of such domains.
OPUS: An efficient admissible algorithm for unordered search
 Journal of Artificial Intelligence Research
, 1995
"... OPUS is a branch and bound search algorithm that enables efficient admissible search through spaces for which the order of search operator application is not significant. The algorithm’s search efficiency is demonstrated with respect to very large machine learning search spaces. The use of admissibl ..."
Abstract

Cited by 91 (14 self)
 Add to MetaCart
(Show Context)
OPUS is a branch and bound search algorithm that enables efficient admissible search through spaces for which the order of search operator application is not significant. The algorithm’s search efficiency is demonstrated with respect to very large machine learning search spaces. The use of admissible search is of potential value to the machine learning community as it means that the exact learning biases to be employed for complex learning tasks can be precisely specified and manipulated. OPUS also has potential for application in other areas of artificial intelligence, notably, truth maintenance. 1.