Results 1  10
of
472
Mining Frequent Patterns without Candidate Generation: A FrequentPattern Tree Approach
 DATA MINING AND KNOWLEDGE DISCOVERY
, 2004
"... Mining frequent patterns in transaction databases, timeseries databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriorilike candidate set generationandtest approach. However, candidate set generation is still co ..."
Abstract

Cited by 1154 (56 self)
 Add to MetaCart
Mining frequent patterns in transaction databases, timeseries databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriorilike candidate set generationandtest approach. However, candidate set generation is still costly, especially when there exist a large number of patterns and/or long patterns. In this study, we propose a novel
frequentpattern tree
(FPtree) structure, which is an extended prefixtree
structure for storing compressed, crucial information about frequent patterns, and develop an efficient FPtree
based mining method, FPgrowth, for mining the complete set of frequent patterns by pattern fragment growth.
Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a condensed,
smaller data structure, FPtree which avoids costly, repeated database scans, (2) our FPtreebased mining adopts
a patternfragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a
partitioningbased, divideandconquer method is used to decompose the mining task into a set of smaller tasks for
mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance
study shows that the FPgrowth method is efficient and scalable for mining both long and short frequent patterns,
and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported
new frequentpattern mining methods
Automatic Subspace Clustering of High Dimensional Data
 Data Mining and Knowledge Discovery
, 2005
"... Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the or ..."
Abstract

Cited by 562 (12 self)
 Add to MetaCart
Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate clusters in large high dimensional datasets.
Beyond Market Baskets: Generalizing Association Rules To Dependence Rules
, 1998
"... One of the more wellstudied problems in data mining is the search for association rules in market basket data. Association rules are intended to identify patterns of the type: “A customer purchasing item A often also purchases item B. Motivated partly by the goal of generalizing beyond market bask ..."
Abstract

Cited by 495 (7 self)
 Add to MetaCart
One of the more wellstudied problems in data mining is the search for association rules in market basket data. Association rules are intended to identify patterns of the type: “A customer purchasing item A often also purchases item B. Motivated partly by the goal of generalizing beyond market basket data and partly by the goal of ironing out some problems in the definition of association rules, we develop the notion of dependence rules that identify statistical dependence in both the presence and absence of items in itemsets. We propose measuring significance of dependence via the chisquared test for independence from classical statistics. This leads to a measure that is upwardclosed in the itemset lattice, enabling us to reduce the mining problem to the search for a border between dependent and independent itemsets in the lattice. We develop pruning strategies based on the closure property and thereby devise an efficient algorithm for discovering dependence rules. We demonstrate our algorithm’s effectiveness by testing it on census data, text data (wherein we seek term dependence), and synthetic data.
Efficiently mining long patterns from databases
, 1998
"... We present a patternmining algorithm that scales roughly linearly in the number of maximal patterns embedded in a database irrespective of the length of the longest pattern. In comparison, previous algorithms based on Apriori scale exponentially with longest pattern length. Experiments on real data ..."
Abstract

Cited by 385 (3 self)
 Add to MetaCart
We present a patternmining algorithm that scales roughly linearly in the number of maximal patterns embedded in a database irrespective of the length of the longest pattern. In comparison, previous algorithms based on Apriori scale exponentially with longest pattern length. Experiments on real data show that when the patterns are long, our algorithm is more efficient by an order of magnimaximal frequent itemset, MaxMiner’s output implicitly and concisely represents all frequent itemsets. MaxMiner is shown to result in two or more orders of magnitude in performance improvements over Apriori on some datasets. On other datasets where the patterns are not so long, the gains are more modest. In practice, MaxMiner is demonstrated to run in time that is roughly linear in the number of maximal frequent itemsets and the size of the database, irrespective of the size of the longest frequent itemset. tude or more. 1.
From data mining to knowledge discovery in databases
 AI Magazine
, 1996
"... ■ Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. What is all the excitement about? This article provides an overview of this emerging field, clarifying how data mining and knowledge discovery in databases ..."
Abstract

Cited by 302 (0 self)
 Add to MetaCart
■ Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. What is all the excitement about? This article provides an overview of this emerging field, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases. The article mentions particular realworld applications, specific datamining techniques, challenges involved in realworld applications of knowledge discovery, and current and future research directions in the field. Across a wide variety of fields, data are
Discovery of frequent episodes in event sequences
 Data Min. Knowl. Discov
, 1997
"... Abstract. Sequences of events describing the behavior and actions of users or systems can be collected in several domains. An episode is a collection of events that occur relatively close to each other in a given partial order. We consider the problem of discovering frequently occurring episodes in ..."
Abstract

Cited by 300 (14 self)
 Add to MetaCart
Abstract. Sequences of events describing the behavior and actions of users or systems can be collected in several domains. An episode is a collection of events that occur relatively close to each other in a given partial order. We consider the problem of discovering frequently occurring episodes in a sequence. Once such episodes are known, one can produce rules for describing or predicting the behavior of the sequence. We give efficient algorithms for the discovery of all frequent episodes from a given class of episodes, and present detailed experimental results. The methods are in use in telecommunication alarm management. Keywords: event sequences, frequent episodes, sequence analysis 1.
SPADE: An efficient algorithm for mining frequent sequences
 Machine Learning
, 2001
"... Abstract. In this paper we present SPADE, a new algorithm for fast discovery of Sequential Patterns. The existing solutions to this problem make repeated database scans, and use complex hash structures which have poor locality. SPADE utilizes combinatorial properties to decompose the original proble ..."
Abstract

Cited by 276 (16 self)
 Add to MetaCart
Abstract. In this paper we present SPADE, a new algorithm for fast discovery of Sequential Patterns. The existing solutions to this problem make repeated database scans, and use complex hash structures which have poor locality. SPADE utilizes combinatorial properties to decompose the original problem into smaller subproblems, that can be independently solved in mainmemory using efficient lattice search techniques, and using simple join operations. All sequences are discovered in only three database scans. Experiments show that SPADE outperforms the best previous algorithm by a factor of two, and by an order of magnitude with some preprocessed data. It also has linear scalability with respect to the number of inputsequences, and a number of other database parameters. Finally, we discuss how the results of sequence mining can be applied in a real application domain.
MAFIA: A maximal frequent itemset algorithm for transactional databases
 In ICDE
, 2001
"... We present a new algorithm for mining maximal frequent itemsets from a transactional database. Our algorithm is especially efficient when the itemsets in the database are very long. The search strategy of our algorithm integrates a depthfirst traversal of the itemset lattice with effective pruning ..."
Abstract

Cited by 242 (3 self)
 Add to MetaCart
We present a new algorithm for mining maximal frequent itemsets from a transactional database. Our algorithm is especially efficient when the itemsets in the database are very long. The search strategy of our algorithm integrates a depthfirst traversal of the itemset lattice with effective pruning mechanisms. Our implementation of the search strategy combines a vertical bitmap representation of the database with an efficient relative bitmap compression schema. In a thorough experimental analysis of our algorithm on real data, we isolate the effect of the individual components of the algorithm. Our performance numbers show that our algorithm outperforms previous work by a factor of three to five. 1
Mining Association Rules with Item Constraints
"... The problem of discovering association rules has received considerable research attention and several fast algorithms for mining association rules have been developed. In practice, users are often interested in a subset of association rules. For example, they may only want rules that contain a speci ..."
Abstract

Cited by 235 (0 self)
 Add to MetaCart
The problem of discovering association rules has received considerable research attention and several fast algorithms for mining association rules have been developed. In practice, users are often interested in a subset of association rules. For example, they may only want rules that contain a specific item or rules that contain children of a specific item in a hierarchy. While such constraints can be applied as a postprocessing step, integrating them into the mining algorithm can dramatically reduce the execution time. We consider the problem of integrating constraints that are boolean expressions over the presence or absence of items into the association discovery algorithm. We present three integrated algorithms for mining association rules with item constraints and discuss their tradeoffs. 1. Introduction The problem of discovering association rules was introduced in (Agrawal, Imielinski, & Swami 1993). Given a set of transactions, where each transaction is a set of literals (call...
A tree projection algorithm for generation of frequent itemsets
 Journal of Parallel and Distributed Computing
, 2000
"... In this paper we propose algorithms for generation of frequent itemsets by successive construction of the nodes of a lexicographic tree of itemsets. We discuss di erent strategies in generation and traversal of the lexicographic tree such as breadth rst search, depth rst search or a combination of ..."
Abstract

Cited by 161 (2 self)
 Add to MetaCart
In this paper we propose algorithms for generation of frequent itemsets by successive construction of the nodes of a lexicographic tree of itemsets. We discuss di erent strategies in generation and traversal of the lexicographic tree such as breadth rst search, depth rst search or a combination of the two. These techniques provide di erent tradeo s in terms of the I/O, memory and computational time requirements. We use the hierarchical structure of the lexicographic tree to successively project transactions at each node of the lexicographic tree, and use matrix counting on this reduced set of transactions for nding frequent itemsets. We tested our algorithm on both real and synthetic data. We provide an implementation of the tree projection method which is up to one order of magnitude faster than other recent techniques in the literature. The algorithm has a well structured data access pattern which provides data locality and reuse of data for multiple levels of the cache. We also discuss methods for parallelization of the