Results 1  10
of
61
Parallel and Distributed Association Mining: A Survey
 IEEE Concurrency
, 1999
"... This article presents a survey of the stateoftheart in parallel and distributed association rule mining (ARM) algorithms. This is direly needed given the importance of association rules to data mining, and given the tremendous amount of research it has attracted in recent years. This article p ..."
Abstract

Cited by 115 (3 self)
 Add to MetaCart
This article presents a survey of the stateoftheart in parallel and distributed association rule mining (ARM) algorithms. This is direly needed given the importance of association rules to data mining, and given the tremendous amount of research it has attracted in recent years. This article provides a taxonomy of the extant association mining methods, characterizing them according to the database format used, search and enumeration techniques utilized, and depending on whether they enumerate all or only maximal patterns, and their complexity in terms of the number of database scans. The survey clearly lists the design space of the parallel and distributed ARM algorithms based on the platform used (distributed or sharedmemory) , kind of parallelism exploited (task or data), and the load balancing strategy used (static or dynamic). A large number of parallel and distributed ARM methods are reviewed and grouped into related techniques. It is shown that there are a few dominan...
Parallel Algorithms for Discovery of Association Rules
 DATA MINING AND KNOWLEDGE DISCOVERY
, 1997
"... Discovery of association rules is an important data mining task. Several parallel and sequential algorithms have been proposed in the literature to solve this problem. Almost all of these algorithms make repeated passes over the database to determine the set of frequent itemsets (a subset of databas ..."
Abstract

Cited by 54 (6 self)
 Add to MetaCart
Discovery of association rules is an important data mining task. Several parallel and sequential algorithms have been proposed in the literature to solve this problem. Almost all of these algorithms make repeated passes over the database to determine the set of frequent itemsets (a subset of database items), thus incurring high I/O overhead. In the parallel case, most algorithms perform a sumreduction at the end of each pass to construct the global counts, also incurring high synchronization cost. In this paper we describe new parallel association mining algorithms. The algorithms use novel itemset clustering techniques to approximate the set of potentially maximal frequent itemsets. Once this set has been identified, the algorithms make use of efficient traversal techniques to generate the frequent itemsets contained in each cluster. We propose two clustering schemes based on equivalence classes and maximal hypergraph cliques, and study two lattice traversal techniques based on bottomup and hybrid search. We use a vertical database layout to cluster related transactions together. The database is also selectively replicated so that the portion of the database needed for the computation of associations is local to each processor. After the initial setup phase, the algorithms do not need any further communication or synchronization. The algorithms minimize I/O overheads by scanning the local database portion only twice. Once in the setup phase, and once when processing the itemset clusters. Unlike previous parallel approaches, the algorithms use simple intersection operations to compute frequent itemsets and
Distributed Data Mining: Algorithms, Systems, and Applications
, 2002
"... This paper presents a brief overview of the DDM algorithms, systems, applications, and the emerging research directions. The structure of the paper is organized as follows. We first present the related research of DDM and illustrate data distribution scenarios. Then DDM algorithms are reviewed. Subs ..."
Abstract

Cited by 49 (4 self)
 Add to MetaCart
This paper presents a brief overview of the DDM algorithms, systems, applications, and the emerging research directions. The structure of the paper is organized as follows. We first present the related research of DDM and illustrate data distribution scenarios. Then DDM algorithms are reviewed. Subsequently, the architectural issues in DDM systems and future directions are discussed
Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance
 In Proceedings of the second SIAM conference on Data Mining
, 2002
"... With recent technological advances, shared memory parallel machines have become more scalable, and oer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining alg ..."
Abstract

Cited by 27 (10 self)
 Add to MetaCart
With recent technological advances, shared memory parallel machines have become more scalable, and oer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining algorithms.
Fast Parallel Association Rule Mining without Candidacy Generation
 In ICDM
, 2001
"... Searching for frequent patterns in transactional databases is considered one of the most important data mining problems. Most current association mining algorithms, whether sequential or parallel, adopt an apriorilike algorithm that requires full multiple I/O scans of the data set and expensive ..."
Abstract

Cited by 27 (3 self)
 Add to MetaCart
Searching for frequent patterns in transactional databases is considered one of the most important data mining problems. Most current association mining algorithms, whether sequential or parallel, adopt an apriorilike algorithm that requires full multiple I/O scans of the data set and expensive computation to generate the potential frequent items. The recent explosive growth in data collection made the current association rule mining algorithms restricted and inadequate to analyze excessively large transaction sets due to the above mentioned limitations. In this paper we introduce a new parallel algorithm MLFPT (Multiple Local Frequent Pattern Tree) for parallel mining of frequent patterns, based on FPgrowth mining, that uses only two full I/O scans of the database, eliminating the need for generating the candidate items, and distributing the work fairly among processors to achieve near optimum load balance.
Cacheconscious frequent pattern mining on a modern processor
 In Proceedings of the International Conference on Very Large Data Bases (VLDB
, 2005
"... In this paper, we examine the performance of frequent pattern mining algorithms on a modern processor. A detailed performance study reveals that even the best frequent pattern mining implementations, with highly efficient memory managers, still grossly underutilize a modern processor. The primary p ..."
Abstract

Cited by 24 (6 self)
 Add to MetaCart
In this paper, we examine the performance of frequent pattern mining algorithms on a modern processor. A detailed performance study reveals that even the best frequent pattern mining implementations, with highly efficient memory managers, still grossly underutilize a modern processor. The primary performance bottlenecks are poor data locality and low instruction level parallelism (ILP). We propose a cacheconscious prefix tree to address this problem. The resulting tree improves spatial locality and also enhances the benefits from hardware cache line prefetching. Furthermore, the design of this data structure allows the use of a novel tiling strategy to improve temporal locality. The result is an overall speedup of up to 3.2 when compared with stateoftheart implementations. We then show how these algorithms can be improved further by realizing a nonnaive threadbased decomposition that targets simultaneously multithreaded processors. A key aspect of this decomposition is to ensure cache reuse between threads that are coscheduled at a fine granularity. This optimization affords an additional speedup of 50%, resulting in an overall speedup of up to 4.8. To
A localized algorithm for parallel association mining
 In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures
, 1997
"... Discovery of association rules is an important database mining problem. Mining for association rules involves extracting patterns from large databases and inferring useful rules from them. Several parallel and sequential algorithms have been proposed in the literature to solve this problem. Almost a ..."
Abstract

Cited by 23 (4 self)
 Add to MetaCart
Discovery of association rules is an important database mining problem. Mining for association rules involves extracting patterns from large databases and inferring useful rules from them. Several parallel and sequential algorithms have been proposed in the literature to solve this problem. Almost all of these algorithms make repeated passes over the database to determine the commonly occurring patterns or itemsets (set of items), thus incurring high I/O overhead. In the parallel case, these algorithms do a reduction at the end of each pass to construct the global patterns, thus incurring high synchronization cost. In this paper we describe a new parallel association mining algorithm. Our algorithm is a result of detailed study of the available parallelism and the properties of associations. The algorithm uses a scheme to cluster related frequent itemsets together, and to partition them among the processors. At the same time it also uses a different database layout which clusters related transactions together, and selectively replicates the database so that the portion of the database needed for the computation of associations is local to each processor. After the initial setup phase, the algorithm eliminates the need for further communication or synchronization. The algorithm further scans the local database partition only three times, thus minimizing I/O overheads. Unlike previous approaches, the algorithms uses simple intersection operations to compute frequent itemsets and doesn’t have to maintain or search complex hash structures. Our experimental testbed is a 32processor DEC Alpha cluster interconnected by the Memory Channel network. We present results on the performance of our algorithm on various databases, and compare it against a well known parallel algorithm. Our algorithm outperforms it by an more than an order of magnitude. 1
Effect of Data Skewness in Parallel Mining of Association Rules
 In PacificAsia Conference on Knowledge Discovery and Data Mining
, 1998
"... An efficient parallel algorithm FPM (Fast Parallel Mining) for mining association rules on a sharednothing parallel system has been proposed. It adopts the count distribution approach and has incorporated two powerful candidate pruning techniques, i.e., distributed pruning and global pruning. It ha ..."
Abstract

Cited by 22 (1 self)
 Add to MetaCart
An efficient parallel algorithm FPM (Fast Parallel Mining) for mining association rules on a sharednothing parallel system has been proposed. It adopts the count distribution approach and has incorporated two powerful candidate pruning techniques, i.e., distributed pruning and global pruning. It has a simple communication scheme which performs only one round of message exchange in each iteration. We found that the two pruning techniques are very sensitive to data skewness, which describes the degree of nonuniformity of the itemset distribution among the database partitions. Distributed pruning is very effective when data skewness is high. Global pruning is more effective than distributed pruning even for the mild data skewness case. We have implemented the algorithm on an IBM SP2 parallel machine. The performance studies confirm our observation on the relationship between the effectiveness of the two pruning techniques and data skewness. It has also shown that FPM outp...
A middleware for developing parallel data mining implementations
 In Proceedings of the first SIAM conference on Data Mining
, 2001
"... Data mining is an interdisciplinary field, having applications in diverse areas like bioinformatics, medical informatics, scientific data analysis, financial analysis, consumer profiling, etc. In each of these application domains, the amount of data available for analysis has exploded in recent year ..."
Abstract

Cited by 22 (14 self)
 Add to MetaCart
Data mining is an interdisciplinary field, having applications in diverse areas like bioinformatics, medical informatics, scientific data analysis, financial analysis, consumer profiling, etc. In each of these application domains, the amount of data available for analysis has exploded in recent years, making the scalability of data
A HighPerformance Distributed Algorithm for Mining Association Rules
, 2003
"... We present a new distributed association rule mining (DARM) algorithm that demonstrates superlinear speedup with the number of computing nodes. The algorithm is the first DARM algorithm to perform a single scan over the database. As such, its performance is unmatched by any previous algorithm. Sca ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
We present a new distributed association rule mining (DARM) algorithm that demonstrates superlinear speedup with the number of computing nodes. The algorithm is the first DARM algorithm to perform a single scan over the database. As such, its performance is unmatched by any previous algorithm. Scaleup experiments over standard synthetic benchmarks demonstrate stable run time regardless of the number of computers. Theoretical analysis reveals a tighter bound on error probability than the one shown in the corresponding sequential algorithm.