Results 1 - 10
of
15
Parallel and Distributed Association Mining: A Survey
- IEEE Concurrency
, 1999
"... This article presents a survey of the state-of-the-art in parallel and distributed association rule mining (ARM) algorithms. This is direly needed given the importance of association rules to data mining, and given the tremendous amount of research it has attracted in recent years. This article p ..."
Abstract
-
Cited by 96 (3 self)
- Add to MetaCart
This article presents a survey of the state-of-the-art in parallel and distributed association rule mining (ARM) algorithms. This is direly needed given the importance of association rules to data mining, and given the tremendous amount of research it has attracted in recent years. This article provides a taxonomy of the extant association mining methods, characterizing them according to the database format used, search and enumeration techniques utilized, and depending on whether they enumerate all or only maximal patterns, and their complexity in terms of the number of database scans. The survey clearly lists the design space of the parallel and distributed ARM algorithms based on the platform used (distributed or sharedmemory) , kind of parallelism exploited (task or data), and the load balancing strategy used (static or dynamic). A large number of parallel and distributed ARM methods are reviewed and grouped into related techniques. It is shown that there are a few dominan...
A High-Performance Distributed Algorithm for Mining Association Rules
, 2003
"... We present a new distributed association rule mining (D-ARM) algorithm that demonstrates superlinear speedup with the number of computing nodes. The algorithm is the first D-ARM algorithm to perform a single scan over the database. As such, its performance is unmatched by any previous algorithm. Sca ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
We present a new distributed association rule mining (D-ARM) algorithm that demonstrates superlinear speedup with the number of computing nodes. The algorithm is the first D-ARM algorithm to perform a single scan over the database. As such, its performance is unmatched by any previous algorithm. Scale-up experiments over standard synthetic benchmarks demonstrate stable run time regardless of the number of computers. Theoretical analysis reveals a tighter bound on error probability than the one shown in the corresponding sequential algorithm.
Association Rules Mining: A Recent Overview
"... Abstract. In this paper, we provide the preliminaries of basic concepts about association rule mining and survey the list of existing association rule mining techniques. Of course, a single article cannot be a complete review of all the algorithms, yet we hope that the references cited will cover th ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Abstract. In this paper, we provide the preliminaries of basic concepts about association rule mining and survey the list of existing association rule mining techniques. Of course, a single article cannot be a complete review of all the algorithms, yet we hope that the references cited will cover the major theoretical issues, guiding the researcher in interesting research directions that have yet to be explored. 1
Parallel and distributed methods for incremental frequent itemset mining
- IEEE Transactions on Systems, Man and Cybernetics
, 2004
"... Abstract—Traditional methods for data mining typically make the assumption that the data is centralized, memory-resident, and static. This assumption is no longer tenable. Such methods waste computational and input/output (I/O) resources when data is dynamic, and they impose excessive communication ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract—Traditional methods for data mining typically make the assumption that the data is centralized, memory-resident, and static. This assumption is no longer tenable. Such methods waste computational and input/output (I/O) resources when data is dynamic, and they impose excessive communication overhead when data is distributed. Efficient implementation of incremental data mining methods is, thus, becoming crucial for ensuring system scalability and facilitating knowledge discovery when data is dynamic and distributed. In this paper, we address this issue in the context of the important task of frequent itemset mining. We first present an efficient algorithm which dynamically maintains the required information even in the presence of data updates without examining the entire dataset. We then show how to parallelize this incremental algorithm. We also propose a distributed asynchronous algorithm, which imposes minimal communication overhead for mining distributed dynamic datasets. Our distributed approach is capable of generating local models (in which each site has a summary of its own database) as well as the global model of frequent itemsets (in which all sites have a summary of the entire database). This ability permits our approach not only to generate frequent itemsets, but also to generate high-contrast frequent itemsets, which allows one to examine how the data is skewed over different sites. Index Terms—Distributed computing, grid computing, incremental data mining, parallel computing. I.
Efficient parallel algorithms for mining associations
- Parallel and Distributed Systems
, 2000
"... ..."
Data Allocation Algorithm for Parallel Association Rule Discovery
- In Proceedings ot the Fifth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2001), Hong Kong
"... Association rule discovery techniques have gradually been adapted to parallel systems in order to take advantage of the higher speed and greater storage capacity that they oer. The transition to a distributed memory system requires the partitioning of the database among the processors, a procedu ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Association rule discovery techniques have gradually been adapted to parallel systems in order to take advantage of the higher speed and greater storage capacity that they oer. The transition to a distributed memory system requires the partitioning of the database among the processors, a procedure that is generally carried out indiscriminately. However, for some techniques the nature of the database partitioning can have a pronounced impact on execution time and attention will be focused on one such algorithm, Fast Parallel Mining (FPM). A new algorithm, Data Allocation Algorithm (DAA), is presented that uses Principal Component Analysis to improve the data distribution prior to FPM.
Robust and Distributed Top-N Frequent-Pattern Mining With SAP BW Accelerator
"... Mining for association rules and frequent patterns is a central activity in data mining. However, most existing algorithms are only moderately suitable for real-world scenarios. Most strategies use parameters like minimum support, for which it can be very difficult to define a suitable value for unk ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Mining for association rules and frequent patterns is a central activity in data mining. However, most existing algorithms are only moderately suitable for real-world scenarios. Most strategies use parameters like minimum support, for which it can be very difficult to define a suitable value for unknown datasets. Since most untrained users are unable or unwilling to set such technical parameters, we address the problem of replacing the minimum-support parameter with top-n strategies. In our paper, we start by extending a top-n implementation of the ECLAT algorithm to improve its performance by using heuristic search strategy optimizations. Also, real-world datasets are often distributed and modern database architectures are switching from expensive SMPs to cheaper shared-nothing blade servers. Thus, most mining queries require distribution handling. Since partitioning can be forced by user-defined semantics, it is often forbidden to transform the data. Therefore, we developed an adaptive top-n frequent-pattern mining algorithm that simplifies the mining process on real distributions by relaxing some requirements on the results. We first combine the PAR-TITION and the TPUT algorithms to handle distributed top-n frequent-pattern mining. Then, we extend this new algorithm for distributions with real-world data characteristics. For frequent-pattern mining algorithms, equal distributions are important conditions, and tiny partitions can cause performance bottlenecks. Hence, we implemented an approach called MAST that defines a minimum absolutesupport threshold. MAST prunes patterns with low chances of reaching the global top-n result set and high computing costs. In total, our approach simplifies the process of frequent-pattern mining for real customer scenarios and data sets. This may make frequent-pattern mining accessible for very new user groups. Finally, we present results of our algorithms when run on the SAP NetWeaver BW Acceleratorwith standard and real business datasets.
Distributed Threshold Querying of General Functions by a Difference of Monotonic Representation
"... The goal of a threshold query is to detect all objects whose score exceeds a given threshold. This type of query is used in many settings, such as data mining, event triggering, and top-k selection. Often, threshold queries are performed over distributed data. Given database relations that are distr ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The goal of a threshold query is to detect all objects whose score exceeds a given threshold. This type of query is used in many settings, such as data mining, event triggering, and top-k selection. Often, threshold queries are performed over distributed data. Given database relations that are distributed over many nodes, an object’s score is computed by aggregating the value of each attribute, applying a given scoring function over the aggregation, and thresholding the function’s value. However, joining all the distributed relations to a central database might incur prohibitive overheads in bandwidth, CPU, and storage accesses. Efficient algorithms required to reduce these costs exist only for monotonic aggregation threshold queries and certain specific scoring functions. We present a novel approach for efficiently performing general distributed threshold queries. To the best of our knowledge, this is the first solution to the problem of performing such queries with general scoring functions. We first present a solution for monotonic functions, and then introduce a technique to solve for other functions by representing them as a difference of monotonic functions. Experiments with real-world data demonstrate the method’s effectiveness in achieving low communication and access costs. 1.
On the Complexity of Rule Discovery from Distributed Data
, 2005
"... This paper analyses the complexity of rule selection for supervised learning in distributed scenarios. The selection of rules is usually guided by a utility measure such as predictive accuracy or weighted relative accuracy. Other examples are support and confidence, known from association rule m ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper analyses the complexity of rule selection for supervised learning in distributed scenarios. The selection of rules is usually guided by a utility measure such as predictive accuracy or weighted relative accuracy. Other examples are support and confidence, known from association rule mining. A common strategy to tackle rule selection from distributed data is to evaluate rules locally on each dataset.
A SURVEY OF ASSOCIATION RULES
"... ABSTRACT: Association rules are one of the most researched areas of data mining and have recently received much attention from the database community. They have proven to be quite useful in the marketing and retail communities as well as other more diverse fields. In this paper we provide an overvie ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
ABSTRACT: Association rules are one of the most researched areas of data mining and have recently received much attention from the database community. They have proven to be quite useful in the marketing and retail communities as well as other more diverse fields. In this paper we provide an overview of association rule research. 1

