Results 1  10
of
37
Scalable Algorithms for Association Mining
 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2000
"... Association rule discovery has emerged as an important problem in knowledge discovery and data mining. The association mining task consists of identifying the frequent itemsets, and then forming conditional implication rules among them. In this paper we present efficient algorithms for the discovery ..."
Abstract

Cited by 182 (22 self)
 Add to MetaCart
Association rule discovery has emerged as an important problem in knowledge discovery and data mining. The association mining task consists of identifying the frequent itemsets, and then forming conditional implication rules among them. In this paper we present efficient algorithms for the discovery of frequent itemsets, which forms the compute intensive phase of the task. The algorithms utilize the structural properties of frequent itemsets to facilitate fast discovery. The items are organized into a subset lattice search space, which is decomposed into small independent chunks or sublattices, which can be solved in memory. Efficient lattice traversal techniques are presented, which quickly identify all the long frequent itemsets, and their subsets if required. We also present the effect of using different database layout schemes combined with the proposed decomposition and traversal techniques. We experimentally compare the new algorithms against the previous approaches, obtaining ...
Scalable Parallel Data Mining for Association Rules
, 1997
"... One of the important problems in data mining is discovering association rules from databases of transactions where each transaction consists of a set of items. The most time consuming operation in this discovery process is the computation of the frequency of the occurrences of interesting subset of ..."
Abstract

Cited by 153 (14 self)
 Add to MetaCart
One of the important problems in data mining is discovering association rules from databases of transactions where each transaction consists of a set of items. The most time consuming operation in this discovery process is the computation of the frequency of the occurrences of interesting subset of items (called candidates) in the database of transactions. To prune the exponentially large space of candidates, most existing algorithms, consider only those candidates that have a user defined minimum support. Even with the pruning, the task of finding all association rules requires a lot of computation power and time. Parallel computers offer a potential solution to the computation requirement of this task, provided efficient and scalable parallel algorithms can be designed. In this paper, we present two new parallel algorithms for mining association rules. The Intelligent Data Distribution algorithm efficiently uses aggregate memory of the parallel computer by employing intelligent candi...
Parallel and Distributed Association Mining: A Survey
 IEEE Concurrency
, 1999
"... This article presents a survey of the stateoftheart in parallel and distributed association rule mining (ARM) algorithms. This is direly needed given the importance of association rules to data mining, and given the tremendous amount of research it has attracted in recent years. This article p ..."
Abstract

Cited by 116 (3 self)
 Add to MetaCart
This article presents a survey of the stateoftheart in parallel and distributed association rule mining (ARM) algorithms. This is direly needed given the importance of association rules to data mining, and given the tremendous amount of research it has attracted in recent years. This article provides a taxonomy of the extant association mining methods, characterizing them according to the database format used, search and enumeration techniques utilized, and depending on whether they enumerate all or only maximal patterns, and their complexity in terms of the number of database scans. The survey clearly lists the design space of the parallel and distributed ARM algorithms based on the platform used (distributed or sharedmemory) , kind of parallelism exploited (task or data), and the load balancing strategy used (static or dynamic). A large number of parallel and distributed ARM methods are reviewed and grouped into related techniques. It is shown that there are a few dominan...
A DataClustering Algorithm On Distributed Memory Multiprocessors
 In LargeScale Parallel Data Mining, Lecture Notes in Artificial Intelligence
, 2000
"... To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the kmeans clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent dataparallelism in the kmeans algorithm. We analyticall ..."
Abstract

Cited by 95 (1 self)
 Add to MetaCart
To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the kmeans clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent dataparallelism in the kmeans algorithm. We analytically show that the speedup and the scaleup of our algorithm approach the optimal as the number of data points increases. We implemented our algorithm on an IBM POWERparallel SP2 with a maximum of 16 nodes. On typical test data sets, we observe nearly linear relative speedups, for example, 15.62 on 16 nodes, and essentially linear scaleup in the size of the data set and in the number of clusters desired. For a 2 gigabyte test data set, our implementation drives the 16 node SP2 at more than 1.8 gigaflops. Keywords: kmeans, data mining, massive data sets, messagepassing, text mining. 1 Introduction Data sets measuring in gigabytes and even terabytes are now quite common in data and text minin...
Parallel data mining for association rules on sharedmemory multiprocessors
 In Proc. Supercomputingâ€™96
, 1996
"... Abstract. In this paper we present a new parallel algorithm for data mining of association rules on sharedmemory multiprocessors. We study the degree of parallelism, synchronization, and data locality issues, and present optimizations for fast frequency computation. Experiments show that a signific ..."
Abstract

Cited by 73 (19 self)
 Add to MetaCart
Abstract. In this paper we present a new parallel algorithm for data mining of association rules on sharedmemory multiprocessors. We study the degree of parallelism, synchronization, and data locality issues, and present optimizations for fast frequency computation. Experiments show that a significant improvement of performance is achieved using our proposed optimizations. We also achieved good speedup for the parallel algorithm. A lot of datamining tasks (e.g. association rules, sequential patterns) use complex pointerbased data structures (e.g. hash trees) that typically suffer from suboptimal data locality. In the multiprocessor case shared access to these data structures may also result in false sharing. For these tasks it is commonly observed that the recursive data structure is built once and accessed multiple times during each iteration. Furthermore, the access patterns after the build phase are highly ordered. In such cases locality and false sharing sensitive memory placement of these structures can enhance performance significantly. We evaluate a set of placement policies for parallel association discovery, and show that simple placement schemes can improve execution time by more than a factor of two. More complex schemes yield additional gains.
Fast Mining of Sequential Patterns in Very Large Databases
, 1997
"... In this paper we present a new algorithm for fast discovery of Sequential Patterns. Given a collection of items, a set of records over those items, and records belonging to a customer, the task is to identify all the commonly occurring sequences of items bought by the customers. An example of a sequ ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
In this paper we present a new algorithm for fast discovery of Sequential Patterns. Given a collection of items, a set of records over those items, and records belonging to a customer, the task is to identify all the commonly occurring sequences of items bought by the customers. An example of a sequential pattern could be that "30% of the people buying Douglas Adam's The Hitchhiker's Guide to the Galaxy bought The Restaurant at the End of the Universe within a month". The existing solutions to this problem make repeated database scans, and use complex hash structures which have poor locality. Our new SPADE algorithm uses only simple join operations, and finds all frequent sequences in usually only three database scans. With the help of extensive experiments, we show that SPADE outperforms the best previous algorithm by more than a factor of 2, and by more than an order of magnitude in the incremental case. It also has excellent scaleup properties with respect to the number of custom...
Parallel sequence mining on sharedmemory machines
 Journal of Parallel and Distributed Computing
, 2001
"... We present pSPADE, a parallel algorithm for fast discovery of frequent sequences in large databases. pSPADE decomposes the original search space into smaller suffixbased classes. Each class can be solved in mainmemory using efficient search techniques, and simple join operations. Further each clas ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
We present pSPADE, a parallel algorithm for fast discovery of frequent sequences in large databases. pSPADE decomposes the original search space into smaller suffixbased classes. Each class can be solved in mainmemory using efficient search techniques, and simple join operations. Further each class can be solved independently on each processor requiring no synchronization. However, dynamic interclass and intraclass load balancing must be exploited to ensure that each processor gets an equal amount of work. Experiments on a 12 processor SGI Origin 2000 shared memory system show good speedup and scaleup results. 1
InterAct: Virtual Sharing for Interactive ClientServer Applications
 In 4th Workshop on Languages, Compilers, and Runtime Systems for
, 1998
"... . We describe InterAct, a framework for interactive clientserver applications. InterAct provides an efficient mechanism to support object sharing while facilitating clientcontrolled consistency. Advantages are twofold: the ability to cache relevant data on the client to help support interactivity, ..."
Abstract

Cited by 12 (6 self)
 Add to MetaCart
. We describe InterAct, a framework for interactive clientserver applications. InterAct provides an efficient mechanism to support object sharing while facilitating clientcontrolled consistency. Advantages are twofold: the ability to cache relevant data on the client to help support interactivity, and the ability to extend the computation boundary to the client in order to reduce server load. We examine its performance on the interactive datamining domain, and present some basic results that indicate the flexibility and performance achievable. 1 Introduction Many applications require interaction among disparate components, often running in different environments. Simulation of interactive virtual environments, interactive speech recognition, interactive vision (object recognition systems), and interactive data mining, are some examples. A large number of such applications are clientserver in nature and involve processes that exchange data in an irregular fashion. In such applicati...
Efficiently mining approximate models of associations in evolving databases
 In Proc. of the 6 th Int'l Conf. on Principles and Practices of Data Mining and Knowledge Discovery in Databases
, 2002
"... Abstract Much of the existing work in machine learning and data mining has relied on devising efficient techniques to build accurate models from the data. Research on how the accuracy of a model changes as a function of dynamic updates to the databases is very limited. In this work we show that extr ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
Abstract Much of the existing work in machine learning and data mining has relied on devising efficient techniques to build accurate models from the data. Research on how the accuracy of a model changes as a function of dynamic updates to the databases is very limited. In this work we show that extracting this information: knowing which aspects of the model are changing; and how they are changing as a function of data updates; can be very effective for interactive data mining purposes (where response time is often more important than model quality as long as model quality is not too far off the best (exact) model. In this paper we consider the problem of generating approximate models within the context of association mining, a key data mining task. We propose a new approach to incrementally generate approximate models of associations in evolving databases. Our approach is able to detect how patterns evolve over time (an interesting result in its own right), and uses this information in generating approximate models with high accuracy at a fraction of the cost (of generating the exact model). Extensive experimental evaluation on real databases demonstrates the effectiveness and advantages of the proposed approach. 1