Results 1  10
of
16
A tree projection algorithm for generation of frequent itemsets
 Journal of Parallel and Distributed Computing
, 2000
"... In this paper we propose algorithms for generation of frequent itemsets by successive construction of the nodes of a lexicographic tree of itemsets. We discuss di erent strategies in generation and traversal of the lexicographic tree such as breadth rst search, depth rst search or a combination of ..."
Abstract

Cited by 160 (2 self)
 Add to MetaCart
In this paper we propose algorithms for generation of frequent itemsets by successive construction of the nodes of a lexicographic tree of itemsets. We discuss di erent strategies in generation and traversal of the lexicographic tree such as breadth rst search, depth rst search or a combination of the two. These techniques provide di erent tradeo s in terms of the I/O, memory and computational time requirements. We use the hierarchical structure of the lexicographic tree to successively project transactions at each node of the lexicographic tree, and use matrix counting on this reduced set of transactions for nding frequent itemsets. We tested our algorithm on both real and synthetic data. We provide an implementation of the tree projection method which is up to one order of magnitude faster than other recent techniques in the literature. The algorithm has a well structured data access pattern which provides data locality and reuse of data for multiple levels of the cache. We also discuss methods for parallelization of the
Scalable Parallel Data Mining for Association Rules
, 1997
"... One of the important problems in data mining is discovering association rules from databases of transactions where each transaction consists of a set of items. The most time consuming operation in this discovery process is the computation of the frequency of the occurrences of interesting subset of ..."
Abstract

Cited by 149 (14 self)
 Add to MetaCart
One of the important problems in data mining is discovering association rules from databases of transactions where each transaction consists of a set of items. The most time consuming operation in this discovery process is the computation of the frequency of the occurrences of interesting subset of items (called candidates) in the database of transactions. To prune the exponentially large space of candidates, most existing algorithms, consider only those candidates that have a user defined minimum support. Even with the pruning, the task of finding all association rules requires a lot of computation power and time. Parallel computers offer a potential solution to the computation requirement of this task, provided efficient and scalable parallel algorithms can be designed. In this paper, we present two new parallel algorithms for mining association rules. The Intelligent Data Distribution algorithm efficiently uses aggregate memory of the parallel computer by employing intelligent candi...
Scalparc: A new scalable and efficient parallel classification algorithm for mining large datasets
 In In Proc. of the International Parallel Processing Symposium,1998. Copyright
"... In this paper, we present ScalParC (Scalable Parallel Classifier), a new parallel formulation of a decision tree based classification process. Like other stateoftheart decision tree classifiers such as SPRINT, ScalParC is suited for handling large datasets. We show that existing parallel formulat ..."
Abstract

Cited by 60 (5 self)
 Add to MetaCart
In this paper, we present ScalParC (Scalable Parallel Classifier), a new parallel formulation of a decision tree based classification process. Like other stateoftheart decision tree classifiers such as SPRINT, ScalParC is suited for handling large datasets. We show that existing parallel formulation of SPRINT is unscalable, whereas ScalParC is shown to be scalable in both runtime and memory requirements. We present the experimental results of classifying up to 6.4 million records on up to 128 processors of Cray T3D, in order to demonstrate the scalable behavior of ScalParC. A key component of ScalParC is the parallel hash table. The proposed parallel hashing paradigm can be used to parallelize other algorithms that require many concurrent updates to a large hash table. 1
Parallel Formulations of DecisionTree Classification Algorithms
 DATA MINING AND KNOWLEDGE DISCOVERY: AN INTERNATIONAL JOURNAL
, 1998
"... Classification decision tree algorithms are used extensively for data mining in many domains such as retail target marketing, fraud detection, etc. Highly parallel algorithms for constructing classification decision trees are desirable for dealing with large data sets in reasonable amount of time. ..."
Abstract

Cited by 32 (1 self)
 Add to MetaCart
Classification decision tree algorithms are used extensively for data mining in many domains such as retail target marketing, fraud detection, etc. Highly parallel algorithms for constructing classification decision trees are desirable for dealing with large data sets in reasonable amount of time. Algorithms for building classification decision trees have a natural concurrency, but are difficult to parallelize due to the inherent dynamic nature of the computation. In this paper, we present parallel formulations of classification decision tree learning algorithm based on induction. We describe two basic parallel formulations. One is based on Synchronous Tree Construction Approach and the other is based on Partitioned Tree Construction Approach. We discuss the advantages and disadvantages of using these methods and propose a hybrid method that employs the good features of these methods. We also provide the analysis of the cost of computation and communication of the proposed hybr...
A Survey of Parallel Search Algorithms for Discrete Optimization Problems
 ORSA JOURNAL ON COMPUTING
, 1993
"... Discrete optimization problems (DOPs) arise in various applications such as planning, scheduling, computer aided design, robotics, game playing and constraint directed reasoning. Often, a DOP is formulated in terms of finding a (minimum cost) solution path in a graph from an initial node to a goal n ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Discrete optimization problems (DOPs) arise in various applications such as planning, scheduling, computer aided design, robotics, game playing and constraint directed reasoning. Often, a DOP is formulated in terms of finding a (minimum cost) solution path in a graph from an initial node to a goal node and solved by graph/tree search methods. Availability of parallel computers has created substantial interest in exploring parallel formulations of these graph and tree search methods. This article provides a survey of various parallel search algorithms such as Backtracking, IDA*, A*, BranchandBound techniques and Dynamic Programming. It addresses issues related to load balancing, communication costs, scalability and the phenomenon of speedup anomalies in parallel search.
Efficient parallel algorithms for mining associations
 PARALLEL AND DISTRIBUTED SYSTEMS
, 2000
"... ..."
Parallel Formulations of Inductive Classification Learning Algorithm
, 1996
"... One of the important problems in data mining [SAD + 93] is the classificationrule learning. The classificationrule learning involves finding rules or decision trees that partition given data into predefined classes. For any realistic problem domain of the classification rule learning, the se ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
One of the important problems in data mining [SAD + 93] is the classificationrule learning. The classificationrule learning involves finding rules or decision trees that partition given data into predefined classes. For any realistic problem domain of the classification rule learning, the set of possible decision trees is too large to be searched exhaustively. In fact, the computational complexity of finding an optimal classification decision tree is NP  hard. All of the existing algorithms, like C4:5 [Qui93], CDP [AIS93] and SLIQ [MAR96], use local heuristics to handle the computational complexity. The computational complexity of these algorithms ranges from O(AN logN) to O(AN(logN) 2 ) with N training data items and A attributes. These algorithms are fast enough for application domains where N is relatively small. However, in the data mining domain where millions of records and a large number of attributes are involved, the execution time of these algorithms can become pr...
Parallel MatrixVector Product Using Approximate Hierarchical Methods
 In Proceedings of Supercomputing '95
, 1995
"... Matrixvector products (matvecs) form the core of iterative methods used for solving dense linear systems. Often, these systems arise in the solution of integral equations used in electromagnetics, heat transfer, and wave propagation. In this paper, we present a parallel approximate method for comp ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Matrixvector products (matvecs) form the core of iterative methods used for solving dense linear systems. Often, these systems arise in the solution of integral equations used in electromagnetics, heat transfer, and wave propagation. In this paper, we present a parallel approximate method for computing matvecs used in the solution of integral equations. We use this method to compute dense matvecs of hundreds of thousands of elements. The combined speedups obtained from the use of approximate methods and parallel processing represent an improvement of several orders of magnitude over exact matvecs on uniprocessors. We demonstrate that our parallel formulation incurs minimal parallel processing overhead and scales up to a large number of processors. We study the impact of varying the accuracy of the approximate matvec on overall time and on parallel efficiency. This work was supported by IST/BMDO through Army Research Office contract DA/DAAH0493G0080, NSF grant NSG/1RI921694...
Visual Data Mining: Framework and Algorithm Development
, 1996
"... Visual data mining is the use of visualization techniques to allow data miners and analysts to evaluate, monitor, and guide the inputs, products and process of data mining. It can help introduce user insights, preferences, and biases in the earlier stages of the data mining lifecycle to reduce its ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Visual data mining is the use of visualization techniques to allow data miners and analysts to evaluate, monitor, and guide the inputs, products and process of data mining. It can help introduce user insights, preferences, and biases in the earlier stages of the data mining lifecycle to reduce its overall computation complexity and reduce the set of uninteresting patterns in the product. Even more useful may be the new insights developed by the data miners and analysts concerning the quality and implications of the decisions made by the data mining process. These new insights may facilitate the development of better algorithms and processes for data mining. This paper provides a framework for visual data mining via the loosecoupling of databases and visualization systems. The paper applies visual data mining towards designing new algorithms that can learn decision trees by manually refining some of the decisions made wellknown algorithms such as C4:5. Experiments with a set of ben...
Partitioning Algorithms for Simultaneously Balancing Iterative and Direct Methods
, 2004
"... This paper focuses on domain decompositionbased numerical simulations whose subproblems corresponding to the various subdomains are solved using sparse direct factorization methods (e.g., FETI). Effective loadbalancing of such computations requires that the resulting partitioning simultaneously ba ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
This paper focuses on domain decompositionbased numerical simulations whose subproblems corresponding to the various subdomains are solved using sparse direct factorization methods (e.g., FETI). Effective loadbalancing of such computations requires that the resulting partitioning simultaneously balances the amount of time required to factor the local subproblem using direct factorization, and the number of elements assigned to each processor. Unfortunately, existing graphpartitioning algorithms cannot be used to loadbalance these type of computations as they can only compute partitionings that simultaneously balance numerous constraints defined a priori on the vertices and optimize different objectives defined locally on the edges. To address this problem, we developed an algorithm that follows a predictorcorrector approach that first computes a highquality partitioning of the underlying graph, and then modifies it to achieve the desired balancing constraints. During the corrector step we compute a fill reducing ordering for each partition, and then we modify the initial partitioning and ordering so that our objectives are satisfied. Experimental results show that the proposed algorithm is able to reduce the fillin of the overweight subdomains and achieve a considerably better balance.