Results 1  10
of
101
Analyzing Scalability of Parallel Algorithms and Architectures
 Journal of Parallel and Distributed Computing
, 1994
"... The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity to effectively utilize an increasing number of processors. Scalability analysis may be used to select the best algorithmarchitecture combination for a problem under different constraints on the growth of ..."
Abstract

Cited by 90 (18 self)
 Add to MetaCart
The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity to effectively utilize an increasing number of processors. Scalability analysis may be used to select the best algorithmarchitecture combination for a problem under different constraints on the growth of the problem size and the number of processors. It may be used to predict the performance of a parallel algorithm and a parallel architecture for a large number of processors from the known performance on fewer processors. For a fixed problem size, it may be used to determine the optimal number of processors to be used and the maximum possible speedup that can be obtained. The objective of this paper is to critically assess the state of the art in the theory of scalability analysis, and motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and architectures. We survey a number of techniques and formalisms t...
On the Efficiency of Parallel Backtracking
, 1992
"... It is known that isolated executions of parallel backtrack search exhibit speedup anomalies. In this paper we present analytical models and experimental results on the average case behavior of parallel backtracking. We consider two types of backtrack search algorithms: (i) simple backtracking (wh ..."
Abstract

Cited by 47 (6 self)
 Add to MetaCart
It is known that isolated executions of parallel backtrack search exhibit speedup anomalies. In this paper we present analytical models and experimental results on the average case behavior of parallel backtracking. We consider two types of backtrack search algorithms: (i) simple backtracking (which does not use any heuristic information) ; (ii) heuristic backtracking (which uses heuristics to order and prune search). We present analytical models to compare the average number of nodes visited in sequential and parallel search for each case. For simple backtracking, we show that the average speedup obtained is (i) linear when distribution of solutions is uniform and (ii) superlinear when distribution of solutions is nonuniform. For heuristic backtracking, the average speedup obtained is at least linear (i.e., either linear or superlinear), and the speedup obtained on a subset of instances (called difficult instances) is superlinear. We also present experimental results over ...
Unstructured Tree Search on SIMD Parallel Computers
 IEEE Transactions on Parallel and Distributed Systems
, 1994
"... In this paper, we present new methods for load balancing of unstructured tree computations on largescale SIMD machines, and analyze the scalability of these and other existing schemes. An efficient formulation of tree search on a SIMD machine comprises of two major components: (i) a triggering mech ..."
Abstract

Cited by 35 (14 self)
 Add to MetaCart
In this paper, we present new methods for load balancing of unstructured tree computations on largescale SIMD machines, and analyze the scalability of these and other existing schemes. An efficient formulation of tree search on a SIMD machine comprises of two major components: (i) a triggering mechanism, which determines when the search space redistribution must occur to balance search space over processors; and (ii) a scheme to redistribute the search space. We have devised a new redistribution mechanism and a new triggering mechanism. Either of these can be used in conjunction with triggering and redistribution mechanisms developed by other researchers. We analyze the scalability of these mechanisms, and verify the results experimentally. The analysis and experiments show that our new load balancing methods are highly scalable on SIMD architectures. Their scalability is shown to be no worse than that of the best load balancing schemes on MIMD architectures. We verify our theoretical...
Efficient Support of Location Transparency in Concurrent ObjectOriented Programming Languages
 In Supercomputing '95
, 1995
"... We describe the design of a runtime system for a finegrained concurrent objectoriented (actor) language and its performance. The runtime system provides considerable flexibility to users; specifically, it supports location transparency, actor creation and dynamic placement, and migration. The runt ..."
Abstract

Cited by 30 (14 self)
 Add to MetaCart
We describe the design of a runtime system for a finegrained concurrent objectoriented (actor) language and its performance. The runtime system provides considerable flexibility to users; specifically, it supports location transparency, actor creation and dynamic placement, and migration. The runtime system includes an efficient distributed name server, a latency hiding scheme for remote actor creation, and a compilercontrolled intranode scheduling mechanism for local messages and dynamic load balancing. Our preliminary evaluation results suggest that the efficiency that is lost by the greater flexibility of actors can be restored by an efficient runtime system which provides an open interface that can be used by a compiler to allow optimizations. On several standard algorithms, the performance results for our system are comparable to efficient C implementations. Key Words: Concurrent ObjectOriented Programming, Actors, Location Transparency, Migration 1 Introduction We argue t...
Automated Parallelization of Discrete Statespace Generation
 Journal of Parallel and Distributed Computing
, 1997
"... We consider the problem of generating a large statespace in a distributed fashion. Unlike previously proposed solutions that partition the set of reachable states according to a hashing function provided by the user, we explore heuristic methods that completely automate the process. The first step ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
We consider the problem of generating a large statespace in a distributed fashion. Unlike previously proposed solutions that partition the set of reachable states according to a hashing function provided by the user, we explore heuristic methods that completely automate the process. The first step is an initial random walk through the state space to initialize a search tree, duplicated in each processor. Then, the reachability graph is built in a distributed way, using the search tree to assign each newly found state to classes assigned to the available processors. Furthermore, we explore two remapping criteria that attempt to balance memory usage or future workload, respectively. We show how the cost of computing the global snapshot required for remapping will scale up for system sizes in the foreseeable future. An extensive set of results is presented to support our conclusions that remapping is extremely beneficial. 1 Introduction Discrete systems are frequently analyzed by genera...
Scalability of Parallel Algorithms for Matrix Multiplication
 in Proc. of Int. Conf. on Parallel Processing
, 1991
"... A number of parallel formulations of dense matrix multiplication algorithm have been developed. For arbitrarily large number of processors, any of these algorithms or their variants can provide near linear speedup for sufficiently large matrix sizes and none of the algorithms can be clearly claimed ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
A number of parallel formulations of dense matrix multiplication algorithm have been developed. For arbitrarily large number of processors, any of these algorithms or their variants can provide near linear speedup for sufficiently large matrix sizes and none of the algorithms can be clearly claimed to be superior than the others. In this paper we analyze the performance and scalability of a number of parallel formulations of the matrix multiplication algorithm and predict the conditions under which each formulation is better than the others. We present a parallel formulation for hypercube and related architectures that performs better than any of the schemes described in the literature so far for a wide range of matrix sizes and number of processors. The superior performance and the analytical scalability expressions for this algorithm are verified through experiments on the Thinking Machines Corporation's CM5 TM y parallel computer for up to 512 processors. We show that special har...
Nearest Neighbor Algorithms for Load Balancing in Parallel Computers
, 1995
"... With nearest neighbor load balancing algorithms, a processor makes balancing decisions based on localized workload information and manages workload migrations within its neighborhood. This paper compares a couple of fairly wellknown nearest neighbor algorithms, the dimensionexchange (DE, for shor ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
With nearest neighbor load balancing algorithms, a processor makes balancing decisions based on localized workload information and manages workload migrations within its neighborhood. This paper compares a couple of fairly wellknown nearest neighbor algorithms, the dimensionexchange (DE, for short) and the diffusion (DF, for short) methods and their several variantsthe average dimensionexchange (ADE), the optimallytuned dimensionexchange (ODE), the local average diffusion (ADF) and the optimallytuned diffusion (ODF). The measures of interest are their efficiency in driving any initial workload distribution to a uniform distribution and their ability in controlling the growth of the variance among the processors' workloads. The comparison is made with respect to both oneport and allport communication architectures and in consideration of various implementation strategies including synchronous/asynchronous invocation policies and static/dynamic random workload behaviors. It t...
Parallel Processing of Discrete Optimization Problems
 IN ENCYCLOPEDIA OF MICROCOMPUTERS
, 1993
"... Discrete optimization problems (DOPs) arise in various applications such as planning, scheduling, computer aided design, robotics, game playing and constraint directed reasoning. Often, a DOP is formulated in terms of finding a (minimum cost) solution path in a graph from an initial node to a goa ..."
Abstract

Cited by 19 (6 self)
 Add to MetaCart
Discrete optimization problems (DOPs) arise in various applications such as planning, scheduling, computer aided design, robotics, game playing and constraint directed reasoning. Often, a DOP is formulated in terms of finding a (minimum cost) solution path in a graph from an initial node to a goal node and solved by graph/tree search methods such as branchandbound and dynamic programming. Availability of parallel computers has created substantial interest in exploring the use of parallel processing for solving discrete optimization problems. This article provides an overview of parallel search algorithms for solving discrete optimization problems.
Scalable Work Stealing ∗
"... Irregular and dynamic parallel applications pose significant challenges to achieving scalable performance on largescale multicore clusters. These applications often require ongoing, dynamic load balancing in order to maintain efficiency. Scalable dynamic load balancing on large clusters is a challe ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
Irregular and dynamic parallel applications pose significant challenges to achieving scalable performance on largescale multicore clusters. These applications often require ongoing, dynamic load balancing in order to maintain efficiency. Scalable dynamic load balancing on large clusters is a challenging problem which can be addressed with distributed dynamic load balancing systems. Work stealing is a popular approach to distributed dynamic load balancing; however its performance on largescale clusters is not well understood. Prior work on work stealing has largely focused on shared memory machines. In this work we investigate the design and scalability of work stealing on modern distributed memory systems. We demonstrate high efficiency and low overhead when scaling to 8,192 processors for three benchmark codes: a producerconsumer benchmark, the unbalanced tree search benchmark, and a multiresolution analysis kernel.
Isoefficiency Function: A Scalability Metric for Parallel Algorithms and Architectures
, 1993
"... This paper provides a tutorial introduction to a performance evaluation metric called the isoefficiency function. Traditional methods for evaluating serial algorithms are inadequate for analyzing the performance of parallel algorithmarchitecture combinations. Isoefficiency function has proven usef ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
This paper provides a tutorial introduction to a performance evaluation metric called the isoefficiency function. Traditional methods for evaluating serial algorithms are inadequate for analyzing the performance of parallel algorithmarchitecture combinations. Isoefficiency function has proven useful for evaluating the performance of a wide variety of such combinations. On a sequential computer, the fastest algorithm for solving a given problem is the best algorithm. However, the performance of a parallel algorithm for a specific problem instance on a given number of processors provides only limited information. The time taken by a parallel algorithm to solve a problem instance depends on the problem size, the number of processors used to solve the problem, and machine characteristics such as: processor speed, speed of communication channels, type of interconnection network, and routing techniques. An algorithm that yields good performance for a selected problem on a fixed number of processors on a given machine may perform poorly if any of these parameters are changed. Hence, the evaluation of a parallel algorithm on a parallel computer requires a more comprehensive analysis, and the study of scalability aids us in this analysis. The