Results 1  10
of
17
Scalable Load Balancing Techniques for Parallel Computers
, 1994
"... In this paper we analyze the scalability of a number of load balancing algorithms which can be applied to problems that have the following characteristics : the work done by a processor can be partitioned into independent work pieces; the work pieces are of highly variable sizes; and it is not po ..."
Abstract

Cited by 120 (16 self)
 Add to MetaCart
In this paper we analyze the scalability of a number of load balancing algorithms which can be applied to problems that have the following characteristics : the work done by a processor can be partitioned into independent work pieces; the work pieces are of highly variable sizes; and it is not possible (or very difficult) to estimate the size of total work at a given processor. Such problems require a load balancing scheme that distributes the work dynamically among different processors. Our goal here is to determine the most scalable load balancing schemes for different architectures such as hypercube, mesh and network of workstations. For each of these architectures, we establish lower bounds on the scalability of any possible load balancing scheme. We present the scalability analysis of a number of load balancing schemes that have not been analyzed before. This gives us valuable insights into their relative performance for different problem and architectural characteristi...
Analyzing Scalability of Parallel Algorithms and Architectures
 Journal of Parallel and Distributed Computing
, 1994
"... The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity to effectively utilize an increasing number of processors. Scalability analysis may be used to select the best algorithmarchitecture combination for a problem under different constraints on the growth of ..."
Abstract

Cited by 98 (19 self)
 Add to MetaCart
(Show Context)
The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity to effectively utilize an increasing number of processors. Scalability analysis may be used to select the best algorithmarchitecture combination for a problem under different constraints on the growth of the problem size and the number of processors. It may be used to predict the performance of a parallel algorithm and a parallel architecture for a large number of processors from the known performance on fewer processors. For a fixed problem size, it may be used to determine the optimal number of processors to be used and the maximum possible speedup that can be obtained. The objective of this paper is to critically assess the state of the art in the theory of scalability analysis, and motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and architectures. We survey a number of techniques and formalisms t...
A simple load balancing scheme for task allocation in parallel machines
 In Proc. Symposium on Parallel Algorithms and Architectures (SPAA) (1991), ACM
"... A collection of local workpiles (task queues) and a simple load balancing scheme is well suited for scheduling tasks in shared memory parallel machines. Task scheduling on such machines has usually been done through a single, globally accessible, workpile. The scheme introduced in this paper achieve ..."
Abstract

Cited by 83 (4 self)
 Add to MetaCart
(Show Context)
A collection of local workpiles (task queues) and a simple load balancing scheme is well suited for scheduling tasks in shared memory parallel machines. Task scheduling on such machines has usually been done through a single, globally accessible, workpile. The scheme introduced in this paper achieves a balancing comparable to that of a global workpile, while minimizing the overheads. In many parallel computer architectures, each processor has some memory that it can access more efficiently, and so it is desirable that tasks do not mirgrate frequently. The load balancing is simple and distributed: Whenever a processor accesses its local workpile, it performs a balancing operation with probability inversely proportional to the size of its workpile. The balancing operation consists of examining the workpile of a random processor and exchanging tasks so as to equalize the size of the two workpiles. The probabilistic analysis of the performance of the load balancing scheme proves that each tasks in the system receives its fair share of computation time. Specifically, the expected size of each local task queue is within a small constant factor of the average, i.e. total number of tasks in the system divided by the number of processors. 1
Scalability of parallel algorithms for the allpairs shortest path problem
 in the Proceedings of the International Conference on Parallel Processing
, 1991
"... Abstract This paper uses the isoefficiency metric to analyze the scalability of several parallel algorithms for finding shortest paths between all pairs of nodes in a densely connected graph. Parallel algorithms analyzed in this paper have either been previously presented elsewhere or are small vari ..."
Abstract

Cited by 39 (14 self)
 Add to MetaCart
Abstract This paper uses the isoefficiency metric to analyze the scalability of several parallel algorithms for finding shortest paths between all pairs of nodes in a densely connected graph. Parallel algorithms analyzed in this paper have either been previously presented elsewhere or are small variations of them. Scalability is analyzed with respect to mesh, hypercube and sharedmemory architectures. We demonstrate that isoefficiency functions are a compact and useful predictor of performance. In fact, previous comparative predictions of some of the algorithms based on experimental results are shown to be incorrect whereas isoefficiency functions predict correctly. We find the classic tradeoffs of hardware cost vs. time and memory vs. time to be represented here as tradeoffs of hardware cost vs. scalability and memory vs. scalability.
Scalability of Parallel Algorithms for Matrix Multiplication
 in Proc. of Int. Conf. on Parallel Processing
, 1991
"... A number of parallel formulations of dense matrix multiplication algorithm have been developed. For arbitrarily large number of processors, any of these algorithms or their variants can provide near linear speedup for sufficiently large matrix sizes and none of the algorithms can be clearly claimed ..."
Abstract

Cited by 27 (0 self)
 Add to MetaCart
(Show Context)
A number of parallel formulations of dense matrix multiplication algorithm have been developed. For arbitrarily large number of processors, any of these algorithms or their variants can provide near linear speedup for sufficiently large matrix sizes and none of the algorithms can be clearly claimed to be superior than the others. In this paper we analyze the performance and scalability of a number of parallel formulations of the matrix multiplication algorithm and predict the conditions under which each formulation is better than the others. We present a parallel formulation for hypercube and related architectures that performs better than any of the schemes described in the literature so far for a wide range of matrix sizes and number of processors. The superior performance and the analytical scalability expressions for this algorithm are verified through experiments on the Thinking Machines Corporation's CM5 TM y parallel computer for up to 512 processors. We show that special har...
Scalability of Parallel Sorting on Mesh Multicomputers
, 1991
"... This paper presents two new parallel algorithms QSP1 and QSP2 based on sequential quicksort for sorting data on a mesh multicomputer, and analyzes their scalability using the isoefficiency metric. We show that QSP2 matches the lower bound on the isoefficiency function for mesh multicomputers. The is ..."
Abstract

Cited by 19 (13 self)
 Add to MetaCart
This paper presents two new parallel algorithms QSP1 and QSP2 based on sequential quicksort for sorting data on a mesh multicomputer, and analyzes their scalability using the isoefficiency metric. We show that QSP2 matches the lower bound on the isoefficiency function for mesh multicomputers. The isoefficiency of QSP1 is also fairly close to optimal. Lang et al. and Schnorr et al. have developed parallel sorting algorithms for the mesh architecture that have either optimal (Schnorr) or close to optimal (Lang) runtime complexity for the oneelementperprocessor case. Both QSP1 and QSP2 have worse performance than these algorithms for the oneelementperprocessor case. But QSP1 and QSP2 have better scalability than the scaleddown variants of these algorithms (for the case in which there are more elements than processors). As a result, our new parallel formulations are better than these scaleddown variants in terms of speedup w.r.t the best sequential algorithms. We also present a dif...
Scalable Load Balancing Strategies for Parallel A* Algorithms
 Journal of Parallel and Distributed Computing
, 1994
"... In this paper, we develop load balancing strategies for scalable highperformance parallel A* algorithms suitable for distributedmemory machines. In parallel A* search, inefficiencies such as processor starvation and search of nonessential spaces (search spaces not explored by the sequential algori ..."
Abstract

Cited by 17 (5 self)
 Add to MetaCart
(Show Context)
In this paper, we develop load balancing strategies for scalable highperformance parallel A* algorithms suitable for distributedmemory machines. In parallel A* search, inefficiencies such as processor starvation and search of nonessential spaces (search spaces not explored by the sequential algorithm) grow with the number of processors P used, thus restricting its scalability. To alleviate this effect, we propose a novel parallel startup phase and an efficient dynamic load balancing strategy called the quality equalizing (QE) strategy. Our new parallel startup scheme executes optimally in \Theta(logP ) time and, in addition, achieves good initial load balance. The QE strategy possesses certain unique quantitative and qualitative load balancing properties that enable it to significantly reduce starvation and nonessential work. Consequently, we obtain a highly scalable parallel A* algorithm with an almostlinear speedup. The startup and load balancing schemes were employed in parallel ...
Analysis and Design of Scalable Parallel Algorithms for Scientific Computing
, 1995
"... This dissertation presents a methodology for understanding the performance and scalability of algorithms on parallel computers and the scalability analysis of a variety of numerical algorithms. We demonstrate the analytical power of this technique and show how it can guide the development of better ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
This dissertation presents a methodology for understanding the performance and scalability of algorithms on parallel computers and the scalability analysis of a variety of numerical algorithms. We demonstrate the analytical power of this technique and show how it can guide the development of better parallel algorithms. We present some new highly scalable parallel algorithms for sparse matrix computations that were widely considered to be poorly suitable for large scale parallel computers. We present some laws governing the performance and scalability properties that apply to all parallel systems. We show that our results generalize or extend a range of earlier research results concerning the performance of parallel systems. Our scalability analysis of algorithms such as fast Fourier transform (FFT), dense matrix multiplication, sparse matrixvector multiplication, and the preconditioned conjugate gradient (PCG) provides many interesting insights into their behavior on parallel computer...
Parallel A* Algorithms and their Performance on Hypercube Multiprocessors
, 1993
"... In this paper we develop parallel A* algorithms suitable for distributedmemory machines. In parallel A* algorithms, inefficiencies grow with the number of processors P used, causing performance to drop significantly at lower and intermediate work densities (the ratio of the problem size to P ). To ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
In this paper we develop parallel A* algorithms suitable for distributedmemory machines. In parallel A* algorithms, inefficiencies grow with the number of processors P used, causing performance to drop significantly at lower and intermediate work densities (the ratio of the problem size to P ). To alleviate this effect, we propose a novel parallel startup phase and efficient dynamic work distribution strategies, and thus improve the scalability of parallel A* search. We also tackle the problem of duplicate searching by different processors, by using work transfer as a means to partial duplicate pruning. The parallel startup scheme proposed requires only \Theta(logP ) time compared to \Theta(P ) time for sequential startup methods used in the past. Using the Traveling Salesman Problem (TSP) as our test case, we see that our work distribution strategies yield speedup improvements of more than 30% and 15% at lower and intermediate work densities, respectively, while requiring 20% to 45%...