Results 1  10
of
41
Analyzing Scalability of Parallel Algorithms and Architectures
 Journal of Parallel and Distributed Computing
, 1994
"... The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity to effectively utilize an increasing number of processors. Scalability analysis may be used to select the best algorithmarchitecture combination for a problem under different constraints on the growth of ..."
Abstract

Cited by 90 (18 self)
 Add to MetaCart
The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity to effectively utilize an increasing number of processors. Scalability analysis may be used to select the best algorithmarchitecture combination for a problem under different constraints on the growth of the problem size and the number of processors. It may be used to predict the performance of a parallel algorithm and a parallel architecture for a large number of processors from the known performance on fewer processors. For a fixed problem size, it may be used to determine the optimal number of processors to be used and the maximum possible speedup that can be obtained. The objective of this paper is to critically assess the state of the art in the theory of scalability analysis, and motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and architectures. We survey a number of techniques and formalisms t...
Scalable Problems and MemoryBounded Speedup
, 1992
"... In this paper three models of parallel speedup are studied. They are fixedsize speedup, fixedtime speedup and memorybounded speedup. The latter two consider the relationship between speedup and problem scalability. Two sets of speedup formulations are derived for these three models. One set consi ..."
Abstract

Cited by 53 (13 self)
 Add to MetaCart
In this paper three models of parallel speedup are studied. They are fixedsize speedup, fixedtime speedup and memorybounded speedup. The latter two consider the relationship between speedup and problem scalability. Two sets of speedup formulations are derived for these three models. One set considers uneven workload allocation and communication overhead and gives more accurate estimation. Another set considers a simplified case and provides a clear picture on the impact of the sequential portion of an application on the possible performance gain from parallel processing. The simplified fixedsize speedup is Amdahl's law. The simplified fixedtime speedup is Gustafson's scaled speedup. The simplified memorybounded speedup contains both Amdahl's law and Gustafson's scaled speedup as special cases. This study leads to a better understanding of parallel processing.
An Approach to Scalability Study of Shared Memory Parallel Systems
, 1994
"... The overheads in a parallel system that limit its scalability need to be identified and separated in order to enable parallel algorithm design and the development of parallel machines. Such overheads may be broadly classified into two components. The first one is intrinsic to the algorithm and arise ..."
Abstract

Cited by 33 (18 self)
 Add to MetaCart
The overheads in a parallel system that limit its scalability need to be identified and separated in order to enable parallel algorithm design and the development of parallel machines. Such overheads may be broadly classified into two components. The first one is intrinsic to the algorithm and arises due to factors such as the workimbalance and the serial fraction. The second one is due to the interaction between the algorithm and the architecture and arises due to latency and contention in the network. A topdown approach to scalability study of shared memory parallel systems is proposed in this research. We define the notion of overhead functions associated with the different algorithmic and architectural characteristics to quantify the scalability of parallel systems; we isolate the algorithmic overhead and the overheads due to network latency and contention from the overall execution time of an application; we design and implement an executiondriven simulation platform that incorporates these methods for quantifying the overhead functions; and we use this simulator to study the scalability characteristics of five applications on shared memory platforms with different communication topologies.
The Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines
, 1994
"... This paper discusses the core factorization routines included in the ScaLAPACK library. These routines allow the factorization and solution of a dense system of linear equations via LU, QR, and Cholesky. They are implemented using a block cyclic data distribution, and are built using de facto standa ..."
Abstract

Cited by 24 (11 self)
 Add to MetaCart
This paper discusses the core factorization routines included in the ScaLAPACK library. These routines allow the factorization and solution of a dense system of linear equations via LU, QR, and Cholesky. They are implemented using a block cyclic data distribution, and are built using de facto standard kernels for matrix and vector operations (BLAS and its parallel counterpart PBLAS) and message passing communication (BLACS). In implementing the ScaLAPACK routines, a major objective was to parallelize the corresponding sequential LAPACK using the BLAS, BLACS, and PBLAS as building blocks, leading to straightforward parallel implementations without a significant loss in performance. We present the details of the implementation of the ScaLAPACK factorization routines, as well as performance and scalability results on the Intel iPSC/860, Intel Touchstone Delta, and Intel Paragon systems.
Performance Properties of Large Scale Parallel Systems
 Department of Computer Science, University of Minnesota
, 1993
"... There are several metrics that characterize the performance of a parallel system, such as, parallel execution time, speedup and efficiency. A number of properties of these metrics have been studied. For example, it is a well known fact that given a parallel architecture and a problem of a fixed size ..."
Abstract

Cited by 23 (7 self)
 Add to MetaCart
There are several metrics that characterize the performance of a parallel system, such as, parallel execution time, speedup and efficiency. A number of properties of these metrics have been studied. For example, it is a well known fact that given a parallel architecture and a problem of a fixed size, the speedup of a parallel algorithm does not continue to increase with increasing number of processors. It usually tends to saturate or peak at a certain limit. Thus it may not be useful to employ more than an optimal number of processors for solving a problem on a parallel computer. This optimal number of processors depends on the problem size, the parallel algorithm and the parallel architecture. In this paper we study the impact of parallel processing overheads and the degree of concurrency of a parallel algorithm on the optimal number of processors to be used when the criterion for optimality is minimizing the parallel execution time. We then study a more general criterion of optimalit...
A Simulationbased Scalability Study of Parallel Systems
 Journal of Parallel and Distributed Computing
, 1993
"... Scalability studies of parallel architectures have used scalar metrics to evaluate their performance. Very often, it is difficult to glean the sources of inefficiency resulting from the mismatch between the algorithmic and architectural requirements using such scalar metrics. Lowlevel performance s ..."
Abstract

Cited by 21 (15 self)
 Add to MetaCart
Scalability studies of parallel architectures have used scalar metrics to evaluate their performance. Very often, it is difficult to glean the sources of inefficiency resulting from the mismatch between the algorithmic and architectural requirements using such scalar metrics. Lowlevel performance studies of the hardware are also inadequate for predicting the scalability of the machine on real applications. We propose a topdown approach to scalability study that alleviates some of these problems. We characterize applications in terms of the frequently occurring kernels, and their interaction with the architecture in terms of overheads in the parallel system. An overhead function is associated with the algorithmic characteristics as well as their interaction with the architectural features. We present a simulation platform called SPASM (Simulator for Parallel Architectural Scalability Measurements) that quantifies these overhead functions. SPASM separates the algorithmic overhead into ...
The Consequences of Fixed Time Performance Measurement
 In Proceedings of the 25th Hawaii International Conference on System Sciences: Volume III
, 1992
"... In measuring performance of parallel computers, the usual method is to choose a problem and test execution time as the processor count is varied. This model underlies definitions of “speedup, ” “efficiency,” and arguments against parallel processing such as Ware’s formulation of Amdahl’s law. Fixed ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
In measuring performance of parallel computers, the usual method is to choose a problem and test execution time as the processor count is varied. This model underlies definitions of “speedup, ” “efficiency,” and arguments against parallel processing such as Ware’s formulation of Amdahl’s law. Fixed time models use problem size as the figure of merit. Analysis and experiments based on fixed time instead of fixed size have yielded surprising consequences: The fixed time method does not reward slower processors with higher speedup; it predicts a new limit to speedup, more optimistic than Amdahl’s; it shows efficiency independent of processor speed and ensemble size; it sometimes gives nonspurious superlinear speedup; it provides a practical means (SLALOM) of comparing computers of widely varying speeds without distortion. 1.
A Domain Decomposition Preconditioner for a Parallel Finite Element Solver on Distributed Unstructured Grids
 Parallel Computing
, 1995
"... We consider a number of practical issues associated with the parallel distributed memory solution of elliptic partial differential equations using unstructured meshes in two dimensions. The first part of the paper describes a parallel mesh generation algorithm which is designed both for efficiency a ..."
Abstract

Cited by 15 (11 self)
 Add to MetaCart
We consider a number of practical issues associated with the parallel distributed memory solution of elliptic partial differential equations using unstructured meshes in two dimensions. The first part of the paper describes a parallel mesh generation algorithm which is designed both for efficiency and to produce a wellpartitioned, distributed mesh, suitable for the efficient parallel solution of an elliptic p.d.e. The second part of the paper concentrates on parallel domain decomposition preconditioning for the linear algebra problems which arise when solving such a p.d.e. on the unstructured meshes that we generate. It is demonstrated that by allowing the mesh generator and the p.d.e. solver to share a certain coarse grid structure we are able to obtain efficient parallel solutions to a number of large problems. Although the work is presented here in a finite element context, the issues of mesh generation and domain decomposition are not of course strictly dependent upon this particu...
Computer Vision Algorithms on Reconfigurable Logic Arrays
 IEEE TRANS. ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1999
"... Computer vision algorithms are natural candidates for high performance computing due to their inherent parallelism and intense computational demands. For example, a simple 3 x 3 convolution on a 512 x 512 gray scale image at 30 frames per second requires 67.5 million multiplications and 60 million a ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
Computer vision algorithms are natural candidates for high performance computing due to their inherent parallelism and intense computational demands. For example, a simple 3 x 3 convolution on a 512 x 512 gray scale image at 30 frames per second requires 67.5 million multiplications and 60 million additions to be performed in one second. Computer vision tasks can be classified into three categories based on their computational complexity andcommunication complexity: lowlevel, intermediatelevel and highlevel. Specialpurpose hardware provides better performance compared to a generalpurpose hardware for all the three levels of vision tasks. With recent advances in very large scale integration (VLSI) technology, an application specific integrated circuit (ASIC) can provide the best performance in terms of total execution time. However, long design cycle time, high development cost and inflexibility of a dedicated hardware deter design of ASICs. In contrast, field programmable gate arrays (FPGAs) support lower design verification time and easier design adaptability atalower cost. Hence, FPGAs with an array of reconfigurable logic blocks canbevery useful compute elements. FPGAbased custom computing machines are