Results 11 - 20
of
73
Empirical Analysis of Overheads in Cluster Environments
- CONCURRENCY: PRACTICE AND EXPERIENCE
, 1995
"... In concurrent computing environments that are based on heterogeneous processing elements interconnected by general-purpose networks, several classes of overheads contribute to lowered performance. In an attempt to gain a deeper insight into the exact nature of these overheads, and to develop stra ..."
Abstract
-
Cited by 23 (5 self)
- Add to MetaCart
In concurrent computing environments that are based on heterogeneous processing elements interconnected by general-purpose networks, several classes of overheads contribute to lowered performance. In an attempt to gain a deeper insight into the exact nature of these overheads, and to develop strategies to alleviate them, we have conducted empirical studies of selected applications representing different classes of concurrent programs. These analyses have identified load imbalance, the parallelism model adopted, communication delay and throughput, and system factors as the primary factors affecting performance in cluster environments. Based on the degree to which these factors affect specific classes of applications, we propose a combination of model selection criteria, partitioning strategies, and software system heuristics to reduce overheads and enhance performance in network based environments. We demonstrate that agenda parallelism and load balancing strategies contribu...
Exploiting Parallelism In Functional Languages: A "Paradigm-Oriented" Approach
- BOOK CHAPTER
, 1995
"... Deriving parallelism automatically from functional programs is simple in theory but very few practical implementations have been realised. Programs may contain too little or too much parallelism causing a degradation in performance. Such parallelism could be more efficiently controlled if parallel a ..."
Abstract
-
Cited by 22 (6 self)
- Add to MetaCart
Deriving parallelism automatically from functional programs is simple in theory but very few practical implementations have been realised. Programs may contain too little or too much parallelism causing a degradation in performance. Such parallelism could be more efficiently controlled if parallel algorithmic structures (or skeletons) are used in the design of algorithms. A structure captures the behaviour of a parallel programming paradigm and acts as a template in the design of an algorithm. This paper presents some important parallel programming paradigms and defines a structure for each of these paradigms. The iterative transformation paradigm (or geometric parallelism) is discussed in detail and a framework under which programs can be developed and transformed into efficient and portable implementations is presented.
In recent years, there has been a st...
Parallel Processing of Discrete Optimization Problems
- IN ENCYCLOPEDIA OF MICROCOMPUTERS
, 1993
"... Discrete optimization problems (DOPs) arise in various applications such as planning, scheduling, computer aided design, robotics, game playing and constraint directed reasoning. Often, a DOP is formulated in terms of finding a (minimum cost) solution path in a graph from an initial node to a goa ..."
Abstract
-
Cited by 19 (6 self)
- Add to MetaCart
Discrete optimization problems (DOPs) arise in various applications such as planning, scheduling, computer aided design, robotics, game playing and constraint directed reasoning. Often, a DOP is formulated in terms of finding a (minimum cost) solution path in a graph from an initial node to a goal node and solved by graph/tree search methods such as branch-and-bound and dynamic programming. Availability of parallel computers has created substantial interest in exploring the use of parallel processing for solving discrete optimization problems. This article provides an overview of parallel search algorithms for solving discrete optimization problems.
Scalability of Parallel Sorting on Mesh Multicomputers
, 1991
"... This paper presents two new parallel algorithms QSP1 and QSP2 based on sequential quicksort for sorting data on a mesh multicomputer, and analyzes their scalability using the isoefficiency metric. We show that QSP2 matches the lower bound on the isoefficiency function for mesh multicomputers. The is ..."
Abstract
-
Cited by 18 (12 self)
- Add to MetaCart
This paper presents two new parallel algorithms QSP1 and QSP2 based on sequential quicksort for sorting data on a mesh multicomputer, and analyzes their scalability using the isoefficiency metric. We show that QSP2 matches the lower bound on the isoefficiency function for mesh multicomputers. The isoefficiency of QSP1 is also fairly close to optimal. Lang et al. and Schnorr et al. have developed parallel sorting algorithms for the mesh architecture that have either optimal (Schnorr) or close to optimal (Lang) run-time complexity for the one-element-per-processor case. Both QSP1 and QSP2 have worse performance than these algorithms for the one-element-perprocessor case. But QSP1 and QSP2 have better scalability than the scaled-down variants of these algorithms (for the case in which there are more elements than processors). As a result, our new parallel formulations are better than these scaled-down variants in terms of speedup w.r.t the best sequential algorithms. We also present a dif...
A Unified Infrastructure for Parallel Out-Of-Core Isosurface Extraction and Volume Rendering of Unstructured Grids
- Proc. IEEE Symposium on Parallel and Large-Data Visualization and Graphics
, 2001
"... In this paper, we present a unified infrastructure for parallel out-ofcore isosurface extraction and volume rendering of large unstructured grids on distributed-memory parallel machines. We parallelize the out-of-core isosurface extraction algorithm of [9] and the out-of-core ZSweep technique [17] ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
In this paper, we present a unified infrastructure for parallel out-ofcore isosurface extraction and volume rendering of large unstructured grids on distributed-memory parallel machines. We parallelize the out-of-core isosurface extraction algorithm of [9] and the out-of-core ZSweep technique [17] for direct volume rendering, using the meta-cell technique as a unified underlying building block.
Advanced Compiler Optimizations for Sparse Computations
- Journal of Parallel and Distributed Computing
, 1995
"... Regular data dependence checking on sparse codes usually results in very conservative estimates of actual dependences that will occur at run-time. Clearly, this is caused by the usage of compact data structures that are necessary to exploit sparsity in order to reduce storage requirements and comput ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
Regular data dependence checking on sparse codes usually results in very conservative estimates of actual dependences that will occur at run-time. Clearly, this is caused by the usage of compact data structures that are necessary to exploit sparsity in order to reduce storage requirements and computational time. However, if the compiler is presented with dense code and automatically converts it into code that operates on sparse data structures, then the dependence information obtained by analysis on the original code can be used to exploit potential concurrency in the generated code. In this paper we present synchronization generating and manipulating techniques that are based on this concept. 1 Introduction Nowadays compiler support usually fails to optimize sparse codes because compact storage formats are used for sparse matrices in order to exploit sparsity with respect to storage requirements and computational time. This exploitation results in complicated code in which, for exam...
An improved Newton iteration for the generalized inverse of a matrix, with applications
- SIAM J. Sci. Stat. Comput
, 1991
"... 94035. An Improved Newton Iteration for the Generalized Inverse of a Matrix, with Applications _" ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
94035. An Improved Newton Iteration for the Generalized Inverse of a Matrix, with Applications _"
Logarithmic Time Cost Optimal Parallel Sorting is Not Yet Fast in Practice!
- in Practice!", Proc Supercomputing 90, IEEE
, 1990
"... When looking for new and faster parallel sorting algorithms for use in massively parallel systems it is tempting to investigate promising alternatives from the large body of research done on parallel sorting in the field of theoretical computer science. Such "theoretical" algorithms are mainly descr ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
When looking for new and faster parallel sorting algorithms for use in massively parallel systems it is tempting to investigate promising alternatives from the large body of research done on parallel sorting in the field of theoretical computer science. Such "theoretical" algorithms are mainly described for the PRAM (Parallel Random Access Machine) model of computation. This paper shows how this kind of investigation can be done on a simple but versatile environment for programming and measuring of PRAM algorithms. The practical value of Cole's Parallel Merge Sort algorithm have been investigated by comparing it with Batcher's bitonic sorting. The O(log n) time consumption of Cole's algorithm implies that it must be faster than bitonic sorting which is O(log^{2}n) time - if n is large enough. However, we have found that bitonic sorting is faster as long as n is less than 1.2 x 10^{21}, i.e. more than 1 Giga Tera items!. Consequently, Cole's logarithmic time sorting algorithm is not fast in practice.
Program Speedup in a Heterogeneous Computing Network
- Journal of Parallel and Distributed Computing
, 1994
"... Program speedup is an important measure of the performance of an algorithm on a parallel machine. Of particular importance is the near linear or superlinear speedup exhibited by the most performance-efficient algorithms for a given system. We describe network and program models for heterogeneous net ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Program speedup is an important measure of the performance of an algorithm on a parallel machine. Of particular importance is the near linear or superlinear speedup exhibited by the most performance-efficient algorithms for a given system. We describe network and program models for heterogeneous networks, define notions of speedup and superlinear speedup, and observe that speedup consists of both heterogeneous and parallel components. We also consider the case of linear tasks, give a lower bound for the speedup, and show that there is no theoretical upper limit on heterogeneous speedup. 1 Introduction Program speedup is a widely used measure of the performance of an algorithm on a multiprocessor or multicomputer. Although there are differing notions of the definition of speedup [7], speedup measurements are still almost universally quoted as proof of system efficiency. A common definition of speedup is that if machine M 1 can solve problem P in time T 1 , and machine M 2 solves the sa...

