Results 11  20
of
191
Fast Parallel Sorting under LogP: from theory to practice
, 1994
"... The LogP model characterizes the performance of modern parallel machines with a small set of parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P ). In this paper, we analyze four parallel sorting algorithms (bitonic, column, radix, and sample sort) ..."
Abstract

Cited by 27 (4 self)
 Add to MetaCart
The LogP model characterizes the performance of modern parallel machines with a small set of parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P ). In this paper, we analyze four parallel sorting algorithms (bitonic, column, radix, and sample sort) under LogP. We develop implementations of these algorithms in a parallel extension to C and compare the actual performance on a CM5 of 32 to 512 processors with that predicted by LogP using parameter values for this machine. Our experience was that the model served as a valuable guide throughout the development of the fast parallel sorts and revealed subtle defects in the implementations. The final observed performance matches closely with the prediction across a broad range of problem and machine sizes. 1.2 INTRODUCTION Fast sorting is important in a wide variety of practical applications, is interesting to study from a theoretical viewpoint, and offers a wealth of novel parallel solutio...
Practical Parallel Algorithms for Dynamic Data Redistribution, Median Finding, and Selection (Extended Abstract)
, 1996
"... David A. Bader* Joseph jjfit Institute for Advanced Computer Studies, and Department of Electrical Engineering, University of Maryland, College Park, MD 20742 Email: {dbader, j oseph}umiacs. umd. edu Abstract A common statistical problem is that of finding the median element in a set of data ..."
Abstract

Cited by 26 (10 self)
 Add to MetaCart
David A. Bader* Joseph jjfit Institute for Advanced Computer Studies, and Department of Electrical Engineering, University of Maryland, College Park, MD 20742 Email: {dbader, j oseph}umiacs. umd. edu Abstract A common statistical problem is that of finding the median element in a set of data. This paper presents a fast and portable parallel algorithm for finding the median given a set of elements distributed across a parallel machine. In fact, our algorithm solves the general selection problem that requires the determination of the element of rank i, for an arbitrarily given integer i. Practical algorithms needed by our selection algorithm for the dynamic redistribution of data are also discussed. Our general framework is a dis tributed memory programming model enhanced by a set of communication primitives. We use efficient techniques for distributing, coalescing, and load balancing data as well as efficient combinations of task and data parallelism. The algorithms have been coded in SPLITC and run on a varie ,ty of platforms, including the Thinking Machines CM5, IBM SP1 and SP2, Cray Research T3D, Meiko Scientific CS2, Intel Paragon, and workstation clusters. Our experimental results illustrate the scalability and efficiency of our algorithms across different platforms and improve upon all the related experimental results known to the authors.
Optimal Communication Algorithms On Star Graphs Using Spanning Tree Constructions
 Journal of Parallel and Distributed Computing
, 1993
"... In this paper we consider three fundamental communicationproblems on the star interconnection network: the problem of simultaneous broadcasting of the same message from every node to all other nodes, or multinode broadcasting, the problem of a single node sending distinct messages to each one of th ..."
Abstract

Cited by 25 (4 self)
 Add to MetaCart
(Show Context)
In this paper we consider three fundamental communicationproblems on the star interconnection network: the problem of simultaneous broadcasting of the same message from every node to all other nodes, or multinode broadcasting, the problem of a single node sending distinct messages to each one of the other nodes, or single node scattering, and finally the problem of each node sending distinct messages to every other node, or total exchange. All of these problems are studied under two different assumptions: the assumption that each node can transmit a message of fixed length to one of its neighbors and simultaneously it can receive a message of fixed length from one of its neighbors (not necessarily the same one) at each time step, or single link availability (SLA), and the assumption that each node can exchange messages of fixed length with all of its neighbors at each time step, or multiple link availability (MLA). In both cases communication is assumed to be bidirectional. The cases ...
Software Cache Coherence for Large Scale Multiprocessors
 In Proceedings of the First International Symposium on High Performance Computer Architecture
, 1995
"... Shared memory is an appealing abstraction for parallel programming. It must be implemented with caches in order to perform well, however, and caches require a coherence mechanism to ensure that processors reference current data. Hardware coherence mechanisms for largescale machines are complex and ..."
Abstract

Cited by 24 (9 self)
 Add to MetaCart
(Show Context)
Shared memory is an appealing abstraction for parallel programming. It must be implemented with caches in order to perform well, however, and caches require a coherence mechanism to ensure that processors reference current data. Hardware coherence mechanisms for largescale machines are complex and costly, but existing software mechanisms for messagepassing machines have not provided a performancecompetitive solution. We claim that an intermediate hardware option—memorymapped network interfaces that support a global physical address space—can provide most of the performance benefits of hardware cache coherence. We present a software coherence protocol that runs on this class of machines and greatly narrows the performance gap between hardware and software coherence. We compare the performance of the protocol to that of existing software and hardware alternatives and evaluate the tradeoffs among various cachewrite policies. We also observe that simple program changes can greatly improve performance. For the programs in our test suite and with the changes in place, software coherence is often faster and never more than 55 % slower than hardware coherence. 1
Performance Properties of Large Scale Parallel Systems
 Department of Computer Science, University of Minnesota
, 1993
"... There are several metrics that characterize the performance of a parallel system, such as, parallel execution time, speedup and efficiency. A number of properties of these metrics have been studied. For example, it is a well known fact that given a parallel architecture and a problem of a fixed size ..."
Abstract

Cited by 23 (7 self)
 Add to MetaCart
There are several metrics that characterize the performance of a parallel system, such as, parallel execution time, speedup and efficiency. A number of properties of these metrics have been studied. For example, it is a well known fact that given a parallel architecture and a problem of a fixed size, the speedup of a parallel algorithm does not continue to increase with increasing number of processors. It usually tends to saturate or peak at a certain limit. Thus it may not be useful to employ more than an optimal number of processors for solving a problem on a parallel computer. This optimal number of processors depends on the problem size, the parallel algorithm and the parallel architecture. In this paper we study the impact of parallel processing overheads and the degree of concurrency of a parallel algorithm on the optimal number of processors to be used when the criterion for optimality is minimizing the parallel execution time. We then study a more general criterion of optimalit...
Performance Evaluation of the CM5 Interconnection Network
 In Proceedings of CompCon Spring'93
, 1993
"... This paper presents performance characteristics of the CM5 data network with respect to bandwidth and latency. We present the maximum effective network bandwidth as a function of (1) message size, (2) varying data path through different levels of the network hierarchy, and (3) system load. We al ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
(Show Context)
This paper presents performance characteristics of the CM5 data network with respect to bandwidth and latency. We present the maximum effective network bandwidth as a function of (1) message size, (2) varying data path through different levels of the network hierarchy, and (3) system load. We also present latency as a function of message size and network hierarchy. In the second part of our evaluation, we present the maximum effective network bandwidth as a function of data flow pattern (2d and 3d mesh, stencils, ring, tree, and hypercube) using several variant embedding schemes. Measurements were obtained using CMMD V 1.3.1, Thinking Machine Corporation's messagepassing library, and CMNF V 1.1, a fast messagepassing library developed by the Minnesota Supercomputer Center (MSC) for the Army High Performance Research Center (AHPCRC). Using CMNF we were able to observe measurements which were very close to actual hardware capabilities. 1 Research supported in part by the...
Scalability of Parallel Algorithms for Matrix Multiplication
 in Proc. of Int. Conf. on Parallel Processing
, 1991
"... A number of parallel formulations of dense matrix multiplication algorithm have been developed. For arbitrarily large number of processors, any of these algorithms or their variants can provide near linear speedup for sufficiently large matrix sizes and none of the algorithms can be clearly claimed ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
(Show Context)
A number of parallel formulations of dense matrix multiplication algorithm have been developed. For arbitrarily large number of processors, any of these algorithms or their variants can provide near linear speedup for sufficiently large matrix sizes and none of the algorithms can be clearly claimed to be superior than the others. In this paper we analyze the performance and scalability of a number of parallel formulations of the matrix multiplication algorithm and predict the conditions under which each formulation is better than the others. We present a parallel formulation for hypercube and related architectures that performs better than any of the schemes described in the literature so far for a wide range of matrix sizes and number of processors. The superior performance and the analytical scalability expressions for this algorithm are verified through experiments on the Thinking Machines Corporation's CM5 TM y parallel computer for up to 512 processors. We show that special har...
Towards Parallel Programming by Transformation: The FAN Skeleton Framework
, 2001
"... A Functional Abstract Notation (FAN) is proposed for the specification and design of parallel algorithms by means of skeletons  highlevel patterns with parallel semantics. The main weakness of the current programming systems based on skeletons is that the user is still responsible for finding the ..."
Abstract

Cited by 21 (10 self)
 Add to MetaCart
A Functional Abstract Notation (FAN) is proposed for the specification and design of parallel algorithms by means of skeletons  highlevel patterns with parallel semantics. The main weakness of the current programming systems based on skeletons is that the user is still responsible for finding the most appropriate skeleton composition for a given application and a given parallel architecture. We describe a transformational framework for the development of skeletal programs which is aimed at filling this gap. The framework makes use of transformation rules which are semantic equivalences among skeleton compositions. For a given problem, an initial, possibly inefficient skeleton specification is refined by applying a sequence of transformations. Transformations are guided by a set of performance prediction models which forecast the behavior of each skeleton and the performance benefits of different rules. The design process is supported by a graphical tool which locates applicable transformations and provides performance estimates, thereby helping the programmer in navigating through the program refinement space. We give an overview of the FAN framework and exemplify its use with performancedirected program derivations for simple case studies. Our experience can be viewed as a first feasibility study of methods and tools for transformational, performancedirected parallel programming using skeletons.
Noncommittal barrier synchronization
 Parallel Computing
, 1995
"... Barrier synchronization is a fundamental operation in parallel computation. In many contexts, at the point a processor enters a barrier it knows that it has already processed all work required of it prior to the synchronization. This paper treats the alternative case, when a processor cannot enter a ..."
Abstract

Cited by 20 (8 self)
 Add to MetaCart
Barrier synchronization is a fundamental operation in parallel computation. In many contexts, at the point a processor enters a barrier it knows that it has already processed all work required of it prior to the synchronization. This paper treats the alternative case, when a processor cannot enter a barrier with the assurance that it has already performed all necessary presynchronization computation. The problem arises when the number of presynchronization messages to be received by a processor is unknown, for example, in a parallel discrete simulation or any other computation that is largely driven by an unpredictable exchange of messages. We describe an optimistic O(log 2 P) barrier algoritlun for such problems, study its performance on a largescale parallel system, and consider extensions to general associative reductions, as well as associative parallel prefix computations.