Results 1  10
of
57
MultiProtocol Active Messages on a Cluster of SMPs
 Proc. Supercomputing 97, IEEE CS
, 1997
"... Clusters of multiprocessors, or Clumps, promise to be the supercomputers of the future, but obtaining high performance on these architectures requires an understanding of interactions between the multiple levels of interconnection. In this paper,wepresent the rst multiprotocol implementation of a l ..."
Abstract

Cited by 75 (2 self)
 Add to MetaCart
(Show Context)
Clusters of multiprocessors, or Clumps, promise to be the supercomputers of the future, but obtaining high performance on these architectures requires an understanding of interactions between the multiple levels of interconnection. In this paper,wepresent the rst multiprotocol implementation of a lightweight message layera version of Active MessagesII running on a cluster of Sun Enterprise 5000 servers connected with Myrinet. This research brings together several pieces of highperformance interconnection technology: bus backplanes for symmetric multiprocessors, lowlatency networks for connections between machines, and simple, userlevel primitives for communication. The paper describes the shared memory messagepassing protocol and analyzes the multiprotocol implementation with both microbenchmarks and SplitC applications. Three aspects of the communication layer are critical to performance: the overhead of cachecoherence mechanisms, the method of managing concurrent access, and the cost of accessing state with the slower protocol. Through the use of an adaptive polling strategy, the multiprotocol implementation limits performance interactions between the protocols, delivering up to 160 MB/s of bandwidth with 3.6 microsecond endtoend latency. Applications within an SMP bene t from this fast communication, running up to 75 % faster than on a network of uniprocessor workstations. Applications running on the entire Clump are limited by the balance of NIC's to processors in our system, and are typically slower than on the NOW. These results illustrate several potential pitfalls for the Clumps architecture. 1
Parallel Execution of Prolog Programs: A Survey
"... Since the early days of logic programming, researchers in the field realized the potential for exploitation of parallelism present in the execution of logic programs. Their highlevel nature, the presence of nondeterminism, and their referential transparency, among other characteristics, make logic ..."
Abstract

Cited by 70 (23 self)
 Add to MetaCart
Since the early days of logic programming, researchers in the field realized the potential for exploitation of parallelism present in the execution of logic programs. Their highlevel nature, the presence of nondeterminism, and their referential transparency, among other characteristics, make logic programs interesting candidates for obtaining speedups through parallel execution. At the same time, the fact that the typical applications of logic programming frequently involve irregular computations, make heavy use of dynamic data structures with logical variables, and involve search and speculation, makes the techniques used in the corresponding parallelizing compilers and runtime systems potentially interesting even outside the field. The objective of this paper is to provide a comprehensive survey of the issues arising in parallel execution of logic programming languages along with the most relevant approaches explored to date in the field. Focus is mostly given to the challenges emerging from the parallel execution of Prolog programs. The paper describes the major techniques used for shared memory implementation of Orparallelism, Andparallelism, and combinations of the two. We also explore some related issues, such as memory
A Fast, Parallel Spanning Tree Algorithm for Symmetric Multiprocessors (SMPs) (Extended Abstract)
, 2004
"... Our study in this paper focuses on implementing parallel spanning tree algorithms on SMPs. Spanning tree is an important problem in the sense that it is the building block for many other parallel graph algorithms and also because it is representative of a large class of irregular combinatorial probl ..."
Abstract

Cited by 45 (13 self)
 Add to MetaCart
Our study in this paper focuses on implementing parallel spanning tree algorithms on SMPs. Spanning tree is an important problem in the sense that it is the building block for many other parallel graph algorithms and also because it is representative of a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but often have no known efficient parallel implementations. In this paper we present a new randomized algorithm and implementation with superior performance that for the firsttime achieves parallel speedup on arbitrary graphs (both regular and irregular topologies) when compared with the best sequential implementation for finding a spanning tree. This new algorithm uses several techniques to give an expected running time that scales linearly with the number p of processors for suitably large inputs (n> p 2). As the spanning tree problem is notoriously hard for any parallel implementation to achieve reasonable speedup, our study may shed new light on implementing PRAM algorithms for sharedmemory parallel computers. The main results of this paper are 1. A new and practical spanning tree algorithm for symmetric multiprocessors that exhibits parallel speedups on graphs with regular and irregular topologies; and 2. An experimental study of parallel spanning tree algorithms that reveals the superior performance of our new approach compared with the previous algorithms. The source code for these algorithms is freelyavailable from our web site hpc.ece.unm.edu.
Bandwidthefficient Collective Communication for Clustered Wide Area Systems
 In Proc. International Parallel and Distributed Processing Symposium (IPDPS 2000), Cancun
, 1999
"... Metacomputing infrastructures couple multiple clusters (or MPPs) via widearea networks and thus allow parallel programs to run on geographically distributed resources. A major problem in programming such widearea parallel applications is the difference in communication costs inside and between clu ..."
Abstract

Cited by 34 (3 self)
 Add to MetaCart
(Show Context)
Metacomputing infrastructures couple multiple clusters (or MPPs) via widearea networks and thus allow parallel programs to run on geographically distributed resources. A major problem in programming such widearea parallel applications is the difference in communication costs inside and between clusters. Latency and bandwidth of WANs often are orders of magnitude worse than those of local networks. Our MagPIe library eases widearea parallel programming by providing an efficient implementation of MPI's collective communication operations. MagPIe exploits the hierarchical structure of clustered widearea systems and minimizes the communication overhead over the WAN links. In this paper, we present improved algorithms for collective communication that achieve shorter completion times by simultaneously using the aggregate bandwidth of the available widearea links. Our new algorithms split messages into multiple segments that are sent in parallel over different WAN links, thus resulting ...
Fast SharedMemory Algorithms for Computing the Minimum Spanning Forest of Sparse Graphs
, 2006
"... ..."
Idempotent work stealing
 in PPoPP, 2009
"... Load balancing is a technique which allows efficient parallelization of irregular workloads, and a key component of many applications and parallelizing runtimes. Workstealing is a popular technique for implementing load balancing, where each parallel thread maintains its own work set of items and o ..."
Abstract

Cited by 31 (6 self)
 Add to MetaCart
(Show Context)
Load balancing is a technique which allows efficient parallelization of irregular workloads, and a key component of many applications and parallelizing runtimes. Workstealing is a popular technique for implementing load balancing, where each parallel thread maintains its own work set of items and occasionally steals items from the sets of other threads. The conventional semantics of work stealing guarantee that each inserted task is eventually extracted exactly once. However, correctness of a wide class of applications allows for relaxed semantics, because either: i) the application already explicitly checks that no work is repeated or ii) the application can tolerate repeated work. In this paper, we introduce idempotent work stealing, and present several new algorithms that exploit the relaxed semantics to deliver better performance. The semantics of the new algorithms guarantee that each inserted task is eventually extracted at least once–instead of exactly once. On mainstream processors, algorithms for conventional work stealing require special atomic instructions or storeload memory ordering fence instructions in the owner’s critical path operations. In general, these instructions are substantially slower than regular memory access instructions. By exploiting the relaxed semantics, our algorithms avoid these instructions in the owner’s operations. We evaluated our algorithms using common graph problems and microbenchmarks and compared them to wellknown conventional work stealing algorithms, the THE Cilk and ChaseLev algorithms. We found that our best algorithm (with LIFO extraction) outperforms existing algorithms in nearly all cases, and often by significant margins.
Evaluating Arithmetic Expressions Using Tree Contraction: A Fast and Scalable Parallel Implementation for Symmetric Multiprocessors (SMPs)
 Proc. 9th Int’l Conf. on High Performance Computing (HiPC 2002), volume 2552 of Lecture Notes in Computer Science
, 2002
"... The ability to provide uniform sharedmemory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform sharedmemory algorithm from a PRAM algorithm and present the res ..."
Abstract

Cited by 26 (7 self)
 Add to MetaCart
(Show Context)
The ability to provide uniform sharedmemory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform sharedmemory algorithm from a PRAM algorithm and present the results of an extensive experimental study demonstrating that the resulting programs scale nearly linearly across a significant range of processors and across the entire range of instance sizes tested. This linear speedup with the number of processors is one of the first ever attained in practice for intricate combinatorial problems. The example we present in detail here is for evaluating arithmetic expression trees using the algorithmic techniques of list ranking and tree contraction; this problem is not only of interest in its own right, but is representativeof a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but have no known efficient parallel implementations. Our results thus offer promise for bridging the gap between the theory and practice of sharedmemory parallel algorithms.
A Programming Methodology for Dualtier Multicomputers
 IEEE Transactions on Software Engineering
, 1999
"... Hierarchicallyorganized ensembles of shared memory multiprocessors possess a richer and more complex model of locality than previous generation multicomputers with single processor nodes. These dualtier computers introduce many new degrees of freedom into the programmer 's performance model. ..."
Abstract

Cited by 25 (7 self)
 Add to MetaCart
(Show Context)
Hierarchicallyorganized ensembles of shared memory multiprocessors possess a richer and more complex model of locality than previous generation multicomputers with single processor nodes. These dualtier computers introduce many new degrees of freedom into the programmer 's performance model. We present a methodology for implementing blockstructured numerical applications on dualtier computers, and a runtime infrastructure, called KeLP2, that implements the methodology. KeLP2 supports two levels of locality and parallelism via hierarchical SPMD control flow, runtime geometric metadata, and asynchronous collective communication. It effectively overlaps communication in cases where nonblocking pointtopoint message passing can fail to tolerate communication latency, either due to an incomplete implementation or because the pointtopoint model is inappropriate. KeLP's abstractions hide considerable detail without sacrificing performance, and dualtier applications written in KeLP...
Optimization Rules for Programming with Collective Operations
 IPPS/SPDP'99. 13th Int. Parallel Processing Symp. & 10th Symp. on Parallel and Distributed Processing
, 1999
"... We study how several collective operations like broadcast, reduction, scan, etc. can be composed efficiently in complex parallel programs. Our specific contributions are: (1) a formal framework for reasoning about collective operations; (2) a set of optimization rules which save communications by fu ..."
Abstract

Cited by 24 (7 self)
 Add to MetaCart
(Show Context)
We study how several collective operations like broadcast, reduction, scan, etc. can be composed efficiently in complex parallel programs. Our specific contributions are: (1) a formal framework for reasoning about collective operations; (2) a set of optimization rules which save communications by fusing several collective operations into one; (3) performance estimates, which guide the application of optimization rules depending on the machine characteristics; (4) a simple case study with the first results of machine experiments.
Using PRAM Algorithms on a UniformMemoryAccess SharedMemory Architecture
 Proc. 5th Int’l Workshop on Algorithm Engineering (WAE 2001), volume 2141 of Lecture Notes in Computer Science
, 2001
"... The ability to provide uniform sharedmemory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform sharedmemory algorithm from a PRAM algorithm and present the res ..."
Abstract

Cited by 20 (11 self)
 Add to MetaCart
(Show Context)
The ability to provide uniform sharedmemory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform sharedmemory algorithm from a PRAM algorithm and present the results of an extensive experimental study demonstrating that the resulting programs scale nearly linearly across a significant range of processors (from 1 to 64) and across the entire range of instance sizes tested. This linear speedup with the number of processors is, to our knowledge, the first ever attained in practice for intricate combinatorial problems. The example we present in detail here is a graph decomposition algorithm that also requires the computation of a spanning tree; this problem is not only of interest in its own right, but is representative of a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but have no known efficient parallel implementations. Our results thus offer promise for bridging the gap between the theory and practice of sharedmemory parallel algorithms.