Results 1  10
of
55
MultiProtocol Active Messages on a Cluster of SMPs
 Proc. Supercomputing 97, IEEE CS
, 1997
"... Clusters of multiprocessors, or Clumps, promise to be the supercomputers of the future, but obtaining high performance on these architectures requires an understanding of interactions between the multiple levels of interconnection. In this paper,wepresent the rst multiprotocol implementation of a l ..."
Abstract

Cited by 73 (2 self)
 Add to MetaCart
Clusters of multiprocessors, or Clumps, promise to be the supercomputers of the future, but obtaining high performance on these architectures requires an understanding of interactions between the multiple levels of interconnection. In this paper,wepresent the rst multiprotocol implementation of a lightweight message layera version of Active MessagesII running on a cluster of Sun Enterprise 5000 servers connected with Myrinet. This research brings together several pieces of highperformance interconnection technology: bus backplanes for symmetric multiprocessors, lowlatency networks for connections between machines, and simple, userlevel primitives for communication. The paper describes the shared memory messagepassing protocol and analyzes the multiprotocol implementation with both microbenchmarks and SplitC applications. Three aspects of the communication layer are critical to performance: the overhead of cachecoherence mechanisms, the method of managing concurrent access, and the cost of accessing state with the slower protocol. Through the use of an adaptive polling strategy, the multiprotocol implementation limits performance interactions between the protocols, delivering up to 160 MB/s of bandwidth with 3.6 microsecond endtoend latency. Applications within an SMP bene t from this fast communication, running up to 75 % faster than on a network of uniprocessor workstations. Applications running on the entire Clump are limited by the balance of NIC's to processors in our system, and are typically slower than on the NOW. These results illustrate several potential pitfalls for the Clumps architecture. 1
Parallel Execution of Prolog Programs: A Survey
"... Since the early days of logic programming, researchers in the field realized the potential for exploitation of parallelism present in the execution of logic programs. Their highlevel nature, the presence of nondeterminism, and their referential transparency, among other characteristics, make logic ..."
Abstract

Cited by 65 (24 self)
 Add to MetaCart
Since the early days of logic programming, researchers in the field realized the potential for exploitation of parallelism present in the execution of logic programs. Their highlevel nature, the presence of nondeterminism, and their referential transparency, among other characteristics, make logic programs interesting candidates for obtaining speedups through parallel execution. At the same time, the fact that the typical applications of logic programming frequently involve irregular computations, make heavy use of dynamic data structures with logical variables, and involve search and speculation, makes the techniques used in the corresponding parallelizing compilers and runtime systems potentially interesting even outside the field. The objective of this paper is to provide a comprehensive survey of the issues arising in parallel execution of logic programming languages along with the most relevant approaches explored to date in the field. Focus is mostly given to the challenges emerging from the parallel execution of Prolog programs. The paper describes the major techniques used for shared memory implementation of Orparallelism, Andparallelism, and combinations of the two. We also explore some related issues, such as memory
A Fast, Parallel Spanning Tree Algorithm for Symmetric Multiprocessors (SMPs) (Extended Abstract)
, 2004
"... Our study in this paper focuses on implementing parallel spanning tree algorithms on SMPs. Spanning tree is an important problem in the sense that it is the building block for many other parallel graph algorithms and also because it is representative of a large class of irregular combinatorial probl ..."
Abstract

Cited by 34 (13 self)
 Add to MetaCart
Our study in this paper focuses on implementing parallel spanning tree algorithms on SMPs. Spanning tree is an important problem in the sense that it is the building block for many other parallel graph algorithms and also because it is representative of a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but often have no known efficient parallel implementations. In this paper we present a new randomized algorithm and implementation with superior performance that for the firsttime achieves parallel speedup on arbitrary graphs (both regular and irregular topologies) when compared with the best sequential implementation for finding a spanning tree. This new algorithm uses several techniques to give an expected running time that scales linearly with the number p of processors for suitably large inputs (n> p 2). As the spanning tree problem is notoriously hard for any parallel implementation to achieve reasonable speedup, our study may shed new light on implementing PRAM algorithms for sharedmemory parallel computers. The main results of this paper are 1. A new and practical spanning tree algorithm for symmetric multiprocessors that exhibits parallel speedups on graphs with regular and irregular topologies; and 2. An experimental study of parallel spanning tree algorithms that reveals the superior performance of our new approach compared with the previous algorithms. The source code for these algorithms is freelyavailable from our web site hpc.ece.unm.edu.
Bandwidthefficient Collective Communication for Clustered Wide Area Systems
 In Proc. International Parallel and Distributed Processing Symposium (IPDPS 2000), Cancun
, 1999
"... Metacomputing infrastructures couple multiple clusters (or MPPs) via widearea networks and thus allow parallel programs to run on geographically distributed resources. A major problem in programming such widearea parallel applications is the difference in communication costs inside and between clu ..."
Abstract

Cited by 31 (3 self)
 Add to MetaCart
Metacomputing infrastructures couple multiple clusters (or MPPs) via widearea networks and thus allow parallel programs to run on geographically distributed resources. A major problem in programming such widearea parallel applications is the difference in communication costs inside and between clusters. Latency and bandwidth of WANs often are orders of magnitude worse than those of local networks. Our MagPIe library eases widearea parallel programming by providing an efficient implementation of MPI's collective communication operations. MagPIe exploits the hierarchical structure of clustered widearea systems and minimizes the communication overhead over the WAN links. In this paper, we present improved algorithms for collective communication that achieve shorter completion times by simultaneously using the aggregate bandwidth of the available widearea links. Our new algorithms split messages into multiple segments that are sent in parallel over different WAN links, thus resulting ...
A Programming Methodology for Dualtier Multicomputers
 IEEE Transactions on Software Engineering
, 1999
"... Hierarchicallyorganized ensembles of shared memory multiprocessors possess a richer and more complex model of locality than previous generation multicomputers with single processor nodes. These dualtier computers introduce many new degrees of freedom into the programmer 's performance model. ..."
Abstract

Cited by 25 (7 self)
 Add to MetaCart
Hierarchicallyorganized ensembles of shared memory multiprocessors possess a richer and more complex model of locality than previous generation multicomputers with single processor nodes. These dualtier computers introduce many new degrees of freedom into the programmer 's performance model. We present a methodology for implementing blockstructured numerical applications on dualtier computers, and a runtime infrastructure, called KeLP2, that implements the methodology. KeLP2 supports two levels of locality and parallelism via hierarchical SPMD control flow, runtime geometric metadata, and asynchronous collective communication. It effectively overlaps communication in cases where nonblocking pointtopoint message passing can fail to tolerate communication latency, either due to an incomplete implementation or because the pointtopoint model is inappropriate. KeLP's abstractions hide considerable detail without sacrificing performance, and dualtier applications written in KeLP...
Optimization Rules for Programming with Collective Operations
 IPPS/SPDP'99. 13th Int. Parallel Processing Symp. & 10th Symp. on Parallel and Distributed Processing
, 1999
"... We study how several collective operations like broadcast, reduction, scan, etc. can be composed efficiently in complex parallel programs. Our specific contributions are: (1) a formal framework for reasoning about collective operations; (2) a set of optimization rules which save communications by fu ..."
Abstract

Cited by 24 (7 self)
 Add to MetaCart
We study how several collective operations like broadcast, reduction, scan, etc. can be composed efficiently in complex parallel programs. Our specific contributions are: (1) a formal framework for reasoning about collective operations; (2) a set of optimization rules which save communications by fusing several collective operations into one; (3) performance estimates, which guide the application of optimization rules depending on the machine characteristics; (4) a simple case study with the first results of machine experiments.
Fast SharedMemory Algorithms for Computing the Minimum Spanning Forest of Sparse Graphs
, 2006
"... ..."
Evaluating Arithmetic Expressions Using Tree Contraction: A Fast and Scalable Parallel Implementation for Symmetric Multiprocessors (SMPs)
 Proc. 9th Int’l Conf. on High Performance Computing (HiPC 2002), volume 2552 of Lecture Notes in Computer Science
, 2002
"... The ability to provide uniform sharedmemory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform sharedmemory algorithm from a PRAM algorithm and present the res ..."
Abstract

Cited by 22 (7 self)
 Add to MetaCart
The ability to provide uniform sharedmemory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform sharedmemory algorithm from a PRAM algorithm and present the results of an extensive experimental study demonstrating that the resulting programs scale nearly linearly across a significant range of processors and across the entire range of instance sizes tested. This linear speedup with the number of processors is one of the first ever attained in practice for intricate combinatorial problems. The example we present in detail here is for evaluating arithmetic expression trees using the algorithmic techniques of list ranking and tree contraction; this problem is not only of interest in its own right, but is representativeof a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but have no known efficient parallel implementations. Our results thus offer promise for bridging the gap between the theory and practice of sharedmemory parallel algorithms.
Using PRAM Algorithms on a UniformMemoryAccess SharedMemory Architecture
 Proc. 5th Int’l Workshop on Algorithm Engineering (WAE 2001), volume 2141 of Lecture Notes in Computer Science
, 2001
"... The ability to provide uniform sharedmemory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform sharedmemory algorithm from a PRAM algorithm and present the res ..."
Abstract

Cited by 20 (11 self)
 Add to MetaCart
The ability to provide uniform sharedmemory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform sharedmemory algorithm from a PRAM algorithm and present the results of an extensive experimental study demonstrating that the resulting programs scale nearly linearly across a significant range of processors (from 1 to 64) and across the entire range of instance sizes tested. This linear speedup with the number of processors is, to our knowledge, the first ever attained in practice for intricate combinatorial problems. The example we present in detail here is a graph decomposition algorithm that also requires the computation of a spanning tree; this problem is not only of interest in its own right, but is representative of a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but have no known efficient parallel implementations. Our results thus offer promise for bridging the gap between the theory and practice of sharedmemory parallel algorithms.
Runtime Support for MultiTier Programming of BlockStructured Applications on SMP Clusters
 International Scientific Computing in ObjectOriented Parallel Environments Conference (ISCOPE ’97
, 1997
"... . We present a small set of programming abstractions to simplify efficient implementations for blockstructured scientific calculations on SMP clusters. We have implemented these abstractions in KeLP 2.0, a C++ class library. KeLP 2.0 provides hierarchical SMPD control flow to manage two levels of p ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
. We present a small set of programming abstractions to simplify efficient implementations for blockstructured scientific calculations on SMP clusters. We have implemented these abstractions in KeLP 2.0, a C++ class library. KeLP 2.0 provides hierarchical SMPD control flow to manage two levels of parallelism and locality. Additionally, to tolerate slow internode communication costs, KeLP 2.0 combines inspector /executor communication analysis with overlap of communication and computation. We illustrate how these programming abstractions hide the lowlevel details of thread management, scheduling, synchronization, and messagepassing, but allow the programmer to express efficient algorithms with intuitive geometric primitives. 1 Introduction Multitier parallel computers, such as clusters of symmetric multiprocessors (SMPs), have emerged as important platforms for highperformance computing [1]. A multitier computer, with several levels of locality and parallelism, presents a more c...