Results 1 - 10
of
36
Diskless Checkpointing
, 1997
"... Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkp ..."
Abstract
-
Cited by 91 (3 self)
- Add to MetaCart
Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkpointing and present the basic diskless checkpointing scheme along with several variants for improved performance. The performance of the basic scheme and its variants is evaluated on a high-performance network of workstations and compared to traditional disk-based checkpointing. We conclude that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures.
SuperLU DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems
- ACM Trans. Mathematical Software
, 2003
"... We present the main algorithmic features in the software package SuperLU DIST, a distributedmemory sparse direct solver for large sets of linear equations. We give in detail our parallelization strategies, with a focus on scalability issues, and demonstrate the software’s parallel performance and sc ..."
Abstract
-
Cited by 68 (14 self)
- Add to MetaCart
We present the main algorithmic features in the software package SuperLU DIST, a distributedmemory sparse direct solver for large sets of linear equations. We give in detail our parallelization strategies, with a focus on scalability issues, and demonstrate the software’s parallel performance and scalability on current machines. The solver is based on sparse Gaussian elimination, with an innovative static pivoting strategy proposed earlier by the authors. The main advantage of static pivoting over classical partial pivoting is that it permits a priori determination of data structures and communication patterns, which lets us exploit techniques used in parallel sparse Cholesky algorithms to better parallelize both LU decomposition and triangular solution on large-scale distributed machines.
Send-receive considered harmful: Myths and realities of message passing
- ACM Transactions on Programming Languages and Systems
"... During the software crisis of the 1960s, Dijkstra’s famous thesis “goto considered harmful ” paved the way for structured programming. This short communication suggests that many current difficulties of parallel programming based on message passing are caused by poorly structured communication, whic ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
During the software crisis of the 1960s, Dijkstra’s famous thesis “goto considered harmful ” paved the way for structured programming. This short communication suggests that many current difficulties of parallel programming based on message passing are caused by poorly structured communication, which is a consequence of using low-level send-receive primitives. We argue that, like goto in sequential programs, send-receive should be avoided as far as possible and replaced by collective operations in the setting of message passing. We dispute some widely held opinions about the apparent superiority of pairwise communication over collective communication and present substantial theoretical and empirical evidence to the contrary in the context of MPI (Message Passing Interface).
Pipelining broadcasts on heterogeneous platforms
, 2005
"... In this paper, we consider the communications involved by the execution of a complex application, deployed on a heterogeneous platform. Such applications extensively use macrocommunication schemes, for example, to broadcast data items. Rather than aiming at minimizing the execution time of a single ..."
Abstract
-
Cited by 27 (14 self)
- Add to MetaCart
In this paper, we consider the communications involved by the execution of a complex application, deployed on a heterogeneous platform. Such applications extensively use macrocommunication schemes, for example, to broadcast data items. Rather than aiming at minimizing the execution time of a single broadcast, we focus on the steady-state operation. We assume that there is a large number of messages to be broadcast in pipeline fashion, and we aim at maximizing the throughput, i.e., the (rational) number of messages which can be broadcast every time-step. We target heterogeneous platforms, modeled by a graph where resources have different communication and computation speeds. Achieving the best throughput may well require that the target platform is used in totality: We show that neither spanning trees nor DAGs are as powerful as general graphs. We show how to compute the best throughput using linear programming, and how to exhibit a periodic schedule, first when restricting to a DAG, and then when using a general graph. The polynomial compactness of the description comes from the decomposition of the schedule into several broadcast trees that are used concurrently to reach the best throughput. It is important to point out that a concrete scheduling algorithm based upon the steady-state operation is asymptotically optimal, in the class of all possible schedules (not only periodic solutions).
Realizing common communication patterns in partitioned optical passive stars (POPS) networks
- IEEE Transactions on Computers
, 1998
"... Abstract—We consider the problem of realizing several common communication structures in the all-optical Partitioned Optical Passive Stars (POPS) topology. We show that, often, the obvious or “natural ” method of implementing a communication pattern in the POPS does not efficiently utilize its commu ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Abstract—We consider the problem of realizing several common communication structures in the all-optical Partitioned Optical Passive Stars (POPS) topology. We show that, often, the obvious or “natural ” method of implementing a communication pattern in the POPS does not efficiently utilize its communication capabilities. We present techniques which distribute the communication load uniformly in the POPS for four of the most common communication patterns (all-to-all personalized, global reduction operations, ring, and torus). We prove that these techniques provide optimal performance in the sense that they minimize the time required to deliver the messages from each node to its neighbors. Index Terms—Optical interconnections, passive stars, embedding, all-to-all communications, reduction operations, multiplexing.
A measurement-based simulation study of processor co-allocation in multicluster systems
- SCHEDULING STRATEGIES FOR PARALLEL PROCESSING
, 2003
"... In systems consisting of multiple clusters of processors interconnected by relatively slow connections such as our Distributed ASCI Supercomputer (DAS), applications may benefit from the availability of processors in multiple clusters. However, the performance of single-application multicluster exec ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
In systems consisting of multiple clusters of processors interconnected by relatively slow connections such as our Distributed ASCI Supercomputer (DAS), applications may benefit from the availability of processors in multiple clusters. However, the performance of single-application multicluster execution may be degraded due to the slow widearea links. In addition, scheduling policies for such systems have to deal with more restrictions than schedulers for single clusters in that every component of a job has to fit in separate clusters. In this paper we present a measurement study of the total runtime of two applications, and of the communication time of one of them, both on single clusters and on multicluster systems. In addition, we perform simulations of several multicluster scheduling policies based on our measurement results. Our results show that in spite of the fact that inter-cluster communication is much slower then intra-cluster communication, the performance of multicluster operation can be very reasonable compared to single-cluster execution.
Parallel Biological Sequence Comparison using Prefix Computations
- Journal of Parallel and Distributed Computing
, 2003
"... ..."
Broadcast trees for heterogeneous platforms
- 19th International Parallel and Distributed Processing Symposium (IPDPS’05
, 2005
"... Laboratoire de l'Informatique du Paralle'lisme E'cole Normale Supe'rieure de LyonUnite ' Mixte de Recherche CNRS-INRIA-ENS LYON-UCBL no 5668 ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Laboratoire de l'Informatique du Paralle'lisme E'cole Normale Supe'rieure de LyonUnite ' Mixte de Recherche CNRS-INRIA-ENS LYON-UCBL no 5668
Determining the Execution Time Distribution for a Data Parallel Program in a Heterogeneous Computing Environment
, 1997
"... this paper. Section 2 presents the basic assumptions and a brief overview of the proposed approach. Methods for computing the execution time distribution of a single code block in either SIMD or SPMD mode are discussed in Section 3. The methods for computing the execution time distribution for the ..."
Abstract
-
Cited by 12 (10 self)
- Add to MetaCart
this paper. Section 2 presents the basic assumptions and a brief overview of the proposed approach. Methods for computing the execution time distribution of a single code block in either SIMD or SPMD mode are discussed in Section 3. The methods for computing the execution time distribution for the entire program executed in SPMD, SIMD, and mixed-mode are introduced in Sections 4, 5, and 6, respectively. Section 7 presents a hypothetical numerical example and an application study to demonstrate the effect of mode selections on the distribution of total execution time. The Appendix reviews the basic probability theory and notation used here
Customizable Parallel Execution of Scientific Stream
, 2005
"... Scientific applications require processing highvolume on-line streams of numerical data from instruments and simulations. We present an extensible stream database system that allows scalable and flexible continuous queries on such streams. Application dependent streams and query functions are ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
Scientific applications require processing highvolume on-line streams of numerical data from instruments and simulations. We present an extensible stream database system that allows scalable and flexible continuous queries on such streams. Application dependent streams and query functions are defined through an object-relational model. Distributed execution plans for continuous queries are described as high-level data flow distribution templates.

