Results 1 - 10
of
12
CLIP: A Checkpointing Tool for Message-Passing Parallel Programs
, 1997
"... Checkpointing is a useful technique for rollback recovery of parallel applications. While extensive research has been performed on checkpointing in parallel environments, there are few checkpointers available to application users on commercial parallel computers. This paper presents one such checkpo ..."
Abstract
-
Cited by 60 (9 self)
- Add to MetaCart
Checkpointing is a useful technique for rollback recovery of parallel applications. While extensive research has been performed on checkpointing in parallel environments, there are few checkpointers available to application users on commercial parallel computers. This paper presents one such checkpointer: CLIP. CLIP is a user-level library that provides semitransparent checkpointing for parallel programs on the Intel Paragon multicomputer. It is publicly available to Paragon users at no cost. Conceptually, checkpointing a multicomputer is quite straightforward. However, when creating an actual tool for checkpointing a complex machine like the Paragon, many more issues arise that require careful design decisions to be made. Sometimes ease-of-use must be sacrificed for efficiency and/or correctness. This paper details what these decisions are, and how they were made in CLIP. We also present performance data when checkpointing several long-running Paragon applications with CLIP. The bottom line is that a convenient, general-purpose checkpointing tool like CLIP can provide fault-tolerance on a massively parallel multicomputer like the Paragon with very good performance.
Predicting Performance on SMPs. A Case Study: The SGI Power Challenge
- IN PROCEEDINGS OF THE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2000
, 2000
"... In this work we study the issue of performance prediction on the SGI-Power Challenge, a typical representative of the class of shared-memory Symmetric MultiProcessors. On such a platform, the cost of memory accesses varies depending on their locality and on contention among processors. By running a ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
In this work we study the issue of performance prediction on the SGI-Power Challenge, a typical representative of the class of shared-memory Symmetric MultiProcessors. On such a platform, the cost of memory accesses varies depending on their locality and on contention among processors. By running a carefully designed suite of microbenchmarks, we provide quantitative evidence that the interaction with the memory hierarchy affects performance far more substantially than other phenomena related to contention. We also fit three cost functions based on variants of the BSP model, which do not account for the hierarchy, and a newly defined function F, expressed in terms of hardware counters, which captures both memory hierarchy and contention effects. We test the accuracy of all the functions on both synthetic and application benchmarks showing that, unlike the other functions, F achieves an excellent level of predictivity in all cases. Although hardware counters are only available at run time, we give evidence that function F can still be employed as a prediction tool by extrapolating values of the counters from pilot runs on small input sizes.
Assessing the Performance of the New IBM SP2 Communication Subsystem
, 1996
"... IBM has recently launched an upgrade of the communication subsystem of its SP2 parallel computer. This change affects the communication hardware (high-performance switch and interface adapters) as well as the communication software (MPI implementation). In order to characterize to what extent the ex ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
IBM has recently launched an upgrade of the communication subsystem of its SP2 parallel computer. This change affects the communication hardware (high-performance switch and interface adapters) as well as the communication software (MPI implementation). In order to characterize to what extent the execution times of parallel applications will be affected by these changes, a collection of benchmarks has been run on a SP2 with the old communication subsystem and on the same machine after being upgraded. These benchmarks include point to point and collective communication tests as well as complete parallel applications. The performance indicators are the latency and throughput exhibited by the basic communication tests, and the execution time in the case of real applications. Keywords Communication subsystem, message passing networks, massively parallel computers, performance evaluation, IBM SP2. 1 Introduction A long time has passed since the high-performance computing community realized ...
The average availability of parallel checkpointing systems and its importance in selecting runtime parameters
- IN 29TH INTERNATIONAL SYMPOSIUM ON FAULT-TOLERANT COMPUTING
, 1999
"... Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In particu ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In particular, the issue of processor allocation is typically ignored. In this paper, we briefly present a performance model for long-running parallel computations that execute with checkpointing enabled. We then discuss how it is relevant to today’s parallel computing environments and software, and present case studies of using the model to select runtime parameters.
CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems
"... Abstract—Considerable work has been done on providing fault tolerance capabilities for different software components on largescale high-end computing systems. Thus far, however, these faulttolerant components have worked insularly and independently and information about faults is rarely shared. Such ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Abstract—Considerable work has been done on providing fault tolerance capabilities for different software components on largescale high-end computing systems. Thus far, however, these faulttolerant components have worked insularly and independently and information about faults is rarely shared. Such lack of system-wide fault tolerance is emerging as one of the biggest problems on leadership-class systems. In this paper, we propose a coordinated infrastructure, named CIFTS, that enables system software components to share fault information with each other and adapt to faults in a holistic manner. Central to the CIFTS infrastructure is a Fault Tolerance Backplane (FTB) that enables fault notification and awareness throughout the software stack, includingfault-aware libraries,middleware,andapplications.We present details of the CIFTS infrastructure and the interface specification that has allowed various software programs, including MPICH2, MVAPICH, Open MPI, and PVFS, to plug into the CIFTS infrastructure. Further, through a detailed evaluation we demonstrate the nonintrusive low-overhead capability of CIFTS that letsapplications runwithminimal performance degradation. I.
Utilizing Home Node Prediction to Improve the Performance of Software Distributed Shared Memory
- In Proc. of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04
, 2004
"... Many researchers use a home-based lazy release consistent protocol (HLRC) to provide a simple, effective, and scalable way to build software distributed shared memory (DSM) systems. However, the performance of HLRC is notoriously sensitive to the initial page distribution among home nodes. This pape ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Many researchers use a home-based lazy release consistent protocol (HLRC) to provide a simple, effective, and scalable way to build software distributed shared memory (DSM) systems. However, the performance of HLRC is notoriously sensitive to the initial page distribution among home nodes. This paper proposes an adaptive HLRC protocol in which the home page designation is able to change according to the observed application sharing pattern. Our system differs from HLRC and other adaptive derivatives in the following respects. First, the number of home nodes for each shared page can be varied, as opposed to having only a single home node. Second, we use prediction in a novel way to dynamically change the the location of home nodes according to different memory access patterns. The home node of each shared page is able to propagate, perish, and migrate. An online home predictor determine whether or not the current node should remain a home node or drop from the current set of home nodes for a given page. Finally, all decisions concerning home node group membership are made locally, eliminating the costly global decision-making communication present in many other systems. Performance evaluations using six well-known DSM benchmarks show that our adaptive protocol outperforms conventional HLRC by up to 60%. 1
Presented at ATM99 DEVELOPMENTS Conf, Mar 3 - Apr 1.1999, Rennes, France
"... We investigate the configuration and performance of remote commodity computing clusters. This is the dynamic pooling of separate clusters into a single large remote cluster via existing LANs or even the Internet. We discuss the configuration and setup of these remote clusters, as well as the netw ..."
Abstract
- Add to MetaCart
We investigate the configuration and performance of remote commodity computing clusters. This is the dynamic pooling of separate clusters into a single large remote cluster via existing LANs or even the Internet. We discuss the configuration and setup of these remote clusters, as well as the networks since these clusters are on separate Ethernet subnets and separate ATM switches. We show that pure switching networks add little additional overhead to remote cluster computing applications, whereas routers can a have significant impact.
Workload Characterization of CFD Applications Using Partial Differential Equation Solvers
"... Workload characterization is used for modeling and evaluating computing systems at different levels of detail. We present workload characterization for a class of Computational Fluid Dynamics (CFD) applications that solve Partial Differential Equations (PDEs). This workload characterization focuses ..."
Abstract
- Add to MetaCart
Workload characterization is used for modeling and evaluating computing systems at different levels of detail. We present workload characterization for a class of Computational Fluid Dynamics (CFD) applications that solve Partial Differential Equations (PDEs). This workload characterization focuses on three high performance computing platforms: SGI Origin2000, IBM SP-2, and a cluster of Intel Pentium Pro based PCs. We execute extensive measurement-based experiments on these platforms to gather statistics of system resource usage, which lead to a quantitative workload characterization. Our workload characterization approach yields a coarse-grain resource utilization behavior that is being applied for performance modeling and evaluation of distributed high performance metacomputing systems. In addition, this study enhances our understanding of interactions between PDE solver workloads and high performance computing platforms and is useful for tuning applications belonging to this class. 1
http://www.fz-juelich.de/nic-series/volume38 Gb Ethernet Protocols for Clusters: An OpenMPI, TIPC, GAMMA Case Study
"... to make digital or hard copies of portions of this work for personal or classroom use is granted provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise requires prior specific ..."
Abstract
- Add to MetaCart
to make digital or hard copies of portions of this work for personal or classroom use is granted provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise requires prior specific permission by the publisher mentioned above.
Performance Evaluation of Gigabit Ethernet and Myrinet for System-Area-Networks
, 2005
"... Low latency and high bandwidth networking is essential for cluster computing and System-Area-Networks (SAN). The performance of a SAN optimized interconnect, Myrinet, is compared with gigabit Ethernet running TCP/IP. Though Myrinet has lower latencies and higher throughput than gigabit Ethernet, it ..."
Abstract
- Add to MetaCart
Low latency and high bandwidth networking is essential for cluster computing and System-Area-Networks (SAN). The performance of a SAN optimized interconnect, Myrinet, is compared with gigabit Ethernet running TCP/IP. Though Myrinet has lower latencies and higher throughput than gigabit Ethernet, it is found that an efficient implementation of message passing interface library over TCP/IP achieves performance very close to Myrinet. These observations leads to the fact that gigabit and the upcoming 10 gigabit Ethernet can serve as a cost-effective alternative to specialized interconnects if the efficiency of protocol implementations can be improved. This paper also surveys some recent work being done on increasing the efficiency of TCP/IP and other efficient protocol implementations that benefit all applications using the ubiquitous sockets interface over Ethernet without using specialized libraries such as GM over Myrinet.

