Results 1 -
6 of
6
A Communication Framework for Fault-tolerant Parallel Execution
"... Abstract. PC grids represent massive computation capacity at a low cost, but are challenging to employ for parallel computing because of variable and unpredictable performance and availability. A communicating parallel program must employ checkpoint-restart and/or process redundancy to make continuo ..."
Abstract
- Add to MetaCart
Abstract. PC grids represent massive computation capacity at a low cost, but are challenging to employ for parallel computing because of variable and unpredictable performance and availability. A communicating parallel program must employ checkpoint-restart and/or process redundancy to make continuous forward progress in such an unreliable environment. A communication model based on one-sided Put/Get calls, pioneered by the Linda system, is a good match as processes can execute their communication operations independently and asynchronously. However, Linda and its many variants are not designed for communicating processes that are replicated or independently restarted from checkpoints. The key problem is that a single logical operation that impacts the global program state may be executed by different instances of the same process at different times leading to semantic inconsistency. This paper presents the design, execution model, implementation, and validation of a communication layer for robust execution on volatile nodes. The research leads to a practical way to employ idle PCs for latency tolerant parallel computing applications. 1
A High-Level Interpreted MPI Library for Parallel Computing in Volunteer Environments
"... Abstract — Idle desktops have been successfully used to run sequential and master-slave task parallel codes on a large scale in the context of volunteer computing. However, execution of message passing parallel programs in such environments is challenging because a pool of nodes to execute an applic ..."
Abstract
- Add to MetaCart
Abstract — Idle desktops have been successfully used to run sequential and master-slave task parallel codes on a large scale in the context of volunteer computing. However, execution of message passing parallel programs in such environments is challenging because a pool of nodes to execute an application may have architectural and operating system heterogeneity, can include widely distributed nodes across security domains, and nodes may become unavailable for computation frequently and without warning. The VolPEx (Parallel Execution on Volatile Nodes) tool set is building MPI support in such environments based on selective use of process redundancy and message logging. However, addressing this challenge requires tradeoffs between performance, portability, and usability. The paper introduces a robust MPI library that is designed to be highly portable across heterogeneous architectures and operating systems. This VolpexPyMPI 1 library is built with Python, works with Linux and Windows platforms and accepts user level MPI programs written in C or FORTRAN. The performance of VolpexPyMPI is compared with a traditional C based implementation of MPI. The paper examines in detail the tradeoffs of these usability focused and performance focused approaches. I.
Influence of the Progress Engine on the Performance of Asynchronous Communication Libraries ∗
, 2010
"... This technical report performs an in-depth performance comparison of two MPI libraries, namely VolpexMPI and Open MPI. The analysis is motivated by some unexpected results in which VolpexMPI shows better performance than Open MPI, despite of some architectural decision in the library that should lea ..."
Abstract
- Add to MetaCart
This technical report performs an in-depth performance comparison of two MPI libraries, namely VolpexMPI and Open MPI. The analysis is motivated by some unexpected results in which VolpexMPI shows better performance than Open MPI, despite of some architectural decision in the library that should lead to a performance degradation on a dedicated compute cluster. Our analysis indicate that general purpose high performance computing communication libraries are optimized for high speed network interconnects such as InfiniBand, which due to their low latency and high bandwidth require an aggressive approach in pushing data into the network. This approach is however not necessarily optimal for a Gigabit Ethernet network. Specifically, the progress function of the communication library is called more often than necessary to saturate the Gigabit Ethernet network, which consequently introduces an overhead.
Application Resilience with Process Failures
"... Abstract- The notion of resiliency is concerned with constructing mission-critical applications that are able to operate through a wide variety of failures, errors, and malicious attacks. A number of approaches have been proposed in the literature based on fault tolerance achieved through replicatio ..."
Abstract
- Add to MetaCart
Abstract- The notion of resiliency is concerned with constructing mission-critical applications that are able to operate through a wide variety of failures, errors, and malicious attacks. A number of approaches have been proposed in the literature based on fault tolerance achieved through replication of resources. In general, these approaches provide graceful degradation of performance to the point of failure but do not guarantee progress in the presence of multiple cascading and recurrent attacks. Our approach is to dynamically replicate message-passing processes, detect inconsistencies in their behavior, and restore the level of fault tolerance as a computation proceeds. This paper describes a novel operating system technology for resilient message-passing applications that is automated, scalable, and transparent. The technology provides mechanisms for process replication, multicast messaging, and process failure detection. We demonstrate resilience to failures and benchmark the performance impact using a distributed exemplar representative of applications constructed using domain decomposition.
Estimation of MPI Application Performance on Volunteer Environments
"... Abstract. Emerging MPI libraries, such as VolpexMPI and P2P MPI, allow message passing parallel programs to execute effectively in heterogeneous volunteer environments despite frequent failures. However, the performance of message passing codes varies widely in a volunteer environment, depending on ..."
Abstract
- Add to MetaCart
Abstract. Emerging MPI libraries, such as VolpexMPI and P2P MPI, allow message passing parallel programs to execute effectively in heterogeneous volunteer environments despite frequent failures. However, the performance of message passing codes varies widely in a volunteer environment, depending on the application characteristics and the computation and communication characteristics of the nodes and the interconnection network. This paper has the dual goal of developing and validating a tool chain to estimate performance of MPI codes in a volunteer environment and analyzing the suitability of the class of computations represented by NAS benchmarks for volunteer computing. The framework is deployed to estimate performance in a variety of possible volunteer configurations, including some based on the measured parameters of a campus volunteer pool. The results show slowdowns by factors between 2 and 10 for different NAS benchmark codes for execution on a realistic volunteer campus pool as compared to dedicated clusters. 1
A Robust Communication Framework for Parallel Execution on Volunteer PC Grids
"... Abstract—Volunteer PC grids represent massive computation capacity at a low cost, but are challenging to employ for parallel computing because of variable and unpredictable performance and availability. A communicating parallel program must employ explicit redundancy, or implicit redundancy with unc ..."
Abstract
- Add to MetaCart
Abstract—Volunteer PC grids represent massive computation capacity at a low cost, but are challenging to employ for parallel computing because of variable and unpredictable performance and availability. A communicating parallel program must employ explicit redundancy, or implicit redundancy with uncoordinated checkpoint-restart to make continuous forward progress in such an unreliable environment. A communication model based on one-sided Put/Get calls to an abstract global shared space is a good match as processes can execute their communication operations independently and asynchronously. However, no existing system is designed for redundant communicating processes. The key problem is that a single logical operation that impacts the global program state may be executed by different instances of the same process at different times leading to semantic inconsistency. This paper presents the design, execution model, implementation, and usage of Volpex, a communication layer for robust execution on volunteer PC grids. The research leads to a practical way to employ idle PCs for latency tolerant parallel computing applications. 1 I.

