Results 1 - 10
of
24
Plfs: A checkpoint filesystem for parallel applications
, 2009
"... Parallel applications running across thousands of processors must protect themselves from inevitable system failures. Many applications insulate themselves from failures by checkpointing. For many applications, checkpointing into a shared single file is most convenient. With such an approach, the si ..."
Abstract
-
Cited by 23 (6 self)
- Add to MetaCart
Parallel applications running across thousands of processors must protect themselves from inevitable system failures. Many applications insulate themselves from failures by checkpointing. For many applications, checkpointing into a shared single file is most convenient. With such an approach, the size of writes are often small and not aligned with file system boundaries. Unfortunately for these applications, this preferred data layout results in pathologically poor performance from the underlying file system which is optimized for large, aligned writes to non-shared files. To address this fundamental mismatch, we have developed a virtual parallel log structured file system, PLFS. PLFS remaps an application’s preferred data layout into one which is optimized for the underlying file system. Through testing on PanFS, Lustre, and GPFS, we have seen that this layer of indirection and reorganization can reduce checkpoint time by an order of magnitude for several important benchmarks and real applications without any application modification.
An Evaluation of Checkpoint Recovery for Massively Multiplayer Online Games
"... Massively multiplayer online games (MMOs) have emerged as an exciting new class of applications for database technology. MMOs simulate long-lived, interactive virtual worlds, which proceed by applying updates in frames or ticks, typically at 30 or 60 Hz. In order to sustain the resulting high update ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Massively multiplayer online games (MMOs) have emerged as an exciting new class of applications for database technology. MMOs simulate long-lived, interactive virtual worlds, which proceed by applying updates in frames or ticks, typically at 30 or 60 Hz. In order to sustain the resulting high update rates of such games, game state is kept entirely in main memory by the game servers. Nevertheless, durability in MMOs is usually achieved by a standard DBMS implementing ARIES-style recovery. This architecture limits scalability, forcing MMO developers to either invest in high-end hardware or to over-partition their virtual worlds. In this paper, we evaluate the applicability of existing checkpoint recovery techniques developed for main-memory DBMS to MMO workloads. Our thorough experimental evaluation uses a detailed simulation model fed with update traces generated synthetically and from a prototype game server. Based on our results, we recommend MMO developers to adopt a copy-on-update scheme with a double-backup disk organization to checkpoint game state. This scheme outperforms alternatives in terms of the latency introduced in the game as well the time necessary to recover after a crash. 1.
Improving the availability of supercomputer job input data using temporal replication, submitted for publication
"... Supercomputers are stepping into the Peta-scale and Exascale era, wherein handling hundreds of concurrent system failures is an urgent challenge. In particular, storage system failures have been identified as a major source of service interruptions in supercomputers. RAID solutions alone cannot prov ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Supercomputers are stepping into the Peta-scale and Exascale era, wherein handling hundreds of concurrent system failures is an urgent challenge. In particular, storage system failures have been identified as a major source of service interruptions in supercomputers. RAID solutions alone cannot provide sufficient storage protection as (1) average disk recovery time is projected to grow, making RAID groups increasingly vulnerable to additional failures during data reconstruction, and (2) disk-level data protection cannot mask higherlevel faults, such as software/hardware failures of entire I/O nodes. This paper presents a complementary approach based on the observation that files in the supercomputer scratch space are typically accessed by batch jobs, whose execution can be anticipated. Therefore, we propose to transparently, selectively, and temporarily replicate ”active ” job input data, by coordinating the parallel file system with the batch job scheduler. We have implemented the temporal replication scheme in the popular Lustre parallel file system and evaluated it with both real-cluster experiments and trace-driven simulations. Our results show that temporal replication allows for fast online data reconstruction, with a reasonably low overall space and I/O bandwidth overhead. 1
Application resilience: Making progress in spite of failure
- In The Workshop on Resilience held in conjunction with the IEEE International Conference on Cluster Computing and the Grid (CCGRID 2008
, 2008
"... Abstract—While measures such as raw compute performance and system capacity continue to be important factors for evaluating cluster performance, such issues as system reliability and application resilience have become increasingly important as cluster sizes rapidly grow. Although efforts to directly ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract—While measures such as raw compute performance and system capacity continue to be important factors for evaluating cluster performance, such issues as system reliability and application resilience have become increasingly important as cluster sizes rapidly grow. Although efforts to directly improve fault-tolerance are important, it is also essential to accept that application failures will inevitably occur and to ensure that progress is made despite these failures. Application monitoring frameworks are central to providing application resilience. As such, the central theme of this paper is to address the impact that application monitoring detection latency has on the overall system performance. We find that immediate fault detection is not necessary in order to obtain substantial improvement in performance. This conclusion is significant because it implies that less complex, highly portable, and predominately less expensive failure detection schemes would provide adequate application resilience. I.
Efficient Exploratory Testing of Concurrent Systems
, 2011
"... In our experience, exploratory testing has reached a level of maturity that makes it a practical and often the most cost-effective approach to testing. Notably, previous work has demonstrated that exploratory testing is capable of finding bugs even in well-tested systems [4, 17, 24, 23]. However, th ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
In our experience, exploratory testing has reached a level of maturity that makes it a practical and often the most cost-effective approach to testing. Notably, previous work has demonstrated that exploratory testing is capable of finding bugs even in well-tested systems [4, 17, 24, 23]. However, the number of bugs found gives little indication of the efficiency of a testing approach. To drive testing efficiency, this paper focuses on techniques for measuring and maximizing the coverage achieved by exploratory testing. In particular, this paper describes the design, implementation, and evaluation of Eta, a framework for exploratory testing of multithreaded components of a large-scale cluster management system at Google. For simple tests (with millions to billions of possible executions), Eta achieves complete coverage one to two orders of magnitude faster than random testing. For complex tests, Eta adopts a state space reduction technique to avoid the need to explore over 85 % of executions and harnesses parallel processing to explore multiple test executions concurrently, achieving a throughput increase of up to 17.5×.
On-the-fly Recovery of Job Input Data in Supercomputers ∗
"... Storage system failure is a serious concern as we approach Petascale computing. Even at today’s sub-Petascale levels, I/O failure is the leading cause of downtimes and job failures. We contribute a novel, on-the-fly recovery framework for job input data into supercomputer parallel file systems. The ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Storage system failure is a serious concern as we approach Petascale computing. Even at today’s sub-Petascale levels, I/O failure is the leading cause of downtimes and job failures. We contribute a novel, on-the-fly recovery framework for job input data into supercomputer parallel file systems. The framework exploits key traits of the HPC I/O workload to reconstruct lost input data during job execution from remote, immutable copies. Each reconstructed data stripe is made immediately accessible in the client request order due to the delayed metadata update and finegranular locking while unrelated access to the same file remains unaffected. We have implemented the recovery component within the Lustre parallel file system, thus building a novel application-transparent online recovery solution. Our solution is integrated into Lustre’s two-level locking scheme using a two-phase blocking protocol. Combining parametric and simulation studies, our experiments demonstrate a significant improvement in HPC center serviceability and user job turnaround time. 1.
Checkpointing vs. Migration for Post-Petascale Machines
, 2009
"... We craft a few scenarios for the execution of sequential and parallel jobs on future generation machines. Checkpointing or migration, which technique to choose? 1 ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We craft a few scenarios for the execution of sequential and parallel jobs on future generation machines. Checkpointing or migration, which technique to choose? 1
Information and Infrastructure Integrity Initiative
"... Mitigating the impact of computer failure is possible if accurate failure predictions are provided. Resources, applications, and services can be scheduled around predicted failure and limit the impact. Such strategies are especially important for multi-computer systems, such as compute clusters, tha ..."
Abstract
- Add to MetaCart
Mitigating the impact of computer failure is possible if accurate failure predictions are provided. Resources, applications, and services can be scheduled around predicted failure and limit the impact. Such strategies are especially important for multi-computer systems, such as compute clusters, that experience a higher rate failure due to the large number of components. However providing accurate predictions with sufficient lead time remains a challenging problem. This paper describes a new spectrum-kernel Support Vector Machine (SVM) approach to predict failure events based on system log files. These files contain messages that represent a change of system state. While a single message in the file may not be sufficient for predicting failure, a sequence or pattern of messages may be. The approach described in this paper will use a sliding window (sub-sequence) of messages to predict the likelihood of failure. The a frequency representation of the message sub-sequences observed are then used as input to the SVM. The SVM then associates the messages to a class of failed or non-failed system. Experimental results using actual system log files from a Linux-based compute cluster indicate the proposed spectrum-kernel SVM approach has promise and can predict hard disk failure with an accuracy of 73 % two days in advance. 1
37th International Conference on Parallel Processing On-the-fly Recovery of Job Input Data in Supercomputers ∗
"... Storage system failure is a serious concern as we approach Petascale computing. Even at today’s sub-Petascale levels, I/O failure is the leading cause of downtimes and job failures. We contribute a novel, on-the-fly recovery framework for job input data into supercomputer parallel file systems. The ..."
Abstract
- Add to MetaCart
Storage system failure is a serious concern as we approach Petascale computing. Even at today’s sub-Petascale levels, I/O failure is the leading cause of downtimes and job failures. We contribute a novel, on-the-fly recovery framework for job input data into supercomputer parallel file systems. The framework exploits key traits of the HPC I/O workload to reconstruct lost input data during job execution from remote, immutable copies. Each reconstructed data stripe is made immediately accessible in the client request order due to the delayed metadata update and finegranular locking while unrelated access to the same file remains unaffected. We have implemented the recovery component within the Lustre parallel file system, thus building a novel application-transparent online recovery solution. Our solution is integrated into Lustre’s two-level locking scheme using a two-phase blocking protocol. Combining parametric and simulation studies, our experiments demonstrate a significant improvement in HPC center serviceability and user job turnaround time. 1.
HPC I/O middleware file formats
"... ...And eat it too: High read performance in write-optimized ..."

