Results 1 
4 of
4
High Performance Linpack Benchmark: A Fault Tolerant Implementation without Checkpointing
 in Proceedings of the 25th ACM International Conference on Supercomputing (ICS 2011). ACM
"... The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For long running applications using a large number of processors, it is essential that fault tolerance be used to prevent a to ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
(Show Context)
The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For long running applications using a large number of processors, it is essential that fault tolerance be used to prevent a total loss of all finished computations after a failure. While checkpointing has been very useful to tolerate failures for a long time, it often introduces a considerable overhead especially when applications modify a large amount of memory between checkpoints and the number of processors is large. In this paper, we propose an algorithmbased recovery scheme for the High Performance Linpack benchmark (which modifies a large amount of memory in each iteration) to tolerate failstop failures without checkpointing. It was proved by Huang and Abraham that a checksum added to a matrix will be maintained after the matrix is factored. We demonstrate that, for the rightlooking LU factorization algorithm, the checksum is maintained at each step of the computation. Based on this checksum relationship maintained at each step in the middle of the computation, we demonstrate that failstop process failures in High Performance Linpack can be tolerated without checkpointing. Because no periodical checkpoint is necessary during computation and no rollback is necessary during recovery, the proposed recovery scheme is highly scalable and has a good potential to scale to extreme scale computing and beyond. Experimental results on the supercomputer Jaguar demonstrate that the fault tolerance overhead introduced by the proposed recovery scheme is negligible.
Algorithmic Cholesky factorization fault recovery
 In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on
, 2010
"... Abstract—Modeling and analysis of large scale scientific systems often use linear least squares regression, frequently employing Cholesky factorization to solve the resulting set of linear equations. With large matrices, this often will be performed in high performance clusters containing many proce ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Modeling and analysis of large scale scientific systems often use linear least squares regression, frequently employing Cholesky factorization to solve the resulting set of linear equations. With large matrices, this often will be performed in high performance clusters containing many processors. Assuming a constant failure rate per processor, the probability of a failure occurring during the execution increases linearly with additional processors. Fault tolerant methods attempt to reduce the expected execution time by allowing recovery from failure. This paper presents an analysis and implementation of a fault tolerant Cholesky factorization algorithm that does not require checkpointing for recovery from failstop failures. Rather, this algorithm uses redundant data added in an additional set of processors. This differs from previous works with algorithmic methods as it addresses failstop failures rather than failcontinue cases. The implementation and experimentation using ScaLAPACK demonstrates that this method has decreasing overhead in relation to overall runtime as the matrix size increases, and thus shows promise to reduce the expected runtime for Cholesky factorizations on very large matrices.
Finding Optimally Conditioned Matrices using Evolutionary Computation
"... In this work, evolutionary computation approaches are used to find optimally conditioned matrices. A matrix is defined to be optimally conditioned if the maximum condition of all square submatrices is minimized. The evolutionary computation is shown to outperform existing approaches, as well to prod ..."
Abstract
 Add to MetaCart
(Show Context)
In this work, evolutionary computation approaches are used to find optimally conditioned matrices. A matrix is defined to be optimally conditioned if the maximum condition of all square submatrices is minimized. The evolutionary computation is shown to outperform existing approaches, as well to produce matrices whose conditions are near the global optimum. 1
A Class of Real Expander Codes Based on Projective Geometrically Constructed Ramanujan Graphs
, 2011
"... Quite recently, codes based on real field are gaining momentum in terms of research and applications. In highperformance computing, these codes are being explored to provide fault tolerance under node failures. In this paper, we propose novel real cycle codes based on expander graphs. The requisite ..."
Abstract
 Add to MetaCart
(Show Context)
Quite recently, codes based on real field are gaining momentum in terms of research and applications. In highperformance computing, these codes are being explored to provide fault tolerance under node failures. In this paper, we propose novel real cycle codes based on expander graphs. The requisite graphs are the Ramanujan graphs constructed using incidence matrices of the appropriate projectivegeometric objects. The proposed codes are elegant in terms of reduced complexity encoding and very simple erasure correction. Further, the codes are guaranteed to correct three erasures. Apart from building the codes from the sound existing principles, necessary simulation results and justification of the useful properties are also presented in the paper. Key words: