Results 1 -
4 of
4
High Performance Linpack Benchmark: A Fault Tolerant Implementation without Checkpointing
- in Proceedings of the 25th ACM International Conference on Supercomputing (ICS 2011). ACM
"... The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For long running applications using a large number of processors, it is essential that fault tolerance be used to prevent a to ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
(Show Context)
The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For long running applications using a large number of processors, it is essential that fault tolerance be used to prevent a total loss of all finished computations after a failure. While check-pointing has been very useful to tolerate failures for a long time, it often introduces a considerable overhead especially when applications modify a large amount of memory be-tween checkpoints and the number of processors is large. In this paper, we propose an algorithm-based recovery scheme for the High Performance Linpack benchmark (which modi-fies a large amount of memory in each iteration) to tolerate fail-stop failures without checkpointing. It was proved by Huang and Abraham that a checksum added to a matrix will be maintained after the matrix is factored. We demon-strate that, for the right-looking LU factorization algorithm, the checksum is maintained at each step of the computa-tion. Based on this checksum relationship maintained at each step in the middle of the computation, we demonstrate that fail-stop process failures in High Performance Linpack can be tolerated without checkpointing. Because no peri-odical checkpoint is necessary during computation and no roll-back is necessary during recovery, the proposed recov-ery scheme is highly scalable and has a good potential to scale to extreme scale computing and beyond. Experimen-tal results on the supercomputer Jaguar demonstrate that the fault tolerance overhead introduced by the proposed re-covery scheme is negligible.
Algorithmic Cholesky factorization fault recovery
- In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on
, 2010
"... Abstract—Modeling and analysis of large scale scientific systems often use linear least squares regression, frequently employing Cholesky factorization to solve the resulting set of linear equations. With large matrices, this often will be performed in high performance clusters containing many proce ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Modeling and analysis of large scale scientific systems often use linear least squares regression, frequently employing Cholesky factorization to solve the resulting set of linear equations. With large matrices, this often will be performed in high performance clusters containing many processors. Assuming a constant failure rate per processor, the probability of a failure occurring during the execution increases linearly with additional processors. Fault tolerant methods attempt to reduce the expected execution time by allowing recovery from failure. This paper presents an analysis and implementation of a fault tolerant Cholesky factorization algorithm that does not require checkpointing for recovery from fail-stop failures. Rather, this algorithm uses redundant data added in an additional set of processors. This differs from previous works with algorithmic methods as it addresses fail-stop failures rather than fail-continue cases. The implemen-tation and experimentation using ScaLAPACK demonstrates that this method has decreasing overhead in relation to overall runtime as the matrix size increases, and thus shows promise to reduce the expected runtime for Cholesky factorizations on very large matrices.
Finding Optimally Conditioned Matrices using Evolutionary Computation
"... In this work, evolutionary computation approaches are used to find optimally conditioned matrices. A matrix is defined to be optimally conditioned if the maximum condition of all square submatrices is minimized. The evolutionary computation is shown to outperform existing approaches, as well to prod ..."
Abstract
- Add to MetaCart
(Show Context)
In this work, evolutionary computation approaches are used to find optimally conditioned matrices. A matrix is defined to be optimally conditioned if the maximum condition of all square submatrices is minimized. The evolutionary computation is shown to outperform existing approaches, as well to produce matrices whose conditions are near the global optimum. 1
A Class of Real Expander Codes Based on Projective- Geometrically Constructed Ramanujan Graphs
, 2011
"... Quite recently, codes based on real field are gaining momentum in terms of research and applications. In high-performance computing, these codes are being explored to provide fault tolerance under node failures. In this paper, we propose novel real cycle codes based on expander graphs. The requisite ..."
Abstract
- Add to MetaCart
(Show Context)
Quite recently, codes based on real field are gaining momentum in terms of research and applications. In high-performance computing, these codes are being explored to provide fault tolerance under node failures. In this paper, we propose novel real cycle codes based on expander graphs. The requisite graphs are the Ramanujan graphs constructed using incidence matrices of the appropriate projective-geometric objects. The proposed codes are elegant in terms of reduced complexity encoding and very simple erasure correction. Further, the codes are guaranteed to correct three erasures. Apart from building the codes from the sound existing principles, necessary simulation results and justification of the useful properties are also presented in the paper. Key words: