Results 1 - 10
of
16
Scalable Temporal Order Analysis for Large Scale Debugging
- Proceedings of the 2009 International Conference for High Performance Computing, Networking, Storage and Analysis
, 2009
"... We present a scalable temporal order analysis technique that sup-ports debugging of large scale applications by classifying MPI tasks based on their logical program execution order. Our approach combines static analysis techniques with dynamic analysis to de-termine this temporal order scalably. It ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
(Show Context)
We present a scalable temporal order analysis technique that sup-ports debugging of large scale applications by classifying MPI tasks based on their logical program execution order. Our approach combines static analysis techniques with dynamic analysis to de-termine this temporal order scalably. It uses scalable stack trace analysis techniques to guide selection of critical program execu-tion points in anomalous application runs. Our novel temporal or-dering engine then leverages this information along with the ap-plication’s static control structure to apply data flow analysis tech-niques to determine key application data such as loop control vari-ables. We then use lightweight techniques to gather the dynamic data that determines the temporal order of the MPI tasks. Our evaluation, which extends the Stack Trace Analysis Tool (STAT), demonstrates that this temporal order analysis technique can isolate bugs in benchmark codes with injected faults as well as a real world hang case with AMG2006.
Diagnosing performance bottlenecks in emerging petascale applications
- In SC ’09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
, 2009
"... Cutting-edge science and engineering applications require petascale computing. It is, however, a significant challenge to use petascale computing platforms effectively. Consequently, there is a critical need for performance tools that enable scientists to understand impediments to performance on eme ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
(Show Context)
Cutting-edge science and engineering applications require petascale computing. It is, however, a significant challenge to use petascale computing platforms effectively. Consequently, there is a critical need for performance tools that enable scientists to understand impediments to performance on emerging petascale systems. In this paper, we describe HPCToolkit—a suite of multi-platform tools that supports sampling-based analysis of application performance on emerging petascale platforms. HPCToolkit uses sampling to pinpoint and quantify both scaling and node performance bottlenecks. We study several emerging petascale applications on the Cray XT and IBM BlueGene/P platforms and use HPCToolkit to identify specific source lines — in their full calling context — associated with performance bottlenecks in these codes. Such information is exactly what application developers need to know to improve their applications to take full advantage of the power of petascale systems.
Addressing Failures in Exascale Computing
"... The current approach to resilience for large high-performance computing (HPC) machines is based on global application checkpoint/restart. The state of each application is checkpointed periodically; if the application fails, then it is restarted from the last checkpoint. Preserving this approach is h ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
The current approach to resilience for large high-performance computing (HPC) machines is based on global application checkpoint/restart. The state of each application is checkpointed periodically; if the application fails, then it is restarted from the last checkpoint. Preserving this approach is highly desirable because it requires no change in application software. The success of this method depends crucially on the following assumptions: 1. The time to checkpoint is mean time before failure (MTBF). 2. The time to restart (which includes the time to restore the system to a consistent state) isMTBF; 3. The checkpoint is correct—errors that could corrupt the checkpointed state are detected before the checkpoint is committed. 4. Committed output data is correct (output is committed when it is read). It was not clear that these assumptions are currently satisfied. In particular, can one ignore silent data corruptions (SDCs)? It is clear that satisfying these assumptions will be harder in the future for the following reasons: • MTBF is decreasing faster than disk checkpoint time. • MTBF is decreasing faster than recovery time—especially recovery from global system failures.
Vrisha: using scaling properties of parallel programs for bug detection and localization
- In Proceedings of the 20th international symposium on High performance distributed computing
, 2011
"... Detecting and isolating bugs that arise in parallel programs is a tedious and a challenging task. An especially subtle class of bugs are those that are scale-dependent: while smallscale test cases may not exhibit the bug, the bug arises in large-scale production runs, and can change the result or pe ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
(Show Context)
Detecting and isolating bugs that arise in parallel programs is a tedious and a challenging task. An especially subtle class of bugs are those that are scale-dependent: while smallscale test cases may not exhibit the bug, the bug arises in large-scale production runs, and can change the result or performance of an application. A popular approach to finding bugs is statistical bug detection, where abnormal behavior is detected through comparison with bug-free behavior. Unfortunately, for scale-dependent bugs, there may not be bug-free runs at large scales and therefore traditional statistical techniques are not viable. In this paper, we propose Vrisha, a statistical approach to detecting and localizing scale-dependent bugs. Vrisha detects bugs in large-scale programs by building models of behavior based on bug-free behavior at small scales. These models are constructed using kernel canonical correlation analysis (KCCA) and exploit scale-determined properties, whose values are predictably dependent on application scale. We use Vrisha to detect and diagnose two bugs caused by errors in popular MPI libraries and show that our techniques can be implemented with low overhead and low false-positive rates.
Group File Operations for Scalable Tools and Middleware
- 16th International Conference on High Performance Computing (HiPC
, 2009
"... Abstract. Group file operations are a new, intuitive idiom for tools and middleware- including parallel debuggers and runtimes, performance measurement and steering, and distributed resource management- that require scal-able operations on large groups of distributed files. The id-iom provides new s ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Group file operations are a new, intuitive idiom for tools and middleware- including parallel debuggers and runtimes, performance measurement and steering, and distributed resource management- that require scal-able operations on large groups of distributed files. The id-iom provides new semantics for using file groups in stan-dard file operations to eliminate costly iteration. A file-based idiom promotes conciseness and portability, and eas-es adoption. With explicit semantics for aggregation of group results, the idiom addresses a key scalability barrier. We have designed TBON-FS, a new distributed file system that provides scalable group file operations by leveraging tree-based overlay networks (TBONs) for scalable commu-nication and data aggregation. We integrated group file operations into several tools: parallel versions of common utilities including cp, grep, rsync, tail, and top, and the Ganglia Distributed Monitoring System. Our experi-ence verifies the group file operation idiom is intuitive, eas-ily adopted, and enables a wide variety of tools to run effi-ciently at scale.
WuKong: Automatically Detecting and Localizing Bugs that Manifest at Large System Scales
"... A key challenge in developing large scale applications is finding bugs that are latent at the small scales of testing, but manifest themselves when the application is deployed at a large scale. Here, we ascribe a dual meaning to “large scale”—it could mean a large number of executing processes or ap ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
A key challenge in developing large scale applications is finding bugs that are latent at the small scales of testing, but manifest themselves when the application is deployed at a large scale. Here, we ascribe a dual meaning to “large scale”—it could mean a large number of executing processes or applications ingesting large amounts of input data (or both). Traditional statistical techniques fail to detect or diagnose such kinds of bugs because no error-free run is available at the large deployment scales for training purposes. Prior work used scaling models to detect anomalous behavior at large scales without training on correct behavior at that scale. However, that work cannot localize bugs automatically, i.e., cannot pinpoint the region of code responsible for the error. In this paper, we resolve that shortcoming by making the following three contributions: (i) we develop an automatic diagnosis technique, based on feature reconstruction; (ii) we design a heuristic to effectively prune the large feature space; and (iii) we demonstrate that our system scales well, in terms of both accuracy and overhead. We validate our design through a large-scale fault-injection study and two case-studies of real-world bugs, finding that our system can effectively localize bugs in 92.5 % of the cases, dramatically easing the challenge of finding bugs in largescale programs.
Abhranta: Locating bugs that manifest at large system scales
- In HotDep ’12
"... Abstract A key challenge in developing large scale applications (both in system size and in input size) is finding bugs that are latent at the small scales of testing, only manifesting when a program is deployed at large scales. Traditional statistical techniques fail because no error-free run is a ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract A key challenge in developing large scale applications (both in system size and in input size) is finding bugs that are latent at the small scales of testing, only manifesting when a program is deployed at large scales. Traditional statistical techniques fail because no error-free run is available at deployment scales for training purposes. Prior work used scaling models to detect anomalous behavior at large scales without being trained on correct behavior at that scale. However, that work cannot localize bugs automatically. In this paper, we extend that work with automatic diagnosis technique, based on feature reconstruction, and validate our design through case studies with two real bugs from an MPI library and a DHT-based file sharing application.
Scalable Performance Analysis of ExaScale MPI Programs through Signature-Based Clustering Algorithms ∗
"... Extreme-scale computing poses a number of challenges to application performance. Developers need to study appli-cation behavior by collecting detailed information with the help of tracing toolsets to determine shortcomings. But not only applications are“scalability challenged”, current tracing tools ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Extreme-scale computing poses a number of challenges to application performance. Developers need to study appli-cation behavior by collecting detailed information with the help of tracing toolsets to determine shortcomings. But not only applications are“scalability challenged”, current tracing toolsets also fall short of exascale requirements for low back-ground overhead since trace collection for each execution en-tity is becoming infeasible. One effective solution is to clus-ter processes with the same behavior into groups. Instead of collecting performance information from each individual node, this information can be collected from just a set of representative nodes. This work contributes a fast, scalable, signature-based clustering algorithm that clusters processes exhibiting similar execution behavior. Instead of prior work based on statistical clustering, our approach produces pre-cise results nearly without loss of events or accuracy. The proposed algorithm combines low overhead at the clustering level with log(P) time complexity, and it splits the merge process to make tracing suitable for extreme-scale comput-ing. Overall, this multi-level precise clustering based on sig-natures further generalizes to a novel multi-metric clustering technique with unprecedented low overhead. Categories and Subject Descriptors
MRNET: A SCALABLE INFRASTRUCTURE FOR DEVELOPMENT OF PARALLEL TOOLS AND APPLICATIONS
"... ABSTRACT: MRNet is a customizable, high-throughput communication software system for parallel tools and applications. It reduces the cost of these tools ’ activities by incorporating a tree-based overlay network (TBON) of processes between the tool’s front-end and back-ends. MRNet was recently porte ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
ABSTRACT: MRNet is a customizable, high-throughput communication software system for parallel tools and applications. It reduces the cost of these tools ’ activities by incorporating a tree-based overlay network (TBON) of processes between the tool’s front-end and back-ends. MRNet was recently ported and released for Cray XT systems. In this paper we describe the main features that make MRNet well-suited as a general facil-ity for building scalable parallel tools. We present our experiences with MRNet and examples of its use.
Overcoming Extreme-Scale Reproducibility Challenges Through a Unified, Targeted and Multilevel Toolset Overcoming Extreme-Scale Reproducibility Challenges Through a Unified, Targeted, and Multilevel Toolset *
"... ABSTRACT Reproducibility, the ability to repeat program executions with the same numerical result or code behavior, is crucial for computational science and engineering applications. However, non-determinism in concurrency scheduling often hampers achieving this ability on high performance computin ..."
Abstract
- Add to MetaCart
(Show Context)
ABSTRACT Reproducibility, the ability to repeat program executions with the same numerical result or code behavior, is crucial for computational science and engineering applications. However, non-determinism in concurrency scheduling often hampers achieving this ability on high performance computing (HPC) systems. To aid in managing the adverse effects of non-determinism, prior work has provided techniques to achieve bit-precise reproducibility, but most of them focus only on small-scale parallelism. While scalable techniques recently emerged, they are disparate and target special purposes, e.g., single-schedule domains. On current systems with O(10 6 ) compute cores and future ones with O(10 9 ), any technique that does not embrace a unified, targeted, and multilevel approach will fall short of providing reproducibility. In this paper, we argue for a common toolset that embodies this approach, where programmers select and compose complementary tools and can effectively, yet scalably, analyze, control, and eliminate sources of non-determinism at scale. This allows users to gain reproducibility only to the levels demanded by specific code development needs. We present our research agenda and ongoing work toward this goal.