Results 1 - 10
of
18
ScalaBenchGen: Auto-generation of communication benchmarks traces
- In IPDPS
, 2012
"... Benchmarks are essential for evaluating HPC hardware and software for petascale machines and beyond. But benchmark creation is a tedious manual process. As a result, benchmarks tend to lag behind the development of complex scientific codes. Our work automates the creation of communication benchmarks ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
(Show Context)
Benchmarks are essential for evaluating HPC hardware and software for petascale machines and beyond. But benchmark creation is a tedious manual process. As a result, benchmarks tend to lag behind the development of complex scientific codes. Our work automates the creation of communication benchmarks. Given an MPI application, we utilize Scala-Trace, a lossless and scalable framework to trace communication operations and execution time while abstracting away the computations. A single trace file that reflects the behavior of all nodes is subsequently expanded to C source code by a novel code generator. This resulting benchmark code is compact, portable, human-readable, and accurately reflects the original application’s communication characteristics and performance. Experimental results demonstrate that generated source code of benchmarks preserves both the communication patterns and the run-time behavior of the original application. Such automatically generated benchmarks not only shorten the transition from application development to benchmark extraction but also facilitate code obfuscation, which is essential for benchmark extraction from commercial and restricted applications. 1.
Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes
"... Many parallel applications suffer from latent performance limitations that may prevent them from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only when an attempt to scale the code is actually being made—a point where remediation can be difficult. However, creati ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
(Show Context)
Many parallel applications suffer from latent performance limitations that may prevent them from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only when an attempt to scale the code is actually being made—a point where remediation can be difficult. However, creating analytical performance models that would allow such issues to be pinpointed earlier is so laborious that application developers attempt it at most for a few selected kernels, running the risk of missing harmful bottlenecks. In this paper, we show how both coverage and speed of this scalability analysis can be substantially improved. Generating an empirical performance model automatically for each part of a parallel program, we can easily identify those parts that will reduce performance at larger core counts. Using a climate simulation as an example, we demonstrate that scalability bugs are not confined to those routines usually chosen as kernels.
Automatic Generation of Executable Communication Specifications from Parallel Applications
"... Portable parallel benchmarks are widely used and highly effective for (a) the evaluation, analysis and procurement of high-performance computing (HPC) systems and (b) quantifying the potential benefits of porting applications for new hardware platforms. Yet, past techniques to synthetically parametr ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
(Show Context)
Portable parallel benchmarks are widely used and highly effective for (a) the evaluation, analysis and procurement of high-performance computing (HPC) systems and (b) quantifying the potential benefits of porting applications for new hardware platforms. Yet, past techniques to synthetically parametrized hand-coded HPC benchmarks prove insufficient for today’s rapidly-evolving scientific codes particularly when subject to multi-scale science modeling or when utilizing domain-specific libraries. To address these problems, this work contributes novel methods to automatically generate highly portable and customizable communication benchmarks from HPC applications. We utilize ScalaTrace, a lossless, yet scalable, parallel application tracing framework to collect selected aspects of the run-time behavior of HPC applications, including communication operations and execution time, while abstracting
Vrisha: using scaling properties of parallel programs for bug detection and localization
- In Proceedings of the 20th international symposium on High performance distributed computing
, 2011
"... Detecting and isolating bugs that arise in parallel programs is a tedious and a challenging task. An especially subtle class of bugs are those that are scale-dependent: while smallscale test cases may not exhibit the bug, the bug arises in large-scale production runs, and can change the result or pe ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
(Show Context)
Detecting and isolating bugs that arise in parallel programs is a tedious and a challenging task. An especially subtle class of bugs are those that are scale-dependent: while smallscale test cases may not exhibit the bug, the bug arises in large-scale production runs, and can change the result or performance of an application. A popular approach to finding bugs is statistical bug detection, where abnormal behavior is detected through comparison with bug-free behavior. Unfortunately, for scale-dependent bugs, there may not be bug-free runs at large scales and therefore traditional statistical techniques are not viable. In this paper, we propose Vrisha, a statistical approach to detecting and localizing scale-dependent bugs. Vrisha detects bugs in large-scale programs by building models of behavior based on bug-free behavior at small scales. These models are constructed using kernel canonical correlation analysis (KCCA) and exploit scale-determined properties, whose values are predictably dependent on application scale. We use Vrisha to detect and diagnose two bugs caused by errors in popular MPI libraries and show that our techniques can be implemented with low overhead and low false-positive rates.
Elastic and Scalable Tracing and Accurate Replay of Non-Deterministic Events ∗
"... SCALATRACE represents the state-of-the-art of parallel application tracing for high performance computing (HPC). This paper presents SCALATRACE II, a next generation tracer that delivers evenhighertracecompressioncapability,evenwheneventsarenot always regular. In this work, we contribute a spectrum ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
(Show Context)
SCALATRACE represents the state-of-the-art of parallel application tracing for high performance computing (HPC). This paper presents SCALATRACE II, a next generation tracer that delivers evenhighertracecompressioncapability,evenwheneventsarenot always regular. In this work, we contribute a spectrum of novel compressionandreplaytechniquesthatarefundamentallydifferent from our past approaches. SCALATRACE II features a redesigned low-levelencodingschemeoftracedatasuchthatdataelementsare elasticandself-explanatory. Withthisnewencodingscheme,trace compression is enhanced by introducing innovative intra-node and inter-node trace compression algorithms that guarantee high compression rates in a loop structure agnostic fashion. In practice, the improvedcompressionschemeisparticularlyefficientforscientific codes that demonstrate inconsistent behavior across time steps and nodes. A novel approach is further contributed to probabilistically replay sequences of non-deterministic events. To assess the compression efficacy of SCALATRACE II, we conduct experiments not only with computational kernels but also a real-world application, the Parallel Ocean Program (POP). Compared to the first generation SCALATRACE, we observe key improvements on trace compression for benchmarks with inconsistent time step behavior and diverging task level behavior while retaining timing accuracy even under probabilisticreplay. 1.
1 Inferring Large-scale Computation Behavior via Trace
"... Abstract—Understanding large-scale application behavior is critical for effectively utilizing existing HPC resources and making design decisions for upcoming systems. In this work we present a methodology for characterizing an MPI application’s large-scale computation behavior and system requirement ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Understanding large-scale application behavior is critical for effectively utilizing existing HPC resources and making design decisions for upcoming systems. In this work we present a methodology for characterizing an MPI application’s large-scale computation behavior and system requirements using information about the behavior of that application at a series of smaller core counts. The methodology finds the best statistical fit from among a set of canonical functions in terms of how a set of features that are important for both performance and energy (cache hit rates, floating point intensity, ILP, etc.) change across aseriesofsmallcorecounts.Thestatisticalmodelsforeachof these application features can then be utilized to generate an extrapolated trace of the application at scale. The fidelity of the fully extrapolated traces is evaluated by comparing the results of building performance models using both the extrapolated trace along with an actual trace in order to predict application performance at using each. For two full-scale HPC applications, SPECFEM3D and UH3D, the extrapolated traces had absolute relative errors of less than 5%. I.
PEMOGEN: Automatic Adaptive Performance Modeling during Program Runtime
"... Traditional means of gathering performance data are trac-ing, which is limited by the available storage, and profiling, which has limited accuracy. Performance modeling is often used to interpret the tracing data and generate performance predictions. We aim to complement the traditional data collect ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Traditional means of gathering performance data are trac-ing, which is limited by the available storage, and profiling, which has limited accuracy. Performance modeling is often used to interpret the tracing data and generate performance predictions. We aim to complement the traditional data collection mechanisms with online performance modeling, a method that generates performance models while the appli-cation is running. This allows us to greatly reduce the stor-age overhead while still producing accurate predictions. We present PEMOGEN, our compilation and modeling frame-work that automatically instruments applications to gen-erate performance models during program execution. We demonstrate the ability of PEMOGEN to both reduce stor-age cost and improve the prediction accuracy compared to traditional techniques such as least squares fitting. With our tool, we automatically detect 3,370 kernels from fifteen NAS and Mantevo applications and model their execution time with a median coefficient of variation (R̄2) of 0.81. These automatically generated performance models can be used to quickly assess the scaling and potential bottlenecks with re-gards to any input parameter and the number of processes of a parallel application. 1.
Scalable Performance Analysis of ExaScale MPI Programs through Signature-Based Clustering Algorithms ∗
"... Extreme-scale computing poses a number of challenges to application performance. Developers need to study appli-cation behavior by collecting detailed information with the help of tracing toolsets to determine shortcomings. But not only applications are“scalability challenged”, current tracing tools ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Extreme-scale computing poses a number of challenges to application performance. Developers need to study appli-cation behavior by collecting detailed information with the help of tracing toolsets to determine shortcomings. But not only applications are“scalability challenged”, current tracing toolsets also fall short of exascale requirements for low back-ground overhead since trace collection for each execution en-tity is becoming infeasible. One effective solution is to clus-ter processes with the same behavior into groups. Instead of collecting performance information from each individual node, this information can be collected from just a set of representative nodes. This work contributes a fast, scalable, signature-based clustering algorithm that clusters processes exhibiting similar execution behavior. Instead of prior work based on statistical clustering, our approach produces pre-cise results nearly without loss of events or accuracy. The proposed algorithm combines low overhead at the clustering level with log(P) time complexity, and it splits the merge process to make tracing suitable for extreme-scale comput-ing. Overall, this multi-level precise clustering based on sig-natures further generalizes to a novel multi-metric clustering technique with unprecedented low overhead. Categories and Subject Descriptors
Scalable communication . . .
"... Performance analysis and prediction for parallel applications is important for the design and development of scientific applications, and for the construction and procurement of highperformance computing (HPC) systems. As one of the most important approaches, application tracing is widely used for t ..."
Abstract
- Add to MetaCart
Performance analysis and prediction for parallel applications is important for the design and development of scientific applications, and for the construction and procurement of highperformance computing (HPC) systems. As one of the most important approaches, application tracing is widely used for this purpose for being able to provide the computation and communication details of an application. Recent progress in communication tracing has tremendously improved the scalability of tracing tools and reduced the size of the trace file, and thereby opened up novel opportunities for trace-based performance analysis for parallel applications. This work focuses on domain-specific trace compression methodology and puts forth fundamentally new approaches to improve the communication tracing techniques. Facilitated by the advances in this area, novel algorithms are further designed to address the hard problem of performance analysis, prediction, and benchmarking at scale. Specifically, this work makes the following contributions: 1. This work contributes ScalaExtrap, a fundamentally novel performance modeling scheme and tool. With ScalaExtrap, we synthetically generate the application trace for large