Results 1 - 10
of
48
Parallel I/O Performance: From Events to Ensembles
"... Abstract—Parallel I/O is fast becoming a bottleneck to the research agendas of many users of extreme scale parallel computers. The principle cause of this is the concurrency explosion of high-end computation, coupled with the complexity of providing parallel file systems that perform reliably at suc ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
(Show Context)
Abstract—Parallel I/O is fast becoming a bottleneck to the research agendas of many users of extreme scale parallel computers. The principle cause of this is the concurrency explosion of high-end computation, coupled with the complexity of providing parallel file systems that perform reliably at such scales. More than just being a bottleneck, parallel I/O performance at scale is notoriously variable, being influenced by numerous factors inside and outside the application, thus making it extremely difficult to isolate cause and effect for performance events. In this paper, we propose a statistical approach to understanding I/O performance that moves from the analysis of performance events to the exploration of performance ensembles. Using this methodology, we examine two I/O-intensive scientific computations from cosmology and climate science, and demonstrate that our approach can identify application and middleware performance deficiencies — resulting in more than 4× run time improvement for both examined applications. I.
Scalable I/O Tracing and Analysis
"... As supercomputer performance approached and then surpassed the petaflop level, I/O performance has become a major performance bottleneck for many scientific applications. Several tools exist to collect I/O traces to assist in the analysis of I/O performance problems. However, these tools either prod ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
(Show Context)
As supercomputer performance approached and then surpassed the petaflop level, I/O performance has become a major performance bottleneck for many scientific applications. Several tools exist to collect I/O traces to assist in the analysis of I/O performance problems. However, these tools either produce extremely large trace files that complicate performance analysis, or sacrifice accuracy to collect high-level statistical information. We propose a multi-level trace generator tool, ScalaIOTrace, that collects traces at several levels in the HPC I/O stack. ScalaIOTrace features aggressive trace compression that generates trace files of near constant size for regular I/O patterns and orders of magnitudes smaller for less regular ones. This enables the collection of I/O and communication traces of applications running on thousands of processors. Our contributions also include automated trace analysis to collect selected statistical information of I/O calls by parsing the compressed trace on-the-fly and time-accurate replay of communication events with MPI-IO calls. We evaluated our approach with the Parallel Ocean Program (POP) climate simulation and the FLASH parallel I/O benchmark. POP uses NetCDF as an I/O library while FLASH I/O uses the parallel HDF5 I/O library, which internally maps onto MPI-IO. We collected MPI-IO and low-level POSIX I/O traces to study application I/O behavior. Our results show constant size trace files of only 145KB irrespective of the number of nodes for FLASH I/O benchmark, which exhibits regular I/O and communication pattern. For POP, we observe up to two orders of magnitude reduction in trace file sizes compared to flat traces. Statistical information gathered reveals insight on the number of I/O and communication calls issued in the POP and FLASH I/O. Such concise traces are unprecedented for isolated I/O and combined I/O plus communication tracing. 1.
Probabilistic communication and i/o tracing with deterministic replay at scale
- In ICPP
, 2011
"... With today’s petascale supercomputers, applications often exhibit low efficiency, such as poor communication and I/O performance, that can be diagnosed by analysis tools. However, these tools either produce extremely large trace files that complicate performance analysis, or sacrifice accuracy to co ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
(Show Context)
With today’s petascale supercomputers, applications often exhibit low efficiency, such as poor communication and I/O performance, that can be diagnosed by analysis tools. However, these tools either produce extremely large trace files that complicate performance analysis, or sacrifice accuracy to collect high-level statistical information using crude averaging. This work contributes Scala-H-Trace, which features more aggressive trace compression than any previous approach, particularly for applications that do not show strict regularity in SPMD behavior. Scala-H-Trace uses histograms expressing the probabilistic distribution of arbitrary communication and I/O parameters to capture variations. Yet, where other tools fail to scale, Scala-H-Trace guarantees trace files of near constant size, even for variable communication and I/O patterns, producing trace files orders of magnitudes smaller than using prior approaches. We demonstrate the ability to collect traces of applications running on thousands of processors with the potential to scale well beyond this level. We further present the first approach to deterministically replay such probabilistic traces (a) without deadlocks and (b) in a manner closely resembling the original applications. Our results show either near constant sized traces or only sublinear increases in trace file sizes irrespective of the number of nodes utilized. Even with the aggressively compressed histogrambased traces, our replay times are within 12 % to 15 % of the runtime of original codes. Such concise traces resembling the behavior of production-style codes closely and our approach of deterministic replay of probabilistic traces are without precedence. 1.
Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes
"... Many parallel applications suffer from latent performance limitations that may prevent them from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only when an attempt to scale the code is actually being made—a point where remediation can be difficult. However, creati ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
(Show Context)
Many parallel applications suffer from latent performance limitations that may prevent them from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only when an attempt to scale the code is actually being made—a point where remediation can be difficult. However, creating analytical performance models that would allow such issues to be pinpointed earlier is so laborious that application developers attempt it at most for a few selected kernels, running the risk of missing harmful bottlenecks. In this paper, we show how both coverage and speed of this scalability analysis can be substantially improved. Generating an empirical performance model automatically for each part of a parallel program, we can easily identify those parts that will reduce performance at larger core counts. Using a climate simulation as an example, we demonstrate that scalability bugs are not confined to those routines usually chosen as kernels.
Scalable identification of load imbalance in parallel executions using call path profiles
- In SC ’10: Proc. of the 2010 ACM/IEEE Conference on Supercomputing
, 2010
"... Abstract—Applications must scale well to make efficient use of today’s class of petascale computers, which contain hundreds of thousands of processor cores. Inefficiencies that do not even appear in modest-scale executions can become major bottlenecks in large-scale executions. Because scaling probl ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
(Show Context)
Abstract—Applications must scale well to make efficient use of today’s class of petascale computers, which contain hundreds of thousands of processor cores. Inefficiencies that do not even appear in modest-scale executions can become major bottlenecks in large-scale executions. Because scaling problems are often difficult to diagnose, there is a critical need for scalable tools that guide scientists to the root causes of scaling problems. Load imbalance is one of the most common scaling problems. To provide actionable insight into load imbalance, we present post-mortem parallel analysis techniques for pinpointing and quantifying load imbalance in the context of call path profiles of parallel programs. We show how to identify load imbalance in its static and dynamic context by using only low-overhead asyn-chronous call path profiling to locate regions of code responsible for communication wait time in SPMD executions. We describe the implementation of these techniques within HPCTOOLKIT. I.
Collective Mind: towards practical and collaborative auto-tuning
, 2014
"... Empirical auto-tuning and machine learning techniques have been showing high potential to improve execution time, power consumption, code size, reliability and other important metrics of various applications for more than two decades. However, they are still far from widespread production use due t ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Empirical auto-tuning and machine learning techniques have been showing high potential to improve execution time, power consumption, code size, reliability and other important metrics of various applications for more than two decades. However, they are still far from widespread production use due to lack of native support for auto-tuning in an ever changing and complex software and hardware stack, large and multi-dimensional optimization spaces, excessively long exploration times, and lack of unified mechanisms for preserving and sharing of optimization knowledge and research material. We present a possible collaborative approach to solve above problems using Collective Mind knowledge management system. In contrast with previous cTuning framework, this modular infrastructure allows to preserve and share through
Visualizing Large-scale Parallel Communication Traces Using a Particle Animation Technique
"... Large-scale scientific simulations require execution on parallel computing systems in order to yield useful results in a reasonable time frame. But parallel execution adds communication overhead. The impact that this overhead has on performance may be difficult to gauge, as parallel application beha ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Large-scale scientific simulations require execution on parallel computing systems in order to yield useful results in a reasonable time frame. But parallel execution adds communication overhead. The impact that this overhead has on performance may be difficult to gauge, as parallel application behaviors are typically harder to understand than the sequential types. We introduce an animation-based interactive visualization technique for the analysis of communication patterns occurring in parallel application execution. Our method has the advantages of illustrating the dynamic communication patterns in the system as well as a static image of MPI (Message Passing Interface) utilization history. We also devise a data streaming mechanism that allows for the exploration of very large data sets. We demonstrate the effectiveness of our approach scaling up to 16 thousand processes using a series of trace data sets of ScaLAPACK matrix operations functions.
Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC
- In Proceedings of Workshop on Productivity and Performance (PROPER
, 2010
"... Abstract. Applicationperformance iscritical inhigh-performance computing (HPC), however, it is not considered in a systematic way in the HPC software development process. Integrated performance models could improve this situation. Advanced analytic performance modeling and performance analysis tools ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
(Show Context)
Abstract. Applicationperformance iscritical inhigh-performance computing (HPC), however, it is not considered in a systematic way in the HPC software development process. Integrated performance models could improve this situation. Advanced analytic performance modeling and performance analysis tools exist in isolation but have similar goals and could benefit mutually. We find that existing analysis tools could be extended to support analytic performance modeling and performance models could be used to improve the understanding of real application performance artifacts. We show a simple example of how a tool could support developers of analytic performance models. Finally, we propose to implement a strategy for integrated tool-supported performance modeling during the whole software development process. 1
Performance analysis of Sweep3D on Blue Gene/P with the Scalasca toolset
- In Proc. 24th Int’l Parallel & Distributed Processing Symposium, Workshop on Large-Scale Parallel Processing (IPDPS–LSPP
, 2010
"... Abstract—In studying the scalability of the Scalasca performance analysis toolset to several hundred thousand MPI processes on IBM Blue Gene/P, we investigated a progressive execution performance deterioration of the well-known ASCI Sweep3D compact application. Scalasca runtime summarization analysi ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
(Show Context)
Abstract—In studying the scalability of the Scalasca performance analysis toolset to several hundred thousand MPI processes on IBM Blue Gene/P, we investigated a progressive execution performance deterioration of the well-known ASCI Sweep3D compact application. Scalasca runtime summarization analysis quantified MPI communication time that correlated with computational imbalance, and automated trace analysis confirmed growing amounts of MPI waiting times. Further instrumentation, measurement and analyses pinpointed a conditional section of highly imbalanced computation which amplified waiting times inherent in the associated wavefront communication that seriously degraded overall execution efficiency at very large scales. By employing effective data collation, management and graphical presentation, Scalasca was thereby able to demonstrate performance measurements and analyses with 294,912 processes for the first time. Keywords- parallel performance measurement & analysis; MPI; scalability of applications & tools; I.
Further improving the scalability of the Scalasca toolset
"... Abstract. Scalasca is an open-source toolset that can be used to analyze the performance behavior of parallel applications and to identify opportunities for optimization. Target applications include simulation codes from science and engineering based on the parallel programming interfaces MPI and/or ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Scalasca is an open-source toolset that can be used to analyze the performance behavior of parallel applications and to identify opportunities for optimization. Target applications include simulation codes from science and engineering based on the parallel programming interfaces MPI and/or OpenMP. Scalasca, which has been specifically designed for use on large-scale machines such as IBM Blue Gene and Cray XT, integrates runtime summaries suitable to obtain a performance overview with in-depth studies of concurrent behavior via event tracing. Although Scalasca was already successfully used with codes running with 294,912 cores on a 72-rack Blue Gene/P system, the current software design shows scalability limitations that adversely affect user experience and that will present a serious obstacle on the way to mastering larger scales in the future. In this paper, we outline how to address the two most important ones, namely the unification of local identifiers at measurement finalization as well as collating and displaying analysis reports.