Results 1 -
8 of
8
SD 3 : A scalable approach to dynamic datadependence profiling
, 2010
"... Abstract—As multicore processors are deployed in mainstream computing, the need for software tools to help parallelize programs is increasing dramatically. Data-dependence profiling is an important technique to exploit parallelism in programs. More specifically, manual or automatic parallelization c ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract—As multicore processors are deployed in mainstream computing, the need for software tools to help parallelize programs is increasing dramatically. Data-dependence profiling is an important technique to exploit parallelism in programs. More specifically, manual or automatic parallelization can use the outcomes of data-dependence profiling to guide where to parallelize in a program. However, state-of-the-art data-dependence profiling techniques are not scalable as they suffer from two major issues when profiling large and long-running applications: (1) runtime overhead and (2) memory overhead. Existing data-dependence profilers are either unable to profile large-scale applications or only report very limited information. In this paper, we propose a scalable approach to datadependence profiling that addresses both runtime and memory overhead in a single framework. Our technique, called SD 3, reduces the runtime overhead by parallelizing the dependence profiling step itself. To reduce the memory overhead, we compress memory accesses that exhibit stride patterns and compute data dependences directly in a compressed format. We demonstrate that SD 3 reduces the runtime overhead when profiling SPEC 2006 by a factor of 4.1 × and 9.7 × on eight cores and 32 cores, respectively. For the memory overhead, we successfully profile SPEC 2006 with the reference input, while the previous approaches fail even with the train input. In some cases, we observe more than a 20 × improvement in memory consumption and a 16 × speedup in profiling time when 32 cores are used. Keywords-profiling, data dependence, parallel programming, program analysis, compression, parallelization. I.
Kremlin: Rethinking and Rebooting gprof for the Multicore Age
"... Many recent parallelization tools lower the barrier for parallelizing a program, but overlook one of the first questions that a programmer needs to answer: which parts of the program should I spend time parallelizing? This paper examines Kremlin, an automatic tool that, given a serial version of a p ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Many recent parallelization tools lower the barrier for parallelizing a program, but overlook one of the first questions that a programmer needs to answer: which parts of the program should I spend time parallelizing? This paper examines Kremlin, an automatic tool that, given a serial version of a program, will make recommendations to the user as to what regions (e.g. loops or functions) of the program to attack first. Kremlin introduces a novel hierarchical critical path analysis and develops a new metric for estimating the potential of parallelizing a region: self-parallelism. We further introduce the concept of a parallelism planner, which provides a ranked order of specific regions to the programmer that are likely to have the largest performance impact when parallelized. Kremlin supports multiple planner personalities, which allow the planner to more effectively target a particular programming environment or class of machine. We demonstrate the effectiveness of one such personality, an OpenMP planner, by comparing versions of programs that are parallelized according to Kremlin’s plan against third-party manually parallelized versions. The results show that Kremlin’s OpenMP planner is highly effective, producing plans whose performance is typically comparable to, and sometimes much better than, manual parallelization. At the same time, these plans would require that the user parallelize significantly fewer regions of the program.
Precise Calling Context Encoding
- In ACM International Conference on Software Engineering
, 2010
"... Calling contexts are very important for a wide range of applications such as profiling, debugging, and event logging. Most applications perform expensive stack walking to recover contexts. The resulting contexts are often explicitly represented as a sequence of call sites and hence bulky. We propose ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Calling contexts are very important for a wide range of applications such as profiling, debugging, and event logging. Most applications perform expensive stack walking to recover contexts. The resulting contexts are often explicitly represented as a sequence of call sites and hence bulky. We propose a technique to encode the current calling context of any point during an execution. In particular, an acyclic call path is encoded into one number through only integer additions. Recursive call paths are divided into acyclic subsequences and encoded independently. We leverage stack depth in a safe way to optimize encoding: if a calling context can be safely and uniquely identified by its stack depth, we do not perform encoding. We propose an algorithm to seamlessly fuse encoding and stack depth based identification. The algorithm is safe because different contexts are guaranteed to have different IDs. It also ensures contexts can be faithfully decoded. Our experiments show that our technique incurs negligible overhead (1.89 % on average). For most medium-sized programs, it can encode all contexts with just one number. For large programs, we are able to encode most calling contexts to a few numbers. 1.
LAMPVIEW: A LOOP-AWARE TOOLSET FOR FACILITATING PARALLELIZATION
"... A continual growth of the number of transistors per unit area coupled with diminishing returns from traditional microarchitectural and clock frequency improvements has led processor manufacturers to place multiple cores on a single chip. However, only multi-threaded code can fully take advantage of ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A continual growth of the number of transistors per unit area coupled with diminishing returns from traditional microarchitectural and clock frequency improvements has led processor manufacturers to place multiple cores on a single chip. However, only multi-threaded code can fully take advantage of the new multicore processors; legacy single-threaded code does not benefit. Many approaches to parallelization have been explored, including both manual and automatic techniques. Unfortunately, research in this area is impeded by the innate difficulty of exploring code by hand for new possible parallelization schemes. Regardless of whether it is a researcher attempting to discover possible automatic techniques or a programmer trying to make manual parallelization, the benefits of good dependence information are substantial. This thesis provides a profiling and analysis toolset aimed at easing a programmer or researcher’s effort in finding parallelism. The toolset, The Loop-Aware Memory Profile Viewing System (LAMPView), is developed in three parts. The first part is a multi-frontend, multi-target compiler pass written to instrument the code with calls to the Loop-Aware Memory Profiling (LAMP) library. The compile-time
Kismet: Parallel Speedup Estimates for Serial Programs
"... Software engineers now face the difficult task of refactoring serial programs for parallel execution on multicore processors. Currently, they are offered little guidance as to how much benefit may come from this task, or how close they are to the best possible parallelization. This paper presents Ki ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Software engineers now face the difficult task of refactoring serial programs for parallel execution on multicore processors. Currently, they are offered little guidance as to how much benefit may come from this task, or how close they are to the best possible parallelization. This paper presents Kismet, a tool that creates parallel speedup estimates for unparallelized serial programs. Kismet differs from previous approaches in that it does not require any manual analysis or modification of the program. This difference allows quick analysis of many programs, avoiding wasted engineering effort on those that are fundamentally limited. To accomplish this task, Kismet builds upon the hierarchical critical path analysis (HCPA) technique, a recently developed dynamic analysis that localizes parallelism to each of the potentially nested regions in the target program. It then uses a parallel execution time model to compute an approximate upper bound for performance, modeling constraints that stem from both hardware parameters and internal program structure. Our evaluation applies Kismet to eight high-parallelism NAS Parallel Benchmarks running on a 32-core AMD multicore system, five low-parallelism SpecInt benchmarks, and six medium-parallelism benchmarks running on the finegrained MIT Raw processor. The results are compelling. Kismet is able to significantly improve the accuracy of parallel speedup estimates relative to prior work based on critical path analysis.
Facilitating program parallelisation: a profiling-based approach
, 2011
"... The advance of multi-core architectures signals the end of universal speed-up of software over time. To continue exploiting hardware developments, effort must be invested in producing software that can be split up to run on multiple cores or processors. Many solutions have been proposed to address t ..."
Abstract
- Add to MetaCart
The advance of multi-core architectures signals the end of universal speed-up of software over time. To continue exploiting hardware developments, effort must be invested in producing software that can be split up to run on multiple cores or processors. Many solutions have been proposed to address this issue, ranging from explicit to implicit parallelism, but consensus has yet to be reached on the best way to tackle such a problem. In this thesis we propose a profiling-based interactive approach to program parallelisation. Profilers gather dependence information on a program, which is then used to automatically parallelise the program at source-level. The programmer can then examine the resulting parallel program, and using critical path information from the profiler, identify and refactor parallelism bottlenecks to enable further parallelism. We argue that this is an efficient and effective method of parallelising general sequential programs. Our first contribution is a comprehensive analysis of limits of parallelism in several benchmark programs, performed by constructing Dynamic Dependence Graphs (DDGs) from execution traces. We show that average available parallelism is often high, but realising it would require various changes in compilation, language or computation models. As
Parkour: Parallel Speedup Estimates for Serial Programs
"... We present Parkour, a tool that creates parallel speedup estimates for unparallelized serial programs. Unlike previous approaches, it does not require any prior human analysis or modification of the program. Parkour automatically quantifies the parallelism of a given program and provides an approxim ..."
Abstract
- Add to MetaCart
We present Parkour, a tool that creates parallel speedup estimates for unparallelized serial programs. Unlike previous approaches, it does not require any prior human analysis or modification of the program. Parkour automatically quantifies the parallelism of a given program and provides an approximate upper bound for performance, modeling fundamental parallelization constraints. For the evaluation, Parkour is applied to three benchmarks from the NAS Parallel benchmark suite running on a 32-core AMD multicore system, and three benchmarks running on the fine-grained MIT Raw processor. The results are compelling. Parkour is able to significantly improve the accuracy of parallel speedup estimates relative to the critical path analysis technique that it extends. 1
Predicting Potential Speedup of Serial Code via Lightweight Profiling and Emulations with Memory Performance Model
"... Abstract—We present Parallel Prophet, which projects potential parallel speedup from an annotated serial program before actual parallelization. Programmers want to see how much speedup could be obtained prior to investing time and effort to write parallel code. With Parallel Prophet, programmers sim ..."
Abstract
- Add to MetaCart
Abstract—We present Parallel Prophet, which projects potential parallel speedup from an annotated serial program before actual parallelization. Programmers want to see how much speedup could be obtained prior to investing time and effort to write parallel code. With Parallel Prophet, programmers simply insert annotations that describe the parallel behavior of the serial program. Parallel Prophet then uses lightweight interval profiling and dynamic emulations to predict potential performance benefit. Parallel Prophet models many realistic features of parallel programs: unbalanced workload, multiple critical sections, nested and recursive parallelism, and specific thread schedulings and paradigms, which are hard to model in previous approaches. Furthermore, Parallel Prophet predicts speedup saturation resulting from memory and caches by monitoring cache hit ratio and bandwidth consumption in a serial program. We achieve very small runtime overhead: approximately a 1.2-10 times slowdown and moderate memory consumption. We demonstrate the effectiveness of Parallel Prophet in eight benchmarks in the OmpSCR and NAS Parallel benchmarks by comparing our predictions with actual parallelized code. Our simple memory model also identifies performance limitations resulting from memory system contention. I.

