Results 1 - 10
of
25
The Common Case Transactional Behavior of Multithreaded Programs
- In Proceedings of the 12th International Conference on High-Performance Computer Architecture
, 2006
"... Transactional memory (TM) provides an easy-to-use and high-performance parallel programming model for the upcoming chip-multiprocessor systems. Several researchers have proposed alternative hardware and software TM implementations. However, the lack of transaction-based programs makes it difficult t ..."
Abstract
-
Cited by 34 (6 self)
- Add to MetaCart
Transactional memory (TM) provides an easy-to-use and high-performance parallel programming model for the upcoming chip-multiprocessor systems. Several researchers have proposed alternative hardware and software TM implementations. However, the lack of transaction-based programs makes it difficult to understand the merits of each proposal and to tune future TM implementations to the common case behavior of real application. This work addresses this problem by analyzing the common case transactional behavior for 35 multithreaded programs from a wide range of application domains. We identify transactions within the source code by mapping existing primitives for parallelism and synchronization management to transaction boundaries. The analysis covers basic characteristics such as transaction length, distribution of readset and write-set size, and the frequency of nesting and I/O operations. The measured characteristics provide key insights into the design of efficient TM systems for both nonblocking synchronization and speculative parallelization. 1.
Improving region selection in dynamic optimization systems
- In MICRO 38: Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
, 2005
"... The performance of a dynamic optimization system depends heavily on the code it selects to optimize. Many current systems follow the design of HP Dynamo and select a single interprocedural path, or trace, as the unit of code optimization and code caching. Though this approach to region selection has ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
The performance of a dynamic optimization system depends heavily on the code it selects to optimize. Many current systems follow the design of HP Dynamo and select a single interprocedural path, or trace, as the unit of code optimization and code caching. Though this approach to region selection has worked well in practice, we show that it is possible to adapt this basic approach to produce regions with greater locality, less needless code duplication, and fewer profiling counters. In particular, we propose two new region-selection algorithms and evaluate them against Dynamo’s selection mechanism, Next-Executing Tail (NET). Our first algorithm, Last-Executed Iteration (LEI), identifies cyclic paths of execution better than NET, improving locality of execution while reducing the size of the code cache. Our second algorithm allows overlapping traces of similar execution frequency to be combined into a single large region. This second technique can be applied to both NET and LEI, and we find that it significantly improves metrics of locality and memory overhead for each. 1
LeakSurvivor: Towards Safely Tolerating Memory Leaks for Garbage-Collected Languages
- In USENIX Annual Technical Conference
, 2008
"... Continuous memory leaks severely hurt program performance and software availability for garbage-collected programs. This paper presents a safe method, called LeakSurvivor, to tolerate continuous memory leaks at runtime for garbage-collected programs. Our main idea is to periodically swap out the “Po ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Continuous memory leaks severely hurt program performance and software availability for garbage-collected programs. This paper presents a safe method, called LeakSurvivor, to tolerate continuous memory leaks at runtime for garbage-collected programs. Our main idea is to periodically swap out the “Potentially Leaked ” (PL) memory objects identified by leak detectors from the virtual memory to disks. As a result, the virtual memory space occupied by the PL objects can be reclaimed by garbage collectors and available for future uses. If a swapped-out PL object is accesses later, LeakSurvivor will restore it from disks to the memory for correct program execution. Furthermore, LeakSurvivor helps developers to prune false positives. We have built the prototype of LeakSurvivor on top of Jikes RVM 2.4.2, a high performance Java-in-Java virtual machine developed by IBM. We conduct the experiments with three Java applications including Eclipse, SPECjbb2000 and Jigsaw. Among them, Eclipse and Jigsaw contain memory leaks introduced by their developers, while SPECjbb2000 contain a memory leak injected by us. Our results show that LeakSurvivor effectively tolerates memory leaks for two applications (Eclipse and SPECjbb2000), i.e., no cumulative performance degradation and no software failures when facing continuous memory leaks at runtime. For Jigsaw, LeakSurvivor extends the program lifetime by two times and improves the performance by 46 % compared with native runs. Furthermore, when there are no memory leaks, LeakSurvivor imposes small runtime overhead, i.e., 2.5 % over the leak detector and 23.7 % over the native runs. 1
The design and implementation Ocelot’s dynamic binary translator from PTX to multi-core x86
, 2009
"... Abstract—Ocelot is a dynamic compilation framework designed to map the explicitly parallel PTX execution model used by NVIDIA CUDA applications onto diverse many-core architectures. Ocelot includes a dynamic binary translator from PTX to many-core processors that leverages the LLVM code generator to ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract—Ocelot is a dynamic compilation framework designed to map the explicitly parallel PTX execution model used by NVIDIA CUDA applications onto diverse many-core architectures. Ocelot includes a dynamic binary translator from PTX to many-core processors that leverages the LLVM code generator to target x86. The binary translator is able to execute CUDA applications without recompilation and Ocelot can in fact dynamically switch between execution on an NVIDIA GPU and a many-core CPU. It has been validated against over 100 applications taken from the CUDA SDK [1], the UIUC Parboil benchmarks [2], the Virginia Rodinia benchmarks [3], the GPU-VSIPL signal and image processing library [4], and several domain specific applications. This paper presents a detailed description of the implementation of our binary translator highlighting design decisions and trade-offs, and showcasing their effect on application performance. We explore several code transformations that are applicable only when translating explicitly parallel applications and suggest additional optimization passes that may be useful to this class of applications. We expect this study to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures. I.
A Comprehensive Framework for Testing Database-Centric Applications
, 2007
"... This dissertation was presented by ..."
Hera-JVM: Abstracting Processor Heterogeneity Behind a Virtual Machine
"... Heterogeneous multi-core processors, such as the Cell processor, can deliver exceptional performance, however, they are notoriously difficult to program effectively. We present Hera-JVM, a runtime system which hides a processor’s heterogeneity behind a homogeneous virtual machine interface. Prelimin ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Heterogeneous multi-core processors, such as the Cell processor, can deliver exceptional performance, however, they are notoriously difficult to program effectively. We present Hera-JVM, a runtime system which hides a processor’s heterogeneity behind a homogeneous virtual machine interface. Preliminary results of three benchmarks running under Hera-JVM are presented. These results suggest a set of application behaviour characteristics that the runtime system should take into account when placing threads on different core types. 1
Dynamic Filtering: Multi-Purpose Architecture Support for Language Runtime Systems
"... This paper introduces a new abstraction to accelerate the readbarriers and write-barriers used by language runtime systems. We exploit the fact that, dynamically, many barrier executions perform checks but no real work—e.g., in generational garbage collection (GC), frequent checks are needed to dete ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This paper introduces a new abstraction to accelerate the readbarriers and write-barriers used by language runtime systems. We exploit the fact that, dynamically, many barrier executions perform checks but no real work—e.g., in generational garbage collection (GC), frequent checks are needed to detect the creation of intergenerational references, even though such references occur rarely in many workloads. We introduce a form of dynamic filtering that identifies redundant checks by (i) recording checks that have recently been executed, and (ii) detecting when a barrier is repeating one of these checks. We show how this technique can be applied to a variety of algorithms for GC, transactional memory, and language-based security. By supporting dynamic filtering in the instruction set, we show that the fast-paths of these barriers can be streamlined, reducing the impact on the quality of surrounding code. We show how we accelerate the barriers used for generational GC and transactional memory in the Bartok research compiler. With a 2048-entry filter, dynamic filtering eliminates almost all the overhead of the GC write-barriers. Dynamic filtering eliminates around half the overhead of STM over a non-synchronized baseline—even when used with an STM that is already designed for low overhead, and which employs static analyses to avoid redundant operations.
Concept Assignment as a Debugging Technique for Code Generators
, 2005
"... Code generators are notoriously difficult to debug, yet their correctness is crucial. This paper demonstrates that concept assignment can be performed in an entirely syntax-directed manner for code generators and other highly structured program modules. Anomalies in this concept assignment indicate ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Code generators are notoriously difficult to debug, yet their correctness is crucial. This paper demonstrates that concept assignment can be performed in an entirely syntax-directed manner for code generators and other highly structured program modules. Anomalies in this concept assignment indicate the possible existence of bugs. These insights enable the formulation of a new debugging technique for code generators. This paper describes the procedure, a practical implementation, and results from the application of this debugging technique to an experimental code generator.
ByCounter: Portable Runtime Counting of Bytecode Instructions and Method Invocations
"... For bytecode-based applications, runtime instruction counts can be used as a platform-independent application execution metric, and also can serve as the basis for bytecode-based performance prediction. However, different instruction types have different execution durations, so they must be counted ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
For bytecode-based applications, runtime instruction counts can be used as a platform-independent application execution metric, and also can serve as the basis for bytecode-based performance prediction. However, different instruction types have different execution durations, so they must be counted separately, and method invocations should be identified and counted because of their substantial contribution to the total application performance. For Java bytecode, most JVMs and profilers do not provide such functionality at all, and existing bytecode analysis frameworks require expensive JVM instrumentation for instruction-level counting. In this paper, we present ByCounter, a lightweight approach for exact runtime counting of executed bytecode instructions and method invocations. ByCounter significantly reduces total counting costs by instrumenting only the application bytecode and not the JVM, and it can be used without modifications on any JVM. We evaluate the presented approach by successfully applying it to multiple Java applications on different JVMs, and discuss the runtime costs of applying ByCounter to these cases. Key words: Java, bytecode, counting, portable, fine-grained

