Results 1 - 10
of
16
LLVM: A compilation framework for lifelong program analysis & transformation
, 2004
"... ... a compiler framework designed to support transparent, lifelong program analysis and transformation for arbitrary programs, by providing high-level information to compiler transformations at compile-time, link-time, run-time, and in idle time between runs. LLVM defines a common, low-level code re ..."
Abstract
-
Cited by 229 (12 self)
- Add to MetaCart
... a compiler framework designed to support transparent, lifelong program analysis and transformation for arbitrary programs, by providing high-level information to compiler transformations at compile-time, link-time, run-time, and in idle time between runs. LLVM defines a common, low-level code representation in Static Single Assignment (SSA) form, with several novel features: a simple, language-independent type-system that exposes the primitives commonly used to implement high-level language features; an instruction for typed address arithmetic; and a simple mechanism that can be used to implement the exception handling features of high-level languages (and setjmp/longjmp in C) uniformly and efficiently. The LLVM compiler framework and code representation together provide a combination of key capabilities that are important for practical, lifelong analysis and transformation of programs. To our knowledge, no existing compilation approach provides all these capabilities. We describe the design of the LLVM representation and compiler framework, and evaluate the design in three ways: (a) the size and effectiveness of the representation, including the type information it provides; (b) compiler performance for several interprocedural problems; and (c) illustrative examples of the benefits LLVM provides for several challenging compiler problems.
Relational Profiling: Enabling Thread-Level Parallelism in Virtual Machines
, 2000
"... Virtual machine service threads can perform many tasks in parallel with program execution such as garbage collection, dynamic compilation, and profile collection and analysis. Hardware-assisted profiling is essential for providing service threads with needed information in a flexible and efficient w ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
Virtual machine service threads can perform many tasks in parallel with program execution such as garbage collection, dynamic compilation, and profile collection and analysis. Hardware-assisted profiling is essential for providing service threads with needed information in a flexible and efficient way. A relational profiling architecture (RPA) is proposed for meeting this goal.
LLVA: A Low-level Virtual Instruction Set Architecture
- IN MICRO-36
, 2003
"... A virtual instruction set architecture (V-ISA) implemented via a processor-specific software translation layer can provide great flexibility to processor designers. Recent examples such as Crusoe and DAISY, however, have used existing hardware instruction sets as virtual ISAs, which complicates tran ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
A virtual instruction set architecture (V-ISA) implemented via a processor-specific software translation layer can provide great flexibility to processor designers. Recent examples such as Crusoe and DAISY, however, have used existing hardware instruction sets as virtual ISAs, which complicates translation and optimization. In fact, there has been little research on specific designs for a virtual ISA for processors. This paper proposes a novel virtual ISA (LLVA) and a translation strategy for implementing it on arbitrary hardware. The instruction set is typed, uses an infinite virtual register set in Static Single Assignment form, and provides explicit control-flow and dataflow information, and yet uses low-level operations closely matched to traditional hardware. It includes novel mechanisms to allow more flexible optimization of native code, including a flexible exception model and minor constraints on self-modifying code. We propose a translation strategy that enables offline translation and transparent offline caching of native code and profile information, while remaining completely OS-independent. It also supports optimizations directly on the representation at install-time, runtime, and offline between executions. We show experimentally that the virtual ISA is compact, it is closely matched to ordinary hardware instruction sets, and permits very fast code generation, yet has enough high-level information to permit sophisticated program analyses and optimizations.
Optimizations and Oracle Parallelism with Dynamic Translation
- In Proc. 32nd International Symposium on Microarchitecture
, 1999
"... We describe several optimizations which can be employed in a dynamic binary translation (DBT) system, where low compilation/translation overhead is essential. These optimizations achieve a high degree of ILP, sometimes even surpassing a static compiler employing more sophisticated, and more time-con ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
We describe several optimizations which can be employed in a dynamic binary translation (DBT) system, where low compilation/translation overhead is essential. These optimizations achieve a high degree of ILP, sometimes even surpassing a static compiler employing more sophisticated, and more time-consuming algorithms [9]. We present results in which we employ these optimizations in a dynamic binary translation system capable of computing oracle parallelism.
Macroscopic Data Structure Analysis and Optimization
, 2005
"... Providing high performance for pointer-intensive programs on modern architectures is an increasingly difficult problem for compilers. Pointer-intensive programs are often bound by memory latency and cache performance, but traditional approaches to these problems usually fail: Pointer-intensive progr ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
Providing high performance for pointer-intensive programs on modern architectures is an increasingly difficult problem for compilers. Pointer-intensive programs are often bound by memory latency and cache performance, but traditional approaches to these problems usually fail: Pointer-intensive programs are often highly-irregular and the compiler has little control over the layout of heap allocated objects. This thesis presents a new class of techniques named “Macroscopic Data Structure Analyses and Optimizations”, which is a new approach to the problem of analyzing and optimizing pointerintensive programs. Instead of analyzing individual load/store operations or structure definitions, this approach identifies, analyzes, and transforms entire memory structures as a unit. The foundation of the approach is an analysis named Data Structure Analysis and a transformation named Automatic Pool Allocation. Data Structure Analysis is a context-sensitive pointer analysis which identifies data structures on the heap and their important properties (such as type safety). Automatic Pool Allocation uses the results of Data Structure Analysis to segregate dynamically allocated objects on the heap, giving control over the layout of the data structure in memory to the compiler. Based on these two foundation techniques, this thesis describes several performance improving
Binary translation and architecture convergence issues for IBM System/390
- In Proc. of the International Conference on Supercomputing 2000, Santa Fe, NM
, 2000
"... We describe the design issues in an implementation of the ESA/390 architecture based on binary translation to a very long instruction word (VLIW) processor. During binary translation, complex ESA/390 instructions are decomposed into instruction “primitives ” which are then scheduled onto a wide-issu ..."
Abstract
-
Cited by 10 (8 self)
- Add to MetaCart
We describe the design issues in an implementation of the ESA/390 architecture based on binary translation to a very long instruction word (VLIW) processor. During binary translation, complex ESA/390 instructions are decomposed into instruction “primitives ” which are then scheduled onto a wide-issue machine. The aim is to achieve high instruction level parallelism due to the increased scheduling and optimization opportunities which can be exploited by binary translation software, combined with the efficiency of long instruction word architectures. A further aim is to study the feasibility of a common execution platform for different instruction set architectures, such as ESA/390, RS/6000, AS/400 and the Java Virtual Machine, so that multiple systems can be built around a common execution platform. 1.
Hardware support for spin management in overcommitted virtual machines
- In Proc. of 15th PACT
, 2006
"... Multiprocessor operating systems (OSs) pose several unique and conflicting challenges to System Virtual Machines (System VMs). For example, most existing system VMs resort to gang scheduling a guest OS’s virtual processors (VCPUs) to avoid OS synchronization overhead. However, gang scheduling is inf ..."
Abstract
-
Cited by 9 (6 self)
- Add to MetaCart
Multiprocessor operating systems (OSs) pose several unique and conflicting challenges to System Virtual Machines (System VMs). For example, most existing system VMs resort to gang scheduling a guest OS’s virtual processors (VCPUs) to avoid OS synchronization overhead. However, gang scheduling is infeasible for some application domains, and is inflexible in other domains. In an overcommitted environment, an individual guest OS has more VCPUs than available physical processors (PCPUs), precluding the use of gang scheduling. In such an environment, we demonstrate a more than two-fold increase in runtime when transparently virtualizing a chipmultiprocessor’s cores. To combat this problem, we propose a hardware technique to detect several cases when a VCPU is not performing useful work, and suggest preempting that VCPU to run a different, more productive VCPU. Our technique can dramatically reduce cycles wasted on OS synchronization, without requiring any semantic information from the software. We then present a case study, typical of server consolidation, to demonstrate the potential of more flexible scheduling policies enabled by our technique. We propose one such policy that logically partitions the CMP cores between guest VMs. This policy increases throughput by 10–25 % for consolidated server workloads due to improved cache locality and core utilization, and substantially improves performance isolation in private caches.
Concurrent garbage collection using hardware-assisted profiling
- In Proceedings of the International Symposium on Memory Management
, 2000
"... In the presence of on-chip multithreading, a Virtual Machine (VM) implementation can readily take advantage of service threads for enhancing performance by performing tasks such as profile collection and analysis, dynamic optimization, and garbage collection concurrently with program execution. In t ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
In the presence of on-chip multithreading, a Virtual Machine (VM) implementation can readily take advantage of service threads for enhancing performance by performing tasks such as profile collection and analysis, dynamic optimization, and garbage collection concurrently with program execution. In this context, a hardware-assisted profiling mechanism is proposed. The Relational Profiling Architecture (RPA) is designed from the top down. RPA is based on a relational model similar to the relational database model. Instructions selected for profiling produce a record of information. A simple query engine examines these records for patterns, and performs simple actions on matching records. The power and flexibility of RPA is demonstrated by developing a concurrent generational garbage collector for Java. Detailed execution driven simulations show that this collector has an average runtime overhead of approximately 0.6%. The short pauses in the application required for synchronization with the garbage collector are at most 54 microseconds, given a 1GHz clock frequency.
Tuning garbage collection for reducing memory system energy in an embedded Java environment
- ACM Transactions on Embedded Computing Systems
, 2002
"... Java has been widely adopted as one of the software platforms for the seamless integration of diverse computing devices. Over the last year, there has been great momentum in adopting Java technology in devices such as cellphones, PDAs, and pagers where optimizing energy consumption is critical. Sinc ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Java has been widely adopted as one of the software platforms for the seamless integration of diverse computing devices. Over the last year, there has been great momentum in adopting Java technology in devices such as cellphones, PDAs, and pagers where optimizing energy consumption is critical. Since, traditionally, the Java virtual machine (JVM), the cornerstone of Java technology, is tuned for performance, taking into account energy consumption requires reevaluation, and possibly redesign of the virtual machine. This motivates us to tune specific components of the virtual machine for a battery-operated architecture. As embedded JVMs are designed to run for long periods of time on limited-memory embedded systems, creating and managing Java objects is of critical importance. The garbage collector (GC) is an important part of the JVM responsible for the automatic reclamation of unused memory. This article shows that the GC is not only important for limited-memory systems but also for energy-constrained architectures. This article focuses on tuning the GC to reduce energy consumption in a multibanked memory architecture. Tuning the GC is important not because it consumes a sizeable portion of overall energy during execution, but because it influences the energy consumed in the memory during application execution. In particular, we present a GC-controlled leakage energy optimization technique
Concurrent Garbage Collection Using Program Slices on Multithreaded Processors
, 2000
"... We investigate reference counting in the context of a multithreaded architecture by exploiting two observations: (1) reference-counting can be performed by a transformed program slice of the mutator that isolates heap references, and (2) hardware trends indicate that microprocessors in the near futu ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We investigate reference counting in the context of a multithreaded architecture by exploiting two observations: (1) reference-counting can be performed by a transformed program slice of the mutator that isolates heap references, and (2) hardware trends indicate that microprocessors in the near future will be able to execute multiple concurrent threads on a single chip. We generate a reference-counting collector as a transformed program slice of an application and then execute this slice in parallel with the application as a "run-behind" thread. Preliminary measurements of collector overheads are quite encouraging, showing a 25% to 53% space overhead to transfer garbage collection to a separate thread.

