Results 1 - 10
of
10
Instruction based memory distance analysis and its application to optimization
- In Proceedings of the 14 th International Conference on Parallel Architectures and Compilation
, 2005
"... Feedback-directed Optimization has become an increasingly important tool in designing and building optimizing compilers as it provides a means to analyze complex program behavior that is not possible using traditional static analysis. Feedback-directed optimization offers the compiler opportunities ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
Feedback-directed Optimization has become an increasingly important tool in designing and building optimizing compilers as it provides a means to analyze complex program behavior that is not possible using traditional static analysis. Feedback-directed optimization offers the compiler opportunities to analyze and optimize the memory behavior of programs even when traditional array-based analysis not applicable. As a result, both floatingpoint and integer programs can memory hierarchy optimization. In this we examine the notion of memory distance as it is applied to the instruction space of a program and to directed optimization. Memory distance is dejined as a dynamic distance in terms of memory references between two accesses to the same memory location. We use memory distance to predict the miss rates of instructions in a program. Using the miss rates, we then identifi the program’s critical instructions-set of high miss instructions whose cumulative misses account for 95 % of the L2 cache misses in the program-in both integer andfloating-point pmgrams. Our experimentsshow that distance analysis can effectively identifi critical instructions in both integer programs. Additionally, we apply memory-distance analysis to memory disambiguation in out-of-order issue processors, using those distances to determinewhen a load may be speculated ahead of apreceding store. Our experiments show that memory-distance-based disambiguation on average achieves within of the performance gain of the store set technique which requires hardware table. 1.
Software-Controlled Multithreading Using Informing Memory Operations
- In International Symposium on High-Performance Computer Architecture
, 1998
"... Memory latency is becoming an increasingly important performance bottleneck, especially in multiprocessors. One technique for tolerating memory latency is multithreading, whereby we switch between threads upon expensive cache misses. In contrast with previous work on multithreading, we explore a new ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Memory latency is becoming an increasingly important performance bottleneck, especially in multiprocessors. One technique for tolerating memory latency is multithreading, whereby we switch between threads upon expensive cache misses. In contrast with previous work on multithreading, we explore a new approach that is software-controlled rather than hardware-controlled. To implement software-controlled multithreading, we use informing memory operations to quickly trap upon cache misses to a miss handler which performs the actual thread switching in software. Our experimental results demonstrate that software-controlled multithreading can result in significant performance gains on a shared-memory multiprocessor, with the majority of applications speeding up by 10% or more, and one application speeding up by 16%. In addition, we find that by selectively applying a register partitioning optimization to reduce the thread-switching overhead, we can increase the overall speedups to as much as ...
An object-aware memory architecture
, 2005
"... Despite its dominance, object-oriented computation has received scant attention from the architecture community. We propose a novel memory architecture that supports objects and garbage collection (GC). Our architecture is co-designed with a Java Virtual Machine to improve the functionality and effi ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Despite its dominance, object-oriented computation has received scant attention from the architecture community. We propose a novel memory architecture that supports objects and garbage collection (GC). Our architecture is co-designed with a Java Virtual Machine to improve the functionality and efficiency of heap memory management. The architecture is based on an address space for objects accessed using object IDs mapped by a translator to physical addresses. To support this, the system includes object-addressed caches, a hardware GC barrier to allow in-cache GC of objects, and an exposed cache structure cooperatively managed by the JVM. These extend a conventional architecture, without compromising compatibility or performance for legacy binaries. Our innovations enable various improvements such as: a novel technique for parallel and concurrent garbage collection, without requiring any global synchronization; an in-cache garbage collector, which never accesses main memory; concurrent compaction of objects; and elimination of most GC store barrier overhead. We compare the behavior of our system against that of a conventional generational garbage collector, both with and without an explicit allocate-incache operation. Explicit allocation eliminates many write misses; our scheme additionally trades L2 misses for in-cache operations, and provides the mapping indirection required for concurrent compaction.
Software prefetching for mark-sweep garbage collection: hardware analysis and software redesign
- In ASPLOS-XI: Proceedings of the 11th international conference on Architectural
, 2004
"... Tracing garbage collectors traverse references from live program variables, transitively tracing out the closure of live objects. Memory accesses incurred during tracing are essentially random: a given object may contain references to any other object. Since application heaps are typically much larg ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Tracing garbage collectors traverse references from live program variables, transitively tracing out the closure of live objects. Memory accesses incurred during tracing are essentially random: a given object may contain references to any other object. Since application heaps are typically much larger than hardware caches, tracing results in many cache misses. Technology trends will make cache misses more important, so tracing is a prime target for prefetching. Simulation of Java benchmarks running with the Boehm-Demers-Weiser mark-sweep garbage collector for a projected hardware platform reveal high tracing overhead (up to 65 % of elapsed time), and that cache misses are a problem. Applying Boehm’s default prefetching strategy yields improvements in execution time (16 % on average with incremental/generational collection for GCintensive benchmarks), but analysis shows that his strategy suffers from significant timing problems: prefetches that occur too early or too late relative to their matching loads. This analysis drives development of a new prefetching strategy that yields up to three times the performance improvement of Boehm’s strategy for GCintensive benchmarks (27 % average speedup), and achieves performance close to that of perfect timing (ie, few misses for tracing accesses) on some benchmarks. Validating these simulation results with live runs on current hardware produces average speedup of 6% for the new strategy on GC-intensive benchmarks with a GC configuration that tightly controls heap growth. In contrast, Boehm’s default prefetching strategy is ineffective on this platform.
Dynamic Filtering: Multi-Purpose Architecture Support for Language Runtime Systems
"... This paper introduces a new abstraction to accelerate the readbarriers and write-barriers used by language runtime systems. We exploit the fact that, dynamically, many barrier executions perform checks but no real work—e.g., in generational garbage collection (GC), frequent checks are needed to dete ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This paper introduces a new abstraction to accelerate the readbarriers and write-barriers used by language runtime systems. We exploit the fact that, dynamically, many barrier executions perform checks but no real work—e.g., in generational garbage collection (GC), frequent checks are needed to detect the creation of intergenerational references, even though such references occur rarely in many workloads. We introduce a form of dynamic filtering that identifies redundant checks by (i) recording checks that have recently been executed, and (ii) detecting when a barrier is repeating one of these checks. We show how this technique can be applied to a variety of algorithms for GC, transactional memory, and language-based security. By supporting dynamic filtering in the instruction set, we show that the fast-paths of these barriers can be streamlined, reducing the impact on the quality of surrounding code. We show how we accelerate the barriers used for generational GC and transactional memory in the Bartok research compiler. With a 2048-entry filter, dynamic filtering eliminates almost all the overhead of the GC write-barriers. Dynamic filtering eliminates around half the overhead of STM over a non-synchronized baseline—even when used with an STM that is already designed for low overhead, and which employs static analyses to avoid redundant operations.
Cell GC: using the cell synergistic processor as a garbage collection coprocessor
- In VEE ’08: Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
, 2008
"... In recent years, scaling of single-core superscalar processor performance has slowed due to complexity and power considerations. To improve program performance, designs are increasingly adopting chip multiprocessing with homogeneous or heterogeneous CMPs. By trading off features from a modern aggres ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In recent years, scaling of single-core superscalar processor performance has slowed due to complexity and power considerations. To improve program performance, designs are increasingly adopting chip multiprocessing with homogeneous or heterogeneous CMPs. By trading off features from a modern aggressive superscalar core, CMPs often offer better scaling characteristics in terms of aggregate performance, complexity and power, but often require additional software investment to rewrite, retune or recompile programs to take advantage of the new designs. The Cell Broadband Engine is a modern example of a heterogeneous CMP with coprocessors (accelerators) which can be found in supercomputers (Roadrunner), blade servers (IBM QS20/21), and video game consoles (SCEI PS3). A Cell BE processor has a host Power RISC processor (PPE) and eight Synergistic Processor Elements (SPE), each consisting of a Synergistic
Braids and Fibers:Language Constructs with Architectural Support for Adaptive Response to Memory Latencies
"... ABSTRACT As processor speeds continue to increase at a much higher expo-nential rate than DRAM speeds, memory latencies will soon exceed ..."
Abstract
- Add to MetaCart
ABSTRACT As processor speeds continue to increase at a much higher expo-nential rate than DRAM speeds, memory latencies will soon exceed
Synchronization Coherence: A Transparent Hardware Mechanism for Cache Coherence and Fine-Grained Synchronization
"... The quest to improve performance forces designers to explore finer-grained multiprocessor machines. Ever increasing chip densities based on CMOS improvements fuel research in highly parallel chip multiprocessors with 100s of processing elements. With such increasing levels of parallelism, synchroniz ..."
Abstract
- Add to MetaCart
The quest to improve performance forces designers to explore finer-grained multiprocessor machines. Ever increasing chip densities based on CMOS improvements fuel research in highly parallel chip multiprocessors with 100s of processing elements. With such increasing levels of parallelism, synchronization is set to become a major performance bottleneck and efficient support for synchronization an important design criterion. Previous research has shown that integrating support for fine-grained synchronization can have significant performance benefits compared to traditional coarse-grained synchronization. Not much progress has been made in supporting fine-grained synchronization transparently to processor nodes: a key reason perhaps why wide adoption has not followed. In this paper, we propose a novel approach called Synchronization Coherence that can provide transparent finegrained synchronization and caching in a multiprocessor machine and single-chip multiprocessor. Our approach merges fine-grained synchronization mechanisms with traditional cache coherence protocols. It reduces network utilization as well as synchronization related processing overheads while adding minimal hardware complexity as compared to cache coherence mechanisms or previously reported fine-grained synchronization techniques. In addition to its benefit of making synchronization transparent to processor nodes, for the applications studied, it provides up to 23% improvement in performance and up to 24 % improvement in energy efficiency with no L2 caches compared to previous

