Results 1 - 10
of
34
Automatic thread extraction with decoupled software pipelining
- In Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture
, 2005
"... {ottoni, ram, astoler, august}@princeton.edu Abstract Until recently, a steadily rising clock rate and otheruniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance fora wide range of applications. Current difficulties in maintaining this trend ..."
Abstract
-
Cited by 59 (10 self)
- Add to MetaCart
{ottoni, ram, astoler, august}@princeton.edu Abstract Until recently, a steadily rising clock rate and otheruniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance fora wide range of applications. Current difficulties in maintaining this trend have lead microprocessor manufacturersto add value by incorporating multiple processors on a chip. Unfortunately, since decades of compiler research have notsucceeded in delivering automatic threading for prevalent code properties, this approach demonstrates no improve-ment for a large class of existing codes. To find useful work for chip multiprocessors, we proposean automatic approach to thread extraction, called Decoupled Software Pipelining (DSWP). DSWP exploits the fine-grained pipeline parallelism lurking in most applications to extract long-running, concurrently executing threads. Useof the non-speculative and truly decoupled threads produced by DSWP can increase execution efficiency and pro-vide significant latency tolerance, mitigating design complexity by reducing inter-core communication and per-coreresource requirements. Using our initial fully automatic compiler implementation and a validated processor model,we prove the concept by demonstrating significant gains for dual-core chip multiprocessor models running a variety ofcodes. We then explore simple opportunities missed by our initial compiler implementation which suggest a promisingfuture for this approach. 1
The Advantages of Machine-Dependent Global Optimization
- In Proceedings of the 1994 Conference on Programming Languages and Systems Architectures
, 1994
"... machine designers have long recognized this dilemma. In 1972, Newey, Poole, and Waite [Newe79] observed that `Most problems will suggest a number of specialized operations which could possibly be implemented quite efficiently on certain hardware. The designer must balance the convenience and utility ..."
Abstract
-
Cited by 41 (17 self)
- Add to MetaCart
machine designers have long recognized this dilemma. In 1972, Newey, Poole, and Waite [Newe79] observed that `Most problems will suggest a number of specialized operations which could possibly be implemented quite efficiently on certain hardware. The designer must balance the convenience and utility of these operations against the increased difficulty of implementing an abstract machine with a rich and varied instruction set.' Fortunately, applying all code improvements to the LIL removes efficiency considerations as HIL design issue. The abstract machine need only contain a set of features roughly equivalent to the intersection of the operations included in typical target machines. The result is a small, simple abstract machine. Such abstract machines are termed `intersection' machines. The analogy between union/intersection abstract machines and CISC/RISC architectures is obvious. void daxpy(n, da, dx, incx, dy, incy) int n, incx, incy; double da, dx[], dy[]; { if (n <= 0) return; i...
Access Order and Memory-Conscious Cache Utilization
- In Proceedings of the First Annual Symposium on High Performance Computer Architecture
, 1995
"... As processor speeds increase relative to memory speeds, memory bandwidth is rapidly becoming the limiting performance factor for many applications. Several approaches to bridging this performance gap have been suggested. This paper examines one approach, access ordering, and pushes its limits to det ..."
Abstract
-
Cited by 39 (11 self)
- Add to MetaCart
As processor speeds increase relative to memory speeds, memory bandwidth is rapidly becoming the limiting performance factor for many applications. Several approaches to bridging this performance gap have been suggested. This paper examines one approach, access ordering, and pushes its limits to determine bounds on memory performance. We present several access-ordering schemes, and compare their performance, developing analytic models and partially validating these with benchmark timings on the Intel i860XR. 1. Introduction Processor speeds are increasing much faster than memory speeds, thus memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly scientific computations. Proposed solutions range from software prefetching [4, 16, 27] and iteration space tiling [5, 8, 9, 18, 32, 38], to address transformations [12, 13], unusual memory systems [3, 10, 33, 36], and prefetching or non-blocking caches [1, 6, 34]. Here we take one technique, ...
Access order and effective bandwidth for streams on a direct rambus memory
- Proc. HPCA-5
, 1999
"... Processor speeds are increasing rapidly, and memory speeds are not keeping up. Streaming computations (such as multi-media or scientific applications) are among those whose performance is most limited by the memory bottleneck. Rambus hopes to bridge the processor/memory performance gap with a recent ..."
Abstract
-
Cited by 33 (5 self)
- Add to MetaCart
Processor speeds are increasing rapidly, and memory speeds are not keeping up. Streaming computations (such as multi-media or scientific applications) are among those whose performance is most limited by the memory bottleneck. Rambus hopes to bridge the processor/memory performance gap with a recently introduced DRAM that can deliver up to 1.6Gbytes/sec. We analyze the performance of these interesting new memory devices on the inner loops of streaming computations, both for traditional memory controllers that treat all DRAM transactions as random cacheline accesses, and for controllers augmented with streaming hardware. For our benchmarks, we find that accessing unit-stride streams in cacheline bursts in the natural order of the computation exploits from 44-76 % of the peak bandwidth of a memory system composed of a single Direct RDRAM device, and that accessing streams via a streaming mechanism with a simple access ordering scheme can improve performance by factors of 1.18 to 2.25. 1.
Stream Scheduling
- in Proceedings of the 3rd Workshop on Media and Streaming Processors
, 2001
"... Media applications can be cast as a set of computation kernels operating on sequential data streams. This representation exposes the structure of an application to the hardware as well as to the compiler tools. Stream scheduling is a compiler technology that applies knowledge of the high-level struc ..."
Abstract
-
Cited by 29 (10 self)
- Add to MetaCart
Media applications can be cast as a set of computation kernels operating on sequential data streams. This representation exposes the structure of an application to the hardware as well as to the compiler tools. Stream scheduling is a compiler technology that applies knowledge of the high-level structure of a program in order to target stream hardware as well as to perform high-level optimizations on the application. The basic methods used by stream scheduling are discussed in this paper. In addition, results on the Imagine stream processor are presented. These show that stream programs executed using an automated stream scheduler require on average 98% of the execution time that hand-optimized assembly versions of the same programs require.
Bounds on Memory Bandwidth in Streamed Computations
- LECTURE NOTES IN COMPUTER SCIENCE 966: EUROPAR'95 PARALLEL PROCESSING
, 1995
"... The growing disparity between processor and memory speeds has caused memory bandwidth to become the performance bottleneck for many applications. In particular, this performance gap severely impacts stream-orientated computations such as (de)compression, encryption, text searching, and scientific ( ..."
Abstract
-
Cited by 24 (13 self)
- Add to MetaCart
The growing disparity between processor and memory speeds has caused memory bandwidth to become the performance bottleneck for many applications. In particular, this performance gap severely impacts stream-orientated computations such as (de)compression, encryption, text searching, and scientific (vector) processing. This paper looks at streaming computations and derives analytic upper bounds on the bandwidth attainable from a class of access reordering schemes. We compare these bounds to the simulated performance of a particular dynamic access ordering scheme, the Stream Memory Controller (SMC). We are building the SMC, and where possible we relate our analytic bounds and simulation data to the simulation performance of the hardware. The results suggest that the SMC can deliver nearly the full attainable bandwidth with relatively modest hardware costs.
Dynamic Access Ordering for Streamed Computations
, 2000
"... Memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations lack the temporal locality of reference that makes traditional caching ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
Memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations lack the temporal locality of reference that makes traditional caching schemes effective, they have predictable access patterns. Since most modern DRAM components support modes that make it possible to perform some access sequences faster than others, the predictability of the stream accesses makes it possible to reorder them to get better memory performance. We describe a Stream Memory Controller (SMC) system that combines compile-time detection of streams with execution-time selection of the access order and issue. The SMC effectively prefetches read-streams, buffers write-streams, and reorders the accesses to exploit the existing memory bandwidth as much as possible. Unlike most other hardware prefetching or stream buffer designs, this system does no...
MISC: A Multiple Instruction Stream Computer
- In Proc of the 25th. Ann. Symp. on Microarchitecture
, 1992
"... This paper describes a single chip Multiple Instruction Stream Computer (MISC) capable of extracting instruction level parallelism from a broad spectrum of programs. The MISC architecture uses multiple asynchronous processing elements to separate a program into streams that can be executed in parall ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
This paper describes a single chip Multiple Instruction Stream Computer (MISC) capable of extracting instruction level parallelism from a broad spectrum of programs. The MISC architecture uses multiple asynchronous processing elements to separate a program into streams that can be executed in parallel, and integrates a conflict-free message passing system into the lowest level of the processor design to facilitate low latency intraMISC communication. This approach allows for increased machine parallelism with minimal code expansion, and provides an alternative approach to single instruction stream multi-issue machines such as SuperScalar and VLIW. 1. The MISC Design The MISC processor, a direct descendant of the PIPE project [CGKP87, GHLP85] will exploit both the instruction and data parallelism available in a task by combining the capabilities of traditional data parallel architectures with those found in machines designed to exploit instruction level parallelism. Unlike the two pro...
Design and evaluation of dynamic access ordering hardware
- In Proc. International Conference on Supercomputing
, 1996
"... Memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations lack the temporal locality of reference that makes caches effective, the ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
Memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations lack the temporal locality of reference that makes caches effective, they have predictable access patterns. Since most modern DRAM components support modes that make it possible to perform some access sequences faster than others, the predictability of the stream accesses makes it possible to reorder them to get better memory performance. We describe and evaluate a Stream Memory Controller system that combines compile-time detection of streams with execution-time selection of the access order and issue. The technique is practical to implement, using existing compiler technology and requiring only a modest amount of special-purpose hardware. With our prototype system, we have observed performance improvements by factors of 13 over normal caching. 1.
Adaptive And Integrated Data Cache Prefetching For Shared-Memory Multiprocessors
, 1995
"... ... yield a better overall scheme. We give a detailed description of the compiler analysis necessary for integrated prefetching. The performance of integrated prefetching is compared to software and hardware prefetching, and we show the effect of adapting the scheduling of prefetches at compile ti ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
... yield a better overall scheme. We give a detailed description of the compiler analysis necessary for integrated prefetching. The performance of integrated prefetching is compared to software and hardware prefetching, and we show the effect of adapting the scheduling of prefetches at compile time. Finally, we discuss approaches that combine integrated prefetching with the adaptive hardware prefetching technique.

