Results 1 -
9 of
9
Memory Access Coalescing: A Technique for Eliminating Redundant Memory Accesses
- ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI
, 1994
"... As microprocessor speeds increase, memory bandwidth is increasingly the performance bottleneck for microprocessors. This has occurred because innovation and technological improvements in processor design have outpaced advances in memory design. Most attempts at addressing this problem have involved ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
As microprocessor speeds increase, memory bandwidth is increasingly the performance bottleneck for microprocessors. This has occurred because innovation and technological improvements in processor design have outpaced advances in memory design. Most attempts at addressing this problem have involved hardware solutions. Unfortunately, these solutions do little to help the situation with respect to current microprocessors. In previous work, we developed, implemented, and evaluated an algorithm that exploited the ability of newer machines with wide-buses to load/ store multiple floating-point operands in a single memory reference. This paper describes a general code improvement algorithm that transforms code to better exploit the available memory bandwidth on existing microprocessors as well as widebus machines. Where possible and advantageous, the algorithm coalesces narrow memory references into wide ones. An interesting characteristic of the algorithm is that some decisions about the ap...
Dynamic Access Ordering for Streamed Computations
, 2000
"... Memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations lack the temporal locality of reference that makes traditional caching ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
Memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations lack the temporal locality of reference that makes traditional caching schemes effective, they have predictable access patterns. Since most modern DRAM components support modes that make it possible to perform some access sequences faster than others, the predictability of the stream accesses makes it possible to reorder them to get better memory performance. We describe a Stream Memory Controller (SMC) system that combines compile-time detection of streams with execution-time selection of the access order and issue. The SMC effectively prefetches read-streams, buffers write-streams, and reorders the accesses to exploit the existing memory bandwidth as much as possible. Unlike most other hardware prefetching or stream buffer designs, this system does no...
Design and evaluation of dynamic access ordering hardware
- In Proc. International Conference on Supercomputing
, 1996
"... Memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations lack the temporal locality of reference that makes caches effective, the ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
Memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations lack the temporal locality of reference that makes caches effective, they have predictable access patterns. Since most modern DRAM components support modes that make it possible to perform some access sequences faster than others, the predictability of the stream accesses makes it possible to reorder them to get better memory performance. We describe and evaluate a Stream Memory Controller system that combines compile-time detection of streams with execution-time selection of the access order and issue. The technique is practical to implement, using existing compiler technology and requiring only a modest amount of special-purpose hardware. With our prototype system, we have observed performance improvements by factors of 13 over normal caching. 1.
Improving Instruction-level Parallelism by Loop Unrolling and Dynamic Memory Disambiguation
- In Proceedings of the 28th annual international symposium on Microarchitecture
, 1995
"... Exploitation of instruction-level parallelism is an effective mechanism for improving the performance of modern super-scalar/VLIW processors. Various software techniques can be applied to increase instruction-level parallelism. This paper describes and evaluates a software technique, dynamic memory ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
Exploitation of instruction-level parallelism is an effective mechanism for improving the performance of modern super-scalar/VLIW processors. Various software techniques can be applied to increase instruction-level parallelism. This paper describes and evaluates a software technique, dynamic memory disambiguation, that permits loops containing loads and stores to be scheduled more aggressively, thereby exposing more instruction-level parallelism. The results of our evaluation show that when dynamic memory disambiguation is applied in conjunction with loop unrolling, register renaming, and static memory disambiguation, the ILP of memory-intensive benchmarks can be increased by as much as 300 percent over loops where dynamic memory disambiguation is not performed. Our measurements also indicate that for the programs that benefit the most from these optimizations, the register usage does not exceed the number of registers on most high-performance processors. Keywords: loop unrolling, dyn...
An Aggressive Approach to Loop Unrolling
- Proc. Compiler Construction '96
, 1995
"... A well-known code transformation for improving the execution performance of a program is loop unrolling. The most obvious benefit of unrolling a loop is that the transformed loop usually, but not always, requires fewer instruction executions than the original loop. The reduction in instruction execu ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
A well-known code transformation for improving the execution performance of a program is loop unrolling. The most obvious benefit of unrolling a loop is that the transformed loop usually, but not always, requires fewer instruction executions than the original loop. The reduction in instruction executions comes from two sources: the number of branch instructions executed is reduced, and the index variable is modified fewer times. In addition, for architectures with features designed to exploit instruction-level parallelism, loop unrolling can expose greater levels of instructionlevel parallelism. Loop unrolling is an effective code transformation often improving the execution performance of programs that spend much of their execution time in loops by ten to thirty percent. Possibly because of the effectiveness of a simple application of loop unrolling, it has not been studied as extensively as other code improvements such as register allocation or common subexpression elimination. The r...
Target-specific Global Code Improvement: Principles and Applications
, 1994
"... This article describes the key principles behind the design and implementation of a global code improver that has been use to construct several high-quality compilers and other program transformation and analysis tools. The code improver, called vpo, employs a paradigm of compilation that has proven ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
This article describes the key principles behind the design and implementation of a global code improver that has been use to construct several high-quality compilers and other program transformation and analysis tools. The code improver, called vpo, employs a paradigm of compilation that has proven to be flexible and adaptable---all code improving transformations are performed on a target-specific representation of the program. The aggressive use of this paradigm yields a code improver with several valuable properties. Four properties stand out. First, vpo is language and compiler independent. That is, it has been used to implement compilers for several different computer languages. For the C programming language, it has been used with several front ends each of which generates a different intermediate language. Second, because all code improvements are applied to a single low-level intermediate representation, phase ordering programs are minimized. Third, vpo is easily retargeted and handles a wide variety of architectures. In particular, vpo's structure allows new architectures and new implementations of existing architectures to be accommodated quickly and easily. Fourth and finally, because of its flexible structure, vpo has several other interesting uses in addition to its primary use in an optimizing compiler. This article describes the principles that have driven the design of vpo and the implications of these principles on vpo's implementation. The article concludes with a brief description of vpo's use as a back end with front ends for several different languages, and its use as a key component
Compiling for Efficient Memory Utilization
, 1996
"... this paper is thus to try to call attention to this work. 2. Access Ordering ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
this paper is thus to try to call attention to this work. 2. Access Ordering
Data-Specific Optimizations
, 1996
"... Optimizing compilers capable of producing efficient code play an important role in improving the performance of computer systems. A comprehensive suite of code optimizations applied to unoptimized code can often yield a significant reduction in execution time. However, the ability of a compiler to a ..."
Abstract
- Add to MetaCart
Optimizing compilers capable of producing efficient code play an important role in improving the performance of computer systems. A comprehensive suite of code optimizations applied to unoptimized code can often yield a significant reduction in execution time. However, the ability of a compiler to apply code optimizations is often limited due to the unavailability or inaccuracy of compile-time information about the operands in a source program. In absence of requisite information, the compiler forgoes an opportunity to apply an optimization so that the semantics of the program are not jeopardized. This thesis proposes and analyses a set of code optimization techniques called data-specific optimizations, which use information available at program execution time. These techniques are practical to implement and provide substantial performance increase without any additional hardware. This thesis describes data-specific optimizations related to three areas 1) optimization to minimize control dependences (loop unrolling) 2) optimization to exploit memory bandwidth (memory-access coalescing) and 3) optimization to exploit instruction-level parallelism. The thesis shows that aggressive loop unrolling can improve the performance of programs by approximately 10 percent. It also shows that when loop unrolling is used to facilitate memory-access coalescing, the memory loads and stores in a program can be reduced by as much as 225%. Also, when loop unrolling is applied in conjunction with dynamic memory disambiguation, the instructionlevel parallelism in loops can be increased by as much as 300 percent.
Semantics Commentary
"... primaryexpression syntax 6.5.1 Primary expressions primary-expression: identifier constant string-literal ( expression) 975 compound expression ..."
Abstract
- Add to MetaCart
primaryexpression syntax 6.5.1 Primary expressions primary-expression: identifier constant string-literal ( expression) 975 compound expression

