Results 1 -
6 of
6
Cache-Conscious Data Placement
- in Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems
, 1998
"... As the gap between memory and processor speeds continues to widen, cache efficiency is an increasingly important component of processor performance. Compiler techniques have been used to improve instruction cache performance by mapping code with temporal locality to different cache blocks in the vir ..."
Abstract
-
Cited by 131 (3 self)
- Add to MetaCart
As the gap between memory and processor speeds continues to widen, cache efficiency is an increasingly important component of processor performance. Compiler techniques have been used to improve instruction cache performance by mapping code with temporal locality to different cache blocks in the virtual address space eliminating cache conflicts. These code placement techniques can be applied directly to the problem of placing data for improved data cache performance. In this paper we present a general framework for Cache Conscious Data Placement. This is a compiler directed approach that creates an address placement for the stack (local variables), global variables, heap objects, and constants in order to reduce data cache misses. The placement of data objects is guided by a temporal relationship graph between objects generated via profiling. Our results show that profile driven data placement significantly reduces the data miss rate by 24% on average. 1 Introduction Much effort has b...
Static Branch Frequency and Program Profile Analysis
- In 27th International Symposium on Microarchitecture
, 1994
"... : Program profiles identify frequently executed portions of a program, which are the places at which optimizations offer programmers and compilers the greatest benefit. Compilers, however, infrequently exploit program profiles, because profiling a program requires a programmer to instrument and run ..."
Abstract
-
Cited by 62 (1 self)
- Add to MetaCart
: Program profiles identify frequently executed portions of a program, which are the places at which optimizations offer programmers and compilers the greatest benefit. Compilers, however, infrequently exploit program profiles, because profiling a program requires a programmer to instrument and run the program. An attractive alternative is for the compiler to statically estimate program profiles. . This paper presents several new techniques for static branch prediction and profiling. The first technique combines multiple predictions of a branch's outcome into a prediction of the probability that the branch is taken. Another technique uses these predictions to estimate the relative execution frequency (i.e., profile) of basic blocks and controlflow edges within a procedure. A third algorithm uses local frequency estimates to predict the global frequency of calls, procedure invocations, and basic block and control-flow edge executions. Experiments on the SPEC92 integer benchmarks and Uni...
Optimizing Instruction Cache Performance for Operating System Intensive Workloads
- IEEE Transactions on Computers
, 1995
"... High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to use an optimizing compiler to minimize cache interference via an improved layout of the code. This technique, however, has been applied to application code only, even though ther ..."
Abstract
-
Cited by 61 (11 self)
- Add to MetaCart
High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to use an optimizing compiler to minimize cache interference via an improved layout of the code. This technique, however, has been applied to application code only, even though there is evidence that the operating system often uses the cache heavily and with less uniform patterns than applications. Therefore, it is unknown how well existing optimizations perform for systems code and whether better optimizations can be found. We address this problem in this paper. This paper characterizes in detail the locality patterns of the operating system code and shows that there is substantial locality. Unfortunately, caches are not able to extract much of it: rarely-executed special-case code disrupts spatial locality, loops with few iterations that call routines make loop locality hard to exploit, and plenty of loop-less code hampers temporal locality. As a result, interference...
SPAID: Software Prefetching in Pointer- and Call-Intensive Environments
- In Proceedings of the 28th annual international symposium on Microarchitecture
, 1995
"... Software prefetching, typically in the context of numericor loop-intensive benchmarks, has been proposed as one remedy for the performance bottleneck imposed on computer systems by the cost of servicing cache misses. This paper proposes a new heuristic--SPAID--for utilizing prefetch instructions in ..."
Abstract
-
Cited by 58 (3 self)
- Add to MetaCart
Software prefetching, typically in the context of numericor loop-intensive benchmarks, has been proposed as one remedy for the performance bottleneck imposed on computer systems by the cost of servicing cache misses. This paper proposes a new heuristic--SPAID--for utilizing prefetch instructions in pointer- and call-intensive environments. We use trace-driven cache simulation of a number of pointer- and call-intensive benchmarks to evaluate the benefits and implementation trade-offs of SPAID. Our results indicate that a significant proportion of the cost of data cache misses can be eliminated or reduced with SPAID without unduly increasing memory traffic. 1. Introduction It is well known that processor clock speeds are increasing exponentially over time, while memory speeds are not increasing nearly as rapidly [RD94]. The computing industry has reached the point where system performance is dominated by the cost of servicing cache misses. To address this problem, several instruction s...
Instruction Cache Effects of Different Code Reordering Algorithms
, 1994
"... While scientific programs written in procedural languages like C and Fortran tend to have good instruction cache behavior, improving instruction cache performance continues to be an important issue for commonly used applications, such as compilers and document previewers, and for applications writte ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
While scientific programs written in procedural languages like C and Fortran tend to have good instruction cache behavior, improving instruction cache performance continues to be an important issue for commonly used applications, such as compilers and document previewers, and for applications written using object oriented languages. This paper explores several code reordering algorithms that aim to improve instruction cache hit rates. We show the effects of different levels of aggressiveness in applying the standard depth-first algorithm and the effects of different enhancements to this standard. We also show that code reordering becomes more important as cache line size increases. 1 Introduction Modern processors rely on small on-chip caches to bridge the gap between speed of the processor and the speed of memory. Typically, on-chip caches are split: one cache is for instructions and another one is for data. The instruction cache tends to have a better hit rate than the data cache be...
Hardware And Software Mechanisms For Reducing Load Latency
, 1996
"... As processor demands quickly outpace memory, the performance of load instructions becomes an increasingly critical component to good system performance. This thesis contributes four novel load latency reduction techniques, each targeting a different component of load latency: address calculation, da ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
As processor demands quickly outpace memory, the performance of load instructions becomes an increasingly critical component to good system performance. This thesis contributes four novel load latency reduction techniques, each targeting a different component of load latency: address calculation, data cache access, address translation, and data cache misses. The contributed techniques are as follows: ffl Fast Address Calculation employs a stateless set index predictor to allow address calculation to overlap with data cache access. The design eliminates the latency of address calculation for many loads. ffl Zero-Cycle Loads combine fast address calculation with an early-issue mechanism to produce pipeline designs capable of hiding the latency of many loads that hit in the data cache. ffl High-Bandwidth Address Translation develops address translation mechanisms with better latency and area characteristics than a multi-ported TLB. The new designs provide multiple-issue processors with ...

