Results 1 - 10
of
13
Optimizing Instruction Cache Performance for Operating System Intensive Workloads
- IEEE Transactions on Computers
, 1995
"... High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to use an optimizing compiler to minimize cache interference via an improved layout of the code. This technique, however, has been applied to application code only, even though ther ..."
Abstract
-
Cited by 61 (11 self)
- Add to MetaCart
High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to use an optimizing compiler to minimize cache interference via an improved layout of the code. This technique, however, has been applied to application code only, even though there is evidence that the operating system often uses the cache heavily and with less uniform patterns than applications. Therefore, it is unknown how well existing optimizations perform for systems code and whether better optimizations can be found. We address this problem in this paper. This paper characterizes in detail the locality patterns of the operating system code and shows that there is substantial locality. Unfortunately, caches are not able to extract much of it: rarely-executed special-case code disrupts spatial locality, loops with few iterations that call routines make loop locality hard to exploit, and plenty of loop-less code hampers temporal locality. As a result, interference...
SPAID: Software Prefetching in Pointer- and Call-Intensive Environments
- In Proceedings of the 28th annual international symposium on Microarchitecture
, 1995
"... Software prefetching, typically in the context of numericor loop-intensive benchmarks, has been proposed as one remedy for the performance bottleneck imposed on computer systems by the cost of servicing cache misses. This paper proposes a new heuristic--SPAID--for utilizing prefetch instructions in ..."
Abstract
-
Cited by 58 (3 self)
- Add to MetaCart
Software prefetching, typically in the context of numericor loop-intensive benchmarks, has been proposed as one remedy for the performance bottleneck imposed on computer systems by the cost of servicing cache misses. This paper proposes a new heuristic--SPAID--for utilizing prefetch instructions in pointer- and call-intensive environments. We use trace-driven cache simulation of a number of pointer- and call-intensive benchmarks to evaluate the benefits and implementation trade-offs of SPAID. Our results indicate that a significant proportion of the cost of data cache misses can be eliminated or reduced with SPAID without unduly increasing memory traffic. 1. Introduction It is well known that processor clock speeds are increasing exponentially over time, while memory speeds are not increasing nearly as rapidly [RD94]. The computing industry has reached the point where system performance is dominated by the cost of servicing cache misses. To address this problem, several instruction s...
Analysis of techniques to improve protocol processing latency
- In Proceedings of the ACM SIGCOMM 1996 Conference
, 1996
"... ..."
Efficient Memory Simulation in SimICS
, 1995
"... We describe novel techniques used for efficient simulation of memory in SimICS, an instruction level simulator developed at SICS. The design has focused on efficiently supporting the simulation of multiprocessors, analyzing complex memory hierarchies and running large binaries with a mixture of syst ..."
Abstract
-
Cited by 39 (6 self)
- Add to MetaCart
We describe novel techniques used for efficient simulation of memory in SimICS, an instruction level simulator developed at SICS. The design has focused on efficiently supporting the simulation of multiprocessors, analyzing complex memory hierarchies and running large binaries with a mixture of system-level and user-level code. A software caching
Protocol Latency: MIPS and Reality
, 1995
"... This paper describes several techniques designed to improve protocol latency, and reports on their effectiveness when measured on a modern RISC processor---the DEC Alpha. We found that memory bandwidth---whichhas long been known to dominate network throughput---is also a key factor in protocol laten ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
This paper describes several techniques designed to improve protocol latency, and reports on their effectiveness when measured on a modern RISC processor---the DEC Alpha. We found that memory bandwidth---whichhas long been known to dominate network throughput---is also a key factor in protocol latency. The techniques are designed to increase the effectiveness of the instruction-cache and result in reduced processor stall rates. Department of Computer Science The University of Arizona Tucson, AZ 85721 1 This work supported in part by ARPA Contract DABT63-91-C-0030, by Digital Equipment Corporation. 1 Introduction Communication latency is often just as important as throughput in distributed systems [TL93], and for this reason, researchers have analyzed the latency characteristics of common network protocols like TCP/IP [KP93, CJRS89, Jac93]. These studies have shown that, despite the rich functionality offered by TCP/IP, the processing overheads are actually quite low. This paper rev...
Dynamic Optimization through the use of Automatic Runtime Specialization
, 1999
"... Profile-driven optimizations and dynamic optimization through specialization have taken optimizations to a new level. By using actual runtime data, optimizers can generate code that is specially tuned for the task at hand. However, most existing compilers that perform these optimizations require s ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
Profile-driven optimizations and dynamic optimization through specialization have taken optimizations to a new level. By using actual runtime data, optimizers can generate code that is specially tuned for the task at hand. However, most existing compilers that perform these optimizations require separate test runs to gather profile information, and/or user annotations in the code. In this thesis, I describe runtime optimizations that a dynamic compiler can perform automatically --- without user annotations --- by utilizing realtime performance data. I describe the implementation of the dynamic optimizations in the framework of a Java Virtual Machine and give performance results.
Accurate Data Distribution Into Blocks May Boost Cache Performance
, 1996
"... Applications often under-utilize cache space, generating unnecessarily high cache miss ratios. Data distribution is a software technique which could improve cache miss rates for any types of application. There is a great potential to exploit: data distribution can reduce capacity misses as well as c ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Applications often under-utilize cache space, generating unnecessarily high cache miss ratios. Data distribution is a software technique which could improve cache miss rates for any types of application. There is a great potential to exploit: data distribution can reduce capacity misses as well as conflict misses.
Microarchitectural and Compile-Time Optimizations for Performance Improvement of Procedural and Object-Oriented Languages
- Northeastern University
, 2000
"... Applications, and their associated programming models, have had a profound influence on computer architecture evolution. Programs developed in procedural languages (e.g., C and fortran) have traditionally served this role. The popularity of the Object Oriented Programming (OOP) paradigm has been gro ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Applications, and their associated programming models, have had a profound influence on computer architecture evolution. Programs developed in procedural languages (e.g., C and fortran) have traditionally served this role. The popularity of the Object Oriented Programming (OOP) paradigm has been growing rapidly, especially through the use of languages such as C++ and Java. OOP languages support the concepts of data encapsulation, polymorphism and inheritance, which promise to increase code reuse and result in more reliable code. Applications developed in object oriented languages exhibit different execution behavior compared to their procedural language counterparts. We focus our work on two primary differences encountered as we move to applications developed in OO languages: i) the increased number of procedures and their higher calling frequencies, and ii) the increased use of indirect branches. Equipped with a set of C and C++ benchmark applications, we propose microarchitectural m...
The Effect of Program Optimization on Trace Cache Efficiency
- Proceedings of International Conference on Parallel
, 1999
"... Trace cache, an instruction fetch technique that reduces taken branch penalties by storing and fetching program instructions in dynamic execution order, dramatically improves instruction fetch bandwidth. Similarly, program transformations like loop unrolling, procedure inlining, feedback-directed pr ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Trace cache, an instruction fetch technique that reduces taken branch penalties by storing and fetching program instructions in dynamic execution order, dramatically improves instruction fetch bandwidth. Similarly, program transformations like loop unrolling, procedure inlining, feedback-directed program restructuring, and profiledirected feedback can improve instruction fetch bandwidth by changing the static structure and ordering of a program's basic blocks. We examine the interaction of these compiletime and run-time techniques in the context of a high-quality production compiler that implements such transformations and a cycle-accurate simulation model of a wide issue superscalar processor. Not surprisingly, we find that the relative benefit of adding trace cache declines with increasing optimization level, and vice versa. Furthermore, we find that certain optimizations that improve performance on a processor model without trace cache can actually degrade performance on a processor...

