Results 1 - 10
of
20
Program Optimization for Instruction Caches
, 1989
"... This paper presents an optimization algorithm for reducing instruction cache misses. The algorithm uses profile information to reposition programs in memory so that a direct-mapped cache behaves much like an optimal cache with full associativity and full knowledge of the future. For best results, th ..."
Abstract
-
Cited by 138 (2 self)
- Add to MetaCart
This paper presents an optimization algorithm for reducing instruction cache misses. The algorithm uses profile information to reposition programs in memory so that a direct-mapped cache behaves much like an optimal cache with full associativity and full knowledge of the future. For best results, the cache should have a mechanism for excluding certain instructions designated by the compiler. This paper first presents a reduced form of the algorithm. This form is shown to produce an optimal miss rate for programs without conditionals and with a tree call graph, assuming basic blocks can be reordered at will. If conditionals are allowed, but there are no loops within conditionals, the algorithm does as well as an optimal cache for the worst case execution of the program consistent with the profile information. Next, the algorithm is extended with heuristics for general programs. The effectiveness of these heuristics are demonstrated with empirical results for a set of 10 programs for various cache sizes. The improvement depends on cache size. For a 512 word cache, miss rates for a direct-mapped instruction cache are halved. For an 8K word cache, miss rates fall by over 75%. Over a wide range of cache sizes the algorithm is as effective as increasing the cache size by a factor of 3 times. For 512 words, the algorithm generates only 32 % more misses than an optimal cache. Optimized programs on a direct-mapped cache have lower miss rates than unoptimized programs on set-associative caches of the same size.
Competitive Paging With Locality of Reference
- Journal of Computer and System Sciences
, 1991
"... Abstract The Sleator-Tarjan competitive analysis of paging [Comm. of the ACM; 28:202- 208, 1985] gives us the ability to make strong theoretical statements about the performance of paging algorithms without making probabilistic assumptions on the input. Nevertheless practitioners voice reservations ..."
Abstract
-
Cited by 117 (3 self)
- Add to MetaCart
Abstract The Sleator-Tarjan competitive analysis of paging [Comm. of the ACM; 28:202- 208, 1985] gives us the ability to make strong theoretical statements about the performance of paging algorithms without making probabilistic assumptions on the input. Nevertheless practitioners voice reservations about the model, citing its inability to discern between LRU and FIFO (algorithms whose performances differ markedly in practice), and the fact that the theoretical competitiveness of LRU is much larger than observed in practice. In addition, we would like to address the following important question: given some knowledge of a program's reference pattern, can we use it to improve paging performance on that program?
Reducing Branch Costs via Branch Alignment
- In Six International Conference on Architectural Support for Programming Languages and Operating Systems
, 1994
"... Several researchers have proposed algorithms for basic block reordering. We call these branch alignment algorithms. The primary emphasis of these algorithms has been on improving instruction cache locality, and the few studies concerned with branch prediction reported small or minimal improvements. ..."
Abstract
-
Cited by 80 (13 self)
- Add to MetaCart
Several researchers have proposed algorithms for basic block reordering. We call these branch alignment algorithms. The primary emphasis of these algorithms has been on improving instruction cache locality, and the few studies concerned with branch prediction reported small or minimal improvements. As wide-issue architectures become increasingly popular the importance of reducing branch costs will increase, and branch alignment is one mechanism which can effectively reduce these costs. In this paper, we propose an improved branch alignment algorithm that takes into consideration the architectural cost model and the branch prediction architecture when performing the basic block reordering. We show that branch alignment algorithms can improve a broad range of static and dynamicbranch prediction architectures. We also show that a programs performance can be improved by approximately 5% even whenusing recently proposed,highly accurate branch prediction architectures. The programs are compi...
SPAID: Software Prefetching in Pointer- and Call-Intensive Environments
- In Proceedings of the 28th annual international symposium on Microarchitecture
, 1995
"... Software prefetching, typically in the context of numericor loop-intensive benchmarks, has been proposed as one remedy for the performance bottleneck imposed on computer systems by the cost of servicing cache misses. This paper proposes a new heuristic--SPAID--for utilizing prefetch instructions in ..."
Abstract
-
Cited by 58 (3 self)
- Add to MetaCart
Software prefetching, typically in the context of numericor loop-intensive benchmarks, has been proposed as one remedy for the performance bottleneck imposed on computer systems by the cost of servicing cache misses. This paper proposes a new heuristic--SPAID--for utilizing prefetch instructions in pointer- and call-intensive environments. We use trace-driven cache simulation of a number of pointer- and call-intensive benchmarks to evaluate the benefits and implementation trade-offs of SPAID. Our results indicate that a significant proportion of the cost of data cache misses can be eliminated or reduced with SPAID without unduly increasing memory traffic. 1. Introduction It is well known that processor clock speeds are increasing exponentially over time, while memory speeds are not increasing nearly as rapidly [RD94]. The computing industry has reached the point where system performance is dominated by the cost of servicing cache misses. To address this problem, several instruction s...
Procedure Placement Using Temporal-Ordering Information
- ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS
, 1997
"... ..."
Code Layout Optimizations for Transaction Processing Workloads
- IN PROC. 28TH ANNUAL INT. SYMP. COMPUTER ARCHITECTURE
, 2001
"... Commercial applications such as databases and Web servers constitute the most important market segment for high-performance servers. Among these applications, on-line transaction processing (OLTP) workloads provide a challenging set of requirements for system designs since they often exhibit ineffic ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
Commercial applications such as databases and Web servers constitute the most important market segment for high-performance servers. Among these applications, on-line transaction processing (OLTP) workloads provide a challenging set of requirements for system designs since they often exhibit inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication miss rates. A number of recent studies have characterized the behavior of commercial workloads and proposed architectural features to improve their performance. However, there has been little research on the impact of software and compiler-level optimizations for improving the behavior of such workloads. This paper provides a detailed study of profile-driven compiler optimizations to improve the code layout in commercial workloads with
A Study of Program Behavior to Establish Temporal Locality at the Function Level
, 2001
"... The trend in computer architecture is that processor speeds are increasing rapidly compared to memory access times and the relatively stagnant disk speed. Computer software, on the other hand is characterized by growing program sizes and sophisticated functionality. The combination of these factors ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
The trend in computer architecture is that processor speeds are increasing rapidly compared to memory access times and the relatively stagnant disk speed. Computer software, on the other hand is characterized by growing program sizes and sophisticated functionality. The combination of these factors has resulted in a processor memory bottleneck, which is worsening with time. While program behavior has been studied at page level and cache level and the locality at page, cache and block levels has been exploited, there has been comparatively much lesser amount of work to exploit locality at the level of functions, and no prior work to study program behavior at this level.
Code Placement using Temporal Profile Information
, 1998
"... Instruction cache performance is important to instruction fetch efficiency and overall processor performance. The layout of an executable has a substantial effect on the cache miss rate and the instruction working set size during execution. This means that the performance of an executable can be imp ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Instruction cache performance is important to instruction fetch efficiency and overall processor performance. The layout of an executable has a substantial effect on the cache miss rate and the instruction working set size during execution. This means that the performance of an executable can be improved significantly by applying a code-placement algorithm that minimizes instruction cache conflicts and improves spatial locality. We describe an algorithm for procedure placement, one type of code-placement algorithm, that significantly differs from previous approaches in the type of information used to drive the placement algorithm. In particular, we gather temporal ordering information that summarizes the interleaving of procedures in a program trace. Our algorithm uses this information along with cache configuration and procedure size information to better estimate the conflict cost of a potential procedure ordering. It optimizes the procedure placement for single- and multi-level caches. In addition to reducing instruction cache conflicts, the algorithm simultaneously minimizes the instruction working set size of the program. We compare the performance of our algorithm with a particularly successful procedure-placement algorithm and show noticeable improvements in the instruction cache behavior, while maintaining the same instruction working set size.
The camino compiler infrastructure
- SIGARCH Comput. Archit. News
, 2005
"... This paper introduces the Camino Compiler Infrastructure. Camino implements several types of profiling, including basic block counts, edge profiling, interprocedural path profiling, and a special technique that allows using a SimPoint-like methodology to do efficient and precise fine-grained power b ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
This paper introduces the Camino Compiler Infrastructure. Camino implements several types of profiling, including basic block counts, edge profiling, interprocedural path profiling, and a special technique that allows using a SimPoint-like methodology to do efficient and precise fine-grained power behavior characterization. It also supports a growing set of code placement optimizations such as branch alignment and pattern history table partitioning. In its current implementation, Camino works as a post-processor for the Gnu Compiler Collection (GCC). The goal of Camino is to serve as a testbed for various low-level performance optimizations as well as power and energy optimizations. It currently supports the x86 instruction set.
Removing the Memory Limitations of Sensor Networks with FlashBased Virtual Memory
- In Proceedings of the European Conference on Computer Systems (EuroSys
, 2007
"... Virtual memory has been successfully used in different domains to extend the amount of memory available to applications. We have adapted this mechanism to sensor networks, where, traditionally, RAM is a severely constrained resource. In this paper we show that the overhead of virtual memory can be s ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Virtual memory has been successfully used in different domains to extend the amount of memory available to applications. We have adapted this mechanism to sensor networks, where, traditionally, RAM is a severely constrained resource. In this paper we show that the overhead of virtual memory can be significantly reduced with compile-time optimizations to make it usable in practice, even with the resource limitations present in sensor networks. Our approach, ViMem, creates an efficient memory layout based on variable access traces obtained from simulation tools. This layout is optimized to the memory access patterns of the application and to the specific properties of the sensor network hardware. Our implementation is based on TinyOS. It includes a pre-compiler for nesC code that translates virtual memory accesses into calls of ViMem’s runtime component. ViMem uses flash memory as secondary storage. In order to evaluate our system we have modified nontrivial existing applications to make use of virtual memory. We show that its runtime overhead is small even for large data sizes.

