Results 1 -
5 of
5
Software trace cache
- Proceedings of the 13th Intl. Conference on Supercomputing
, 1999
"... Abstract—This paper explores the use of compiler optimizations which optimize the layout of instructions in memory. The target is to enable the code to make better use of the underlying hardware resources regardless of the specific details of the processor/ architecture in order to increase fetch pe ..."
Abstract
-
Cited by 36 (9 self)
- Add to MetaCart
Abstract—This paper explores the use of compiler optimizations which optimize the layout of instructions in memory. The target is to enable the code to make better use of the underlying hardware resources regardless of the specific details of the processor/ architecture in order to increase fetch performance. The Software Trace Cache (STC) is a code layout algorithm with a broader target than previous layout optimizations. We target not only an improvement in the instruction cache hit rate, but also an increase in the effective fetch width of the fetch engine. The STC algorithm organizes basic blocks into chains trying to make sequentially executed basic blocks reside in consecutive memory positions, then maps the basic block chains in memory to minimize conflict misses in the important sections of the program. We evaluate and analyze in detail the impact of the STC, and code layout optimizations in general, on the three main aspects of fetch performance: the instruction cache hit rate, the effective fetch width, and the branch prediction accuracy. Our results show that layout optimized codes have some special characteristics that make them more amenable for highperformance instruction fetch: They have a very high rate of not-taken branches and execute long chains of sequential instructions; also, they make very effective use of instruction cache lines, mapping only useful instructions which will execute close in time, increasing both spatial and temporal locality. Index Terms—Pipeline processors, instruction fetch, compiler optimizations, branch prediction, trace cache. 1
Code Layout Optimizations for Transaction Processing Workloads
- IN PROC. 28TH ANNUAL INT. SYMP. COMPUTER ARCHITECTURE
, 2001
"... Commercial applications such as databases and Web servers constitute the most important market segment for high-performance servers. Among these applications, on-line transaction processing (OLTP) workloads provide a challenging set of requirements for system designs since they often exhibit ineffic ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
Commercial applications such as databases and Web servers constitute the most important market segment for high-performance servers. Among these applications, on-line transaction processing (OLTP) workloads provide a challenging set of requirements for system designs since they often exhibit inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication miss rates. A number of recent studies have characterized the behavior of commercial workloads and proposed architectural features to improve their performance. However, there has been little research on the impact of software and compiler-level optimizations for improving the behavior of such workloads. This paper provides a detailed study of profile-driven compiler optimizations to improve the code layout in commercial workloads with
Compiling for Instruction Cache Performance on a Multithreaded Architecture
, 2002
"... Instruction cache aware compilation seeks to lay out a program in memory in such a way that cache conflicts between procedures are minimized. It does this through profile-driven knowledge of procedure invocation patterns. On a multithreaded architecture, however, more conflicts may arise between thr ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Instruction cache aware compilation seeks to lay out a program in memory in such a way that cache conflicts between procedures are minimized. It does this through profile-driven knowledge of procedure invocation patterns. On a multithreaded architecture, however, more conflicts may arise between threads than between procedures on the same thread. This research examines opportunities for the compiler to optimize instruction cache layout on a multithreaded architecture. We examine scenarios where (1) the compiler has knowledge about multiple programs that will be or are likely to be co-scheduled, and where (2) the compiler has no knowledge at compile time of which applications will be co-scheduled. We present solutions for both environments.
Optimization of instruction fetch for decision support workloads
- Proceedings of the Intl. Conference on Parallel Processing
, 1999
"... Instruction fetch bandwidth is feared to be a major limiting factor to the performance of future wide-issue aggressive superscalars. In this paper, we focus on Database applications running Decision Support workloads. We characterize the locality patterns of ia database kernel and find frequently ex ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Instruction fetch bandwidth is feared to be a major limiting factor to the performance of future wide-issue aggressive superscalars. In this paper, we focus on Database applications running Decision Support workloads. We characterize the locality patterns of ia database kernel and find frequently executed paths. Using this information, we propose an algorithm to lay out the basic blocks for improved I-fetch. Our results show a miss reduction of 60-98 % for realistic I-cache sizes and a doubling of the number of instructions executed between taken branches. As a consequence, we increase the fetch bandwith provided by an aggressive sequential fetch unit from 5.8 for the original code to 10.6 using our proposed layout. Our software scheme combines well with hardware schemes like a Trace Cache providing up to 12.1 instruction per cycle, suggesting that commercial workloads may be amenable to the aggressive I-fetch of future superscalars. 1
Code Reordering of Decision Support Systems for optimized Instruction Fetch
- Proc. of the Intl. Conference on Parallel Processing
, 1998
"... Instruction fetch bandwidth is feared to be a major limiting factor to the performance of future wide-issue aggressive superscalars. Consequently, it is crucial to develop techniques to increase the number of useful instructions per cycle provided to the processor. Unfortunately, most of the past wo ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Instruction fetch bandwidth is feared to be a major limiting factor to the performance of future wide-issue aggressive superscalars. Consequently, it is crucial to develop techniques to increase the number of useful instructions per cycle provided to the processor. Unfortunately, most of the past work in this area has largely focused on engineering workloads, rather than on the more challenging, badly-behaved popular commercial workloads. In this paper, we focus on Database applications running Decision Support workloads. We characterize the locality patterns of database kernel code and find frequently executed paths. Using this information, we propose an algorithm to lay out the basic blocks of the database kernel for improved I-fetch. Finally, we evaluate the scheme via simulations. Our results show a miss reduction of 60-98% for realistic I-cache sizes and a doubling of the number of instructions executed between taken branches. As a consequence we increase the fetch bandwith provid...

