Results 1 - 10
of
27
Dead-block prediction and dead-block correlating prefetchers
- In Proceedings of the 28th International Symposium on Computer Architecture
, 2001
"... laia @ ecn.purdue, edu Effective data prefetching requires accurate mechanisms to predict both "which " cache blocks to prefetch and "when " to prefetch them. This paper proposes the Dead-Block Predictors (DBPs), trace-based predictors that accu-rately identify &a ..."
Abstract
-
Cited by 102 (9 self)
- Add to MetaCart
laia @ ecn.purdue, edu Effective data prefetching requires accurate mechanisms to predict both "which " cache blocks to prefetch and "when " to prefetch them. This paper proposes the Dead-Block Predictors (DBPs), trace-based predictors that accu-rately identify "when " an L1 data cache block becomes evictable or "dead". Predicting a dead block significantly enhances prefetching lookahead and opportunity, and enables placing data directly into L1, obviating the need for auxiliary prefetch buffers. This paper also proposes Dead-Block Correlating Prefetchers (DBCPs), that use address correlation to predict "which " subsequent block to prefetch when a block becomes evictable. A DBCP enables effective data prefetching in a wide spectrum of pointer-intensive, integer, and floating-point applications. We use cycle-accurate simulation of an out-of-order superscalar processor and memory-intensive benchmarks to show that: (1) dead-block prediction enhances prefetch-ing lookahead at least by an order of magnitude as com-pared to previous techniques, (2) a DBP can predict dead blocks on average with a coverage of 90 % only mispredict-ing 4 % of the time, (3) a DBCP offers an address prediction coverage of 86 % only mispredicting 3 % of the time, and (4) DBCPs improve performance by 62 % on average and 282 % at best in the benchmarks we studied. 1
Guided Region Prefetching: A Cooperative Hardware/Software Approach
- In Proceedings of the 30th International Symposium on Computer Architecture
, 2003
"... Despite large caches, main-memory access latencies still cause significant performance losses in many applications. Numerous hardware and software prefetching schemes have been proposed to tolerate these latencies. Software prefetching typically provides better prefetch accuracy than hardware, but i ..."
Abstract
-
Cited by 65 (9 self)
- Add to MetaCart
(Show Context)
Despite large caches, main-memory access latencies still cause significant performance losses in many applications. Numerous hardware and software prefetching schemes have been proposed to tolerate these latencies. Software prefetching typically provides better prefetch accuracy than hardware, but is limited by prefetch instruction overheads and the compiler's limited ability to schedule prefetches sufficiently far in advance to cover level-two cache miss latencies. Hardware prefetching can be effective at hiding these large latencies, but generates many useless prefetches and consumes considerable memory bandwidth. In this paper, we propose a cooperative hardware-software prefetching scheme called Guided Region Prefetching (GRP), which uses compiler-generated hints encoded in load instructions to regulate an aggressive hardware prefetching engine. We compare GRP against a sophisticated pure hardware stride prefetcher and a scheduled region prefetching (SRP) engine. SRP and GRP show the best performance, with respective 22% and 21% gains over no prefetching, but SRP incurs 180% extra memory traffic---nearly tripling bandwidth requirements. GRP achieves performance close to SRP, but with a mere eighth of the extra prefetching traffic, a 23% increase over no prefetching. The GRP hardware-software collaboration thus combines the accuracy of compilerbased program analysis with the performance potential of aggressive hardware prefetching, bringing the performance gap versus a perfect L2 cache under 20%.
Timekeeping in the Memory System: Predicting and Optimizing Memory Behavior
, 2002
"... Techniques for analyzing and improving memory referencing behavior continue to be important for achieving good overall program performance due to the ever-increasing performance gap between processors and main memory. This paper offers a fresh perspective on the problem of predicting and optimizing ..."
Abstract
-
Cited by 55 (4 self)
- Add to MetaCart
Techniques for analyzing and improving memory referencing behavior continue to be important for achieving good overall program performance due to the ever-increasing performance gap between processors and main memory. This paper offers a fresh perspective on the problem of predicting and optimizing memory behavior. Namely, we show quantitatively the extent to which detailed timing characteristics of past memory reference events are strongly predictive of future program reference behavior. We propose a family of timekeeping techniques that optimize behavior based on observations about particular cache time durations, such as the cache access interval or the cache dead time. Timekeeping techniques can be used to build small, simple, and high-accuracy (often 90% or more) predictors for identifying conflict misses, for predicting dead blocks, and even for estimating the time at which the next reference to a cache frame will occur and the address that will be accessed. Based on these predictors, we demonstrate two new and complementary time-based hardware structures: (1) a time-based victim cache that improves performance by only storing conflict miss lines with likely reuse, and (2) a time-based prefetching technique that hones in on the right address to prefetch, and the right time to schedule the prefetch. Our victim cache technique improves performance over previous proposals by better selections of what to place in the victim cache. Our prefetching technique outperforms similar prior hardware prefetching proposals, despite being orders of magnitude smaller. Overall, these techniques improve performance by more than 11% across the SPEC2000 benchmark suite.
TCP: Tag Correlating Prefetchers
, 2003
"... Although caches for decades have been the backbone of the memory system, the speed gap between CPU and main memory suggests their augmentation with prefetching mechanisms. Recently, sophisticated hardware correlating prefetching mechanisms have been proposed, in some cases coupled with some form of ..."
Abstract
-
Cited by 26 (0 self)
- Add to MetaCart
Although caches for decades have been the backbone of the memory system, the speed gap between CPU and main memory suggests their augmentation with prefetching mechanisms. Recently, sophisticated hardware correlating prefetching mechanisms have been proposed, in some cases coupled with some form of dead-block prediction. In many proposals, however, correlating prefetchers demand a significant investment in hardware. In this paper we show that correlating prefetchers that work with tags instead of cache-line addresses are significantly more resource-efficient, providing equal or better performance than previous proposals. We support this claim by showing that per-set tag sequences exhibit highly repetitive patterns both within a set and across different sets. Because a single tag sequence can capture multiple address sequences spread over different cache sets, significant space savings can be achieved. We propose a tag-based prefetcher called a Tag Correlating Prefetcher (TCP). Even with very small history tables, TCP outperforms address-based correlating prefetchers many times larger. In addition, we show that such a prefetcher can yield most of its performance benefits if placed at the L2 level of an aggressive out-of-order processor. Only if one wants prefetching all the way up to L1, is dead-block prediction required. Finally, we draw parallels between the twolevel structure of TCP and similar structures for branch prediction mechanisms; these parallels raise interesting opportunities for improving correlating memory prefetchers by harnessing lessons already learned for correlating branch predictors.
Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems
, 2008
"... Linked data structure (LDS) accesses are critical to the performance of many large scale applications. Techniques have been proposed to prefetch such accesses. Unfortunately, many LDS prefetching techniques 1) generate a large number of useless prefetches, thereby degrading performance and bandwidth ..."
Abstract
-
Cited by 22 (4 self)
- Add to MetaCart
(Show Context)
Linked data structure (LDS) accesses are critical to the performance of many large scale applications. Techniques have been proposed to prefetch such accesses. Unfortunately, many LDS prefetching techniques 1) generate a large number of useless prefetches, thereby degrading performance and bandwidth-efficiency, 2) require significant hardware or storage cost, or 3) when employed together with stream-based prefetchers, cause significant resource contention in the memory system. As a result, existing processors do not employ LDS prefetchers even though they commonly employ stream-based prefetchers. This paper proposes a low-cost hardware/software cooperative technique that enables bandwidth-efficient prefetching of linked data structures. Our solution has two new components: 1) a compiler-guided prefetch filtering mechanism that informs the hardware about which pointer addresses to prefetch, 2) a coordinated prefetcher throttling mechanism that uses run-time feedback to manage the interference between multiple prefetchers (LDS and stream-based) in a hybrid prefetching system. Evaluations show that the proposed solution improves average performance by 22.5 % while decreasing memory bandwidth consumption by 25 % over a baseline system that employs an effective stream prefetcher on a set of memory- and pointer-intensive applications. We compare our proposal to three different LDS/correlation prefetching techniques and find that it provides significantly better performance on both single-core and multi-core systems, while requiring less hardware cost.
Compiler Orchestrated Prefetching via Speculation and Predication
- In ASPLOS-XI: Proceedings of the 11th international conference on Architectural
, 2004
"... This paper introduces a compiler-orchestrated prefetching system as a unified framework geared toward ameliorating the gap between processing speeds and memory access latencies. We focus the scope of the optimization on specific subsets of the program dependence graph that succinctly characterize th ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
This paper introduces a compiler-orchestrated prefetching system as a unified framework geared toward ameliorating the gap between processing speeds and memory access latencies. We focus the scope of the optimization on specific subsets of the program dependence graph that succinctly characterize the memory access pattern of both regular array-based applications and irregular pointer-intensive programs. We illustrate how program embedded precomputation via speculative execution can accurately predict and effectively prefetch future memory references with negligible overhead. The proposed techniques reduce the total running time of seven SPEC benchmarks and two OLDEN benchmarks by 27% on an Itanium 2 processor. The improvements are in addition to several state-of-the-art optimizations including software pipelining and data prefetching. In addition, we use cycle-accurate simulations to identify important and lightweight architectural innovations that further mitigate the memory system bottleneck. In particular, we focus on the notoriously challenging class of pointerchasing applications, and demonstrate how they may benefit from a novel scheme of sentineled prefetching. Our results for twelve SPEC benchmarks demonstrate that 45% of the processor stalls that are caused by the memory system are avoidable. The techniques in this paper can effectively mask long memory latencies with little instruction overhead, and can readily contribute to the performance of processors today.
Branch History Guided Instruction Prefetching
- In Proceedings of the Seventh International Conference on High Performance Computer Architecture (HPCA
, 2001
"... Instruction cache misses stall the fetch stage of the processor pipeline and hence affect instruction supply to the processor. Instruction prefetching has been proposed as a mechanism to reduce instruction cache (I-cache) misses. However, a prefetch is effective only if accurate and initiated suffic ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
(Show Context)
Instruction cache misses stall the fetch stage of the processor pipeline and hence affect instruction supply to the processor. Instruction prefetching has been proposed as a mechanism to reduce instruction cache (I-cache) misses. However, a prefetch is effective only if accurate and initiated sufficiently early to cover the miss penalty. This paper presents a new hardware-based instruction prefetching mechanism, Branch History Guided Prefetching (BHGP), to improve the timeliness of instruction prefetches. BHGP correlates the execution of a branch instruction with I-cache misses and uses branch instructions to trigger prefetches of instructions that occur (N 1) branches later in the program execution, for a given N > 1. Evaluations on commercial applications, windows-NT applications, and some CPU2000 applications show an average reduction of 66% in miss rate over all applications. BHGP improved the IPC by 12 to 14% for the CPU2000 applications studied; on average 80% of the BHGP prefetc...
Temporal memory streaming
, 2007
"... While device scaling has led to continued processor performance improvement, scaling trends in DRAM technology have favored improving density over access latency. As a result, processors in modern servers spend much of execution time stalled on long-latency memory accesses. The conventional approach ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
While device scaling has led to continued processor performance improvement, scaling trends in DRAM technology have favored improving density over access latency. As a result, processors in modern servers spend much of execution time stalled on long-latency memory accesses. The conventional approach to latency tolerance—enlarging the on-chip cache hierarchy as transistor budgets scale—is providing diminishing returns because today's multi-megabyte caches already capture available locality. Commercial server applications present a particular challenge for memory system design because current prefetching/streaming approaches are often ineffective on the irregular data structures and dependent miss chains characteristic of these applications. To further improve server performance, architects must design mechanisms that issue memory requests earlier and with greater parallelism in the face of complex access patterns. Despite their complexity, commercial applications nonetheless execute repetitive code sequences, which give rise to recurring data structure traversals. As a result, memory addresses are temporally-correlated—addresses accessed near one another in time often recur together. By recording temporally-correlated cache miss addresses and using the recorded information to predict
Dynamic memory optimization using pool allocation and prefetching
- SIGARCH Comput. Archit. News
"... Heap memory allocation plays an important role in modern applications. Conventional heap allocators, however, generally ignore the underlying memory hierarchy of the system, favoring instead a low runtime overhead and fast response times. Unfortunately, with little concern for the memory hierarchy, ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
Heap memory allocation plays an important role in modern applications. Conventional heap allocators, however, generally ignore the underlying memory hierarchy of the system, favoring instead a low runtime overhead and fast response times. Unfortunately, with little concern for the memory hierarchy, the data layout may exhibit poor spatial locality, and degrade cache performance. In this paper, we describe a dynamic heap allocation scheme called pool allocation. The strategy aims to improve cache performance by inspecting memory allocation requests, and allocating memory from appropriate heap pools as dictated by the requesting context. The advantages are two fold. First, by pooling together data with a common context, we expect to improve spatial locality, as data fetched to the caches will contain fewer items from different contexts. If the allocation patterns are closely matched to the traversal patterns, the end result is faster memory performance. Second, by pooling heap objects, we expect access patterns to exhibit more regularity, thus creating more opportunities for data prefetching. Our dynamic memory optimizer exploits the increased regularity to insert prefetch instructions at runtime. The optimizations are implemented in DynamoRIO, a dynamic optimization framework. We evaluate the work using various benchmarks, and measure a 17 % speedup over gcc-O3 on an Athlon MP, and a 13 % speedup on a Pentium 4. 1.