Results 1 -
3 of
3
Silent Stores and Store Value Locality
- IEEE Transactions on Computers
, 2001
"... AbstractÐValue locality, a recently discovered program attribute that describes the likelihood of the recurrence of previously seen program values, has been studied enthusiastically in the recent published literature. Much of the energy has focused on refining the initial efforts at predicting load ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
AbstractÐValue locality, a recently discovered program attribute that describes the likelihood of the recurrence of previously seen program values, has been studied enthusiastically in the recent published literature. Much of the energy has focused on refining the initial efforts at predicting load instruction outcomes, with the balance of the effort examining the value locality of either all registerwriting instructions or a focused subset of them. Surprisingly, there has been very little published characterization of or effort to exploit the value locality of data words stored to memory by computer programs. This paper presents such a characterization, including detailed source-level analysis of the causes of silent stores, proposes both memory-centric (based on message passing) and producer-centric (based on program structure) prediction mechanisms for stored data values, introduces the concept of silent stores and new definitions of multiprocessor false sharing based on these observations, and suggests new techniques for aligning cache coherence protocols and microarchitectural store handling techniques to exploit the value locality of stores. We find that realistic implementations of these techniques can significantly reduce multiprocessor data bus traffic and are more effective at reducing address bus traffic than the addition of Exclusive state to a MSI coherence protocol. We also show that squashing of silent stores can provide uniprocessor speedups greater than the addition of store-to-load forwarding. Index TermsÐValue locality, value prediction, store optimization, false sharing, cache coherence. 1
Hybrid Load-Value Predictors
- IEEE Transactions on Computers
, 2002
"... AbstractÐLoad instructions diminish processor performance in two ways. First, due to the continuously widening gap between CPU and memory speed, the relative latency of load instructions grows constantly and already slows program execution. Second, memory reads limit the available instruction-level ..."
Abstract
-
Cited by 21 (5 self)
- Add to MetaCart
AbstractÐLoad instructions diminish processor performance in two ways. First, due to the continuously widening gap between CPU and memory speed, the relative latency of load instructions grows constantly and already slows program execution. Second, memory reads limit the available instruction-level parallelism because instructions that use the result of a load must wait for the memory access to complete before they can start executing. Load-value predictors alleviate both problems by allowing the CPU to speculatively continue processing without having to wait for load instructions, which can significantly improve the execution speed. While several hybrid load-value predictors have been proposed and found to work well, no systematic study of such predictors exists. In this paper, we investigate the performance of all hybrids that can be built out of a register value, a last value, a stride 2-delta, a last four value, and a finite context method predictor. Our analysis shows that hybrids can deliver 25 percent more speedup than the best singlecomponent predictors. An examination of the individual components of hybrids revealed that predictors with a poor standalone performance sometimes make excellent components in a hybrid, while combining well-performing individual predictors often does not result in an effective hybrid. Our hybridization study identified the register value + stride 2-delta predictor as one of the best two-component hybrids. It matches or exceeds the speedup of two-component hybrids from the literature in spite of its substantially smaller and simpler design. Of all the predictors we studied, the register value + stride 2-delta + last four value hybrid performs best. It yields a harmonic-mean speedup over the eight SPECint95 programs of 17.2 percent. Index TermsÐValue prediction, value locality, load-value predictor, hybrid predictor, performance metrics. 1
Reducing Memory Latency via Read-after-Read Memory Dependence Prediction
- In IEEE Transactions on Computers
, 2002
"... AbstractÐWe observe that typical programs exhibit highly regular read-after-read �RAR) memory dependence streams. To exploit this regularity, we introduce read-after-read �RAR) memory dependence prediction. This technique predicts whether: 1) A load will access a memory location that a preceding loa ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
AbstractÐWe observe that typical programs exhibit highly regular read-after-read �RAR) memory dependence streams. To exploit this regularity, we introduce read-after-read �RAR) memory dependence prediction. This technique predicts whether: 1) A load will access a memory location that a preceding load accesses and 2) exactly which this preceding load is. This prediction is done without actual knowledge of the corresponding memory addresses. We also present two techniques that utilize RAR memory dependence prediction to reduce memory latency. In the first technique, a load may obtain a value by naming a preceding load with which an RAR dependence is predicted. The second technique speculatively converts a series of LOAD1-USE1;...; LOADN-USEN chains into a single LOAD1-USE1...USEN producer/consumer graph. This is done whenever RAR dependences are predicted among the LOADi instructions. Our techniques can be implemented as small extensions to the previously proposed read-after-write �RAW) dependence prediction-based speculative memory cloaking and speculative memory bypassing. On average, our RAR-based techniques provide correct values for an additional 20 percent �integer codes) and 30 percent �floating-point codes) of all loads. Moreover, a combined RAW- and RAR-based cloaking/bypassing mechanism improves performance by 6.44 percent �integer) and 4.66 percent �floatingpoint) over a highly aggressive dynamically scheduled superscalar processor that uses naive memory dependence speculation. By comparison, the original RAW-based cloaking/bypassing mechanism yields improvements of 4.28 percent �integer) and 3.20 percent �floating-point). When no memory dependence speculation is used, our techniques yield speedups of 9.85 percent �integer) and 6.14 percent �floating-point). Index TermsÐMemory dependence prediction, load, cache, dynamic optimization.

