Results 1 - 10
of
11
Spatio-Temporal Memory Streaming
"... Recent research advocates memory streaming techniques to alleviate the performance bottleneck caused by the high latencies of off-chip memory accesses. Temporal memory streaming replays previously observed miss sequences to eliminate long chains of dependent misses. Spatial memory streaming predicts ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
(Show Context)
Recent research advocates memory streaming techniques to alleviate the performance bottleneck caused by the high latencies of off-chip memory accesses. Temporal memory streaming replays previously observed miss sequences to eliminate long chains of dependent misses. Spatial memory streaming predicts repetitive data layout patterns within fixed-size memory regions. Because each technique targets a different subset of misses, their effectiveness varies across workloads and each leaves a significant fraction of misses unpredicted. In this paper, we propose Spatio-Temporal Memory Streaming (STeMS) to exploit the synergy between spatial and temporal streaming. We observe that the order of spatial accesses repeats both within and across regions. STeMS records and replays the temporal sequence of region accesses and uses spatial relationships within each region to dynamically reconstruct a predicted total miss order. Using trace-driven and cycle-accurate simulation across a suite of commercial workloads, we demonstrate that with similar implementation complexity as temporal streaming, STeMS achieves equal or higher coverage than spatial or temporal memory streaming alone, and improves performance by 31%, 3%, and 18% over stride, spatial, and temporal prediction, respectively. Categories and Subject Descriptors B.3.2 [Memory Structures]: Design styles—cache memories
Machine Learning-Based Prefetch Optimization for Data Center Applications
"... Performance tuning for data centers is essential and complicated. It is important since a data center comprises thousands of machines and thus a single-digit performance improvement can significantly reduce cost and power consumption. Unfortunately, it is extremely difficult as data centers are dyna ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
(Show Context)
Performance tuning for data centers is essential and complicated. It is important since a data center comprises thousands of machines and thus a single-digit performance improvement can significantly reduce cost and power consumption. Unfortunately, it is extremely difficult as data centers are dynamic environments where applications are frequently released and servers are continually upgraded. In this paper, we study the effectiveness of different processor prefetch configurations, which can greatly influence the performance of memory system and the overall data center. We observe a wide performance gap when comparing the worst and best configurations, from 1.4 % to 75.1%, for 11 important data center applications. We then develop a tuning framework which attempts to predict the optimal configuration based on hardware performance counters. The framework achieves performance within 1 % of the best performance of any single configuration for the same set of applications. 1.
Temporal Instruction Fetch Streaming
"... Abstract—L1 instruction-cache misses pose a critical performance bottleneck in commercial server workloads. Cache access latency constraints preclude L1 instruction caches large enough to capture the application, library, and OS instruction working sets of these workloads. To cope with capacity cons ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
Abstract—L1 instruction-cache misses pose a critical performance bottleneck in commercial server workloads. Cache access latency constraints preclude L1 instruction caches large enough to capture the application, library, and OS instruction working sets of these workloads. To cope with capacity constraints, researchers have proposed instruction prefetchers that use branch predictors to explore future control flow. However, such prefetchers suffer from several fundamental flaws: their lookahead is limited by branch prediction bandwidth, their accuracy suffers from geometrically-compounding branch misprediction probability, and they are ignorant of the cache contents, frequently predicting blocks already present in L1. Hence, L1 instruction misses remain a bottleneck. We propose Temporal Instruction Fetch Streaming (TIFS)—a mechanism for prefetching temporally-correlated instruction streams from lower-level caches. Rather than explore a program's control flow graph, TIFS predicts future instruction-cache misses directly, through recording and replaying recurring L1 instruction miss sequences. In this paper, we first present an informationtheoretic offline trace analysis of instruction-miss repetition to show that 94 % of L1 instruction misses occur in long, recurring sequences. Then, we describe a practical mechanism to record these recurring sequences in the L2 cache and leverage them for instruction-cache prefetching. Our TIFS design requires less than 5 % storage overhead over the baseline L2 cache and improves performance by 11 % on average and 24 % at best in a suite of commercial server workloads.
Practical Off-chip Meta-data for Temporal Memory Streaming
"... Prior research demonstrates that temporal memory streaming and related address-correlating prefetchers improve performance of commercial server workloads though increased memory level parallelism. Unfortunately, these prefetchers require large on-chip meta-data storage, making previously-proposed de ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Prior research demonstrates that temporal memory streaming and related address-correlating prefetchers improve performance of commercial server workloads though increased memory level parallelism. Unfortunately, these prefetchers require large on-chip meta-data storage, making previously-proposed designs impractical. Hence, to improve practicality, researchers have sought ways to enable timely prefetch while locating meta-data entirely off-chip. Unfortunately, current solutions for off-chip meta-data increase memory traffic by over a factor of three. We observe three requirements to store meta-data off chip: minimal off-chip lookup latency, bandwidthefficient meta-data updates, and off-chip lookup amortized over many prefetches. In this work, we show: (1) minimal off-chip meta-data lookup latency can be achieved through a hardware-managed main memory hash table, (2) bandwidth-efficient updates can be performed through probabilistic sampling of meta-data updates, and (3) off-chip lookup costs can be amortized by organizing meta-data to allow a single lookup to yield long prefetch sequences. Using these techniques, we develop Sampled Temporal Memory Streaming (STMS), a practical address-correlating prefetcher that keeps predictor meta-data in main memory while achieving 90 % of the performance potential of idealized on-chip meta-data storage. 1.
Timing local streams: improving timeliness in data prefetching
- In Proc. of the 2010 International Conference on Supercomputing
, 2010
"... Data prefetching technique is widely used to bridge the grow-ing performance gap between processor and memory. Nu-merous prefetching techniques have been proposed to exploit data patterns and correlations in the miss address stream. In general, the miss addresses are grouped by some common character ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Data prefetching technique is widely used to bridge the grow-ing performance gap between processor and memory. Nu-merous prefetching techniques have been proposed to exploit data patterns and correlations in the miss address stream. In general, the miss addresses are grouped by some common characteristics, such as program counter or memory region they belong to, into localized streams to improve prefetch accuracy and coverage. However, the existing stream lo-calization technique lacks the timing information of misses. This drawback can lead to a large fraction of untimely pre-fetches, which in turn limits the effectiveness of prefetching, wastes precious bandwidth and leads to high cache pollu-tion potentially. This paper proposes a novel mechanism named stream timing technique that can largely reduce un-timely prefetches and in turn increase the overall perfor-mance. Based on the proposed stream timing technique, we extend the conventional stride prefetcher and propose a new stride prefetcher called Time-Aware Stride (TAS) prefetch-er. We have carried out extensive simulation experiments to verify the design of the stream timing technique and the TAS prefetcher. The simulation results show that the proposed stream timing technique is promising in reducing untimely prefetches and the IPC improvement of TAS prefetcher out-performs the existing stride prefetcher by 11%.
Temporal Streams in Commercial Server Applications
- APPEARS IN 2008 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC)
, 2008
"... Commercial server applications remain memory bound on modern multiprocessor systems because of their large data footprints, frequent sharing, complex non-strided access patterns, and long chains of dependant misses. To improve memory system performance despite these challenging access patterns, rese ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Commercial server applications remain memory bound on modern multiprocessor systems because of their large data footprints, frequent sharing, complex non-strided access patterns, and long chains of dependant misses. To improve memory system performance despite these challenging access patterns, researchers have proposed prefetchers that exploit temporal streams—recurring sequences of memory accesses. Although prior studies show substantial performance improvement from such schemes, they fail to explain why temporal streams arise; that is, they treat commercial applications as a black box and do not identify the specific behaviors that lead to recurring miss sequences. In this paper, we perform an information-theoretic analysis of miss traces from single-chip and multi-chip multiprocessors to identify recurring temporal streams in web serving, online transaction processing, and decision support workloads. Then, using function names embedded in the application binaries and Solaris kernel, we identify the code modules and behaviors that give rise to temporal streams.
Article history:
"... August 2010 decades [18]. The unbalanced performance improvement leads to one of the significant performance bottlenecks in high-performance computing known as memory-wall problem [24,39]. The reason behind this huge processor-memory per-formance disparity is several folds. First, most advanced arch ..."
Abstract
- Add to MetaCart
(Show Context)
August 2010 decades [18]. The unbalanced performance improvement leads to one of the significant performance bottlenecks in high-performance computing known as memory-wall problem [24,39]. The reason behind this huge processor-memory per-formance disparity is several folds. First, most advanced architectural and organizational efforts are focused on processor technology, instead of memory storage devices. Second, drastically improving semiconductor technology results in much smaller and more transistors to be built on chip for processing units and thus can achieve a high computational capability.
PREDICTING MEMORY ACTIVITY USING SPATIAL CORRELATION
, 2009
"... The memory wall continues to pose a performance bottleneck for computer systems—studies show that modern servers spend up to two-thirds of execution time stalled on memory accesses. Although recent trends forecast growth in processor clock frequencies to be minimal, improvements to memory access lat ..."
Abstract
- Add to MetaCart
(Show Context)
The memory wall continues to pose a performance bottleneck for computer systems—studies show that modern servers spend up to two-thirds of execution time stalled on memory accesses. Although recent trends forecast growth in processor clock frequencies to be minimal, improvements to memory access latencies are correspondingly slow. Traditional approaches, such as large on-chip caches, hardware multithreading, and out-of-order processing, demonstrate some success at mitigating the impact of high memory latencies, but offer little hope of completely overcoming the memory wall. Prefetching/streaming techniques have been proposed for predicting and eliminating misses in desktop, scientific, and engineering applications, but are less effective across commercial workloads, which exhibit data dependent and irregular memory behaviors. Though complex, commercial server applications nevertheless organize their data in a structured manner and at large granularity. In this thesis, we explore spatial correlation of access patterns that span page-sized regions of memory. We develop mechanisms for accurately observing and predicting repetitive spatial layouts, which lead us to propose Spatial Memory Streaming (SMS), a hardware prefetcher that exploits spatial correlation to predict cache misses in server workloads.
Carnegie Mellon
"... ......Memory access latency continues to pose a crucial performance bottleneck for server workloads. 1 Prefetchers bring cache blocks from main memory to on-chip caches prior to explicit processor requests to hide cache miss latency. Prefetching improves throughput and response time by increasing me ..."
Abstract
- Add to MetaCart
......Memory access latency continues to pose a crucial performance bottleneck for server workloads. 1 Prefetchers bring cache blocks from main memory to on-chip caches prior to explicit processor requests to hide cache miss latency. Prefetching improves throughput and response time by increasing memory-level parallelism 2,3 and remains an essential strategy to address the processormemory performance gap. Today’s systems primarily employ stride-based prefetchers, which require only simple hardware additions and minimal on-chip area. However, these prefetchers are only partially effective in commercial server workloads, such as online
Transactional Prefetching: Narrowing the Window of Contention in Hardware Transactional Memory
"... Memory access latency is the primary performance bottle-neck in modern computer systems. Prefetching data before it is needed by a processing core allows substantial perfor-mance gains by overlapping significant portions of memory latency with useful work. Prior work has investigated this technique ..."
Abstract
- Add to MetaCart
(Show Context)
Memory access latency is the primary performance bottle-neck in modern computer systems. Prefetching data before it is needed by a processing core allows substantial perfor-mance gains by overlapping significant portions of memory latency with useful work. Prior work has investigated this technique and measured potential benefits in a variety of scenarios. However, its use in speeding up Hardware Trans-actional Memory (HTM) has remained hitherto unexplored. In several HTM designs transactions invalidate speculatively updated cache lines when they abort. Such cache lines tend to have high locality and are likely to be accessed again when the transaction re-executes. Coarse grained transac-tions that update several cache lines are particularly sus-ceptible to performance degradation even under moderate contention. However, such transactions show strong locality of reference, especially when contention is high. Prefetching cache lines with high locality can, therefore, improve overall concurrency by speeding up transactions and, thereby, nar-rowing the window of time in which such transactions persist and can cause contention. Such transactions are important since they are likely to form a common TM use-case. We note that traditional prefetch techniques may not be able to track such lines adequately or issue prefetches quickly enough. This paper investigates the use of prefetching in HTMs, proposing a simple design to identify and request prefetch candidates, and measures performance gains to be had for several representative TM workloads.