Results 1 -
9 of
9
Spatio-Temporal Memory Streaming
"... Recent research advocates memory streaming techniques to alleviate the performance bottleneck caused by the high latencies of off-chip memory accesses. Temporal memory streaming replays previously observed miss sequences to eliminate long chains of dependent misses. Spatial memory streaming predicts ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
(Show Context)
Recent research advocates memory streaming techniques to alleviate the performance bottleneck caused by the high latencies of off-chip memory accesses. Temporal memory streaming replays previously observed miss sequences to eliminate long chains of dependent misses. Spatial memory streaming predicts repetitive data layout patterns within fixed-size memory regions. Because each technique targets a different subset of misses, their effectiveness varies across workloads and each leaves a significant fraction of misses unpredicted. In this paper, we propose Spatio-Temporal Memory Streaming (STeMS) to exploit the synergy between spatial and temporal streaming. We observe that the order of spatial accesses repeats both within and across regions. STeMS records and replays the temporal sequence of region accesses and uses spatial relationships within each region to dynamically reconstruct a predicted total miss order. Using trace-driven and cycle-accurate simulation across a suite of commercial workloads, we demonstrate that with similar implementation complexity as temporal streaming, STeMS achieves equal or higher coverage than spatial or temporal memory streaming alone, and improves performance by 31%, 3%, and 18% over stride, spatial, and temporal prediction, respectively. Categories and Subject Descriptors B.3.2 [Memory Structures]: Design styles—cache memories
Fingerprinting: Hash-Based Error Detection in Microprocessors
, 2008
"... Today’s commodity processors are tuned primarily for performance and power. As CMOS scaling continues into the deep sub-micron regime, soft errors and device wearout will increas-ingly jeopardize the reliability of unprotected processor pipelines. To preserve reliable operation, processor cores will ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
Today’s commodity processors are tuned primarily for performance and power. As CMOS scaling continues into the deep sub-micron regime, soft errors and device wearout will increas-ingly jeopardize the reliability of unprotected processor pipelines. To preserve reliable operation, processor cores will require mechanisms to detect errors affecting the control and datapaths. Con-ventional techniques such as parity, error correcting codes, and self-checking circuits have high implementation overheads and therefore these techniques cannot be easily applied to complex and timing-critical high-performance pipelines. This thesis proposes and evaluates architectural and microarchitectural fingerprints. A finger-print is a compact (e.g., 16-bit) signature of recent architectural or microarchitectural state updates. By periodically comparing a fingerprint and a reference over an interval of execution, the system can detect errors in a timely and bandwidth-efficient manner. Architectural fingerprints capture in-order architectural state with lightweight monitoring hardware at the retirement stages of a pipeline, while microarchitectural fingerprints leverage existing design-for-test hardware to accumulate a signature of internal state. This thesis explores two applications of fingerprints. In the Reunion execution model, this thesis shows that architectural fingerprints can detect both soft errors and input incoherence with complexity-effective redundant execution in a chip multiprocessor. Cycle-accurate simulation shows that the performance overhead is only 5-6 % over more complicated designs that strictly replicate inputs. In another application, FIRST, fingerprints detect emerging wearout faults by periodically testing the processor under marginal operating conditions. Wearout fault simulation in a commercial processor show that architectural fingerprints have high coverage of widespread wearout, while mi-croarchitectural fingerprints provide superior coverage of both individual and widespread wearout. ii
Practical Off-chip Meta-data for Temporal Memory Streaming
"... Prior research demonstrates that temporal memory streaming and related address-correlating prefetchers improve performance of commercial server workloads though increased memory level parallelism. Unfortunately, these prefetchers require large on-chip meta-data storage, making previously-proposed de ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Prior research demonstrates that temporal memory streaming and related address-correlating prefetchers improve performance of commercial server workloads though increased memory level parallelism. Unfortunately, these prefetchers require large on-chip meta-data storage, making previously-proposed designs impractical. Hence, to improve practicality, researchers have sought ways to enable timely prefetch while locating meta-data entirely off-chip. Unfortunately, current solutions for off-chip meta-data increase memory traffic by over a factor of three. We observe three requirements to store meta-data off chip: minimal off-chip lookup latency, bandwidthefficient meta-data updates, and off-chip lookup amortized over many prefetches. In this work, we show: (1) minimal off-chip meta-data lookup latency can be achieved through a hardware-managed main memory hash table, (2) bandwidth-efficient updates can be performed through probabilistic sampling of meta-data updates, and (3) off-chip lookup costs can be amortized by organizing meta-data to allow a single lookup to yield long prefetch sequences. Using these techniques, we develop Sampled Temporal Memory Streaming (STMS), a practical address-correlating prefetcher that keeps predictor meta-data in main memory while achieving 90 % of the performance potential of idealized on-chip meta-data storage. 1.
EXPLOITING SOFTWARE INFORMATION FOR AN EFFICIENT MEMORY HIERARCHY BY
"... Power consumption is one of the most important factors in the design of today’s processor chips. Multicore and heterogeneous systems have emerged to address the rising power concerns. Since the memory hierarchy is becoming one of the major consumers of the on-chip power budget in these systems [73], ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Power consumption is one of the most important factors in the design of today’s processor chips. Multicore and heterogeneous systems have emerged to address the rising power concerns. Since the memory hierarchy is becoming one of the major consumers of the on-chip power budget in these systems [73], designing an efficient memory hierarchy is critical to future systems. We identify three sources of inefficiencies in memory hierarchies of today’s systems: (a) coherence, (b) data communication, and (c) data storage. This thesis takes the stand that many of these inefficiencies are a result of today’s software-agnostic hardware design. There is a lot of information in the software that can be exploited to build an efficient memory hierarchy. This thesis focuses on identifying some of the inefficiencies related to each of the above three sources, and proposing various techniques to mitigate them by exploiting information from the software. First, we focus on inefficiencies related to coherence and communication. Today’s hardware based direc-tory coherence protocols are extremely complex and incur unnecessary overheads for sending invalidation messages and maintaining sharer lists. We propose DeNovo, a hardware-software co-designed protocol, to address these issues for a class of programs that are deterministic. DeNovo assumes a disciplined program-ming environment and exploits features such as structured parallel control, data-race-freedom, and software
IN
"... An exponentially increasing demand for online services continues pushing server performance into the forefront of computer architecture. While the diversity and complexity of server workloads places demands on many aspects of server processors, the memory system has been among the key exposed bottle ..."
Abstract
- Add to MetaCart
An exponentially increasing demand for online services continues pushing server performance into the forefront of computer architecture. While the diversity and complexity of server workloads places demands on many aspects of server processors, the memory system has been among the key exposed bottlenecks. In particular, long-latency instruction accesses have long been recognized as one of the key factors limiting the performance of servers. Server workloads span multiple application binaries, shared libraries, and operating system modules which comprise hundreds of kilobytes to megabytes of code. While steady technological improvements have enabled growth in the total on-chip cache capacity, cache access latency constraints preclude building L1 instruction caches large enough to capture the instruction working sets of server workloads, leaving L1 instruction-cache misses as a major bottleneck. In this work, we make the observation that instruction-cache misses repeat in long recurring sequences that we call Temporal Instruction Streams. Temporal instruction streams comprise sequences of tens to thousands of instruction-cache blocks which recur frequently during program execution. The stability and length of the instruction streams lend themselves well to prediction, allowing accurate prediction of long sequences of upcoming instruction accesses once a previously
Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two
"... One of the major challenges NoC and on-board interconnection has to face in current and future multicore chips is the skyrocketing off-chip bandwidth requirement. As the number of cores increases, the demand for off-chip bus, memory ports, and chip pins increases and this can severely hurt performan ..."
Abstract
- Add to MetaCart
(Show Context)
One of the major challenges NoC and on-board interconnection has to face in current and future multicore chips is the skyrocketing off-chip bandwidth requirement. As the number of cores increases, the demand for off-chip bus, memory ports, and chip pins increases and this can severely hurt performance. It leads to bus congestion, processor stalls and hence, performance loss. Off-chip bandwidth is generated by the on-chip cache hierarchy (cache misses and cache writebacks). This paper studies the interaction among off-chip bandwidth requirement, cache performance, and overall system performance in multicore systems. The traffic from the chip to the memory is due to the writes which are sent from the last level on-chip cache to the memory, or to the following level external cache, whenever a block is replaced from the cache. We relax some constraints of the well known LRU replacement policy. We call it Modified Least Recently Used (MLRU) policy which essentiallyreduces the traffic from the chip to the memory system. Simulations using the MLRU policy show considerable writeback decrease by more than 90 % with a minimum performance impact. However, it shows also some interesting interaction among cache performance, overall system performance, and off-chip writeback traffic. 1
1PBC: Prefetched Blocks Compaction
"... Abstract—Cache compression improves the performance of a multi-core system by being able to store more cache blocks in a compressed format. Compression is achieved by exploiting data patterns present within a block. For a given cache space, compression increases the effective cache capacity. However ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—Cache compression improves the performance of a multi-core system by being able to store more cache blocks in a compressed format. Compression is achieved by exploiting data patterns present within a block. For a given cache space, compression increases the effective cache capacity. However, this increase is limited by the number of tags that can be accommodated at the cache. Prefetching is another technique that improves system performance by fetching the cache blocks ahead of time into the cache and hiding the off-chip latency. Commonly used hardware prefetchers, such as stream and stride, fetch multiple contiguous blocks into the cache. In this paper we propose prefetched blocks compaction (PBC) wherein we exploit the data patterns present across these prefetched blocks. PBC compacts the prefetched blocks into a single block with a single tag, effectively increasing the cache capacity. We also modify the cache organization to access these multiple cache blocks residing in a single block without any need for extra tag look-ups. PBC improves the system performance by 11.1 % with a maximum of 43.4 % on a 4-core system.
The Use of Memory State Knowledge to Improve Computer Memory System Organization
, 2011
"... Copyright by ..."
(Show Context)
§Parallel Systems Architecture Lab
"... Caches mitigate the long memory latency that limits the performance of modern processors. However, caches can be quite inefficient. On average, a cache block in a 2MB L2 cache is dead 59 % of the time, i.e., it will not be referenced again before it is evicted. Increasing cache efficiency can improv ..."
Abstract
- Add to MetaCart
(Show Context)
Caches mitigate the long memory latency that limits the performance of modern processors. However, caches can be quite inefficient. On average, a cache block in a 2MB L2 cache is dead 59 % of the time, i.e., it will not be referenced again before it is evicted. Increasing cache efficiency can improve performance by reducing miss rate, or alternately, improve power and energy by allowing a smaller cache with the same miss rate. This paper proposes using predicted dead blocks to hold blocks evicted from other sets. When these evicted blocks are referenced again, the access can be satisfied from the other set, avoiding a costly access to main memory. The pool of predicted dead blocks can be thought of as a virtual victim cache. A virtual victim cache in a 16-way set associative 2MB L2 cache reduces misses by 11.7%, yields an average speedup of 12.5 % and improves cache efficiency by 15 % on average, where cache efficiency is defined as the average time during which cache blocks contain live information. This virtual victim cache yields a lower average miss rate than a fully-associative LRU cache of the same capacity. Using an adaptive insertion policy, the virtual victim cache gives an average speedup of 17.3 % over the baseline 2MB cache. The virtual victim cache significantly reduces cache misses in multi-threaded workloads. For a 2MB cache accessed simultaneously by four threads, the virtual victim cache reduces misses by 12.9 % and increases cache efficiency by 16 % on average Alternately, a 1.7MB virtual victim cache achieves about the same performance as a larger 2MB L2 cache, reducing the number of SRAM cells required by 16%, thus maintaining performance while reducing power and area. 1