Results 1 -
5 of
5
Using Dead Blocks as a Virtual Victim Cache ABSTRACT
"... Caches mitigate the long memory latency that limits the performance of modern processors. However, caches can be quite inefficient. On average, a cache block in a 2MB L2 cache is dead 59 % of the time, i.e., it will not be referenced again before it is evicted. Increasing cache efficiency can improv ..."
Abstract
- Add to MetaCart
Caches mitigate the long memory latency that limits the performance of modern processors. However, caches can be quite inefficient. On average, a cache block in a 2MB L2 cache is dead 59 % of the time, i.e., it will not be referenced again before it is evicted. Increasing cache efficiency can improve performance by reducing miss rate, or alternately, improve power and energy by allowing a smaller cache with the same miss rate. This paper proposes using predicted dead blocks to hold blocks evicted from other sets. When these evicted blocks are referenced again, the access can be satisfied from the other set, avoiding a costly access to main memory. The pool of predicted dead blocks can be thought of as a virtual victim cache. For a set of memory-intensive single-threaded workloads, a virtual victim cache in a 16-way set associative 2MB L2 cache reduces misses by 26%, yields an geometric mean speedup of 12.1 % and improves cache efficiency by 27 % on average, where cache efficiency is defined as the average time during which cache blocks contain live information. This virtual victim cache yields a lower average miss rate than a fully-associative LRU cache of the same capacity. For a set of multi-core workloads, the virtual victim cache improves throughput performance by 4 % over LRU while improving cache efficiency by 62%. Alternately, a 1.7MB virtual victim cache achieves about the same performance as a larger 2MB L2 cache, reducing the number of SRAM cells required by 16%, thus maintaining performance while reducing power and area.
An Intra-Tile Cache Set Balancing Scheme
"... This poster describes an intra-tile cache set balancing strategy that exploits the demand imbalance across sets within the same L2 cache bank. This strategy retains some fraction of the working set at underutilized sets so as to satisfy far-flung reuses. It adapts to phase changes in programs and pr ..."
Abstract
- Add to MetaCart
This poster describes an intra-tile cache set balancing strategy that exploits the demand imbalance across sets within the same L2 cache bank. This strategy retains some fraction of the working set at underutilized sets so as to satisfy far-flung reuses. It adapts to phase changes in programs and promotes a very flexible sharing among cache sets referred to as many-from-many sharing. Simulation results using a full system simulator demonstrate the effectiveness of the proposed scheme and show that it compares favorably with related cache designs on a 16-way tiled CMP platform.
Improving Cache Performance using Victim Tag Stores
, 2011
"... With increasing pressure on memory bandwidth, there have been a number of proposals that improve the cache replacement policy. These mechanisms monitor the cache blocks while they are in the cache and evict blocks that are deemed to have low temporal locality. However, a majority of these mechanisms ..."
Abstract
- Add to MetaCart
With increasing pressure on memory bandwidth, there have been a number of proposals that improve the cache replacement policy. These mechanisms monitor the cache blocks while they are in the cache and evict blocks that are deemed to have low temporal locality. However, a majority of these mechanisms are agnostic to the temporal locality of a missed block and follow a single insertion policy for all incoming blocks. There is comparatively very little work on mechanisms to distinguish between missed blocks based on their temporal reuse behavior. Prior work has shown that distinguishing missed blocks based on their temporal locality and choosing the insertion policy on a per-block basis can significantly improve performance. To this end, we propose a new, simple hardware mechanism that predicts the temporal locality of a missed block before inserting it into the cache. The key insight behind the prediction scheme is that if a block with good temporal locality gets prematurely evicted from the cache, it will be accessed soon after eviction. To implement this prediction scheme, our mechanism augments the conventional cache with a structure, victim tag store, that keeps track of addresses of blocks evicted from the cache. We provide a practical, low-complexity hardware implementation of our mechanism using Bloom filters. We qualitatively and quantitatively compare our mechanism to five different cache management mechanisms and show that it provides significant performance improvements. 1
Memory
"... Abstract—By reconfiguring part of the cache as softwaremanaged scratchpad memory (SPM), hybrid caches manage to handle both unknown and predictable memory access patterns. However, existing hybrid caches provide a flexible partitioning of cache and SPM without considering adaptation to the run-time ..."
Abstract
- Add to MetaCart
Abstract—By reconfiguring part of the cache as softwaremanaged scratchpad memory (SPM), hybrid caches manage to handle both unknown and predictable memory access patterns. However, existing hybrid caches provide a flexible partitioning of cache and SPM without considering adaptation to the run-time cache behavior. Previous cache set balancing techniques are either energy-inefficient or require serial tag and data array access. In this paper an adaptive hybrid cache is proposed to dynamically remap SPM blocks from high-demand cache sets to low-demand cache sets. This achieves 19%, 25%, 18 % and 18% energy-runtime-production reductions over four previous representative techniques on a wide range of benchmarks.
Intel Labs Pittsburgh
"... Off-chip main memory has long been a bottleneck for system performance. With increasing memory pressure due to multiple onchip cores, effective cache utilization is important. In a system with limited cache space, we would ideally like to prevent 1) cache pollution, i.e., blocks with low reuse evict ..."
Abstract
- Add to MetaCart
Off-chip main memory has long been a bottleneck for system performance. With increasing memory pressure due to multiple onchip cores, effective cache utilization is important. In a system with limited cache space, we would ideally like to prevent 1) cache pollution, i.e., blocks with low reuse evicting blocks with high reuse from the cache, and 2) cache thrashing, i.e., blocks with high reuse evicting each other from the cache. In this paper, we propose a new, simple mechanism to predict the reuse behavior of missed cache blocks in a manner that mitigates both pollution and thrashing. Our mechanism tracks the addresses of recently evicted blocks in a structure called the Evicted-Address Filter (EAF). Missed blocks whose addresses are present in the EAF are predicted to have high reuse and all other blocks are predicted to have low reuse. The key observation behind this prediction scheme is that if a block with high reuse is prematurely evicted from the cache, it will be accessed soon after eviction. We show that an EAFimplementation using a Bloom filter, which is cleared periodically, naturally mitigates the thrashing problem by ensuring that only a portion of a thrashing working set is retained in the cache, while incurring low storage cost and implementation complexity. We compare our EAF-based mechanism to five state-of-the-art mechanisms that address cache pollution or thrashing, and show that it provides significant performance improvements for a wide variety of workloads and system configurations.

