Results 1 - 10
of
18
SHiP: Signature-based Hit Predictor for High Performance Caching
"... The shared last-level caches in CMPs play an important role in improving application performance and reducing off-chip memory bandwidth requirements. In order to use LLCs more efficiently, recent research has shown that changing the re-reference prediction on cache insertions and cache hits can sign ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
The shared last-level caches in CMPs play an important role in improving application performance and reducing off-chip memory bandwidth requirements. In order to use LLCs more efficiently, recent research has shown that changing the re-reference prediction on cache insertions and cache hits can significantly improve cache performance. A fundamental challenge, however, is how to best predict the re-reference pattern of an incoming cache line. This paper shows that cache performance can be improved by correlating the re-reference behavior of a cache line with a unique signature. We investigate the use of memory region, program counter, and instruction sequence history based signatures. We also propose a novel Signature-based Hit Predictor (SHiP) to learn the re-reference behavior of cache lines belonging to each signature. Overall, we find that SHiP offers substantial improvements over the baseline LRU replacement and state-of-the-art replacement policy proposals. On average, SHiP improves sequential and multiprogrammed application performance by roughly 10 % and 12 % over LRU replacement, respectively. Compared to recent replacement policy proposals such as Seg-LRU and SDBP, SHiP nearly doubles the performance gains while requiring less hardware overhead.
Simple but effective heterogeneous main memory with on-chip memory controller support. SC ’10
"... promising technologies to bring more memory onto a microprocessor package to mitigate the “memory wall ” problem. In this paper, instead of using them to build caches, we study a heterogenous main memory using both on- and off-package memories providing both fast and high-bandwidth on-package access ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
(Show Context)
promising technologies to bring more memory onto a microprocessor package to mitigate the “memory wall ” problem. In this paper, instead of using them to build caches, we study a heterogenous main memory using both on- and off-package memories providing both fast and high-bandwidth on-package accesses and expandable and low-cost commodity off-package memory capacity. We introduce another layer of address translation coupled with an on-chip memory controller that can dynamically migrate data between off-package and off-package memory either in hardware or with operating system assistance depending on the migration granularity. Our experimental results demonstrate that such design can achieve the average effectiveness of 83% of the ideal case where all memory can be placed in high-speed on-package memory for our simulated benchmarks 1. I.
Die-Stacked DRAM Caches for Servers Hit Ratio, Latency, or Bandwidth? Have It All with Footprint Cache
"... Recent research advocates using large die-stacked DRAM caches to break the memory bandwidth wall. Existing DRAM cache designs fall into one of two categories — block-based and pagebased. The former organize data in conventional blocks (e.g., 64B), ensuring low off-chip bandwidth utilization, but co- ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
(Show Context)
Recent research advocates using large die-stacked DRAM caches to break the memory bandwidth wall. Existing DRAM cache designs fall into one of two categories — block-based and pagebased. The former organize data in conventional blocks (e.g., 64B), ensuring low off-chip bandwidth utilization, but co-locate tags and data in the stacked DRAM, incurring high lookup latency. Furthermore, such designs suffer from low hit ratios due to poor temporal locality. In contrast, page-based caches, which manage data at larger granularity (e.g., 4KB pages), allow for reduced tag array overhead and fast lookup, and leverage high spatial locality at the cost of moving large amounts of data on and off the chip. This paper introduces Footprint Cache, an efficient die-stacked DRAM cache design for server processors. Footprint Cache allocates data at the granularity of pages, but identifies and fetches only those blocks within a page that will be touched during the page's residency in the cache — i.e., the page's footprint. In doing so, Footprint Cache eliminates the excessive off-chip traffic associated with page-based designs, while preserving their high hit ratio, small tag array overhead, and low lookup latency. Cycleaccurate simulation results of a 16-core server with up to 512MB Footprint Cache indicate a 57 % performance improvement over a baseline chip without a die-stacked cache. Compared to a state-ofthe-art block-based design, our design improves performance by 13 % while reducing dynamic energy of stacked DRAM by 24%.
PACMan: Prefetch-Aware Cache Management for High Performance Caching
"... Hardware prefetching and last-level cache (LLC) management are two independent mechanisms to mitigate the growing latency to memory. However, the interaction between LLC management and hardware prefetching has received very little attention. This paper characterizes the performance of state-of-the-a ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
(Show Context)
Hardware prefetching and last-level cache (LLC) management are two independent mechanisms to mitigate the growing latency to memory. However, the interaction between LLC management and hardware prefetching has received very little attention. This paper characterizes the performance of state-of-the-art LLC management policies in the presence and absence of hardware prefetching. Although prefetching improves performance by fetching useful data in advance, it can interact with LLC management policies to introduce application performance variability. This variability stems from the fact that current replacement policies treat prefetch and demand requests identically. In order to provide better and more predictable performance, we propose Prefetch-Aware Cache Management (PACMan). PACMan dynamically estimates and mitigates the degree of prefetch-induced cache interference by modifying the cache insertion and hit promotion policies to treat demand and prefetch requests differently. Across a variety of emerging workloads, we show that PACMan eliminates the performance variability in state-of-the-art replacement policies under the influence of prefetching. In fact, PAC-Man improves performance consistently across multimedia, games, server, and SPEC CPU2006 workloads by an average of 21.9 % over the baseline LRU policy. For multiprogrammed workloads, on a 4-core CMP, PACMan improves performance by 21.5 % on average.
Insertion Policy Selection Using Decision Tree Analysis
"... The last-level cache (LLC) mitigates the impact of long memory access latencies in today’s microarchitectures. The insertion policy in the LLC can have a significant impact on cache efficiency. However, a fixed insertion policy can allow useless blocks to remain in the cache longer than necessary, r ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
The last-level cache (LLC) mitigates the impact of long memory access latencies in today’s microarchitectures. The insertion policy in the LLC can have a significant impact on cache efficiency. However, a fixed insertion policy can allow useless blocks to remain in the cache longer than necessary, resulting in inefficiency. We introduce insertion policy selection using Decision Tree Analysis (DTA). The technique requires minimal hardware modification over the least-recently-used (LRU) replacement policy. This policy uses the fact that the LLC filters temporal locality. Many of the lines brought to the cache are never accessed again. Even if they are reaccessed they do not experience bursts, but rather they are reused when they are near to the LRU position in the LRU stack. We use decision tree analysis of multi-set-dueling to choose the optimal insertion position in the LRU stack. Inserting in this position, zero reuse lines minimize their dead time while the non-zero reuse lines remain in the cache long enough to be reused and avoid a miss. For a 1MB 16 way set-associative last level cache in a single core processor, our entry uses only 2069 bits over the LRU replacement policy. 1
Unison cache: A scalable and effective die-stacked dram cache
- in Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, Dec 2014
"... Abstract—Recent research advocates large die-stacked DRAM caches in manycore servers to break the memory latency and bandwidth wall. To realize their full potential, die-stacked DRAM caches necessitate low lookup latencies, high hit rates and the efficient use of off-chip bandwidth. Today’s stacked ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Recent research advocates large die-stacked DRAM caches in manycore servers to break the memory latency and bandwidth wall. To realize their full potential, die-stacked DRAM caches necessitate low lookup latencies, high hit rates and the efficient use of off-chip bandwidth. Today’s stacked DRAM cache designs fall into two categories based on the granularity at which they manage data: block-based and page-based. The state-of-the-art block-based design, called Al-loy Cache, colocates a tag with each data block (e.g., 64B) in the stacked DRAM to provide fast access to data in a single DRAM access. However, such a design suffers from low hit rates due to poor temporal locality in the DRAM cache. In contrast, the state-of-the-art page-based design, called Footprint Cache, organizes the DRAM cache at page granularity (e.g., 4KB), but fetches only the blocks that will likely be touched within a page. In doing so, the Footprint Cache achieves high hit rates with moderate on-chip tag storage and reasonable lookup latency. However, multi-gigabyte stacked DRAM caches will soon be practical and needed by server applications, thereby mandating tens of MBs of tag storage even for page-based DRAM caches. We introduce a novel stacked-DRAM cache design, Unison Cache. Similar to Alloy Cache’s approach, Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities. Then, leveraging the insights from the Footprint Cache design, Unison Cache employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads, while predicting and fetching only the useful blocks within each page to minimize the off-chip traffic. Our evaluation using server workloads and caches of up to 8GB reveals that Unison cache improves performance by 14 % compared to Alloy Cache due to its high hit rate, while outperforming the state-of-the art page-based designs that require impractical SRAM-based tags of around 50MB. Keywords-caches; DRAM; 3D die stacking; memory; servers I.
Practical Near-Data Processing for In-memory Analytics Frameworks
"... Abstract—The end of Dennard scaling has made all sys-tems energy-constrained. For data-intensive applications with limited temporal locality, the major energy bottleneck is data movement between processor chips and main memory modules. For such workloads, the best way to optimize energy is to place ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—The end of Dennard scaling has made all sys-tems energy-constrained. For data-intensive applications with limited temporal locality, the major energy bottleneck is data movement between processor chips and main memory modules. For such workloads, the best way to optimize energy is to place processing near the data in main memory. Advances in 3D integration provide an opportunity to implement near-data processing (NDP) without the technology problems that similar efforts had in the past. This paper develops the hardware and software of an NDP architecture for in-memory analytics frameworks, including MapReduce, graph processing, and deep neural networks. We develop simple but scalable hardware support for coherence, communication, and synchronization, and a runtime system that is sufficient to support analytics frameworks with complex data patterns while hiding all the details of the NDP hardware. Our NDP architecture provides up to 16x performance and energy advantage over conventional approaches, and 2.5x over recently-proposed NDP systems. We also investigate the balance between processing and memory throughput, as well as the scalability and physical and logical organization of the memory system. Finally, we show that it is critical to optimize software frameworks for spatial locality as it leads to 2.9x efficiency improvements for NDP. Keywords-Near-data processing; Processing in memory; En-ergy efficiency; In-memory analytics;
Improving Cache Performance by Exploiting Read-Write Disparity
"... Cache read misses stall the processor if there are no inde-pendent instructions to execute. In contrast, most cache write misses are off the critical path of execution, since writes can be buffered in the cache or the store buffer. With few excep-tions, cache lines that serve loads are more critical ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Cache read misses stall the processor if there are no inde-pendent instructions to execute. In contrast, most cache write misses are off the critical path of execution, since writes can be buffered in the cache or the store buffer. With few excep-tions, cache lines that serve loads are more critical for per-formance than cache lines that serve only stores. Unfortu-nately, traditional cache management mechanisms do not take into account this disparity between read-write criticality. The key contribution of this paper is the new idea of distinguish-ing between lines that are reused by reads versus those that are reused only by writes to focus cache management poli-cies on the more critical read lines. We propose a Read-Write Partitioning (RWP) policy that minimizes read misses by dy-namically partitioning the cache into clean and dirty parti-tions, where partitions grow in size if they are more likely to receive future read requests. We show that exploiting the dif-ferences in read-write criticality provides better performance over prior cache management mechanisms. For a single-core system, RWP provides 5 % average speedup across the entire SPEC CPU2006 suite, and 14 % average speedup for cache-sensitive benchmarks, over the baseline LRU replacement pol-icy. We also show that RWP can perform within 3 % of a new yet complex instruction-address-based technique, Read Refer-ence Predictor (RRP), that bypasses cache lines which are un-likely to receive any read requests, while requiring only 5.4% of RRP’s state overhead. On a 4-core system, our RWP mech-anism improves system throughput by 6 % over the baseline and outperforms three other state-of-the-art mechanisms we evaluate. 1.
Shared Last-Level Caches and The Case for Longer Timeslices
"... ABSTRACT Memory performance is important in modern systems. Contention at various levels in memory hierarchy can lead to significant application performance degradation due to interference. Further, modern, large, last-level caches (LLC) have fill times greater than the OS scheduling window. When s ..."
Abstract
- Add to MetaCart
(Show Context)
ABSTRACT Memory performance is important in modern systems. Contention at various levels in memory hierarchy can lead to significant application performance degradation due to interference. Further, modern, large, last-level caches (LLC) have fill times greater than the OS scheduling window. When several threads are running concurrently and time-sharing the CPU cores, they may never be able to load their working sets into the cache before being rescheduled, thus permanently stuck in the "cold-start" regime. We show that by increasing the system scheduling timeslice length it is possible to amortize the cache cold-start penalty due to the multitasking and improve application performance by 10-15%.
Author manuscript, published in "JWAC 2010- 1st JILP Worshop on Computer Architecture Competitions: cache replacement Championship (2010)" Insertion Policy Selection Using Decision Tree Analysis
, 2010
"... The last-level cache (LLC) mitigates the impact of long memory access latencies in today’s microarchitectures. The insertion policy in the LLC can have a significant impact on cache efficiency. However, a fixed insertion policy can allow useless blocks to remain in the cache longer than necessary, r ..."
Abstract
- Add to MetaCart
(Show Context)
The last-level cache (LLC) mitigates the impact of long memory access latencies in today’s microarchitectures. The insertion policy in the LLC can have a significant impact on cache efficiency. However, a fixed insertion policy can allow useless blocks to remain in the cache longer than necessary, resulting in inefficiency. We introduce insertion policy selection using Decision Tree Analysis (DTA). The technique requires minimal hardware modification over the least-recently-used (LRU) replacement policy. This policy uses the fact that the LLC filters temporal locality. Many of the lines brought to the cache are never accessed again. Even if they are reaccessed they do not experience bursts, but rather they are reused when they are near to the LRU position in the LRU stack. We use decision tree analysis of multi-set-dueling to choose the optimal insertion position in the LRU stack. Inserting in this position, zero reuse lines minimize their dead time while the non-zero reuse lines remain in the cache long enough to be reused and avoid a miss. For a 1MB 16 way set-associative last level cache in a single core processor, our entry uses only 2069 bits over the LRU replacement policy. 1