Results 11 - 20
of
37
P-OPT: Program-Directed Optimal Cache Management ⋆
"... Abstract. As the amount of on-chip cache increases as a result of Moore’s law, cache utilization is increasingly important as the number of processor cores multiply and the contention for memory bandwidth becomes more severe. Optimal cache management requires knowing the future access sequence and b ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Abstract. As the amount of on-chip cache increases as a result of Moore’s law, cache utilization is increasingly important as the number of processor cores multiply and the contention for memory bandwidth becomes more severe. Optimal cache management requires knowing the future access sequence and being able to communicate this information to hardware. The paper addresses the communication problem with two new optimal algorithms for Program-directed OPTimal cache management (P-OPT), in which a program designates certain accesses as bypasses and trespasses through an extended hardware interface to effect optimal cache utilization. The paper proves the optimality of the new methods, examines their theoretical properties, and shows the potential benefit using a simulation study and a simple test on a multi-core, multi-processor PC. 1
P-OPT: Program-directed Optimal Cache Management ⋆
"... Abstract. As the amount of on-chip cache increases as a result of Moore’s law, cache utilization is increasingly important as the number of processor cores multiply and the contention for memory bandwidth becomes more severe. Optimal cache management requires knowing the future access sequence and b ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract. As the amount of on-chip cache increases as a result of Moore’s law, cache utilization is increasingly important as the number of processor cores multiply and the contention for memory bandwidth becomes more severe. Optimal cache management requires knowing the future access sequence and being able to communicate this information to hardware. The paper addresses the communication problem with two new optimal algorithms for Program-directed OPTimal cache management (P-OPT), in which a program designates certain accesses as bypasses and trespasses through an extended hardware interface to effect optimal cache utilization. The paper proves the optimality of the new methods, examines their theoretical properties, and shows the potential benefit using a simulation study and a simple test on a multi-core, multi-processor PC. 1
SHiP: Signature-based Hit Predictor for High Performance Caching
"... The shared last-level caches in CMPs play an important role in improving application performance and reducing off-chip memory bandwidth requirements. In order to use LLCs more efficiently, recent research has shown that changing the re-reference prediction on cache insertions and cache hits can sign ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The shared last-level caches in CMPs play an important role in improving application performance and reducing off-chip memory bandwidth requirements. In order to use LLCs more efficiently, recent research has shown that changing the re-reference prediction on cache insertions and cache hits can significantly improve cache performance. A fundamental challenge, however, is how to best predict the re-reference pattern of an incoming cache line. This paper shows that cache performance can be improved by correlating the re-reference behavior of a cache line with a unique signature. We investigate the use of memory region, program counter, and instruction sequence history based signatures. We also propose a novel Signature-based Hit Predictor (SHiP) to learn the re-reference behavior of cache lines belonging to each signature. Overall, we find that SHiP offers substantial improvements over the baseline LRU replacement and state-of-the-art replacement policy proposals. On average, SHiP improves sequential and multiprogrammed application performance by roughly 10 % and 12 % over LRU replacement, respectively. Compared to recent replacement policy proposals such as Seg-LRU and SDBP, SHiP nearly doubles the performance gains while requiring less hardware overhead.
Double-DIP: Augmenting DIP with Adaptive Promotion Policies to Manage Shared L2 Caches
"... In this paper, we study how the Dynamic Insert Policy (DIP) cache mechanism behaves in a multi-core shared-cache environment. Based on our observations, we explore a new direction in the design space of caches called the promotion policy. In a conventional LRU-based cache, a hit causes the line to b ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In this paper, we study how the Dynamic Insert Policy (DIP) cache mechanism behaves in a multi-core shared-cache environment. Based on our observations, we explore a new direction in the design space of caches called the promotion policy. In a conventional LRU-based cache, a hit causes the line to be promoted to the MRU position in the recency stack. Instead, we suggest an incremental promotion policy where each hit on a cacheline progressively moves it toward the MRU position. We describe a generalization of the DIP approach that can simultaneously adapt both the insertion and promotion policies of a shared multi-core cache. Our preliminary results indicate that promotion polices are a promising avenue to further improve the behavior of shared L2 caches. 1.
Characterization and Dynamic Mitigation of Intra-Application Cache Interference
"... Abstract—Given the emerging dominance of CMPs, an important research problem concerns application memory performance in the face of deep memory hierarchies, where one or more caches are shared by several cores. In current systems, many factors can cause interference in the shared last-level cache (L ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—Given the emerging dominance of CMPs, an important research problem concerns application memory performance in the face of deep memory hierarchies, where one or more caches are shared by several cores. In current systems, many factors can cause interference in the shared last-level cache (LLC). While predicting an application’s memory performance is difficult enough in an idealized setup, it becomes even more complicated in real-machine environments in which interference can stem from operating system memory accesses, and even from an application’s own prefetch requests and page table walks caused by TLB misses. This paper characterizes the degree by which intra-application interference factors such as page table walks and hardware prefetching influence performance. Using hardware performance counters on an Intel platform, we first characterize real-system LLC interference and show that application data memory references represent much less than half of the LLC misses, with hardware prefetching and page table walks causing considerable LLC interference. Based on these characterizations, we propose dynamic management methods to reduce intra-application interference. First, we evaluate a dynamic OS-reference-aware cache insertion policy that reduces interference and improves user IPCs by as much as 19 % (5 % on average). Second, to mitigate prefetch-induced LLC interference, we propose, implement, and evaluate an automatic prefetch manager that uses Intel PEBS capabilities to dynamically estimate prefetch-induced interference and accordingly adjust the aggressiveness of hardware prefetchers as programs run. Overall, our characterizations are important in highlighting the challenges of intra-application interference, and our hardware and software proposals offer significant solutions for addressing them.
The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches
"... Abstract—Graphics Processing Units (GPUs) have recently emerged as a new platform for high performance, generalpurpose computing. Because current GPUs employ deep multithreading to hide latency, they only have small, per-core caches to capture reuse and eliminate unnecessary off-chip accesses. This ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract—Graphics Processing Units (GPUs) have recently emerged as a new platform for high performance, generalpurpose computing. Because current GPUs employ deep multithreading to hide latency, they only have small, per-core caches to capture reuse and eliminate unnecessary off-chip accesses. This paper shows that for general-purpose workloads, the ability to copy cache lines between private caches captures inter-core temporal locality and provides substantial reductions in off-chip bandwidth requirements. Unlike hardware cache coherence, a sharing tracker only needs to track cache lines in the private caches imprecisely, because it is only a performance hint. This simplifies the implementation and is so effective at capturing inter-core reuse that the L2 can be eliminated entirely. The sharing tracker is motivated by but not specific to the GPU and could be used in other manycore organizations. I.
Efficient Throughput Cores for Asymmetric Manycore Processors
, 2009
"... The microprocessor industry has had to switch from developing ever more complex and more deeply pipelined single-core processors to multicore processors due to running into power, thermal and complexity limits. Future microprocessors will be asymmetric manycore chip multiprocessors, with a small num ..."
Abstract
- Add to MetaCart
The microprocessor industry has had to switch from developing ever more complex and more deeply pipelined single-core processors to multicore processors due to running into power, thermal and complexity limits. Future microprocessors will be asymmetric manycore chip multiprocessors, with a small number of complex cores for serial programs and serial sections of parallel programs. The majority of the cores will be small, power- and area-efficient cores to maximize overall throughput in a limited power budget. The main contributions of this dissertation are techniques for improving the performance and area-efficiency of these throughput-oriented cores. This work shows how the single-thread performance of small, scalar cores can be increased or dynamically combined to speed up programs with only a limited number of parallel threads. It also shows how to improve both the cores and the cache subsystem of multicore processor using SIMD cores. iv Acknowledgments
Deconstructing the Inefficacy of Global Cache Replacement Policies
"... In a conventional two-level cache hierarchy, L1 cache hits do not propagate to the L2 cache; as a result, the L2 cache only observes a “filtered ” memory access stream. A frequently accessed address may hit in the L1, but since these accesses never make it to the L2, the corresponding copy in the L2 ..."
Abstract
- Add to MetaCart
In a conventional two-level cache hierarchy, L1 cache hits do not propagate to the L2 cache; as a result, the L2 cache only observes a “filtered ” memory access stream. A frequently accessed address may hit in the L1, but since these accesses never make it to the L2, the corresponding copy in the L2 will “decay ” with respect to its replacement policy state and may eventually get evicted. Previous studies have advocated the use of global replacement policies where the L1 access information propagates to the L2 to maintain a replacement policy state that is consistent with the overall global memory access stream. We first attempt to duplicate previously reported results on global cache replacement policies. Despite the intuitive explanation for why a global scheme should work, our experimental results show that the performance potential of global replacement is very limited. We deconstruct the problem with reuse-distance analysis and show that only under very specific reuse-distance profiles will a program be able to benefit from global replacement. Our experiments include the evaluation of multi-core shared caches, inclusive cache hierarchies, and a wide spectrum of cache sizes and associativities; we show that global replacement fails to provide significant performance benefits for any of these scenarios. 1.
Understanding the Limits of Capacity Sharing in CMP Private Caches
"... Abstract—Chip Multi Processor (CMP) systems present interesting design challenges at the lower levels of the cache hierarchy. Private L2 caches allow easier processor-cache design reuse, thus scaling better than a system with a shared L2 cache, while offering better performance isolation and lower a ..."
Abstract
- Add to MetaCart
Abstract—Chip Multi Processor (CMP) systems present interesting design challenges at the lower levels of the cache hierarchy. Private L2 caches allow easier processor-cache design reuse, thus scaling better than a system with a shared L2 cache, while offering better performance isolation and lower access latency. While some private cache management schemes that utilize space in peer private L2 caches have been recently proposed, we find that there is significant potential for improving their performance. We propose and study an oracular scheme, OPT, which identifies the performance limits of schemes to manage private caches. OPT uses offline-generated traces of cache accesses to uncover applications ’ reuse patterns. OPT uses this perfect knowledge of each application’s future memory accesses to optimally place cache blocks brought on-chip in either the local or a remote private L2 cache. We discover that in order to optimally manage private caches, peer private caches must be utilized not only at a local-cache-replacement time, as has been previously proposed, but also at cache-placement time- it may be better to place a missed block directly into a peer L2 rather than the traditional approach of first bringing it into the local L2. We implement OPT on a 4-core CMP with 512KB, 8-way, private caches, across 10 carefully chosen, relevant, multiprogram workload mixes. We find that compared to a baseline system that does not employ capacity sharing across private caches, OPT improves weighted-speedup (performance metric) by 13.4 % on average. Further, compared to the state of the art technique for private cache management, OPT improves weighted-speedup by 11.2%. This shows the significant potential that exists for improving previously proposed private cache management schemes. I.
“Doctor of Philosophy”
"... To my beloved Adina, for the loving support and never ending acceptance, and for never laughing out loud when I said I’m simply thinking about my research with my eyes closed. To my advisor Dror Feitelson (who I still suspect was not in his right mind to take me as a student), for being a true role ..."
Abstract
- Add to MetaCart
To my beloved Adina, for the loving support and never ending acceptance, and for never laughing out loud when I said I’m simply thinking about my research with my eyes closed. To my advisor Dror Feitelson (who I still suspect was not in his right mind to take me as a student), for being a true role model and mentor, and for giving real meaning to the title of an academic father. To my parents, for believing despite what all the teachers said. And finally to my feisty son Yotam: although you only recently joined the team, your smile makes all the difference, redhead. The increasing gap between processor and memory speeds, as well as the introduction of multicore CPUs, have exacerbated the dependency of CPU performance on the memory subsystem. This trend motivates the search for more efficient caching mechanisms, enabling both faster service of frequently used blocks and decreased power consumption. This thesis explores the temporal locality phenomenon in an effort to devise such efficient caching mechanisms. Specifically, it is shown that while Denning’s working sets model puts

