Results 1 -
2 of
2
Managing Wire Delay in Chip Multiprocessor Caches
, 2006
"... Increasing on-chip wire delay and growing off-chip miss latency, present two key challenges in designing large Level-2 (L2) CMP caches. Currently, some CMPs use a shared L2 cache to maximize cache capacity and minimize off-chip misses. Others use private L2 caches, replicating data to limit the dela ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Increasing on-chip wire delay and growing off-chip miss latency, present two key challenges in designing large Level-2 (L2) CMP caches. Currently, some CMPs use a shared L2 cache to maximize cache capacity and minimize off-chip misses. Others use private L2 caches, replicating data to limit the delay from slow on-chip wires and minimize cache access time. Ideally, to improve performance for a wide variety of work-loads, CMPs prefer both the capacity of a shared cache and the access latency of private caches. In this thesis, we propose three techniques that combine the benefits of shared and private caches. In partic-ular, to reduce access latency in a shared cache, we investigate cache block migration and on-chip trans-mission lines. Migration reduces access latency by moving frequently used blocks towards the lower-latency banks. We show migration successfully reduces latency to blocks requested by only one processor, but doesn’t reduce the latency to shared blocks. In contrast, transmission lines can reduce on-chip wire delay by an order of magnitude versus conventional wires and provide low latency to all shared cache banks. We demonstrate on-chip transmission lines consistently improve performance versus a baseline shared cache, but bandwidth contention can limit them from reaching their full potential. To improve the effective capacity of private caches, we propose Adaptive Selective Replication (ASR). ASR dynamically monitors workload behavior and replicates cache blocks only when it estimates the ben-efit of replication (lower L2 hit latency) exceeds the cost (more L2 misses). When ASR detects replication is less beneficial, processors coordinate writebacks with remote on-chip caches to conserve cache storage. ASR provides a robust CMP cache hierarchy: improving performance versus both shared and private caches. Additionally, ASR can leverage the fast remote cache access latency provided by transmission lines and reduce off-chip misses versus a design using conventional wires. We demonstrate the combina-tion of transmission lines and ASR outperforms either isolated technique and preforms similarly to a shared cache using four times the transmission line bandwidth.
Software Prepromotion for Non-Uniform Cache Architecture
"... Abstract—As a solution to growing global wire delay, non-uniform cache architecture (NUCA) has already been a trend in large cache designs. The access time of NUCA is determined by the distance between the cache bank containing the required data and the processor. Thus, one of the important NUCA res ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—As a solution to growing global wire delay, non-uniform cache architecture (NUCA) has already been a trend in large cache designs. The access time of NUCA is determined by the distance between the cache bank containing the required data and the processor. Thus, one of the important NUCA researches focuses on how to place data to be used into cache banks close to the processor. This paper proposes software prepromotion technique, which prepromote data using prepromotion instructions as similar as software prefetching does. Besides the basic software prepromotion, this paper also proposes smart multihop software prepromotion (SMSP), very long software prepromotion (VLSP) and their combination technique. SMSP intelligently chooses cache banks which the prepromoted data most ideally suit to being moved into. And VLSP prepromote multiple data using one instruction. Finally, we evaluate our approaches by testing 7 kernel benchmarks on a full-system simulator. The basic software prepromotion gets an average improvement of 2.6893 % in IPC. The SMSP improves IPC by 7.0928 % averagely. And the VLSP gets an IPC improvement of 7.2194 % averagely. Lastly, after combining the SMSP and VLSP, the average improvement in IPC achieves 11.8650%. Index Terms—NUCA, software prepromotion, smart multihop software prepromotion, very long software prepromotion, prefetching I.