Results 1 -
8 of
8
Mete: meeting end-to-end qos in multicores through system-wide resource management
- In ACM SIGMETRICS
, 2011
"... Management of shared resources in emerging multicores for achiev-ing predictable performance has received considerable attention in recent times. In general, almost all these approaches attempt to guarantee a certain level of performance QoS (weighted IPC, har-monic speedup, etc) by managing a singl ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
(Show Context)
Management of shared resources in emerging multicores for achiev-ing predictable performance has received considerable attention in recent times. In general, almost all these approaches attempt to guarantee a certain level of performance QoS (weighted IPC, har-monic speedup, etc) by managing a single shared resource or at most a couple of interacting resources. A fundamental shortcoming of these approaches is the lack of coordination between these shared resources to satisfy a system level QoS. This is undesirable because providing end-to-end QoS in future multicores is essential for sup-porting wide-spread adoption of these architectures in virtualized servers and cloud computing systems. An initial step towards such an end-to-end QoS support in multicores is to ensure that at least the major computational and memory resources on-chip are man-aged efficiently in a coordinated fashion.
The auction: Optimizing banks usage in non-uniform cache architectures
- in Proc. 24th Int. Conf. Supercomputing, 2010
"... ABSTRACT The growing influence of wire delay in cache design has meant that access latencies to last-level cache banks are no longer constant. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem. Furthermore, an efficient last-level cache is crucial in chip multiproce ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
ABSTRACT The growing influence of wire delay in cache design has meant that access latencies to last-level cache banks are no longer constant. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem. Furthermore, an efficient last-level cache is crucial in chip multiprocessors (CMP) architectures to reduce requests to the offchip memory, because of the significant speed gap between processor and memory and the limited memory bandwidth. Therefore, a bank replacement policy that efficiently manages the NUCA cache is desirable. However, the decentralized nature of NUCA has prevented previously proposed replacement policies from being effective in this kind of caches. As banks operate independently of each other, their replacement decisions are restricted to a single NUCA bank. We propose a novel mechanism based on the bank replacement policy for NUCA caches on CMP, called The Auction. This mechanism enables the replacement decisions taken in a single bank to be spread to the whole NUCA cache. Thus, global replacement policies that rely on the current state of the NUCA cache, such as evicting the least frequently accessed data in the whole NUCA cache, are now feasible. Moreover, The Auction adapts to current program behaviour in order to relocate a line that is being evicted from a bank in the NUCA cache to the most suitable position in the whole cache. We propose, implement and evaluate three approaches of The Auction mechanism. We also show that The Auction manages the cache efficiently and significantly reduces the requests to the off-chip memory by increasing the hit ratio in the NUCA cache. This translates into an average IPC improvement of 8%, and reduces energy consumed by the memory system by 4%.
Clustered Caching for Improving Performance and Energy requirements in NoC based Multiprocessors
"... allowing to run larger applications on chip multiprocessors. Parallelism is achieved by running different threads of applications on separate processors. This leads to coherence issues of shared data. As wire delays are dominating in current SoCs, added communication over the interconnect also adds ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
allowing to run larger applications on chip multiprocessors. Parallelism is achieved by running different threads of applications on separate processors. This leads to coherence issues of shared data. As wire delays are dominating in current SoCs, added communication over the interconnect also adds to latency and power requirements. In this paper we propose to form small size clusters of cores which will share the same high-level cache instead of one global, large-size banked cache. Experimental evaluation shows that clustering gives improvement in performance and power requirements. Research on application mapping on NoC has shown that assigning nearby cores to an application improves performance. We performed experiments by localising an application within a cluster and obtained improvements in performance as well as power.
Performance Improvement by N-Chance Clustered Caching in NoC based
"... Abstract — Cache management is one of the key factors that affect the performance of present day Chip Multi-Processors. The main aspects that govern the cache management are the access latency and the cache space utilization. This paper proposes 3-chance clustered caching cache management scheme in ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract — Cache management is one of the key factors that affect the performance of present day Chip Multi-Processors. The main aspects that govern the cache management are the access latency and the cache space utilization. This paper proposes 3-chance clustered caching cache management scheme in NoC based multi-core systems, where it targets to address both the issues. The L2 banks are formed into a cluster and are non inclusive. The cache management policy concentrates on increasing the life time of a cache block by giving up to 3 chances by rotation of data among the L2 banks and clustering keeps the data close to the processors thereby decreasing the access latency. The caches act as non-inclusive for increasing the cache space, which increases the access latency but is reduced by 2 level directory protocol implemented for cache coherence and cache clustering. The evicted L1 cache bocks are stored in the Cluster home L2 bank and the evicted L2 cache blocks follow the 3-Chance rotation algorithm. Experimental results based on full-system simulation show that for 16 core 2D-MESH the scheme increases the performance by 9 ∼ 15%.
Symbiotic scheduling for shared caches in multi-core systems using memory footprint signature
- in International Conference on Parallel Processing (ICPP
"... As the trend of more cores sharing common resources on a single die and more systems crammed into enterprise computing space con-tinue, optimizing the economies of scale for a given compute capacity is becoming more critical. One major challenge in performance scal-ability is the growing L2 cache co ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
As the trend of more cores sharing common resources on a single die and more systems crammed into enterprise computing space con-tinue, optimizing the economies of scale for a given compute capacity is becoming more critical. One major challenge in performance scal-ability is the growing L2 cache contention caused by multiple contexts running on a multi-core processor either natively or under a virtual machine environment. Currently, an OS, at best, relies on history-based affinity information to dispatch a process or thread onto a par-ticular processor core. Unfortunately, this simple method can easily lead to destructive performance effect due to conflicts in common re-sources, thereby slowing down all processes. To ameliorate the allocation/management policy of a shared cache on a multi-core, in this paper, we propose Bloom filter signatures, a low-complexity architectural support to allow an OS or a Virtual Machine Monitor to infer cache footprint characteristics and inter-ference of applications, and then perform job scheduling based on symbiosis. Our scheme integrates hardware-level counting Bloom fil-ters in caches to efficiently summarize cache usage behavior on a per-core, per-process or per-VM basis. We then proposed and stud-ied three resource allocation algorithms to determine the optimal process-to-core mapping to minimize interference in the L2. We ex-ecuted applications using allocation generated by our new process-to-core mapping algorithms on an Intel Core 2 Duo machine and showed an averaged 22 % (up to 54%) improvement when applica-tions run natively, and an averaged 9.5 % improvement (up to 26%) when running inside VMs.
Locality-Oblivious Cache Organization leveraging Single-Cycle Multi-Hop NoCs
"... Locality has always been a critical factor in on-chip data placement on CMPs as accessing further-away caches has in the past been more costly than accessing nearby ones. Sub-stantial research on locality-aware designs have thus focused on keeping a copy of the data private. However, this compli-cat ..."
Abstract
- Add to MetaCart
(Show Context)
Locality has always been a critical factor in on-chip data placement on CMPs as accessing further-away caches has in the past been more costly than accessing nearby ones. Sub-stantial research on locality-aware designs have thus focused on keeping a copy of the data private. However, this compli-cates the problem of data tracking and search/invalidation; tracking the state of a line at all on-chip caches at a directory or performing full-chip broadcasts are both non-scalable and extremely expensive solutions. In this paper, we make the case for Locality-Oblivious Cache Organization (LOCO), a CMP cache organization that leverages the on-chip network to create virtual single-cycle paths between distant caches, thus redefining the notion of locality. LOCO is a clustered cache organization, supporting both homogeneous and het-erogeneous cluster sizes, and provides near single-cycle ac-cesses to data anywhere within the cluster, just like a private cache. Globally, LOCO dynamically creates a virtual mesh connecting all the clusters, and performs an efficient global data search and migration over this virtual mesh, without having to resort to full-chip broadcasts or perform expen-sive directory lookups. Trace-driven and full system simu-lations running SPLASH-2 and PARSEC benchmarks show that LOCO improves application run time by up to 44.5% over baseline private and shared cache.
ADAPSCO: Distance-Aware Partially Shared Cache Organization ANTONIO GARC´IA-GUIRADO, Universidad de Murcia
"... Many-core tiled CMP proposals often assume a partially shared last level cache (LLC) since this provides a good compromise between access latency and cache utilization. In this paper, we propose a novel way to map memory addresses to LLC banks that takes into account the average distance between the ..."
Abstract
- Add to MetaCart
Many-core tiled CMP proposals often assume a partially shared last level cache (LLC) since this provides a good compromise between access latency and cache utilization. In this paper, we propose a novel way to map memory addresses to LLC banks that takes into account the average distance between the banks and the tiles that access them. Contrary to traditional approaches, our mapping does not group the tiles in clusters within which all the cores access the same bank for the same addresses. Instead, two neighboring cores access different sets of banks minimizing the average distance travelled by the cache requests. Results for a 64-core CMP show that our proposal improves both execution time and the energy consumed by the network by 13 % when compared to a traditional mapping. Moreover, our proposal comes at a negligible cost in terms of hardware and its benefits in both energy and execution time increase with the number of cores.
Performance Analysis of Non-Uniform Cache Architecture Policies for Chip-Multiprocessors Using the Parsec v2.0 Benchmark Suite
"... have been proposed as a solution to overcome wire delays that will dominate on-chip latencies in Chip Multiprocessor designs in the near future. This novel means of organization divides the total memory area into a set of banks that provides non-uniform access latencies and thus faster access to tho ..."
Abstract
- Add to MetaCart
(Show Context)
have been proposed as a solution to overcome wire delays that will dominate on-chip latencies in Chip Multiprocessor designs in the near future. This novel means of organization divides the total memory area into a set of banks that provides non-uniform access latencies and thus faster access to those banks that are close to the processor. A NUCA model can be characterized according to the four policies that determine its behavior: bank placement, bank access, bank migration and bank replacement. Placement determines the first location of data, access defines the searching algorithm across the banks, migration decides data movements inside the memory and replacement deals with the evicted data. This paper analyzes the performance of several alternatives that can be considered for each of these four policies. Moreover, the Parsec v2.0 benchmark suite has been used to handle this evaluation because it is a representative group of upcoming shared-memory programs for Chip Multiprocessors. The results may help researchers to identify key features of NUCA organizations and to open up new areas of investigation. I.