Results 1 -
4 of
4
Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors?
"... Abstract. On Chip Multiprocessors (CMP), it is common that multiple cores share certain levels of cache. The sharing increases the contention in cache and memory-to-chip bandwidth, further highlighting the importance of data locality analysis. As a rigorous and hardware-independent locality metric, ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Abstract. On Chip Multiprocessors (CMP), it is common that multiple cores share certain levels of cache. The sharing increases the contention in cache and memory-to-chip bandwidth, further highlighting the importance of data locality analysis. As a rigorous and hardware-independent locality metric, reuse distance has served for a variety of locality analysis, program transformations, and performance prediction. However, previous studies have concentrated on sequential programs running on unicore processors. On CMP, accesses by different threads (or jobs) interact in the shared cache. How reuse distance applies to the new architecture remains an open question—particularly, how the interactions in shared cache affect the collection and application of reuse distance, and how reuse-distance–based locality analysis should adapt to such architecture changes. This paper presents our explorations towards answering those questions. It first introduces the concept of concurrent reuse distance, a direct extension of the traditional concept of reuse distance with data references by all co-running threads (or jobs) considered. It then discusses the properties of concurrent reuse distance, revealing the special challenges facing the collection and application of concurrent reuse distance on CMP platforms. Finally, it presents the solutions to those challenges for a class of multithreading applications. The solutions center on a probabilistic model that connects concurrent reuse distance with the data locality of each individual thread. Experiments demonstrate the effectiveness of the proposed techniques in facilitating the uses of concurrent reuse distance for CMP computing. 1
ULCC: A User-Level Facility for Optimizing Shared Cache Performance on Multicores
"... Scientific applications face serious performance challenges on multicore processors, one of which is caused by access contention in last level shared caches from multiple running threads. The contention increases the number of long latency memory accesses, and consequently increases application exec ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Scientific applications face serious performance challenges on multicore processors, one of which is caused by access contention in last level shared caches from multiple running threads. The contention increases the number of long latency memory accesses, and consequently increases application execution times. Optimizing shared cache performance is critical to significantly reduce execution times of multi-threaded programs on multicores. However, there are two unique problems to be solved before implementing cache optimization techniques on multicores at the user level. First, available cache space for each running thread in a last level cache is difficult to predict due to access contention in the shared space, which makes cache conscious algorithms for single cores ineffective on multicores. Second, at the user level, programmers are not able to allocate cache space at will to running threads in the shared cache, thus data sets with strong locality may not be allocated with sufficient cache space, and cache pollution can easily happen. To address these two critical issues, we have designed ULCC (User Level Cache Control), a software runtime library that enables programmers to explicitly manage and optimize last level cache usage by allocating proper cache space for different data sets of different threads. We have implemented ULCC at the user level based on a page-coloring technique for last level cache usage management. By means of multiple case studies on an Intel multicore processor, we show that with ULCC, scientific applications can achieve significant performance improvements by fully exploiting the benefit of cache optimization algorithms and by partitioning the cache space accordingly to protect frequently reused data sets and to avoid cache pollution. Our experiments with various applications show that ULCC can significantly improve application performance by nearly 40%.
Controlling Cache Utilization of HPC Applications
, 2011
"... This paper discusses the use of software cache partitioning techniques to study and improve cache behavior of HPC applications. Cache partitioning is traditionally considered as an hardware/OS solution to shared caches issues, particularly to resource utilization fairness between multiple processes. ..."
Abstract
- Add to MetaCart
This paper discusses the use of software cache partitioning techniques to study and improve cache behavior of HPC applications. Cache partitioning is traditionally considered as an hardware/OS solution to shared caches issues, particularly to resource utilization fairness between multiple processes. We believe that, in the HPC context of a single application being studied/optimized on the system, with a single thread per core, cache partitioning can be used in new and interesting ways. First, we propose an implementation of software cache partitioning using the well known page coloring technique. This implementation differs from existing work by giving control of the partitioning to the application programmer. Developed on the most popular OS in HPC (Linux), this cache control scheme has low overhead both in memory and CPU while being simple to use. Second, we show how this user-controlled cache partitioning can lead to efficient measurements of the cache behavior of a parallel scientific visualization application. While existing works require expensive binary instrumentation of an application to obtain its working sets, our method only needs a few unmodified runs on the target platform. Finally, we discuss the use of our scheme to optimize memory intensive applications by isolating each of their critical data structures into dedicated cache partitions. This isolation allows the analysis of each structure cache requirements and leads to new and significant optimization strategies. To the best of our knowledge, no other existing tool enables such tuning of HPC applications.
SRM-Buffer: An OS Buffer Management Technique to Prevent Last Level Caches from Thrashing in Multicores
"... Buffer caches in operating systems keep active file blocks in memory to reduce disk accesses. Related studies have focused on minimizing buffer misses and the resulting performance degradation. However, the side effects and performance implications of accessing the data in buffer caches (i.e. buffer ..."
Abstract
- Add to MetaCart
Buffer caches in operating systems keep active file blocks in memory to reduce disk accesses. Related studies have focused on minimizing buffer misses and the resulting performance degradation. However, the side effects and performance implications of accessing the data in buffer caches (i.e. buffer cache hits) have been ignored. In this paper, we show that accessing buffer caches can cause serious performance degradation on multicores, particularly with shared last level caches (LLCs). There are two reasons for this problem. First, data objects in files normally have weaker localities than data objects in virtual memory spaces. Second, due to the shared structure of LLCs on multicore processors, an application accessing the data in a buffer cache may flush the to-be-reused data of its co-running applications from the shared LLC and significantly slow down these applications. The paper proposes a buffer cache design called Selected Region Mapping Buffer (SRM-buffer) for multicore systems to address effectively the cache pollution problem caused by OS buffer. SRM-buffer improves existing OS buffer management with an enhanced page allocation policy that carefully selects mapping physical pages upon buffer misses. For a sequence of blocks accessed by an application, SRMbuffer allocates physical pages that are mapped to a selected region consisting of a small portion of sets in the LLC. Thus, when these blocks are accessed, cache pollution is effectively limited within the small cache region. We have implemented a prototype of SRM-buffer into the Linux kernel, and tested it with extensive workloads. Performance evaluation shows SRM-buffer can improve system performance and decrease the execution times of workloads by up to 36%.

