Results 1 -
5 of
5
Accelerating Multicore Reuse Distance Analysis with Sampling and Parallelization
"... Reuse distance analysis is a well-established tool for predicting cache performance, driving compiler optimizations, and assisting visualization and manual optimization of programs. Existing reuse distance analysis methods either do not account for the effects of multithreading, or suffer severe per ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Reuse distance analysis is a well-established tool for predicting cache performance, driving compiler optimizations, and assisting visualization and manual optimization of programs. Existing reuse distance analysis methods either do not account for the effects of multithreading, or suffer severe performance penalties. This paper presents a sampled, parallelized method of measuring reuse distance profiles for multithreaded programs, modeling private and shared cache configurations. The sampling technique allows it to spend much of its execution in a fast low-overhead mode, and allows the use of a new measurement method since sampled analysis does not need to consider the full state of the reuse stack. This measurement method uses O(1) data structures that may be made thread-private, allowing parallelization to reduce overhead in analysis mode. The performance of the resulting system is analyzed for a diverse set of parallel benchmarks and shown to generate accurate output compared to non-sampled full analysis as well as good results for the common application of locating low-locality code in the benchmarks, all with a performance overhead comparable to the best single-threaded analysis techniques.
StatCC: Design and Evaluation
"... Abstract—This work presents StatCC, a simple and efficient model for estimating the shared cache miss ratios of co-scheduled applications on architectures with a hierarchy of private and shared caches. StatCC leverages the StatStack cache model to estimate the co-scheduled applications ’ cache miss ..."
Abstract
- Add to MetaCart
Abstract—This work presents StatCC, a simple and efficient model for estimating the shared cache miss ratios of co-scheduled applications on architectures with a hierarchy of private and shared caches. StatCC leverages the StatStack cache model to estimate the co-scheduled applications ’ cache miss ratios from their individual memory reuse distance distributions, and a simple performance model that estimates their CPIs based on the shared cache miss ratios. These methods are combined into a system of equations that explicitly models the CPIs in terms of the shared miss ratios and can be solved to determine both. The result is a fast algorithm with a 2 % error across the SPEC CPU2006 benchmark suite compared to a simulated in-order processor and a hierarchy of private and shared caches. I.
Cache Conscious Task Regrouping on Multicore Processors
"... Abstract—Because of the interference in the shared cache on multicore processors, the performance of a program can be severely affected by its co-running programs. If job scheduling does not consider how a group of tasks utilize cache, the performance may degrade significantly, and the degradation u ..."
Abstract
- Add to MetaCart
Abstract—Because of the interference in the shared cache on multicore processors, the performance of a program can be severely affected by its co-running programs. If job scheduling does not consider how a group of tasks utilize cache, the performance may degrade significantly, and the degradation usually varies sizably and unpredictably from run to run. In this paper, we use trace-based program locality analysis and make it efficient enough for dynamic use. We show a complete on-line system for periodically measuring the parallel execution, predicting and ranking cache interference for all co-run choices, and reorganizing programs based on the prediction. We test our system on floating-point and mixed integer and floating-point workloads composed of SPEC 2006 benchmarks and compare with the default Linux job scheduler to show the benefit of the new system in improving performance and reducing performance variation. Keywords-multicore; task grouping; online program locality analysis; lifetime sampling I.
Design and Evaluation of the Bandwidth Bandit
"... Abstract—Applications that are co-scheduled on a multicore compete for shared resources, such as cache capacity and memory bandwidth. The performance degradation resulting from this contention can be substantial, which makes it important to effectively manage these shared resources. This, however, r ..."
Abstract
- Add to MetaCart
Abstract—Applications that are co-scheduled on a multicore compete for shared resources, such as cache capacity and memory bandwidth. The performance degradation resulting from this contention can be substantial, which makes it important to effectively manage these shared resources. This, however, requires an understanding of how applications are impacted by such contention. While the effects of contention for cache capacity have been extensively studied, less is known about the effects of contention for memory bandwidth. This is in large due to its complex nature, as sensitivity to bandwidth contention depends on bottlenecks at several levels of the memory-system, the interaction and locality properties of the application’s access stream. This paper explores the contention effects of increased latency and decreased memory parallelism at different points in the memory hierarchy, both of which cause decreases in available bandwidth. To understand the impact of such contention on applications, it also presents a method whereby an application’s overall sensitivity to different degrees of bandwidth contention can be directly measured. This method is used to demonstrate the varying contention sensitivity across a selection of benchmarks, and explains why some of them experience substantial slowdowns long before the overall memory bandwidth saturates. I.
unknown title
"... To reduce latency and increase bandwidth to memory, modern microprocessors are designed with deep memory hierarchies including several levels of caches. For such microprocessors, the service time for fetching data from off-chip memory is about two orders of magnitude longer than fetching data from t ..."
Abstract
- Add to MetaCart
To reduce latency and increase bandwidth to memory, modern microprocessors are designed with deep memory hierarchies including several levels of caches. For such microprocessors, the service time for fetching data from off-chip memory is about two orders of magnitude longer than fetching data from the level-one cache. Consequently, the performance of applications is largely determined by how well they utilize the caches in the memory hierarchy, captured by their miss ratio curves. However, efficiently obtaining an application’s miss ratio curve and interpreting its performance implications is hard. This task becomes even more challenging when analyzing application performance on multi-core processors where several applications/threads share caches and memory bandwidths. To accomplish this, we need powerful techniques that capture applications ’ cache utilization and provide intuitive performance metrics. In this thesis we present three techniques for analyzing application performance, StatStack, StatCC and Cache Pirating. Our main focus is on providing memory hierarchy related performance metrics such as miss ratio, fetch ratio and bandwidth

