Results 1 -
8 of
8
1 Linear-time Modeling of Program Working Set in Shared Cache
"... Abstract—Many techniques characterize the program working set by the notion of the program footprint, which is the volume of data accessed in a time window. A complete characterization requires measuring data access in all O(n 2) windows in an n-element trace. Two recent techniques have significantl ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract—Many techniques characterize the program working set by the notion of the program footprint, which is the volume of data accessed in a time window. A complete characterization requires measuring data access in all O(n 2) windows in an n-element trace. Two recent techniques have significantly reduced the measurement time, but the cost is still too high for real-size workloads. Instead of measuring all footprint sizes, this paper presents a technique for measuring the average footprint size. By confining the analysis to the average rather than the full range, the problem can be solved accurately by a linear-time algorithm. The paper presents the algorithm and evaluates it using the complete suites of 26 SPEC2000 and 29 SPEC2006 benchmarks. The new algorithm is compared against the previously fastest algorithm in both the speed of the measurement and the accuracy of shared-cache performance prediction.
Coherent Profiles: Enabling Efficient Reuse Distance Analysis of Multicore Scaling for Loop-based Parallel Programs
"... Abstract—Reuse distance (RD) analysis is a powerful memory analysis tool that can potentially help architects study multicore processor scaling. One key obstacle though is multicore RD analysis requires measuring concurrent reuse distance (CRD) profiles across thread-interleaved memory reference str ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract—Reuse distance (RD) analysis is a powerful memory analysis tool that can potentially help architects study multicore processor scaling. One key obstacle though is multicore RD analysis requires measuring concurrent reuse distance (CRD) profiles across thread-interleaved memory reference streams. Sensitivity to memory interleaving makes CRD profiles architecture dependent, preventing them from analyzing different processor configurations. For loop-based parallel programs, CRD profiles shift coherently to larger CRD values with core count scaling because interleaving threads are symmetric. Simple techniques can predict such shifting, making the analysis of numerous multicore configurations from a small set of CRD profiles feasible. Given the ubiquity and scalability of loop-level parallelism, such techniques will be extremely valuable for studying future large multicore designs. This paper investigates using RD analysis to efficiently analyze multicore cache performance for loop-based parallel programs, making several contributions. First, we provide indepth analysis on how CRD profiles change with core count scaling. Second, we develop techniques to predict CRD profile scaling, in particular employing reference groups [1] to predict coherent shift, and evaluate prediction accuracy. Third, we show core count scaling only degrades performance for lastlevel caches (LLCs) below 16MB for our benchmarks and problem sizes, increasing to 64–128MB if problem size scales by 64x. Finally, we apply CRD profiles to analyze multicore cache performance. When combined with existing problem scaling prediction, our techniques can predict LLC MPKI to within 11.1 % of simulation across 1,728 configurations using only 36 measured CRD profiles. I.
Enabling Efficient Online Profiling of Homogeneous and Heterogeneous Multicore Systems
"... Using profiling tools is a common way to understand computer systems and software and to achieve the best performance. Profiling becomes more important as computing technology advances and makes it more difficult to intuitively reason about system characteristics. However, the recent shift in comput ..."
Abstract
- Add to MetaCart
Using profiling tools is a common way to understand computer systems and software and to achieve the best performance. Profiling becomes more important as computing technology advances and makes it more difficult to intuitively reason about system characteristics. However, the recent shift in computing technology to multicore systems and heterogeneous systems requires new profiling methods that are more suited to the challenges of profiling multiple processing elements and multiple types of resources. In this dissertation, we focus on an important profiling problem for each of three application classes on modern hardware: multithreaded applications, multiprogrammed workloads, and heterogeneous systems. For multithreaded applications, we target reducing the overhead of collecting a trace of application characteristics such as memory references. Reducing the overhead reduces the impact on thread interleavings in a multithreaded application. We reduce the overhead by buffering gathered profile data in a dynamic binary instrumentation system to decouple collection of profile data from processing of profile data. By controlling the code that is generated to fill the buffer and using a variety of methods to empty the buffer, we reduce the overhead by half compared to the previous best implementation in the system.
Understanding Multicore Cache Behavior of Loop-based Parallel Programs via Reuse Distance Analysis
"... Understanding multicore memory behavior is crucial, but can be challenging due to the cache hierarchies employed in modern CPUs. In today’s hierarchies, performance is determined by complex thread interactions, such as interference in shared caches and replication and communication in private caches ..."
Abstract
- Add to MetaCart
Understanding multicore memory behavior is crucial, but can be challenging due to the cache hierarchies employed in modern CPUs. In today’s hierarchies, performance is determined by complex thread interactions, such as interference in shared caches and replication and communication in private caches. Researchers normally perform simulation to sort out these interactions, but this can be costly and not very insightful. An alternative is reuse distance (RD) analysis. RD analysis for multicore processors is becoming feasible because recent research has developed new notions of reuse distance that can analyze thread interactions. In particular, concurrent reuse distance (CRD) models shared cache interference, while private-stack reuse distance (PRD) models private cache replication and communication. Previous multicore RD research has centered around developing techniques and verifying accuracy. In this paper, we apply multicore RD analysis to better understand memory behavior. We focus on loop-based parallel programs, an important class of programs for which RD analysis provides high accuracy. First, we develop techniques to isolate thread interactions, permitting analysis of their relative contributions. Then, we use our techniques to extract several new insights that can help architects optimize multicore cache hierarchies. One of our findings is that data sharing 1 in parallel loops varies with reuse distance, becoming significant only at larger RD values. This implies capacity sharing in shared caches and replication/communication in private caches occur only beyond some capacity. We define Cshare to be the “turn-on capacity ” for data sharing, and study its impact on private vs. shared cache performance. In addition, we find machine scaling degrades locality at smaller RD values and increases sharing frequency (i.e., reduces Cshare). We characterize how these effects vary with core count, and study their impact on the preference for private vs. shared caches. 1
Cache Conscious Task Regrouping on Multicore Processors
"... Abstract—Because of the interference in the shared cache on multicore processors, the performance of a program can be severely affected by its co-running programs. If job scheduling does not consider how a group of tasks utilize cache, the performance may degrade significantly, and the degradation u ..."
Abstract
- Add to MetaCart
Abstract—Because of the interference in the shared cache on multicore processors, the performance of a program can be severely affected by its co-running programs. If job scheduling does not consider how a group of tasks utilize cache, the performance may degrade significantly, and the degradation usually varies sizably and unpredictably from run to run. In this paper, we use trace-based program locality analysis and make it efficient enough for dynamic use. We show a complete on-line system for periodically measuring the parallel execution, predicting and ranking cache interference for all co-run choices, and reorganizing programs based on the prediction. We test our system on floating-point and mixed integer and floating-point workloads composed of SPEC 2006 benchmarks and compare with the default Linux job scheduler to show the benefit of the new system in improving performance and reducing performance variation. Keywords-multicore; task grouping; online program locality analysis; lifetime sampling I.
Appears in ACM Transactions on Computer System. Efficient Reuse Distance Analysis of Multicore Scaling for Loop-based Parallel Programs
"... Reuse distance (RD) analysis is a powerful memory analysis tool that can potentially help architects study multicore processor scaling. One key obstacle, however, is that multicore RD analysis requires measuring concurrent reuse distance (CRD) and private-LRU-stack reuse distance (PRD) profiles acro ..."
Abstract
- Add to MetaCart
Reuse distance (RD) analysis is a powerful memory analysis tool that can potentially help architects study multicore processor scaling. One key obstacle, however, is that multicore RD analysis requires measuring concurrent reuse distance (CRD) and private-LRU-stack reuse distance (PRD) profiles across threadinterleaved memory reference streams. Sensitivity to memory interleaving makes CRD and PRD profiles architecture dependent, preventing them from analyzing different processor configurations. For loop-based parallel programs, CRD and PRD profiles shift coherently across RD values with core count scaling because interleaving threads are symmetric. Simple techniques can predict such shifting, making the analysis of numerous multicore configurations from a small set of CRD and PRD profiles feasible. Given the ubiquity of parallel loops, such techniques will be extremely valuable for studying future large multicore designs. This article investigates using RD analysis to efficiently analyze multicore cache performance for loopbased parallel programs, making several contributions. First, we provide an in-depth analysis on how CRD and PRD profiles change with core count scaling. Second, we develop techniques to predict CRD and PRD profile scaling, in particular employing reference groups [Zhong et al. 2003] to predict coherent shift, demonstrating 90 % or greater prediction accuracy. Third, our CRD and PRD profile analyses define two application parameters with architectural implications:Ccore is the minimum shared cache capacity that “contains ” locality degradation due to core count scaling, and Cshare is the capacity at which shared caches begin to provide a cache-miss reduction compared to private caches. And fourth, we apply CRD and PRD profiles to analyze multicore cache performance. When combined with existing problem scaling prediction, our techniques can predict shared LLC MPKI (private L2 cache MPKI) to within 10.7 % (13.9%) of simulation across 1,728 (1,440) configurations using only 36 measured CRD (PRD) profiles. 1.
A Generalized Theory of Collaborative Caching
"... Collaborative caching allows software to use hints to influence cache management in hardware. Previous theories have shown that such hints observe the inclusion property and can obtain optimal caching if the access sequence and the cache size are known ahead of time. Previously, the interface of a c ..."
Abstract
- Add to MetaCart
Collaborative caching allows software to use hints to influence cache management in hardware. Previous theories have shown that such hints observe the inclusion property and can obtain optimal caching if the access sequence and the cache size are known ahead of time. Previously, the interface of a cache hint is limited, e.g., a binary choice between LRU and MRU. In this paper, we generalize the hint interface, where a hint is a number encoding a priority. We show the generality in a hierarchical relation where collaborative caching subsumes noncollaborative caching, and within collaborative caching, the priority hint subsumes the previous binary hint. We show two theoretical results for the general hint. The first is a new cache replacement policy, priority LRU, which permits the complete range of choices between MRU and LRU. We prove a new type of inclusion property—non-uniform inclusion—and give a one-pass algorithm to compute the miss rate for all cache sizes. Second, we show that priority hints can enable the use of the same hints to obtain optimal caching for all cache sizes, without having to know the cache size beforehand.
Advisor: Study programme: Specialization:
"... It is my pleasant obligation to thank all those who supported me in my doctoral studies and in the research which resulted in this thesis. I am deeply grateful to my advisor, Petr T˚uma for his help, guidance and co-authorship of all of the included papers. I thank all my colleagues from the departm ..."
Abstract
- Add to MetaCart
It is my pleasant obligation to thank all those who supported me in my doctoral studies and in the research which resulted in this thesis. I am deeply grateful to my advisor, Petr T˚uma for his help, guidance and co-authorship of all of the included papers. I thank all my colleagues from the department for their continuous feedback and fruitful discussion. In particular, my thanks go to the rest of my co-authors and all fellows from the office room 205: Lubomír Bulej, Martin Děck´y, Vojtěch Hork´y, Peter Libič, Lukáˇs Marek, Tomáˇs Martinec and Andrej Podzimek. I am also grateful to Frantiˇsek Pláˇsil for creating such an inspiring research environment that our department, and the former research group, has been.

