Results 1 - 10
of
15
RapidMRC: Approximating L2 Miss Rate Curves on Commodity Systems for Online Optimizations
, 2009
"... Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been assumed to be expensive when done in software and c ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been assumed to be expensive when done in software and consequently, their usage for online optimizations has been limited. To address these problems and opportunities, we have developed a low-overhead software technique to obtain L2 MRCs online on current processors, exploiting features available in their performance monitoring units so that no changes to the application source code or binaries are required. Our technique, called RapidMRC, requires a single probing period of roughly 221 million processor cycles (147 ms), and subsequently 124 million cycles (83 ms) to process the data. We demonstrate its accuracy by comparing the obtained MRCs to the actual L2 MRCs of 30 applications taken from SPECcpu2006, SPECcpu2000, and SPECjbb2000. We show that RapidMRC can be applied to sizing cache partitions, helping to achieve performance improvements of up to 27%.
Modeling Cache Sharing on Chip Multiprocessor Architectures
"... Abstract — As CMPs are emerging as the dominant architecture for a wide range of platforms (from embedded systems and game consoles, to PCs, and to servers) the need to manage on-chip resources, such as shared caches, becomes a necessity. In this paper we propose a new statistical model of a CMP sha ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Abstract — As CMPs are emerging as the dominant architecture for a wide range of platforms (from embedded systems and game consoles, to PCs, and to servers) the need to manage on-chip resources, such as shared caches, becomes a necessity. In this paper we propose a new statistical model of a CMP shared cache which not only describes cache sharing but also its management via a novel fine-grain mechanism. Our model, called StatShare, accurately describes the behavior of the sharing threads using run-time information (reuse-distance information for memory accesses) and helps us understand how effectively each thread uses its space. The mechanism to manage the cache at the cache-line granularity is inspired by Cache Decay, but contains important differences. Decayed cache-lines are not turned-off to save leakage but are rather “available for replacement. ” Decay modifies the underlying replacement policy (random, LRU) to control sharing but in a very flexible and non-strict way which makes it superior to strict cache partitioning schemes (both fine and coarse grained). The statistical model allows us to assess a thread’s cache behavior under decay. Detailed CMP simulations show that: i) StatShare accurately predicts the thread behavior in a shared cache, ii) managing sharing via decay (in combination with the StatShare run time information) can be used to enforce external QoS requirements or various high-level fairness policies. 1.
Multicore-aware reuse distance analysis
, 2009
"... Abstract—This paper presents and validates methods to extend reuse distance analysis of application locality characteristics to shared-memory multicore platforms by accounting for invalidation-based cache-coherence and inter-core cache sharing. Existing reuse distance analysis methods track the numb ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Abstract—This paper presents and validates methods to extend reuse distance analysis of application locality characteristics to shared-memory multicore platforms by accounting for invalidation-based cache-coherence and inter-core cache sharing. Existing reuse distance analysis methods track the number of distinct addresses referenced between reuses of the same address by a given thread, but do not model the effects of data references by other threads. This paper shows several methods to keep reuse stacks consistent so that they account for invalidations and cache sharing, either as references arise in a simulated execution or at synchronization points. These methods are evaluated against a Simics-based coherent cache simulator running several OpenMP and transactionbased benchmarks. The results show that adding multicoreawareness substantially improves the ability of reuse distance analysis to model cache behavior, reducing the error in miss ratio prediction (relative to cache simulation for a specific cache size) by an average of 70 % for per-core caches and an average of 90 % for shared caches. I.
Accelerating Multicore Reuse Distance Analysis with Sampling and Parallelization
"... Reuse distance analysis is a well-established tool for predicting cache performance, driving compiler optimizations, and assisting visualization and manual optimization of programs. Existing reuse distance analysis methods either do not account for the effects of multithreading, or suffer severe per ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Reuse distance analysis is a well-established tool for predicting cache performance, driving compiler optimizations, and assisting visualization and manual optimization of programs. Existing reuse distance analysis methods either do not account for the effects of multithreading, or suffer severe performance penalties. This paper presents a sampled, parallelized method of measuring reuse distance profiles for multithreaded programs, modeling private and shared cache configurations. The sampling technique allows it to spend much of its execution in a fast low-overhead mode, and allows the use of a new measurement method since sampled analysis does not need to consider the full state of the reuse stack. This measurement method uses O(1) data structures that may be made thread-private, allowing parallelization to reduce overhead in analysis mode. The performance of the resulting system is analyzed for a diverse set of parallel benchmarks and shown to generate accurate output compared to non-sampled full analysis as well as good results for the common application of locating low-locality code in the benchmarks, all with a performance overhead comparable to the best single-threaded analysis techniques.
A composable model for analyzing locality of multi-threaded programs
, 2009
"... In a multi-threaded execution, threads may negatively interfere when their private data contends for shared cache or positively interact when the data brought in by one thread is used by other threads. This paper presents a model of such cache behavior to predict locality without exhaustive simulati ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
In a multi-threaded execution, threads may negatively interfere when their private data contends for shared cache or positively interact when the data brought in by one thread is used by other threads. This paper presents a model of such cache behavior to predict locality without exhaustive simulation and provide insight into trends. The new model extends prior work that assumes no data sharing and uniform thread interleaving. Based on a single pass over an interleaved execution trace, we compute a set of per-thread statistics that includes the effect of thread interleaving and data sharing. The per-thread statistics is then composed to predict performance for all cache sizes, either for sub-clusters of threads or for futuristic environments with a larger number of similar threads. We evaluate and validate our model against exhaustive simulation using a server application running on a quad-core machine and productivity, multimedia and gaming applications running on a dual-core machine. The results indicate that our model is accurate and relies on incorporating both irregular thread interleaving and data sharing to achieve this accuracy. In addition, it identifies and separates individual factors affecting locality and scalability and hence opens new possibilities in performance tuning, program scheduling, and hardware cache design for concurrent applications. 1.
StatStack: Efficient Modeling of LRU caches
- in Proceedings of the 2010 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2010
, 2010
"... The identification of the memory gap in terms of the relatively slow memory accesses put a focus on cache performance in the 90s. The introduction of the moderately clocked multicores has shifted this focus from memory latency to memory bandwidth for modern processors. The multicore’s limited cache ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
The identification of the memory gap in terms of the relatively slow memory accesses put a focus on cache performance in the 90s. The introduction of the moderately clocked multicores has shifted this focus from memory latency to memory bandwidth for modern processors. The multicore’s limited cache capacity per thread in combination with their current a projected off-chip memory bandwidth limitation makes this the most likely bottleneck of future computer systems. This paper presents a new and efficient way of estimating the cache performance for an application. The method has several similarities with that of Stack Distance, but instead of counting unique memory objects, as is done for Stack Distance calculations, our schema only requires the number of memory accesses to be counted between two successive accesses to the same data object. This task can be efficiently handled at runtime by existing built-in hardware counters. Furthermore, only a small fraction of the memory accesses have to be monitored for an accurate estimation. We show how low-overhead runtime data, similar to that of StatCache, is sufficient to feed this model. We evaluate the accuracy of the proposed transformation based on sparse data and compare the results with that of native stack distance based all memory accesses. We show excellent accuracy over a wide range of cache sizes and applications. 1
Coherent Profiles: Enabling Efficient Reuse Distance Analysis of Multicore Scaling for Loop-based Parallel Programs
"... Abstract—Reuse distance (RD) analysis is a powerful memory analysis tool that can potentially help architects study multicore processor scaling. One key obstacle though is multicore RD analysis requires measuring concurrent reuse distance (CRD) profiles across thread-interleaved memory reference str ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract—Reuse distance (RD) analysis is a powerful memory analysis tool that can potentially help architects study multicore processor scaling. One key obstacle though is multicore RD analysis requires measuring concurrent reuse distance (CRD) profiles across thread-interleaved memory reference streams. Sensitivity to memory interleaving makes CRD profiles architecture dependent, preventing them from analyzing different processor configurations. For loop-based parallel programs, CRD profiles shift coherently to larger CRD values with core count scaling because interleaving threads are symmetric. Simple techniques can predict such shifting, making the analysis of numerous multicore configurations from a small set of CRD profiles feasible. Given the ubiquity and scalability of loop-level parallelism, such techniques will be extremely valuable for studying future large multicore designs. This paper investigates using RD analysis to efficiently analyze multicore cache performance for loop-based parallel programs, making several contributions. First, we provide indepth analysis on how CRD profiles change with core count scaling. Second, we develop techniques to predict CRD profile scaling, in particular employing reference groups [1] to predict coherent shift, and evaluate prediction accuracy. Third, we show core count scaling only degrades performance for lastlevel caches (LLCs) below 16MB for our benchmarks and problem sizes, increasing to 64–128MB if problem size scales by 64x. Finally, we apply CRD profiles to analyze multicore cache performance. When combined with existing problem scaling prediction, our techniques can predict LLC MPKI to within 11.1 % of simulation across 1,728 configurations using only 36 measured CRD profiles. I.
Online Cache Modeling for Commodity Multicore Processors ∗
"... Modern chip-level multiprocessors (CMPs) contain multiple processor cores sharing a common last-level cache, memory interconnects, and other hardware resources. Workloads running on separate cores compete for these resources, often resulting in highlyvariable performance. It is generally desirable t ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Modern chip-level multiprocessors (CMPs) contain multiple processor cores sharing a common last-level cache, memory interconnects, and other hardware resources. Workloads running on separate cores compete for these resources, often resulting in highlyvariable performance. It is generally desirable to co-schedule workloads that have minimal resource contention, in order to improve both performance and fairness. Unfortunately, commodity processors expose only limited information about the state of shared resources such as caches to the software responsible for scheduling workloads that execute concurrently. To make informed resourcemanagement decisions, it is important to obtain accurate measurements of per-workload cache occupancies and their impact on performance, often summarized by utility functions such as miss-ratio curves (MRCs). In this paper, we first introduce an efficient online technique for estimating the cache occupancy of individual software threads using only commonly-available hardware performance counters. We derive an analytical model as the basis of our occupancy estimation, and extend it for improved accuracy on modern cache configurations, considering the impact of set-associativity, line replacement policy, and memory locality effects. We demonstrate the effectiveness of occupancy estimation with a series of CMP simulations in which SPEC benchmarks execute concurrently on multiple cores. Leveraging our occupancy estimation technique, we also introduce a lightweight approach for online MRC construction, and demonstrate its effectiveness using a prototype implementation in the VMware ESX Server hypervisor. We present a series of experiments involving SPEC benchmarks, comparing the MRCs we construct online with MRCs generated offline in which various cache sizes are enforced via static page coloring.
Estimating the Performance of Cache Replacement Policies ∗
"... Caches are commonly employed to hide the latency gap between memory and the CPU by exploiting locality in memory accesses. The cache performance strongly influences a system’s overall performance, as this gap is large and ever-increasing. The efficiency of a given cache architecture – usually measur ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Caches are commonly employed to hide the latency gap between memory and the CPU by exploiting locality in memory accesses. The cache performance strongly influences a system’s overall performance, as this gap is large and ever-increasing. The efficiency of a given cache architecture – usually measured by its miss ratio – varies greatly depending on the software being executed. We present an efficient method to estimate the miss ratio using a stochastic model. The model takes into account the parameters of the cache architecture and a concise characterization of the software’s locality. In contrast to previous approaches, we consider the replacement policy as an important component of the cache architecture. To this end, we introduce policy tables as a concise representation of replacement policies. The software’s locality is characterized by stack histograms or our extension thereof: History stack histograms, which refine stack histograms by distinguishing contexts of accesses. Simulation results on the SPEC benchmarks demonstrate the strong influence of the replacement policy on the miss ratio and the precision of our estimates: average absolute errors between 0.18 % and 2.92%. 1.
Performance Modelling for Parallel PDE Solvers on NUMA-Systems
, 2006
"... A detailed model of the memory performance of a PDE solver running on a NUMA-system is set up. Due to the complexity of modern computers, such a detailed model inevitably is very complicated. Therefore, approximations are introduced that simplify the model and allows NUMA-systems and PDE solvers to ..."
Abstract
- Add to MetaCart
A detailed model of the memory performance of a PDE solver running on a NUMA-system is set up. Due to the complexity of modern computers, such a detailed model inevitably is very complicated. Therefore, approximations are introduced that simplify the model and allows NUMA-systems and PDE solvers to be described conveniently. Using the simpli ed model, it is shown that PDE solvers using ordered local methods can be made very unsensitive to high NUMA-ratios, allowing them to scale well on virtually any NUMA-system. PDE solvers using unordered local methods, semiglobal methods or global methods are more sensitive to high NUMA-ratios and require special techniques in order to scale well beyond a single locality group. Nevertheless, the potential performance gain of improving the data distribution on a NUMA-system can be considerable for all kinds of PDE solvers studied. 1

