Results 1 - 10
of
12
Data caches in multitasking hard real-time systems
- IN IEEE REAL-TIME SYSTEMS SYMPOSIUM
, 2003
"... Data caches are essential in modern processors, bridging the widening gap between main memory and processor speeds. However, they yield very complex performance models, which makes it hard to bound execution times tightly. This paper contributes a new technique to obtain predictability in preemptive ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
Data caches are essential in modern processors, bridging the widening gap between main memory and processor speeds. However, they yield very complex performance models, which makes it hard to bound execution times tightly. This paper contributes a new technique to obtain predictability in preemptive multitasking systems in the presence of data caches. We explore the use of cache partitioning, dynamic cache locking and static cache analysis to provide worst-case performance estimates in a safe and tight way. Cache partitioning divides the cache among tasks to eliminate inter-task cache interferences. We combine static cache analysis and cache locking mechanisms to ensure that all intra-task conflicts, and consequently, memory access times, are exactly predictable. To minimize the performance degradation due to cache partitioning and locking, two strategies are employed. First, the cache is loaded with data likely to be accessed so that their cache utilization is maximized. Second, compiler optimizations such as tiling and padding are applied in order to reduce cache replacement misses. Experimental results show that this scheme is fully predictable, without compromising the performance of the transformed programs. Our method outperforms static cache locking for all analyzed task sets under various cache architectures, with a CPU utilization reduction ranging between 3.8 and 20.0 times for a high performance system.
Miss rate prediction across program inputs and cache configurations
- IEEE TRANSACTIONS ON COMPUTERS
, 2007
"... Improving cache performance requires understanding cache behavior. However, measuring cache performance for one or two data input sets provides little insight into how cache behavior varies across all data input sets and all cache configurations. This paper uses locality analysis to generate a para ..."
Abstract
-
Cited by 17 (12 self)
- Add to MetaCart
Improving cache performance requires understanding cache behavior. However, measuring cache performance for one or two data input sets provides little insight into how cache behavior varies across all data input sets and all cache configurations. This paper uses locality analysis to generate a parameterized model of program cache behavior. Given a cache size and associativity, this model predicts the miss rate for arbitrary data input set sizes. This model also identifies critical data input sizes where cache behavior exhibits marked changes. Experiments show this technique is within 2 percent of the hit rate for set associative caches on a set of floating-point and integer programs using array and pointer-based data structures. Building on the new model, this paper presents an interactive visualization tool that uses a three-dimensional plot to show miss rate changes across program data sizes and cache sizes and its use in evaluating compiler transformations. Other uses of this visualization tool include assisting machine and benchmark-set design. The tool can be accessed on the Web at
Automated and accurate cache behavior analysis for codes with irregular access patterns
- 12th Workshop on Compilers for Parallel Computers, CPC 2006
, 2006
"... Abstract. The memory hierarchy plays an essential role in the performance of current computers, thus good analysis tools that help predict and understand its behavior are required. Analytical modeling is the ideal base for such tools if its traditional limitations in accuracy and scope of applicatio ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Abstract. The memory hierarchy plays an essential role in the performance of current computers, thus good analysis tools that help predict and understand its behavior are required. Analytical modeling is the ideal base for such tools if its traditional limitations in accuracy and scope of application are overcome. For example, while there has been extensive research on the modeling of codes with regular access patterns, less attention has been paid to the codes with irregular patterns due to the increased difficulty to analyze them. Nevertheless, many important applications exhibit this kind of patterns, and their lack of locality make them more cache-demanding, which makes their study more relevant. In this paper we define the information requirements of an existing analytical model that can provide fast and accurate predictions of the cache behavior of codes with irregular access patterns. In addition, we describe the integration of the model in a research compiler oriented to automatic kernel recognition in scientific codes. The paper shows how to exploit the powerful information-gathering capabilities provided by the compiler to allow automated modeling of loop-oriented scientific codes. 1
Finding Optimal L1 Cache Configuration for Embedded Systems,” ASP-DAC
, 2006
"... Modern embedded system execute a single application or a class of applications repeatedly. A new emerging methodology of designing embedded system utilizes configurable processors where the cache size, associativity, and line size can be chosen by the designer. In this paper, a method is given to ra ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Modern embedded system execute a single application or a class of applications repeatedly. A new emerging methodology of designing embedded system utilizes configurable processors where the cache size, associativity, and line size can be chosen by the designer. In this paper, a method is given to rapidly find the L1 cache miss rate of an application. An energy model and an execution time model are developed to find the best cache configuration for the given embedded application. Using benchmarks from Mediabench, we find that our method is on average 45 times faster to explore the design space, compared to Dinero IV while still having 100 % accuracy. 1.
Data Cache Locking for Tight Timing Calculations
"... Caches have become increasingly important with the widening gap between main memory and processor speeds. Small and fast cache memories are designed to bridge this discrepancy. However, they are only effective when programs exhibit sufficient data locality. In addition, caches are a source of unpred ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Caches have become increasingly important with the widening gap between main memory and processor speeds. Small and fast cache memories are designed to bridge this discrepancy. However, they are only effective when programs exhibit sufficient data locality. In addition, caches are a source of unpredictability, resulting in programs sometimes behaving in a different way than expected. Detailed information about the number of cache misses and their causes allows us to predict cache behavior and to detect bottlenecks. Small modifications in the source code may change memory patterns, thereby altering the cache behavior. Code transformations which take the cache behavior into account might result in a high cache performance improvement. However, cache memory behavior is very hard to predict, thus making the task of optimizing and timing cache behavior very difficult. This article proposes and evaluates a new compiler framework that times cache behavior for multitasking systems. Our method explores the use of cache partitioning and dynamic cache locking to provide worst-case performance estimates in a safe and tight way for multitasking systems. We use cache partitioning, which divides the cache among tasks to eliminate inter-task
AUTOMATED DESIGN OF APPLICATION-SPECIFIC SUPERSCALAR PROCESSORS
, 2006
"... Automated design of superscalar processors can provide future system-on-chip (SOC) designers with a key-turn method of generating superscalar processors that are Pareto-optimal in terms of performance, energy consumption, and area for the target application program(s). Unfortunately, current optimiz ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Automated design of superscalar processors can provide future system-on-chip (SOC) designers with a key-turn method of generating superscalar processors that are Pareto-optimal in terms of performance, energy consumption, and area for the target application program(s). Unfortunately, current optimization methods are based on time-consuming cycle-accurate simulation, unsuitable for analysis of hundreds of thousands of design options that is required to arrive at Pareto-optimal designs. This dissertation bridges the gap between a large design space of superscalar processors and the inability of cycle-accurate simulation to analyze a large design space, by providing a computationally and conceptually simple analytical method for generating Pareto-optimal superscalar processor designs. The proposed and evaluated analytical method consists of three parts: (1) a method for analytically estimating the performance in terms a cycles-per-instruction (CPI) using the application program statistics and the superscalar processor parameters, (2) a method of analytically estimating various energy consuming activities using the application program statistics and the superscalar processor parameters, and (3) a method of finding the Pareto-
A component-based definition of spatial locality
, 2006
"... The data layout of a program is critical to performance because it determines the spatial locality of the data access. Most quantitative notions of spatial locality are based on the overall miss rate and leave three questions not fully answered: how much can the locality of a given data layout be im ..."
Abstract
- Add to MetaCart
The data layout of a program is critical to performance because it determines the spatial locality of the data access. Most quantitative notions of spatial locality are based on the overall miss rate and leave three questions not fully answered: how much can the locality of a given data layout be improved, can a data layout be improved if the miss rate cannot be lowered, and can the overall spatial locality be decomposed into smaller components? This paper describes a new definition of spatial locality that addresses these questions. The model is based on off-line profiling of a sequential execution. It has been used to analyze the spatial locality of 14 SPEC2000 benchmarks. 1
Modelling the Performance of the Gaussian Chemistry Code on x86 Architectures
"... Summary. Gaussian is a widely used scientific code with application areas in chemistry, biochemistry and material sciences. To operate efficiently on modern architectures Gaussian employs cache blocking in the generation and processing of the twoelectron integrals that are used by many of its electr ..."
Abstract
- Add to MetaCart
Summary. Gaussian is a widely used scientific code with application areas in chemistry, biochemistry and material sciences. To operate efficiently on modern architectures Gaussian employs cache blocking in the generation and processing of the twoelectron integrals that are used by many of its electronic structure methods. This study uses hardware performance counters to characterise the cache and memory behavior of the integral generation code used by Gaussian in Hartree-Fock calculations. A simple performance model is proposed that aims to predict overall performance as a function of total instruction and cache miss counts. The model is parameterised for three different x86 processors – the Intel Pentium M, the P4 and the AMD Opteron. Results suggest that the model is capable of predicting execution times to an accuracy of between 5 and 15%. Use of this model in developing a dynamic cache blocking scheme is also discussed. 1
A Component-based Definition of Spatial Locality
, 2008
"... The data layout of a program is critical to performance because it determines the spatial locality of the data access. Most quantitative notions of spatial locality are based on the overall miss rate and leave three questions not fully answered: how much can the locality of a given data layout be im ..."
Abstract
- Add to MetaCart
The data layout of a program is critical to performance because it determines the spatial locality of the data access. Most quantitative notions of spatial locality are based on the overall miss rate and leave three questions not fully answered: how much can the locality of a given data layout be improved, can a data layout be improved if the miss rate cannot be lowered, and can the overall spatial locality be decomposed into finer components? This paper describes a new definition of spatial locality that addresses these questions. The model is based on online profiling and off-line analysis. It has been used to analyze 7 SPEC2000 benchmarks and 1 SPEC2006 benchmarks. Among their 18 components, it finds 5 components that have a significant problem of poor spatial locality.

