Results 1 - 10
of
13
Efficient Procedure Mapping using Cache Line Coloring
- IN PROCEEDINGS OF THE SIGPLAN'97 CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION
, 1997
"... As the gap between memory and processor performance continues to widen, it becomes increasingly important to exploit cache memory effectively. Both hardware and software approaches can be explored to optimize cache performance. Hardware designers focus on cache organization issues, including replace ..."
Abstract
-
Cited by 67 (12 self)
- Add to MetaCart
As the gap between memory and processor performance continues to widen, it becomes increasingly important to exploit cache memory effectively. Both hardware and software approaches can be explored to optimize cache performance. Hardware designers focus on cache organization issues, including replacement policy, associativity, line size and the resulting cache access time. Software writers use various optimization techniques, including software prefetching, data scheduling and code reordering. Our focus is on improving memory usage through code reordering compiler techniques. In this
Adding Instruction Cache Effect to Schedulability Analysis of Preemptive RealTime Systems
, 1996
"... Cache memories are commonly avoided in real-time systems because of their unpredictable behavior. Recently, some research has been done to obtain tighter bounds on the worst case execution time (WCET) of cached programs. These techniques usually assume a non preemptive underlying system. However, so ..."
Abstract
-
Cited by 44 (3 self)
- Add to MetaCart
Cache memories are commonly avoided in real-time systems because of their unpredictable behavior. Recently, some research has been done to obtain tighter bounds on the worst case execution time (WCET) of cached programs. These techniques usually assume a non preemptive underlying system. However, some techniques can be applied to allow the use of caches in preemptive systems. This paper describes how to incorporate the effect of instruction cache to the Response Time schedulability Analysis (RTA). RTA is an efficient analysis for preemptive fixed priority schedulers. We also compare through simulations the results of such approach to both cache partitioning (increase of the cache predictability by assigning private cache partitions to tasks) and CRMA (Cached RMA: cache effect is incorporated in the utilization based Rate Monotonic schedulability analysis). The results show that cached version of RTA (CRTA) clearly outperforms CRMA, however the partitioning scheme may be better dependin...
Procedure Placement Using Temporal-Ordering Information
- ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS
, 1997
"... ..."
Automatic and Efficient Evaluation of Memory Hierarchies for Embedded Systems
, 1999
"... Automation is the key to the design of future embedded systems as it permits application-specific customization while keeping design costs low. A key problem faced by automatic design systems is evaluating the performance of the vast number of alternative designs in a timely manner. For this paper, ..."
Abstract
-
Cited by 23 (2 self)
- Add to MetaCart
Automation is the key to the design of future embedded systems as it permits application-specific customization while keeping design costs low. A key problem faced by automatic design systems is evaluating the performance of the vast number of alternative designs in a timely manner. For this paper, we focus on an embedded system consisting of the following components: a VLIW processor, instruction cache, data cache, and second-level unified cache. A hierarchical approach of partitioning the system into its constituent components and evaluating each component individually is utilized. The performance of each processor is evaluated independent of its memory hierarchy, and each of the caches is simulated using the traces from a single reference processor. Since the changes in the processor architecture do indeed affect the address traces and thus the performance of the memory hierarchy, the overall performance is inaccurate. To overcome this error, the changes in the processor architecture are modeled as a dilation of the reference processor's address trace, where each instruction block in the trace is conceptually stretched out by the dilation coefficient. This approach provides a projected cache performance that more accurately accounts for changes in the processor architecture. In order to understand the accuracy of the dilation model, we separate the possible errors that the model introduces and quantify these errors on a set of benchmarks. The results show the dilation model is effective for most of the design space and facilitates efficient automatic design.
An analytical model of the working-set sizes in decision-support systems
- In SIGMETRICS
, 2000
"... This paper presents an analytical model to study how working sets scale with database size and other applications parameters in decision-support systems (DSS). The model uses application parameters, that are measured on down-scaled database executions, to predict cache miss ratios for executions of ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
This paper presents an analytical model to study how working sets scale with database size and other applications parameters in decision-support systems (DSS). The model uses application parameters, that are measured on down-scaled database executions, to predict cache miss ratios for executions of large databases. By applying the model to two database engines and typical DSS queries we find that, even for large databases, the most performance-critical working set is small and is caused by the instructions and private data that are required to access a single tuple. Consequently, its size is not affected by the database size. Surprisingly, database data may also exhibit temporal locality but the size of its working set critically depends on the structure of the query, the method of scanning, and the size and the content of the database. 1.
Balancing Design Options with Sherpa
- In CASES ’04: Proceedings of the 2004 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems
, 2004
"... Application specific processors offer the potential of rapidly designed logic specifically constructed to meet the performance and area demands of the task at hand. Recently, there have been several major projects that attempt to automate the process of transforming a predetermined processor configu ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Application specific processors offer the potential of rapidly designed logic specifically constructed to meet the performance and area demands of the task at hand. Recently, there have been several major projects that attempt to automate the process of transforming a predetermined processor configuration into a low level description for fabrication. These projects either leave the specification of the processor to the designer, which can be a significant engineering burden, or handle it in a fully automated fashion, which completely removes the designer from the loop. In this paper we introduce a technique for guiding the design and optimization of application specific processors. The goal of the Sherpa design framework is to automate certain design tasks and provide early feedback to help the designer navigate their way through the architecture design space. Our approach is to decompose the overall problem of choosing an optimal architecture into a set of sub-problems that are, to the first order, independent. For each subproblem, we create a model that relates performance to area. From this, we build a constraint system that can be solved using integer-linear programming techniques, and arrive at an ideal parameter selection for all architectural components. Our approach only takes a few minutes to explore the design space allowing the designer or compiler to see the potential benefits of optimizations rapidly. We show that the expected performance using our model correlates strongly to detailed pipeline simulations, and present results showing design tradeoffs for several different benchmarks.
An Empirical Study on How Program Layout Affects Cache Miss Rates
- ACM SIGMETRICS Performance Evaluation Review
, 1995
"... Cache miss rates are quoted for a specific program, cache configuration, and input set; the effect of program layout on the miss rate has largely been ignored. We examine the variation of the miss rate resulting from randomly chosen layouts, the miss variation, for several cache configurations (cach ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Cache miss rates are quoted for a specific program, cache configuration, and input set; the effect of program layout on the miss rate has largely been ignored. We examine the variation of the miss rate resulting from randomly chosen layouts, the miss variation, for several cache configurations (cache size, lines size, and set-associativity), input sets, and optimization levels for five programs in the SPEC benchmark suite. We observed miss rates that varied from 0.6m to 1.8m, where m is the mean miss rate. We did not observe any consistently good layouts across different parameters; in contrast, several layouts were consistently bad. Overall, cache line size has little effect on the miss variation, while increasing the cache size (decreasing the miss rate), decreasing the set-associativity, or increasing the optimization level increased the miss variation. We question the validity of using a single layout to represent the miss rate of a given program for a direct- mapped cache. 1 1....
Code Placement using Temporal Profile Information
, 1998
"... Instruction cache performance is important to instruction fetch efficiency and overall processor performance. The layout of an executable has a substantial effect on the cache miss rate and the instruction working set size during execution. This means that the performance of an executable can be imp ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Instruction cache performance is important to instruction fetch efficiency and overall processor performance. The layout of an executable has a substantial effect on the cache miss rate and the instruction working set size during execution. This means that the performance of an executable can be improved significantly by applying a code-placement algorithm that minimizes instruction cache conflicts and improves spatial locality. We describe an algorithm for procedure placement, one type of code-placement algorithm, that significantly differs from previous approaches in the type of information used to drive the placement algorithm. In particular, we gather temporal ordering information that summarizes the interleaving of procedures in a program trace. Our algorithm uses this information along with cache configuration and procedure size information to better estimate the conflict cost of a potential procedure ordering. It optimizes the procedure placement for single- and multi-level caches. In addition to reducing instruction cache conflicts, the algorithm simultaneously minimizes the instruction working set size of the program. We compare the performance of our algorithm with a particularly successful procedure-placement algorithm and show noticeable improvements in the instruction cache behavior, while maintaining the same instruction working set size.
Applications of Randomness in System Performance Measurement
, 1998
"... This thesis presents and analyzes a simple principle for building systems: that there should be a random component in all arbitrary decisions. If no randomness is used, system performance can vary widely and unpredictably due to small changes in the system workload or configuration. This makes measu ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
This thesis presents and analyzes a simple principle for building systems: that there should be a random component in all arbitrary decisions. If no randomness is used, system performance can vary widely and unpredictably due to small changes in the system workload or configuration. This makes measurements hard to reproduce and less meaningful as predictors of performance that could be expected in similar situations.
Locality of Reference, Patterns in Program Behavior, Memory Management, and Memory Hierarchies
"... Locality of reference is crucial to the performance of modern computers, but is actually poorly understood. In this paper, we survey issues in locality and memory hierarchy design, attempting to bring together what is known, correct common misconceptions, and clarify what is not known. We present a ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Locality of reference is crucial to the performance of modern computers, but is actually poorly understood. In this paper, we survey issues in locality and memory hierarchy design, attempting to bring together what is known, correct common misconceptions, and clarify what is not known. We present a unified approach to locality, based on the concept of timescale relativity, which simply says that some patterns in program behavior are relevant to issues of caching, and others are not, and that the difference depends crucially on the timescale relevant to a particular cache. Memory hierarchies use a kind of online, adaptive algorithm to control caching; such algorithms cannot be studied properly without some understanding of the regularities in the "data" (program behavior) they must process. We attempt a vertical unification, showing that locality of reference results from regularities in the structure of programs, and from regularities in how memory allocators map program objects ont...

