Results 1 - 10
of
22
Instruction Fetching: Coping with Code Bloat
- In Proceedings of the 22nd Annual International Symposium on Computer Architecture
, 1995
"... Previous research has shown that the SPEC benchmarks achieve low miss ratios in relatively small instruction caches. This paper presents evidence that current software-development practices produce applications that exhibit substantially higher instruction-cache miss ratios than do the SPEC benchmar ..."
Abstract
-
Cited by 62 (9 self)
- Add to MetaCart
Previous research has shown that the SPEC benchmarks achieve low miss ratios in relatively small instruction caches. This paper presents evidence that current software-development practices produce applications that exhibit substantially higher instruction-cache miss ratios than do the SPEC benchmarks. To represent these trends, we have assembled a collection of applications, called the Instruction Benchmark Suite (IBS), that provides a better test of instruction-cache performance. We discuss the rationale behind the design of IBS and characterize its behavior relative to the SPEC benchmark suite. Our analysis is based on trace-driven and trap-driven simulations and takes into full account both the application and operating-system components of the workloads. This paper then reexamines a collection of previously-proposed hardware mechanisms for improving instruction-fetch performance
TRAPEDS: Producing Traces for Multicomputers Via Execution Driven Simulation
- In Proceedings of the 1989 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems
, 1989
"... Trace-driven simulation is an important aid in performance analysis of computer systems. Capturing address traces for these simulations is a difficult problem for single processors and particularly for multicomputers. Even when existing trace methods can be used on multicomputers, thc amount of colt ..."
Abstract
-
Cited by 38 (0 self)
- Add to MetaCart
Trace-driven simulation is an important aid in performance analysis of computer systems. Capturing address traces for these simulations is a difficult problem for single processors and particularly for multicomputers. Even when existing trace methods can be used on multicomputers, thc amount of coltected data typically grows with the number of processors, so IP and trace storage costs increase. A new technique is presented in this paper which modifies the executable code to dynamically collect the address trace from the user code and analyzes this trace during the execution of the program. This method helps resolve the I/O and storage problems and facilitates parallel analysis of the address trace. If a trace stared on disk is desired, the generated trace information can also be wriaen to files during execution, with a resultant drop in program execution speed. An initial implementation on the Intel ipSC/2 hypercube multicomputer is detailed, and sample simulation results are presented. The effect of this trace collection method on execution time is illustrated. Acknowledgmmnf: This ~ m h wu suppond in put by a Shell Doud Fellowship. by a Digital Faculty
Cache Coherence Directories for Scalable Multiprocessors
, 1992
"... this memory bandwidth problem, since multiple processors may be referencing the same memory modules. Furthermore, it is impossible to physically locate all memory nearby all of the processors, 2 CHAPTER 1. so some references must incur long access latencies due to interconnect delays. Average memory ..."
Abstract
-
Cited by 29 (5 self)
- Add to MetaCart
this memory bandwidth problem, since multiple processors may be referencing the same memory modules. Furthermore, it is impossible to physically locate all memory nearby all of the processors, 2 CHAPTER 1. so some references must incur long access latencies due to interconnect delays. Average memory latency can be reduced somewhat by distributing the global shared memory across the processing nodes. Even so, if a processor is actively sharing data, then it will reference some data that is not locally resident. One approach to improving the characteristics of the memory system is to follow the lead of uniprocessor designers by pairing each processor with a cache. Caches improve latency and bandwidth by using small, fast memories that are tightly coupled with the CPU. In a multiprocessor, they reduce the bandwidth strain due to data sharing by allowing each processor to cache its own copy of a shared data value. Unfortunately, allowing multiple processors to simultaneously cache a given datum leads to the cache consistency problem, also known as the cache coherence problem. If one processor writes a shared data value in its cache, the other cached copies of the data become stale. A processor that reads a stale copy of the data does not receive the most recently written value, violating the shared memory model we would like to provide. The simplest solution to the cache coherence problem is to disallow the caching of shared data. The performance effects stemming from the longer latencies that result from caching only private data are demonstrated by Figure 1.1. The vertical axis shows the fraction of uniprocessor utilization that is achieved by each processor. The horizontal axis shows the fraction of data references that are to shared data. The solid curve indicates the...
Cache Performance for Multimedia Applications
- In Proceedings of the 15th IEEE International Conference on Supercomputing
, 2001
"... The caching behavior of multimedia applications has been described as having high instruction reference locality within small loops, very large working sets, and poor data cache performance due to non-locality of data references. Despite this, there is no published research deriving or measuring the ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
The caching behavior of multimedia applications has been described as having high instruction reference locality within small loops, very large working sets, and poor data cache performance due to non-locality of data references. Despite this, there is no published research deriving or measuring these qualities. Utilizing the previously developed Berkeley Multimedia Workload, we present the results of execution driven cache simulations with the goal of aiding future media processing architecture design. Our analysis examines the differences between multimedia and traditional applications in cache behavior. We find that multimedia applications actually exhibit lower instruction miss ratios and comparable data miss ratios when contrasted with other widely studied workloads. In addition, we find that longer data cache line sizes than are currently used would benefit multimedia processing.
Dynamic Pointer Allocation for Scalable Cache Coherence Directories
- In International Symposium on Shared Memory Multiprocessing
, 1991
"... one of the primary challenges in building shared memory multiprocessors with hundreds or thousands of processors. While directory-based coherency schemes are promising because they rely on point-to-point messages rather than a network broadcast mechanism, traditional directory organizations would us ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
one of the primary challenges in building shared memory multiprocessors with hundreds or thousands of processors. While directory-based coherency schemes are promising because they rely on point-to-point messages rather than a network broadcast mechanism, traditional directory organizations would use a prohibitive amount of memory in a large-scale machine. In this paper we introduce a dynamic pointer allocation directory that exploits reference behavior characteristics of large-scale parallel programs to reduce directory storage requirements to manageable levels while maintaining performance comparable to traditional directory organizations.
CPU Cache Prefetching: Timing Evaluation of Hardware Implementations
- IEEE Transactions on Computers
, 1998
"... Abstract—Prefetching into CPU caches has long been known to be effective in reducing the cache miss ratio, but known implementations of prefetching have been unsuccessful in improving CPU performance. The reasons for this are that prefetches interfere with normal cache operations by making cache add ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
Abstract—Prefetching into CPU caches has long been known to be effective in reducing the cache miss ratio, but known implementations of prefetching have been unsuccessful in improving CPU performance. The reasons for this are that prefetches interfere with normal cache operations by making cache address and data ports busy, the memory bus busy, the memory banks busy, and by not necessarily being complete by the time that the prefetched data is actually referenced. In this paper, we present extensive quantitative results of a detailed cycle-by-cycle trace-driven simulation of a uniprocessor memory system in which we vary most of the relevant parameters in order to determine when and if hardware prefetching is useful. We find that, in order for prefetching to actually improve performance, the address array needs to be double ported and the data array needs to either be double ported or fully buffered. It is also very helpful for the bus to be very wide (e.g., 16 bytes) for bus transactions to be split and for main memory to be interleaved. Under the best circumstances, i.e., with a significant investment in extra hardware, prefetching can significantly improve performance. For implementations without adequate hardware, prefetching often decreases performance.
Accuracy of Memory Reference Traces of Parallel Computations in Trace-Driven Simulation
- IEEE Transactions on Parallel and Distributed Systems
, 1990
"... For given input, the global trace generated by a parallel program in a shared memory multiprocessing environment may change as the memory architecture and management policies change. Consequently, if trace-driven simulation is used, care must be taken to adjust the global trace to reflect the refere ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
For given input, the global trace generated by a parallel program in a shared memory multiprocessing environment may change as the memory architecture and management policies change. Consequently, if trace-driven simulation is used, care must be taken to adjust the global trace to reflect the reference pattern that would result from program execution in the new environment. Since the addresses may change as the environment changes, traditional process traces are not sufficient. We propose a method for ensuring that a correct global trace is generated in the new environment. This method involves a new characterization of a parallel program that identifies its address change points and address affecting points. An extension of traditional process traces, called the intrinsic trace of each process, is developed. The intrinsic traces maximize the decoupling of program execution from simulation by describing the address flow graph and path expressions of each process program. At each point...
Techniques for Cache and Memory Simulation Using Address Reference Traces
- Int. J. Comput. Simul
, 1990
"... Simulation using address reference traces is one of the primary methods for the performance evaluation of the memory hierarchy of computer systems. In this paper we survey the techniques used in such a simulation. In both the uniprocessor and shared-memory multiprocessor cases, the issues can be ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
Simulation using address reference traces is one of the primary methods for the performance evaluation of the memory hierarchy of computer systems. In this paper we survey the techniques used in such a simulation. In both the uniprocessor and shared-memory multiprocessor cases, the issues can be divided into trace collection, trace storage, and trace usage. Trace collection can employ several hardware or software methods. Common concerns are that the collection method capture all of the address references of interest, that the execution overhead of the collection method is not excessive, and that the trace is of adequate length. The increasing size of caches heightens the adequate length concern. Trace storage is of concern because of the large size of traces. Techniques for trace compression and trace reduction have been developed. Trace usage is of concern because of the length of a simulation. Under some circumstances it is possible to evaluate multiple cache sizes in a si...
A Probabilistic Approach to the Analysis of Program Execution Time
, 1998
"... We present a new approach to the performance prediction of parallel programs that provides information on the distribution of execution times when considering a large space of input data sets. The research aims to extend low-cost performance analysis techniques by accounting for the stochastic behav ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
We present a new approach to the performance prediction of parallel programs that provides information on the distribution of execution times when considering a large space of input data sets. The research aims to extend low-cost performance analysis techniques by accounting for the stochastic behavior of system parameters. Current analysis techniques are based on path analysis with the assumption of deterministic task times (mean values) instead of accounting for variance. Most of the system model parameters, however, are stochastic rather than deterministic due to data dependency in programs, for example in terms of branches and loop bounds, as well as due to various other probabilistic model abstractions. The approach is based on moments representations of distribution. We present a lowcost algorithm that computes the moments of the program execution time based on the moments associated with branching, loop bounds, and basic blocks. The novelty of the analysis technique is the combi...

