Results 11 - 20
of
89
Towards a Theory of Cache-Efficient Algorithms
, 1999
"... We describe a model that enables us to analyze the running time of an algorithm in a computer with a memory hierarchy with limited associativity, in terms of various cache parameters. Our model, an extension of Aggarwal and Vitter's I/O model, enables us to establish useful relationships between the ..."
Abstract
-
Cited by 43 (3 self)
- Add to MetaCart
We describe a model that enables us to analyze the running time of an algorithm in a computer with a memory hierarchy with limited associativity, in terms of various cache parameters. Our model, an extension of Aggarwal and Vitter's I/O model, enables us to establish useful relationships between the cache complexity and the I/O complexity of computations. As a corollary, we obtain cache-optimal algorithms for some fundamental problems like sorting, FFT, and an important subclass of permutations in the single-level cache model. We also show that ignoring associativity concerns could lead to inferior performance, by analyzing the average-case cache behavior of mergesort. We further extend our model to multiple levels of cache with limited associativity and present optimal algorithms for matrix transpose and sorting. Our techniques may be used for systematic exploitation of the memory hierarchy starting from the algorithm design stage, and dealing with the hitherto unresolved problem of l...
Architectural requirements and scalability of the nas parallel benchmarks
- In Supercomputing
, 1999
"... andDavidE.Culler ..."
Performance Modeling for Realistic Storage Devices
, 1997
"... Managing large amounts of storage is difficult and becoming more so as both the complexity and number of storage devices are increasing. One approach to this problem is a self-managing storage system. Since a self-managing storage system is a real-time system, it requires a model that quickly approx ..."
Abstract
-
Cited by 36 (8 self)
- Add to MetaCart
Managing large amounts of storage is difficult and becoming more so as both the complexity and number of storage devices are increasing. One approach to this problem is a self-managing storage system. Since a self-managing storage system is a real-time system, it requires a model that quickly approximates the behavior of the storage device in a workload-dependent fashion. We develop such a model.
Our approach to modeling storage devices is to model the individual physical components of the device, such as queues, caches, and disk mechanisms, and then compose the component models. Each component model determines its behavior from the specification of the entering workload and the lower-level device behavior. To support the lower level component model in determining its behavior, each component model creates a modified workload specification to support the manner that the physical component would modify the entering workload. Modifying the workload specification allows us, for example, to capture the altered spatial locality that occurs when queues reorder their requests.
Our model predicts the device behavior in terms of response time within a relative error ranging from 2% to 30% for interesting subsets of the domain of devices and workloads. To demonstrate this, the model has been validated with synthetic traces of parallel scientific file system workloads and video-on-demand applications and traces of transaction processing applications.
Our contributions to the area of performance modeling for storage devices include the following:
- An infrastructure for developing a composite model. The infrastructure
supports the development of more complicated devices and workloads
than we have validated.
- Methods to approximate the mean seek time and rotational latency of
a disk mechanism using measures of workload spatial locality.
- Methods to approximate the miss probability and the full- and partial- hit
probabilities in an I/O system's data caches using measures of workload
spatial locality.
- Methods to approximate the queue delay for non-FCFS scheduling algorithms
using a description of the workload arrival process.
These methods can be composed to provide analytic estimation procedures for the behavior of a subset of current storage devices.
Analytical Modeling of Set-Associative Cache Behavior
- IEEE Transactions on Computers
, 1998
"... Cache behavior is complex and inherently unstable, yet is a critical factor aecting program performance. A method of evaluating cache performance is required, both to give quantitative predictions of miss-ratio, and information to guide optimization of cache use. ..."
Abstract
-
Cited by 31 (9 self)
- Add to MetaCart
Cache behavior is complex and inherently unstable, yet is a critical factor aecting program performance. A method of evaluating cache performance is required, both to give quantitative predictions of miss-ratio, and information to guide optimization of cache use.
Architectural Exploration and Optimization of Local Memory in Embedded Systems
, 1997
"... Embedded processor-based systems allow for the tailoring of the on-chip memory architecture based on application-specific requirements. We present an analytical strategy for exploring the on-chip memory architecture for a given application, based on a memory performance estimation scheme. The analyt ..."
Abstract
-
Cited by 29 (5 self)
- Add to MetaCart
Embedded processor-based systems allow for the tailoring of the on-chip memory architecture based on application-specific requirements. We present an analytical strategy for exploring the on-chip memory architecture for a given application, based on a memory performance estimation scheme. The analytical technique has the important advantage of enabling a fast evaluation of candidate memory architectures in the early stages of system design. Our experiments demonstrate that our estimations closely follow the actual simulated performance, at significantly reduced run times. 1. Introduction Increasing design complexity and shrinking product design cycle times have fueled the need for design reuse in the IC-design industry. Reuse is enabled by modern design libraries, which frequently consist of pre-designed megacells such as microprocessor cores, memories, numeric coprocessors, and modules implementing standardized functions such as JPEG. For example, the CW33000 processor core from LSI ...
Combining Trace Sampling with Single Pass Methods for Efficient Cache Simulation
, 1997
"... The design of the memory hierarchy is crucial to the performance of high performance computer systems. The incorporation of multiple levels of caches into the memory hierarchy is known to increase the performance of high end machines but the development of architectural prototypes of various memory ..."
Abstract
-
Cited by 27 (1 self)
- Add to MetaCart
The design of the memory hierarchy is crucial to the performance of high performance computer systems. The incorporation of multiple levels of caches into the memory hierarchy is known to increase the performance of high end machines but the development of architectural prototypes of various memory hierarchy designs is costly and time consuming. In this paper, we will describe a single pass method used in combination with trace sampling techniques to produce a fast and accurate approach for simulating multiple sizes of caches simultaneously.
A Novel Cache Architecture to Support Layer-Four Packet Classification at Memory Access Speeds
, 2000
"... | Existing and emerging layer-4 switching technologies require packet classication to be performed on more than one header elds, known as layer-4 lookup. Currently, the fastest general layer-4 lookup scheme delivers a throughput of 1 Million Lookups Per Second (MLPS), far o from 25/75 MLPS needed to ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
| Existing and emerging layer-4 switching technologies require packet classication to be performed on more than one header elds, known as layer-4 lookup. Currently, the fastest general layer-4 lookup scheme delivers a throughput of 1 Million Lookups Per Second (MLPS), far o from 25/75 MLPS needed to support 50/150 Gbps layer4 router. We propose the use of route caching to speed up layer-4 lookup, and design and implement a cache architecture for this purpose. We investigated the locality behavior of the Interent trac (at layer-4) and proposed a near-LRU algorithm that best harness this behavior. In implementation, to best approximate fully-associative nearLRU using relatively inexpensive set-associative hardware, we invented a dynamic set-associative scheme that exploits the nice properties of N-universal hash functions. The cache architecture achieves a high and stable hit ratio above 90 percent and a fast throughput up to 75 MLPS at a reasonable cost ($700/1700 for 50/150 Gbps rou...
Cache Performance Analysis of Traversals and Random Accesses
- In Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms
, 1999
"... This paper describes a model for studying the cache performance of algorithms in a direct-mapped cache. Using this model, we analyze the cache performance of several commonly occurring memory access patterns: (i) sequential and random memory traversals, (ii) systems of random accesses, and (iii) com ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
This paper describes a model for studying the cache performance of algorithms in a direct-mapped cache. Using this model, we analyze the cache performance of several commonly occurring memory access patterns: (i) sequential and random memory traversals, (ii) systems of random accesses, and (iii) combinations of each. For each of these, we give exact expressions for the number of cache misses per memory access in our model. We illustrate the application of these analyses by determining the cache performance of two algorithms: the traversal of a binary search tree and the counting of items in a large array. Trace driven cache simulations validate that our analyses accurately predict cache performance. 1 Introduction The concrete analysis of algorithms has a long and rich history. It has played an important role in understanding the performance of algorithms in practice. Traditional concrete analysis of algorithms is interested in approximating as closely as possible the number of "cost...

