Results 1 - 10
of
15
Cache-Conscious Structure Layout
, 1999
"... Hardware trends have produced an increasing disparity between processor speeds and memory access times. While a variety of techniques for tolerating or reducing memory latency have been proposed, these are rarely successful for pointer-manipulating programs. This paper explores a complementary appro ..."
Abstract
-
Cited by 164 (8 self)
- Add to MetaCart
Hardware trends have produced an increasing disparity between processor speeds and memory access times. While a variety of techniques for tolerating or reducing memory latency have been proposed, these are rarely successful for pointer-manipulating programs. This paper explores a complementary approach that attacks the source (poor reference locality) of the problem rather than its manifestation (memory latency). It demonstrates that careful data organization and layout provides an essential mechanism to improve the cache locality of pointer-manipulating programs and consequently, their performance. It explores two placement technique-lustering and colorinet improve cache performance by increasing a pointer structure’s spatial and temporal locality, and by reducing cache-conflicts. To reduce the cost of applying these techniques, this paper discusses two strategies-cache-conscious reorganization and cacheconscious allocation--and describes two semi-automatic toolsccmorph and ccmalloc-that use these strategies to produce cache-conscious pointer structure layouts. ccmorph is a transparent tree reorganizer that utilizes topology information to cluster and color the structure. ccmalloc is a cache-conscious heap allocator that attempts to co-locate contemporaneously accessed data elements in the same physical cache block. Our evaluations, with microbenchmarks, several small benchmarks, and a couple of large real-world applications, demonstrate that the cache-conscious structure layouts produced by ccmorph and ccmalloc offer large performance benefit-n most cases, significantly outperforming state-of-the-art prefetching.
Performance Modeling for Realistic Storage Devices
, 1997
"... Managing large amounts of storage is difficult and becoming more so as both the complexity and number of storage devices are increasing. One approach to this problem is a self-managing storage system. Since a self-managing storage system is a real-time system, it requires a model that quickly approx ..."
Abstract
-
Cited by 36 (8 self)
- Add to MetaCart
Managing large amounts of storage is difficult and becoming more so as both the complexity and number of storage devices are increasing. One approach to this problem is a self-managing storage system. Since a self-managing storage system is a real-time system, it requires a model that quickly approximates the behavior of the storage device in a workload-dependent fashion. We develop such a model.
Our approach to modeling storage devices is to model the individual physical components of the device, such as queues, caches, and disk mechanisms, and then compose the component models. Each component model determines its behavior from the specification of the entering workload and the lower-level device behavior. To support the lower level component model in determining its behavior, each component model creates a modified workload specification to support the manner that the physical component would modify the entering workload. Modifying the workload specification allows us, for example, to capture the altered spatial locality that occurs when queues reorder their requests.
Our model predicts the device behavior in terms of response time within a relative error ranging from 2% to 30% for interesting subsets of the domain of devices and workloads. To demonstrate this, the model has been validated with synthetic traces of parallel scientific file system workloads and video-on-demand applications and traces of transaction processing applications.
Our contributions to the area of performance modeling for storage devices include the following:
- An infrastructure for developing a composite model. The infrastructure
supports the development of more complicated devices and workloads
than we have validated.
- Methods to approximate the mean seek time and rotational latency of
a disk mechanism using measures of workload spatial locality.
- Methods to approximate the miss probability and the full- and partial- hit
probabilities in an I/O system's data caches using measures of workload
spatial locality.
- Methods to approximate the queue delay for non-FCFS scheduling algorithms
using a description of the workload arrival process.
These methods can be composed to provide analytic estimation procedures for the behavior of a subset of current storage devices.
Cache Performance Analysis of Traversals and Random Accesses
- In Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms
, 1999
"... This paper describes a model for studying the cache performance of algorithms in a direct-mapped cache. Using this model, we analyze the cache performance of several commonly occurring memory access patterns: (i) sequential and random memory traversals, (ii) systems of random accesses, and (iii) com ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
This paper describes a model for studying the cache performance of algorithms in a direct-mapped cache. Using this model, we analyze the cache performance of several commonly occurring memory access patterns: (i) sequential and random memory traversals, (ii) systems of random accesses, and (iii) combinations of each. For each of these, we give exact expressions for the number of cache misses per memory access in our model. We illustrate the application of these analyses by determining the cache performance of two algorithms: the traversal of a binary search tree and the counting of items in a large array. Trace driven cache simulations validate that our analyses accurately predict cache performance. 1 Introduction The concrete analysis of algorithms has a long and rich history. It has played an important role in understanding the performance of algorithms in practice. Traditional concrete analysis of algorithms is interested in approximating as closely as possible the number of "cost...
Automatic and Efficient Evaluation of Memory Hierarchies for Embedded Systems
, 1999
"... Automation is the key to the design of future embedded systems as it permits application-specific customization while keeping design costs low. A key problem faced by automatic design systems is evaluating the performance of the vast number of alternative designs in a timely manner. For this paper, ..."
Abstract
-
Cited by 23 (2 self)
- Add to MetaCart
Automation is the key to the design of future embedded systems as it permits application-specific customization while keeping design costs low. A key problem faced by automatic design systems is evaluating the performance of the vast number of alternative designs in a timely manner. For this paper, we focus on an embedded system consisting of the following components: a VLIW processor, instruction cache, data cache, and second-level unified cache. A hierarchical approach of partitioning the system into its constituent components and evaluating each component individually is utilized. The performance of each processor is evaluated independent of its memory hierarchy, and each of the caches is simulated using the traces from a single reference processor. Since the changes in the processor architecture do indeed affect the address traces and thus the performance of the memory hierarchy, the overall performance is inaccurate. To overcome this error, the changes in the processor architecture are modeled as a dilation of the reference processor's address trace, where each instruction block in the trace is conceptually stretched out by the dilation coefficient. This approach provides a projected cache performance that more accurately accounts for changes in the processor architecture. In order to understand the accuracy of the dilation model, we separate the possible errors that the model introduces and quantify these errors on a set of benchmarks. The results show the dilation model is effective for most of the design space and facilitates efficient automatic design.
The Effectiveness of Affinity-Based Scheduling in Multiprocessor Networking (Extended Version)
- IEEE/ACM Transactions on Networking
, 1996
"... Techniques for avoiding the high memory overheads found on many modern shared-memory multiprocessors are of increasing importance in the development of high-performance multiprocessor protocol implementations. One such technique is processor-cache affinity scheduling, which can significantly lower p ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
Techniques for avoiding the high memory overheads found on many modern shared-memory multiprocessors are of increasing importance in the development of high-performance multiprocessor protocol implementations. One such technique is processor-cache affinity scheduling, which can significantly lower packet latency and substantially increase protocol processing throughput [30]. In this paper, we evaluate several aspects of the effectiveness of affinity-based scheduling in multiprocessor network protocol processing, under packet-level and connection-level parallelization approaches. Specifically, we evaluate the performance of the scheduling technique 1) when a large number of streams are concurrently supported, 2) when processing includes copying of uncached packet data, 3) as applied to send-side protocol processing, and 4) in the presence of stream burstiness and source locality, two well-known properties of network traffic. We find that affinity-based scheduling performs well under the...
Expected I-Cache Miss Rates via the Gap Model
- In 21st Annual International Symposium on Computer Architecture
, 1994
"... To evaluate the performance of a memory system, computer architects must determine the miss rate m of a cache C when running program P . Typically, the measured miss rate depends on the specific address mapping M of P set arbitrarily by the compiler and linker. In this paper, we remove the effect of ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
To evaluate the performance of a memory system, computer architects must determine the miss rate m of a cache C when running program P . Typically, the measured miss rate depends on the specific address mapping M of P set arbitrarily by the compiler and linker. In this paper, we remove the effect of the address-mapping on the miss rate by analyzing a symbolic trace T of basic blocks. By assuming each basic block has an equal probability of ending up anywhere in the address map, we determine the expected miss rate averaged over all possible address mappings.
The Performance Impact of Scheduling for Cache Affinity in Parallel Network Processing
- In International Symposium on High Performance Distributed Computing (HPDC-4), Pentagon City
, 1995
"... We explore processor-cache affinity scheduling of parallel network protocol processing, in a setting in which protocol processing executes on a shared-memory multiprocessor concurrently with a general workload of non-protocol activity. We find affinity-based scheduling can significantly reduce the c ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
We explore processor-cache affinity scheduling of parallel network protocol processing, in a setting in which protocol processing executes on a shared-memory multiprocessor concurrently with a general workload of non-protocol activity. We find affinity-based scheduling can significantly reduce the communication delay associated with protocol processing, enabling the host to support a greater number of concurrent streams and to provide higher maximum throughput to individual streams. In addition, we compare the performance of two parallelization alternatives, Locking and Independent Protocol Stacks (IPS), with very different caching behaviors. We find that IPS (which maximizes cache affinity) delivers much lower message latency and significantly higher message throughput capacity, yet exhibits less robust response to intra-stream burstiness and limited intra-stream scalability. 1 Introduction In many modern computer architectures, there is a significant difference in the amount of ti...
A Blocked All-Pairs Shortest-Paths Algorithm
- JOURNAL OF EXPERIMENTAL ALGORITHMICS
, 2003
"... We propose a blocked version of Floyd's all-pairs shortestpaths algorithm. The blocked algorithm makes better utilization of cache than does Floyd's original algorithm. Experiments indicate that the blocked algorithm delivers a speedup (relative to the unblocked Floyd's algorithm) between 1.6 an ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
We propose a blocked version of Floyd's all-pairs shortestpaths algorithm. The blocked algorithm makes better utilization of cache than does Floyd's original algorithm. Experiments indicate that the blocked algorithm delivers a speedup (relative to the unblocked Floyd's algorithm) between 1.6 and 1.9 on a Sun Ultra Enterprise 4000/5000 for graphs that have between 480 and 3200 vertices. The measured speedup on an SGI 02 for graphs with between 240 and 1200 vertices is between 1.6 and 2.
Efficient Profile-Based Evaluation of Randomising Set Index Functions For Cache Memories
- In 2nd International Symposium on Performance Analysis of Systems and Software
, 2001
"... The performance of direct mapped caches is degraded by conflict misses. It has been shown that conflict misses can be reduced by using randomising set index functions, such that repeated conflicts are avoided. However, optimising the set index function requires time consuming simulations, because th ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
The performance of direct mapped caches is degraded by conflict misses. It has been shown that conflict misses can be reduced by using randomising set index functions, such that repeated conflicts are avoided. However, optimising the set index function requires time consuming simulations, because the design space of randomising set index functions is very large. Therefore, we developed a profilebased technique that allows one to make a fast estimation of the miss ratio incurred by a set index function. Using this technique, one can perform a fast, initial exploration of the design space of set index functions, followed by a slower, but more accurate, analysis using simulation. The profilebased technique is based on a new representation of randomising set index functions using null spaces. The profilebased technique consists of two phases. In the first phase, a program is profiled and in the second phase, a score is computed from the profile data and the null space of a set index function. We show that the computed score closely reflects the miss ratio incurred by that set index function. Computing a score is a simple operation that requires no simulation time. Therefore, only one profiling run is required to estimate the miss ratios for a wide range of set index functions. 1

