Results 11  20
of
43
Cache Performance Analysis of Traversals and Random Accesses
 In Proceedings of the Tenth Annual ACMSIAM Symposium on Discrete Algorithms
, 1999
"... This paper describes a model for studying the cache performance of algorithms in a directmapped cache. Using this model, we analyze the cache performance of several commonly occurring memory access patterns: (i) sequential and random memory traversals, (ii) systems of random accesses, and (iii) com ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
This paper describes a model for studying the cache performance of algorithms in a directmapped cache. Using this model, we analyze the cache performance of several commonly occurring memory access patterns: (i) sequential and random memory traversals, (ii) systems of random accesses, and (iii) combinations of each. For each of these, we give exact expressions for the number of cache misses per memory access in our model. We illustrate the application of these analyses by determining the cache performance of two algorithms: the traversal of a binary search tree and the counting of items in a large array. Trace driven cache simulations validate that our analyses accurately predict cache performance. 1 Introduction The concrete analysis of algorithms has a long and rich history. It has played an important role in understanding the performance of algorithms in practice. Traditional concrete analysis of algorithms is interested in approximating as closely as possible the number of "cost...
A Comparison of Cache Aware and Cache Oblivious Static Search Trees Using Program Instrumentation
, 2002
"... An experimental comparison of cache aware and cache oblivious static search tree algorithms is presented. Both cache aware and cache oblivious algorithms outperform classic binary search on large data sets because of their better utilization of cache memory. Cache aware algorithms with implicit p ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
An experimental comparison of cache aware and cache oblivious static search tree algorithms is presented. Both cache aware and cache oblivious algorithms outperform classic binary search on large data sets because of their better utilization of cache memory. Cache aware algorithms with implicit pointers perform best overall, but cache oblivious algorithms do almost as well and do not have to be tuned to the memory block size as cache aware algorithms require. Program instrumentation techniques are used to compare the cache misses and instruction counts for implementations of these algorithms.
HighPerformance Algorithm Engineering for Computational Phylogenetics
 J. Supercomputing
, 2002
"... A phylogeny is the evolutionary history of a group of organisms; systematists (and other biologists) attempt to reconstruct this history from various forms of data about contemporary organisms. Phylogeny reconstruction is a crucial step in the understanding of evolution as well as an important tool ..."
Abstract

Cited by 21 (7 self)
 Add to MetaCart
A phylogeny is the evolutionary history of a group of organisms; systematists (and other biologists) attempt to reconstruct this history from various forms of data about contemporary organisms. Phylogeny reconstruction is a crucial step in the understanding of evolution as well as an important tool in biological, pharmaceutical, and medical research. Phylogeny reconstruction from molecular data is very difficult: almost all optimization models give rise to NPhard (and thus computationally intractable) problems. Yet approximations must be of very high quality in order to avoid outright biological nonsense. Thus many biologists have been willing to run farms of processors for many months in order to analyze just one dataset. Highperformance algorithm engineering offers a battery of tools that can reduce, sometimes spectacularly, the running time of existing phylogenetic algorithms, as well as help designers produce better algorithms. We present an overview of algorithm engineering techniques, illustrating them with an application to the "breakpoint analysis" method of Sankoff et al., which resulted in the GRAPPA software suite. GRAPPA demonstrated a speedup in running time by over eight orders of magnitude over the original implementation on a variety of real and simulated datasets. We show how these algorithmic engineering techniques are directly applicable to a large variety of challenging combinatorial problems in computational biology.
Using PRAM Algorithms on a UniformMemoryAccess SharedMemory Architecture
 Proc. 5th Int’l Workshop on Algorithm Engineering (WAE 2001), volume 2141 of Lecture Notes in Computer Science
, 2001
"... The ability to provide uniform sharedmemory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform sharedmemory algorithm from a PRAM algorithm and present the res ..."
Abstract

Cited by 20 (11 self)
 Add to MetaCart
The ability to provide uniform sharedmemory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform sharedmemory algorithm from a PRAM algorithm and present the results of an extensive experimental study demonstrating that the resulting programs scale nearly linearly across a significant range of processors (from 1 to 64) and across the entire range of instance sizes tested. This linear speedup with the number of processors is, to our knowledge, the first ever attained in practice for intricate combinatorial problems. The example we present in detail here is a graph decomposition algorithm that also requires the computation of a spanning tree; this problem is not only of interest in its own right, but is representative of a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but have no known efficient parallel implementations. Our results thus offer promise for bridging the gap between the theory and practice of sharedmemory parallel algorithms.
Accessing Multiple Sequences Through Set Associative Caches
 In Proc
, 1999
"... The cache hierarchy prevalent in todays high performance processors has to be taken into account in order to design algorithms which perform well in practice. We start from the empirical observation that external memory algorithms often turn out to be good algorithms for cached memory. This is n ..."
Abstract

Cited by 19 (4 self)
 Add to MetaCart
The cache hierarchy prevalent in todays high performance processors has to be taken into account in order to design algorithms which perform well in practice. We start from the empirical observation that external memory algorithms often turn out to be good algorithms for cached memory. This is not self evident since caches have a fixed and quite restrictive algorithm choosing the content of the cache. We investigate the impact of this restriction for the frequently occurring case of access to multiple sequences. We show that any access pattern to k = \Theta(M=B ) sequential data streams can be efficiently supported on an away set associative cache with capacity M and line size B. The bounds are tight up to lower order terms.
Cache performance of SAT solvers: A case study for efficient implementation of algorithms
, 2003
"... Abstract. We experimentally evaluate the cache performance of different SAT solvers as a case study for efficient implementation of SAT algorithms. We evaluate several different BCP mechanisms and show their respective run time and cache performances on selected benchmark instances. From the experim ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
Abstract. We experimentally evaluate the cache performance of different SAT solvers as a case study for efficient implementation of SAT algorithms. We evaluate several different BCP mechanisms and show their respective run time and cache performances on selected benchmark instances. From the experiments we conclude that cache friendly data structure is a key element for efficient implementation of SAT solvers. We also show empirical cache miss rates of several modern SAT solvers based on the DavisLogemannLoveland algorithm with learning and nonchronological backtracking. We conclude that recently developed SAT solvers are much more cache friendly in data structures and algorithm implementations compared with their predecessors. 1
Graph and Hashing Algorithms for Modern Architectures: Design and Performance
 PROC. 2ND WORKSHOP ON ALGORITHM ENG. WAE 98, MAXPLANCK INST. FÜR INFORMATIK, 1998, IN TR MPII981019
, 1998
"... We study the eects of caches on basic graph and hashing algorithms and show how cache effects inuence the best solutions to these problems. We study the performance of basic data structures for storing lists of values and use these results to design and evaluate algorithms for hashing, BreadthFi ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
We study the eects of caches on basic graph and hashing algorithms and show how cache effects inuence the best solutions to these problems. We study the performance of basic data structures for storing lists of values and use these results to design and evaluate algorithms for hashing, BreadthFirstSearch (BFS) and DepthFirstSearch (DFS). For the basic
Polynomial Division using Dynamic Arrays, Heaps, and Packed Exponent Vectors
 COMPUTER ALGEBRA GROUP SIMON FRASER UNIVERSITY
, 2007
"... A common way of implementing multivariate polynomial multiplication and division is to represent polynomials as linked lists of terms sorted in a term ordering and to use repeated merging. This results in poor performance on large sparse polynomials. In this paper we use an auxiliary heap of pointe ..."
Abstract

Cited by 12 (8 self)
 Add to MetaCart
A common way of implementing multivariate polynomial multiplication and division is to represent polynomials as linked lists of terms sorted in a term ordering and to use repeated merging. This results in poor performance on large sparse polynomials. In this paper we use an auxiliary heap of pointers to reduce the number of monomial comparisons in the worst case while keeping the overall storage linear. We give two variations. In the first, the size of the heap is bounded by the number of terms in the quotient(s). In the second, which is new, the size is bounded by the number of terms in the divisor(s). We use dynamic arrays of terms rather than linked lists to reduce storage allocations and indirect memory references. We pack monomials in the array to reduce storage and to speed up monomial comparisons. We give a new packing for the graded reverse lexicographical ordering. We have implemented the heap algorithms in C with an interface to Maple. For comparison we have also implemented Yan’s “geobuckets” data structure. Our timings demonstrate that heaps of pointers are comparable in speed with geobuckets but use significantly less storage.
Cacheoblivious algorithms and data structures
 IN SWAT
, 2004
"... Frigo, Leiserson, Prokop and Ramachandran in 1999 introduced the idealcache model as a formal model of computation for developing algorithms in environments with multiple levels of caching, and coined the terminology of cacheoblivious algorithms. Cacheoblivious algorithms are described as stand ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
Frigo, Leiserson, Prokop and Ramachandran in 1999 introduced the idealcache model as a formal model of computation for developing algorithms in environments with multiple levels of caching, and coined the terminology of cacheoblivious algorithms. Cacheoblivious algorithms are described as standard RAM algorithms with only one memory level, i.e. without any knowledge about memory hierarchies, but are analyzed in the twolevel I/O model of Aggarwal and Vitter for an arbitrary memory and block size and an optimal offline cache replacement strategy. The result are algorithms that automatically apply to multilevel memory hierarchies. This paper gives an overview of the results achieved on cacheoblivious algorithms and data structures since the seminal paper by Frigo et al.