Results 1  10
of
14
Engineering a cacheoblivious sorting algorithm
 In Proc. 6th Workshop on Algorithm Engineering and Experiments
, 2004
"... The cacheoblivious model of computation is a twolevel memory model with the assumption that the parameters of the model are unknown to the algorithms. A consequence of this assumption is that an algorithm efficient in the cache oblivious model is automatically efficient in a multilevel memory mod ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
The cacheoblivious model of computation is a twolevel memory model with the assumption that the parameters of the model are unknown to the algorithms. A consequence of this assumption is that an algorithm efficient in the cache oblivious model is automatically efficient in a multilevel memory model. Since the introduction of the cacheoblivious model by Frigo et al. in 1999, a number of algorithms and data structures in the model has been proposed and analyzed. However, less attention has been given to whether the nice theoretical proporities of cacheoblivious algorithms carry over into practice. This paper is an algorithmic engineering study of cacheoblivious sorting. We investigate a number of implementation issues and parameters choices for the cacheoblivious sorting algorithm Lazy Funnelsort by empirical methods, and compare the final algorithm with Quicksort, the established standard for comparison based sorting, as well as with recent cacheaware proposals. The main result is a carefully implemented cacheoblivious sorting algorithm, which we compare to the best implementation of Quicksort we can find, and find that it competes very well for input residing in RAM, and outperforms Quicksort for input on disk. 1
Cacheaware and cacheoblivious adaptive sorting
 In Proc. 32nd International Colloquium on Automata, Languages, and Programming, Lecture Notes in Computer Science
, 2005
"... Abstract. Two new adaptive sorting algorithms are introduced which perform an optimal number of comparisons with respect to the number of inversions in the input. The first algorithm is based on a new linear time reduction to (nonadaptive) sorting. The second algorithm is based on a new division pr ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
Abstract. Two new adaptive sorting algorithms are introduced which perform an optimal number of comparisons with respect to the number of inversions in the input. The first algorithm is based on a new linear time reduction to (nonadaptive) sorting. The second algorithm is based on a new division protocol for the GenericSort algorithm by EstivillCastro and Wood. From both algorithms we derive I/Ooptimal cacheaware and cacheoblivious adaptive sorting algorithms. These are the first I/Ooptimal adaptive sorting algorithms. 1
Low Depth CacheOblivious Algorithms
, 2009
"... In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on a variety of parallel cache architectures. The approach is to design nested parallel algorithms that have low depth (span, critical path length) and for which the natural s ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on a variety of parallel cache architectures. The approach is to design nested parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cacheoblivious model. We describe several cacheoblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparsematrix vector multiply on matrices with good vertex separators. Our sorting algorithm yields the first cacheoblivious algorithms with polylogarithmic depth and low sequential cache complexities for list ranking, Euler tour tree labeling, tree contraction, least common ancestors, graph connectivity, and minimum spanning forest. Using known mappings, our results lead to low cache complexities on multicore processors (and sharedmemory multiprocessors) with a single level of private caches or a single shared cache. We generalize these mappings to a multilevel parallel treeofcaches model that reflects current and future trends in multicore cache hierarchies—these new mappings imply that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the
Cacheoblivious algorithms and data structures
 IN SWAT
, 2004
"... Frigo, Leiserson, Prokop and Ramachandran in 1999 introduced the idealcache model as a formal model of computation for developing algorithms in environments with multiple levels of caching, and coined the terminology of cacheoblivious algorithms. Cacheoblivious algorithms are described as stand ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
Frigo, Leiserson, Prokop and Ramachandran in 1999 introduced the idealcache model as a formal model of computation for developing algorithms in environments with multiple levels of caching, and coined the terminology of cacheoblivious algorithms. Cacheoblivious algorithms are described as standard RAM algorithms with only one memory level, i.e. without any knowledge about memory hierarchies, but are analyzed in the twolevel I/O model of Aggarwal and Vitter for an arbitrary memory and block size and an optimal offline cache replacement strategy. The result are algorithms that automatically apply to multilevel memory hierarchies. This paper gives an overview of the results achieved on cacheoblivious algorithms and data structures since the seminal paper by Frigo et al.
An Optimal CacheOblivious Priority Queue and its Application to Graph Algorithms
 SIAM JOURNAL ON COMPUTING
, 2007
"... We develop an optimal cacheoblivious priority queue data structure, supporting insertion, deletion, and deletemin operations in $O(\frac{1}{B}\log_{M/B}\frac{N}{B})$ amortized memory transfers, where $M$ and $B$ are the memory and block transfer sizes of any two consecutive levels of a multilevel ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We develop an optimal cacheoblivious priority queue data structure, supporting insertion, deletion, and deletemin operations in $O(\frac{1}{B}\log_{M/B}\frac{N}{B})$ amortized memory transfers, where $M$ and $B$ are the memory and block transfer sizes of any two consecutive levels of a multilevel memory hierarchy. In a cacheoblivious data structure, $M$ and $B$ are not used in the description of the structure. Our structure is as efficient as several previously developed external memory (cacheaware) priority queue data structures, which all rely crucially on knowledge about $M$ and $B$. Priority queues are a critical component in many of the best known external memory graph algorithms, and using our cacheoblivious priority queue we develop several cacheoblivious graph algorithms.
On the limits of cacheoblivious matrix transposition
 In Proc. of 2nd Symp. of Trustworthy Global Computing
, 2006
"... Abstract Intuitively, a cacheoblivious algorithm implements an adaptive strategy which runs efficiently on any memory hierarchy without requiring previous knowledge of the parameters of the hierarchy. For this reason, cacheobliviousness is an attractive feature of an algorithm meant for a global c ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
Abstract Intuitively, a cacheoblivious algorithm implements an adaptive strategy which runs efficiently on any memory hierarchy without requiring previous knowledge of the parameters of the hierarchy. For this reason, cacheobliviousness is an attractive feature of an algorithm meant for a global computing environment, where software may be run on a variety of different platforms for load management purposes. In this paper we present a negative result on cacheobliviousness, namely, we show that an optimal cacheoblivious algorithm for the fundamental primitive of matrix transposition cannot exist without the tall cache assumption, which forces the (unknown) parameters of the memory hierarchy to satisfy a certain technical relation. Our contribution specializes the result of Brodal and Fagerberg for general permutations to matrix transposition, and provides further evidence that the tall cache assumption is often necessary to attain optimality in the context of cacheoblivious algorithms. 1
Redesigning the String Hash Table, Burst Trie, and BST to Exploit Cache
, 2011
"... A key decision when developing inmemory computing applications is choice of a mechanism to store and retrieve strings. The most efficient current data structures for this task are the hash table with movetofront chains and the burst trie, both of which use linked lists as a substructure, and vari ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
A key decision when developing inmemory computing applications is choice of a mechanism to store and retrieve strings. The most efficient current data structures for this task are the hash table with movetofront chains and the burst trie, both of which use linked lists as a substructure, and variants of binary search tree. These data structures are computationally efficient, but typical implementations use large numbers of nodes and pointers to manage strings, which is not efficient in use of cache. In this article, we explore two alternatives to the standard representation: the simple expedient of including the string in its node, and, for linked lists, the more drastic step of replacing each list of nodes by a contiguous array of characters. Our experiments show that, for large sets of strings, the improvement is dramatic. For hashing, in the best case the total space overhead is reduced to less than 1 bit per string. For the burst trie, over 300MB of strings can be stored in a total of under 200MB of memory with significantly improved search time. These results, on a variety of data sets, show that cachefriendly variants of fundamental data structures can yield remarkable gains in performance.
CacheOblivious RedBlue Line Segment Intersection
"... Abstract. We present an optimal cacheoblivious algorithm for finding all intersections between a set of nonintersecting red segments and a set of nonintersecting blue segments in the plane. Our algorithm uses O ( N B log M/B N B + T/B) memory transfers, where N is the total number of segments, M ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Abstract. We present an optimal cacheoblivious algorithm for finding all intersections between a set of nonintersecting red segments and a set of nonintersecting blue segments in the plane. Our algorithm uses O ( N B log M/B N B + T/B) memory transfers, where N is the total number of segments, M and B are the memory and block transfer sizes of any two consecutive levels of any multilevel memory hierarchy, and T is the number of intersections. 1
Oblivious Algorithms for Multicores and Networks of Processors
"... We address the design of algorithms for multicores that are oblivious to machine parameters. We propose HM, a multicore model consisting of a parallel sharedmemory machine with hierarchical multilevel caching, and we introduce a multicoreoblivious approach to algorithms and schedulers for HM. A m ..."
Abstract
 Add to MetaCart
We address the design of algorithms for multicores that are oblivious to machine parameters. We propose HM, a multicore model consisting of a parallel sharedmemory machine with hierarchical multilevel caching, and we introduce a multicoreoblivious approach to algorithms and schedulers for HM. A multicoreoblivious algorithm is specified with no mention of any machine parameters, such as the number of cores, number of cache levels, cache sizes and block lengths. However, it is equipped with a small set of instructions that can be used to provide hints to the runtime scheduler on how to schedule parallel tasks. We present efficient multicoreoblivious algorithms for several fundamental problems including matrix transposition, FFT, sorting, the Gaussian Elimination Paradigm, list ranking, and connected components. The notion of a multicoreoblivious algorithm is complementary to that of a networkoblivious algorithm, introduced by Bilardi et al. (2007) for parallel distributedmemory machines where processors communicate pointtopoint. We show that several of our multicoreoblivious algorithms translate into efficient networkoblivious algorithms, adding to the body of known efficient networkoblivious algorithms. Keywords: list ranking multicore, cache, network, oblivious algorithm, Gaussian elimination paradigm,
Techniques for Parallel Memory Hierarchies
"... Shared memory parallel machines are a popular choice for processing large data sets as they present a comparatively simple interface for programmers. Despite this, writing scalable parallel programs on them is very difficult because of the large disparity in the costs of accessing memory locations a ..."
Abstract
 Add to MetaCart
Shared memory parallel machines are a popular choice for processing large data sets as they present a comparatively simple interface for programmers. Despite this, writing scalable parallel programs on them is very difficult because of the large disparity in the costs of accessing memory locations and limited interconnect bandwidth. For most data intensive applications, performance on these machines is bound not by the number of the processors, but by the capabilities of interconnect between processors, caches, and memory and how well the program is optimized for data locality. By using parallel memory hierarchies a tree of caches in essence as an approximate model for such machines, I propose to develop software and hardware approaches to make designing scalable nested parallel programs easier. The three techniques I plan to study (in theory and practice) are: • Designing parallel algorithms with low “memory access costs ” for frequently used problems, along with a new cost model to measure these costs. • Thread schedulers with provable guarantees on running times for a broad class of nested parallel programs, and an experimental setup for expressing and testing them. • Combinable memoryblock transactions: A scheme for scalable implementation of atomic operations required for managing concurrency in runtime systems or any other concurrent program.