STXXL: Standard template library for XXL data sets
 In: Proc. of ESA 2005. Volume 3669 of LNCS
, 2005
"... for processing huge data sets that can fit only on hard disks. It supports parallel disks, overlapping between disk I/O and computation and it is the first I/Oefficient algorithm library that supports the pipelining technique that can save more than half of the I/Os. STXXL has been applied both in ..."
for processing huge data sets that can fit only on hard disks. It supports parallel disks, overlapping between disk I/O and computation and it is the first I/Oefficient algorithm library that supports the pipelining technique that can save more than half of the I/Os. STXXL has been applied both in academic and industrial environments for a range of problems including text processing, graph algorithms, computational geometry, gaussian elimination, visualization, and analysis of microscopic images, differential cryptographic analysis, etc. The performance of STXXL and its applications is evaluated on synthetic and realworld inputs. We present the design of the library, how its performance features are supported, and demonstrate how the library integrates with STL. KEY WORDS: very large data sets; software library; C++ standard template library; algorithm engineering 1.
A computational study of externalmemory BFS algorithms
 In SODA
, 2006
"... Breadth First Search (BFS) traversal is an archetype for many important graph problems. However, computing a BFS level decomposition for massive graphs was considered nonviable so far, because of the large number of I/Os it incurs. This paper presents the first experimental evaluation of recent exte ..."
Breadth First Search (BFS) traversal is an archetype for many important graph problems. However, computing a BFS level decomposition for massive graphs was considered nonviable so far, because of the large number of I/Os it incurs. This paper presents the first experimental evaluation of recent externalmemory BFS algorithms for general graphs. With our STXXL based implementations exploiting pipelining and diskparallelism, we were able to compute the BFS level decomposition of a webcrawl based graph of around 130 million nodes and 1.4 billion edges in less than 4 hours using single disk and 2.3 hours using 4 disks. We demonstrate that some rather simple externalmemory algorithms perform significantly better (minutes as compared to hours) than internalmemory BFS, even if more than half of the input resides internally. 1
Low Depth CacheOblivious Algorithms
, 2009
"... In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on a variety of parallel cache architectures. The approach is to design nested parallel algorithms that have low depth (span, critical path length) and for which the natural s ..."
In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on a variety of parallel cache architectures. The approach is to design nested parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cacheoblivious model. We describe several cacheoblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparsematrix vector multiply on matrices with good vertex separators. Our sorting algorithm yields the first cacheoblivious algorithms with polylogarithmic depth and low sequential cache complexities for list ranking, Euler tour tree labeling, tree contraction, least common ancestors, graph connectivity, and minimum spanning forest. Using known mappings, our results lead to low cache complexities on multicore processors (and sharedmemory multiprocessors) with a single level of private caches or a single shared cache. We generalize these mappings to a multilevel parallel treeofcaches model that reflects current and future trends in multicore cache hierarchies—these new mappings imply that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the
Cacheoblivious algorithms and data structures
 IN SWAT
, 2004
"... Frigo, Leiserson, Prokop and Ramachandran in 1999 introduced the idealcache model as a formal model of computation for developing algorithms in environments with multiple levels of caching, and coined the terminology of cacheoblivious algorithms. Cacheoblivious algorithms are described as stand ..."
Frigo, Leiserson, Prokop and Ramachandran in 1999 introduced the idealcache model as a formal model of computation for developing algorithms in environments with multiple levels of caching, and coined the terminology of cacheoblivious algorithms. Cacheoblivious algorithms are described as standard RAM algorithms with only one memory level, i.e. without any knowledge about memory hierarchies, but are analyzed in the twolevel I/O model of Aggarwal and Vitter for an arbitrary memory and block size and an optimal offline cache replacement strategy. The result are algorithms that automatically apply to multilevel memory hierarchies. This paper gives an overview of the results achieved on cacheoblivious algorithms and data structures since the seminal paper by Frigo et al.
Computeraided design of highperformance algorithms
, 2008
"... Highperformance algorithms play an important role in many areas of computer science and are core components of many software systems used in realworld applications. Traditionally, the creation of these algorithms requires considerable expertise and experience, often in combination with a substanti ..."
Highperformance algorithms play an important role in many areas of computer science and are core components of many software systems used in realworld applications. Traditionally, the creation of these algorithms requires considerable expertise and experience, often in combination with a substantial amount of trial and error. Here, we outline a new approach to the process of designing highperformance algorithms that is based on the use of automated procedures for exploring potentially very large spaces of candidate designs. We contrast this computeraided design approach with the traditional approach and discuss why it can be expected to yield better performing, yet simpler algorithms. Finally, we sketch out the highlevel design of a software environment that supports our new design approach. Existing work on algorithm portfolios, algorithm selection, algorithm configuration and parameter tuning, but also on general methods for discrete and continuous optimisation methods fits naturally into our design approach and can be integrated into the proposed software environment. 1
CacheOblivious Databases: Limitations and Opportunities
, 2008
"... Cacheoblivious techniques, proposed in the theory community, have optimal asymptotic bounds on the amount of data transferred between any two adjacent levels of an arbitrary memory hierarchy. Moreover, this optimal performance is achieved without any hardware platform specific tuning. These propert ..."
Cacheoblivious techniques, proposed in the theory community, have optimal asymptotic bounds on the amount of data transferred between any two adjacent levels of an arbitrary memory hierarchy. Moreover, this optimal performance is achieved without any hardware platform specific tuning. These properties are highly attractive to autonomous databases, especially because the hardware architectures are becoming increasingly complex and diverse. In this paper, we present our design, implementation, and evaluation of the first cacheoblivious inmemory query processor, EaseDB. Moreover, we discuss the inherent limitations of the cacheoblivious approach as well as the opportunities given by the upcoming hardware architectures. Specifically, a cacheoblivious technique usually requires sophisticated algorithm design to achieve a comparable performance to its cacheconscious counterpart. Nevertheless, this developmenttime effort is compensated by the automaticity of performance achievement and the reduced ownership cost. Furthermore, this automaticity enables cacheoblivious techniques to outperform their cacheconscious counterparts in multithreading processors.
A Novel Parallel Sorting Algorithm for Contemporary Architectures
, 2007
"... Traditionally, the field of scientific computing has been dominated by numerical methods. However, modern scientific codes often combine numerical methods with combinatorial methods. Sorting, a widely studied problem in computer science, is an important primitive for combinatorial scientific computi ..."
Traditionally, the field of scientific computing has been dominated by numerical methods. However, modern scientific codes often combine numerical methods with combinatorial methods. Sorting, a widely studied problem in computer science, is an important primitive for combinatorial scientific computing. As high
Computing visibility on terrains in external memory
 In Proceedings of the Ninth Workshop on Algorithm Engineering and Experiments / Workshop on Analytic Algorithms and Combinatorics (ALENEX/ANALCO
, 2007
"... Given an arbitrary viewpoint v and a terrain, the visibility map or viewshed of v is the set of points in the terrain that are visible from v. In this paper we consider the problem of computing the viewshed of a point on a very large grid terrain in external memory. We describe algorithms for this p ..."
Given an arbitrary viewpoint v and a terrain, the visibility map or viewshed of v is the set of points in the terrain that are visible from v. In this paper we consider the problem of computing the viewshed of a point on a very large grid terrain in external memory. We describe algorithms for this problem in the cacheaware and cacheoblivious models, together with an implementation and an experimental evaluation. Our algorithms are a novel application of the distribution sweeping technique and use O(sort(n)) I/Os, where sort(n) is the complexity of sorting n items of data in the I/Omodel. The experimental results demonstrate that our algorithm scales up and performs significantly better than the traditional internalmemory plane sweep algorithm, and can compute visibility for terrains of 1.1 billion points in less than 4 hours on a lowcost machine compared to more than 32 hours with the internalmemory algorithm.
An empirical study of cacheoblivious priority queues and their application to the shortest path problem. Available online under http://www.cs.bris.ac.uk/ 95 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited
, 2008
"... Abstract. In recent years the CacheOblivious model of external memory computation has provided an attractive theoretical basis for the analysis of algorithms on massive datasets. Much progress has been made in discovering algorithms that are asymptotically optimal or near optimal. However, to date ..."
Abstract. In recent years the CacheOblivious model of external memory computation has provided an attractive theoretical basis for the analysis of algorithms on massive datasets. Much progress has been made in discovering algorithms that are asymptotically optimal or near optimal. However, to date there are still relatively few successful experimental studies. In this paper we compare two different CacheOblivious priority queues based on the Funnel and Bucket Heap and apply them to the single source shortest path problem on graphs with positive edge weights. Our results show that when RAM is limited and data is swapping to external storage, the CacheOblivious priority queues achieve orders of magnitude speedups over standard internal memory techniques. However, for the single source shortest path problem both on simulated and real world graph data, these speedups are markedly lower due to the time required to access the graph adjacency list itself. 1