Results 1  10
of
28
Better external memory suffix array construction
 In: Workshop on Algorithm Engineering & Experiments
, 2005
"... Suffix arrays are a simple and powerful data structure for text processing that can be used for full text indexes, data compression, and many other applications in particular in bioinformatics. However, so far it has looked prohibitive to build suffix arrays for huge inputs that do not fit into main ..."
Abstract

Cited by 29 (5 self)
 Add to MetaCart
Suffix arrays are a simple and powerful data structure for text processing that can be used for full text indexes, data compression, and many other applications in particular in bioinformatics. However, so far it has looked prohibitive to build suffix arrays for huge inputs that do not fit into main memory. This paper presents design, analysis, implementation, and experimental evaluation of several new and improved algorithms for suffix array construction. The algorithms are asymptotically optimal in the worst case or on the average. Our implementation can construct suffix arrays for inputs of up to 4GBytes in hours on a low cost machine. As a tool of possible independent interest we present a systematic way to design, analyze, and implement pipelined algorithms.
The CacheOblivious Gaussian Elimination Paradigm: Theoretical Framework, Parallelization and Experimental Evaluation
, 2009
"... We consider triplynested loops of the type that occur in the standard Gaussian elimination algorithm, which we denote by GEP (or the Gaussian Elimination Paradigm). We present two related cacheoblivious methods IGEP and CGEP, both of which reduce the number of I/Os performed by the computation o ..."
Abstract

Cited by 21 (7 self)
 Add to MetaCart
We consider triplynested loops of the type that occur in the standard Gaussian elimination algorithm, which we denote by GEP (or the Gaussian Elimination Paradigm). We present two related cacheoblivious methods IGEP and CGEP, both of which reduce the number of I/Os performed by the computation over that performed by standard GEP by a factor of √ M, where M is the size of the cache. Cacheoblivious IGEP computes inplace and solves most of the known applications of GEP including Gaussian elimination and LUdecomposition without pivoting and FloydWarshall allpairs shortest paths. Cacheoblivious CGEP uses a modest amount of additional space, but is completely general and applies to any code in GEP form. Both IGEP and CGEP produce systemindependent cacheefficient code, and are potentially applicable to being used by optimizing compilers for loop transformation. We present parallel IGEP and CGEP that achieve good speedup and match the sequential caching performance cacheobliviously for both shared and distributed caches for sufficiently large inputs. We present extensive experimental results for both incore and outofcore performance of our algorithms. We consider both sequential and parallel implementations, and compare them with finelytuned cacheaware BLAS code for matrix multiplication and Gaussian elimination without pivoting. Our results indicate that cacheoblivious GEP offers an attractive tradeoff between efficiency and portability.
A computational study of externalmemory BFS algorithms
 In SODA
, 2006
"... Breadth First Search (BFS) traversal is an archetype for many important graph problems. However, computing a BFS level decomposition for massive graphs was considered nonviable so far, because of the large number of I/Os it incurs. This paper presents the first experimental evaluation of recent exte ..."
Abstract

Cited by 18 (4 self)
 Add to MetaCart
Breadth First Search (BFS) traversal is an archetype for many important graph problems. However, computing a BFS level decomposition for massive graphs was considered nonviable so far, because of the large number of I/Os it incurs. This paper presents the first experimental evaluation of recent externalmemory BFS algorithms for general graphs. With our STXXL based implementations exploiting pipelining and diskparallelism, we were able to compute the BFS level decomposition of a webcrawl based graph of around 130 million nodes and 1.4 billion edges in less than 4 hours using single disk and 2.3 hours using 4 disks. We demonstrate that some rather simple externalmemory algorithms perform significantly better (minutes as compared to hours) than internalmemory BFS, even if more than half of the input resides internally. 1
MCSTL: The MultiCore Standard Template Library
"... Abstract. 1 Future gain in computing performance will not stem from increased clock rates, but from even more cores in a processor. Since automatic parallelization is still limited to easily parallelizable sections of the code, most applications will soon have to support parallelism explicitly. The ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
Abstract. 1 Future gain in computing performance will not stem from increased clock rates, but from even more cores in a processor. Since automatic parallelization is still limited to easily parallelizable sections of the code, most applications will soon have to support parallelism explicitly. The MultiCore Standard Template Library (MCSTL) simplifies parallelization by providing efficient parallel implementations of the algorithms in the C++ Standard Template Library. Thus, simple recompilation will provide partial parallelization of applications that make consistent use of the STL. We present performance measurements on several architectures. For example, our sorter achieves a speedup of 21 on an 8core 32thread SUN T1. 1
Revisiting Resistance Speeds Up I/OEfficient LTL Model Checking
, 2008
"... Revisiting resistant graph algorithms are those that can tolerate reexploration of edges without yielding incorrect results. Revisiting resistant I/O efficient graph algorithms exhibit considerable speedup in practice in comparison to nonrevisiting resistant algorithms. In the paper we present a ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
Revisiting resistant graph algorithms are those that can tolerate reexploration of edges without yielding incorrect results. Revisiting resistant I/O efficient graph algorithms exhibit considerable speedup in practice in comparison to nonrevisiting resistant algorithms. In the paper we present a new revisiting resistant I/O efficient LTL model checking algorithm. We analyze its theoretical I/O complexity and we experimentally compare its performance to already existing I/O efficient LTL model checking algorithms.
The cacheoblivious Gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation
 In SPAA ’07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
, 2007
"... The cacheoblivious Gaussian Elimination Paradigm (GEP) was introduced by the authors in [6] to obtain efficient cacheoblivious algorithms for several important problems that have algorithms with triplynested loops similar to those that occur in Gaussian elimination. These include Gaussian elimina ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
The cacheoblivious Gaussian Elimination Paradigm (GEP) was introduced by the authors in [6] to obtain efficient cacheoblivious algorithms for several important problems that have algorithms with triplynested loops similar to those that occur in Gaussian elimination. These include Gaussian elimination and LUdecomposition without pivoting, allpairs shortest paths and matrix multiplication. In this paper, we prove several important properties of the cacheoblivious framework for GEP given in [6], which we denote by IGEP. We build on these results to obtain CGEP, a completely general cacheoblivious implementation of GEP that applies to any code in GEP form, and which has the same time and I/O bounds as the earlier algorithm in [6], while using a modest amount of additional space. We present an experimental evaluation of the caching performance of IGEP and CGEP in relation to the traditional Gaussian elimination algorithm. Our experimental results indicate that IGEP and CGEP outperform GEP on inputs of reasonable size, with dramatic improvement in running time over GEP when the data is out of core. ‘Tiling’, an important loop transformation technique employed by optimizing compilers in order to improve temporal locality in nested loops, is a cacheaware method that does not adapt to all levels in a multilevel memory hierarchy. The cacheoblivious GEP framework (either IGEP or CGEP) produces systemindependent I/Oefficient code for triply nested loops of the form that appears in Gaussian elimination without pivoting, and is potentially applicable to being used by optimizing compilers for loop transformation.
On computational models for flash memory devices
 in Experimental Algorithms, 2009
"... Abstract. Flash memorybased solidstate disks are fast becoming the dominant form of enduser storage devices, partly even replacing the traditional harddisks. Existing twolevel memory hierarchy models fail to realize the full potential of flashbased storage devices. We propose two new computati ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Abstract. Flash memorybased solidstate disks are fast becoming the dominant form of enduser storage devices, partly even replacing the traditional harddisks. Existing twolevel memory hierarchy models fail to realize the full potential of flashbased storage devices. We propose two new computation models, the general flash model and the unitcost model, for memory hierarchies involving these devices. Our models are simple enough for meaningful algorithm design and analysis. In particular, we show that a broad range of existing externalmemory algorithms and data structures based on the merging paradigm can be adapted efficiently into the unitcost model. Our experiments show that the theoretical analysis of algorithms on our models corresponds to the empirical behavior of algorithms when using solidstate disks as external memory. 1
Building a parallel pipelined external memory algorithm library
 In 23rd IEEE International Parallel & Distributed Processing Symposium (IPDPS
, 2009
"... Large and fast hard disks for little money have enabled the processing of huge amounts of data on a single machine. For this purpose, the wellestablished STXXL library provides a framework for external memory algorithms with an easytouse interface. However, the clock speed of processors cannot ke ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Large and fast hard disks for little money have enabled the processing of huge amounts of data on a single machine. For this purpose, the wellestablished STXXL library provides a framework for external memory algorithms with an easytouse interface. However, the clock speed of processors cannot keep up with the increasing bandwidth of parallel disks, making many algorithms actually computebound. To overcome this steadily worsening limitation, we exploit today’s multicore processors with two new approaches. First, we parallelize the internal computation of the encapsulated external memory algorithms by utilizing the MCSTL library. Second, we augment the unique pipelining feature of the STXXL, to enable automatic task parallelization. We show using synthetic and practical use cases that the combination of both techniques increases performance greatly. 1
Experimental study of high performance priority queues, 2007. Undergraduate Honors Thesis
, 2007
"... The priority queue is a very important and widely used data structure in computer science, with a variety of applications including Dijkstra’s Single Source Shortest Path algorithm on sparse graph types. This study presents the experimental results of a variety of priority queues. The focus of the e ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
The priority queue is a very important and widely used data structure in computer science, with a variety of applications including Dijkstra’s Single Source Shortest Path algorithm on sparse graph types. This study presents the experimental results of a variety of priority queues. The focus of the experiments is to measure the speed and performance of highly specialized priority queues in outofcore and memory intensive situations. The priority queues are run incore on small input sizes as well as outofcore using large input sizes and restricted memory. The experiments compare a variety of wellknown priority queue implementations such as Binary Heap with highly specialized implementations, such as 4ary Aligned Heap, Chowdhury and Ramachandran’s Auxiliary Buffer Heap, and Fast Binary Heap. The experiments include CacheAware as well as CacheOblivious priority queues. The results indicate that the highperformance priority queues easily outperform traditional implementations. Also, overall the Auxiliary Buffer Heap has the best performance among the priority queues considered in most incore and outofcore situations. 1 1
Design and Implementation of a Practical I/Oefficient Shortest Paths Algorithm
"... We report on initial experimental results for a practical I/Oefficient SingleSource ShortestPaths (SSSP) algorithm on general undirected sparse graphs where the ratio between the largest and the smallest edge weight is reasonably bounded (for example integer weights in {1,...,2 32}) and the reali ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
We report on initial experimental results for a practical I/Oefficient SingleSource ShortestPaths (SSSP) algorithm on general undirected sparse graphs where the ratio between the largest and the smallest edge weight is reasonably bounded (for example integer weights in {1,...,2 32}) and the realistic assumption holds that main memory is big enough to keep one bit per vertex. While our implementation only guarantees averagecase efficiency, i.e., assuming randomly chosen edgeweights, it turns out that its performance on realworld instances with nonrandom edge weights is actually even better than on the respective inputs with random weights. Furthermore, compared to the currently best implementation for externalmemory BFS [6], which in a sense constitutes a lower bound for SSSP, the running time of our approach always stayed within a factor of five, for the most difficult graph classes the difference was even less than a factor of two. We are not aware of any previous I/Oefficient implementation for the classic general SSSP in a (semi) external setting: in two recent projects [10, 23], Kumar/Schwabelike SSSP approaches on graphs of at most 6 million vertices have been tested, forcing the authors to artificially restrict the main memory size, M, to rather unrealistic 4 to 16 MBytes in order not to leave the semiexternal setting or produce huge running times for larger graphs: for random graphs of 2 20 vertices, the best previous approach needed over six hours. In contrast, for a similar ratio of input size vs. M, but on a 128 times larger and even sparser random graph, our approach was less than seven times slower, a relative gain of nearly 20. On a realworld 24 million node street graph, our implementation was over 40 times faster. Even larger gains of over 500 can be estimated for ran