Results 1 
8 of
8
LargeScale Pattern Search Using ReducedSpace OnDisk Suffix Arrays
"... Abstract—The suffix array is an efficient data structure for inmemory pattern search. Suffix arrays can also be used for externalmemory pattern search, via twolevel structures that use an internal index to identify the correct block of suffix pointers. In this paper we describe a new twolevel su ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Abstract—The suffix array is an efficient data structure for inmemory pattern search. Suffix arrays can also be used for externalmemory pattern search, via twolevel structures that use an internal index to identify the correct block of suffix pointers. In this paper we describe a new twolevel suffix arraybased index structure that requires significantly less disk space than previous approaches. Key to the saving is the use of disk blocks that are based on prefixes rather than the more usual uniformsampling approach, allowing reductions between blocks and subparts of other blocks. We also describe a new inmemory structure – the condensed BWT – and show that it allows common patterns to be resolved without access to the text. Experiments using 64 GB of English web text on a computer with 4 GB of main memory demonstrate the speed and versatility of the new approach. For this data the index is around onethird the size of previous twolevel mechanisms; and the memory footprint of as little as 1 % of the text size means that queries can be processed more quickly than is possible with a compact FMINDEX. Index Terms—String search, pattern matching, suffix array, BurrowsWheeler transform, succinct data structure, diskbased algorithm, experimental evaluation. I.
LempelZiv parsing in external memory
, 2013
"... Abstract. For decades, computing the LZ factorization (or LZ77 parsing) of a string has been a requisite and computationally intensive step in many diverse applications, including text indexing and data compression. Many algorithms for LZ77 parsing have been discovered over the years; however, des ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Abstract. For decades, computing the LZ factorization (or LZ77 parsing) of a string has been a requisite and computationally intensive step in many diverse applications, including text indexing and data compression. Many algorithms for LZ77 parsing have been discovered over the years; however, despite the increasing need to apply LZ77 to massive data sets, no algorithm to date scales to inputs that exceed the size of internal memory. In this paper we describe the first algorithm for computing the LZ77 parsing in external memory. Our algorithm is fast in practice and will allow the next generation of text indexes to be realised for massive strings and string collections. 1
Fast Parallel Computation of Longest Common Prefixes Julian
"... Abstract—Suffix arrays and the corresponding longest common prefix (LCP) array have wide applications in bioinformatics, information retrieval and data compression. In this work, we propose and theoretically analyze new parallel algorithms for computing the LCP array given the suffix array as input ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—Suffix arrays and the corresponding longest common prefix (LCP) array have wide applications in bioinformatics, information retrieval and data compression. In this work, we propose and theoretically analyze new parallel algorithms for computing the LCP array given the suffix array as input. Most of our algorithms have a work and depth (parallel time) complexity related to the LCP values of the input. We also present a slight variation of Kärkkäinen and Sanders ’ skew algorithm that requires linear work and polylogarithmic depth in the worst case. We present a comprehensive experimental study of our parallel algorithms along with existing parallel and sequential LCP algorithms. On a variety of realworld and artificial strings, we show that on a 40core sharedmemory machine our fastest algorithm is up to 2.3 times faster than the fastest existing parallel algorithm, and up to 21.8 times faster than the fastest sequential LCP algorithm. I.
RESEARCH ARTICLE Indexing ArbitraryLength kMers in Sequencing Reads
"... We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating kmers in sequencing reads. Our solution, PgSA (pseudogenome suffix a ..."
Abstract
 Add to MetaCart
(Show Context)
We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating kmers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNAseq experiments.
1 SAISOPT: On the Characterization and Optimization of the SAIS Algorithm for Suffix Array Construction
"... Abstract—The suffix array and BurrowsWheeler Transform are critical index structures in next generation sequence analysis. The construction of such index structures for mammaliansized genomes can take thousands of seconds (i.e. tens of minutes). Its construction is complicated by computational ove ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—The suffix array and BurrowsWheeler Transform are critical index structures in next generation sequence analysis. The construction of such index structures for mammaliansized genomes can take thousands of seconds (i.e. tens of minutes). Its construction is complicated by computational overheads that coming from irregular or complex memoryaccess patterns. This paper rigorously characterizes the execution profile of the SAIS algorithm in order to guide its optimization. The resulting optimized SAIS, which we refer to as saisopt, outperforms previous implementations of SAIS as well as “best in practice” algorithms, when applied to large DNA strings. Keywords—suffix array; BurrowsWheeler Transform; irregular memory access
Sharedmemory parallelism can be simple, . . .
, 2015
"... Parallelism is the key to achieving high performance in computing. However, writing efficient and scalable parallel programs is notoriously difficult, and often requires significant expertise. To address this challenge, it is crucial to provide programmers with highlevel tools to enable them to de ..."
Abstract
 Add to MetaCart
Parallelism is the key to achieving high performance in computing. However, writing efficient and scalable parallel programs is notoriously difficult, and often requires significant expertise. To address this challenge, it is crucial to provide programmers with highlevel tools to enable them to develop solutions efficiently, and at the same time emphasize the theoretical and practical aspects of algorithm design to allow the solutions developed to run efficiently under all possible settings. This thesis addresses this challenge using a threepronged approach consisting of the design of sharedmemory programming techniques, frameworks, and algorithms for important problems in computing. The thesis provides evidence that with appropriate programming techniques, frameworks, and algorithms, sharedmemory programs can be simple, fast, and scalable, both in theory and in practice. The results developed in this thesis serve to ease the transition into the multicore era. The first part of this thesis introduces tools and techniques for deterministic