Results 1  10
of
13
DNA sequence compression using the BurrowsWheeler Transform
 Proc. IEEE Bioinformatics Conference, Stanford University, CA, 2002: 303
"... Abstract We investigate offline dictionary oriented approaches to DNA sequence compression, based on the BurrowsWheeler Transform (BWT). The preponderance of short repeating patterns is an important phenomenon in biological sequences. Here, we propose offline methods to compress DNA sequences t ..."
Abstract

Cited by 20 (4 self)
 Add to MetaCart
Abstract We investigate offline dictionary oriented approaches to DNA sequence compression, based on the BurrowsWheeler Transform (BWT). The preponderance of short repeating patterns is an important phenomenon in biological sequences. Here, we propose offline methods to compress DNA sequences that exploit the different repetition structures inherent in such sequences. Repetition analysis is performed based on the relationship between the BWT and important pattern matching data structures, such as the suffix tree and suffix array. We discuss how the proposed approach can be incorporated in the BWT compression pipeline. Index terms DNA sequence compression, repetition structures, BurrowsWheeler Transform, BWT 1.
The SBCTree: An Index for RunLength Compressed Sequences
, 2008
"... RunLengthEncoding (RLE) is a data compression technique that is used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges is how to operate on (e.g., index, search, and retrieve) compressed data without decompressing it. In this pap ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
RunLengthEncoding (RLE) is a data compression technique that is used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges is how to operate on (e.g., index, search, and retrieve) compressed data without decompressing it. In this paper, we introduce the String Btree for Compressed sequences, termed the SBCtree, for indexing and searching RLEcompressed sequences of arbitrary length. The SBCtree is a twolevel index structure based on the wellknown String Btree and a 3sided range query structure [7]. The SBCtree supports pattern matching queries such as substring matching, prefix matching, and range search operations over RLEcompressed sequences. The SBCtree has an optimal externalmemory space complexity of O(N/B) pages, where N is the total length of the compressed sequences, and B is the disk page size. Substring matching, prefix matching, and range search execute in an optimal O(logB N + p+T) I/O operations, where p  is the
Accelerating Boyer Moore Searches on Binary Texts Shmuel
"... The Boyer and Moore (BM) pattern matching algorithm is considered as one of the best, but its performance is reduced on binary data. Yet, searching in binary texts has important applications, such as compressed matching. The paper shows how, by means of some precomputed tables, one may implement th ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
(Show Context)
The Boyer and Moore (BM) pattern matching algorithm is considered as one of the best, but its performance is reduced on binary data. Yet, searching in binary texts has important applications, such as compressed matching. The paper shows how, by means of some precomputed tables, one may implement the BM algorithm also for the binary case without referring to bits, and processing only entire blocks such as bytes or words, thereby significantly reducing the number of comparisons. Empirical comparisons show that the new variant performs better than regular binary BM and even than BDM. Key words: PACS:
On compressibility of protein sequences
 in IEEE Data Compression Conference, Snowbird, UT
, 2006
"... We consider the problem of compressibility of protein sequences. Based on an observed genomescale longrange correlation in concatenated protein sequences from different organisms, we propose a method to exploit this unusual redundancy in compressing the protein sequences. The result is a significa ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
We consider the problem of compressibility of protein sequences. Based on an observed genomescale longrange correlation in concatenated protein sequences from different organisms, we propose a method to exploit this unusual redundancy in compressing the protein sequences. The result is a significant reduction in the number of bits required for representing the sequences. We report results in bits per symbol (bps) of 2.27, 2.55, 3.11 and 3.44 for protein sequences from M. jannaschii, H. influenzae, S. cerevisiae, and H. sapiens respectively, the same protein sequences used by NevillManning and Witten in the “Protein is incompressible ” paper [23]. The observed longrange correlations could have significant implications beyond compression and complexity analysis of protein sequences. 1.
LZW Based Compressed Pattern Matching
"... Compressed pattern matching is an emerging research area that addresses the following problem: given a file in compressed format and a pattern, report the occurrence(s) of the pattern in the file with minimal (or no) decompression. In this paper, we report our work on compressed pattern matching in ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Compressed pattern matching is an emerging research area that addresses the following problem: given a file in compressed format and a pattern, report the occurrence(s) of the pattern in the file with minimal (or no) decompression. In this paper, we report our work on compressed pattern matching in LZW compressed files. The reported work is based on Amir’s wellknown “almostoptimal ” algorithm [1] but has been improved to search not only the first occurrence of the pattern but also all other occurrences. The improvements also include the multipattern matching and a faster implementation for socalled “simple pattern”, which is defined as “a pattern with no symbol appearing more than once”. Extensive experiments have been conducted to test the search performance and to compare with not only the “decompressthensearch” approach but also the best available compressed pattern matching algorithms, particularly the BWTbased algorithms [2, 3]. The results showed that our method is competitive among the best algorithms.
Approximate pattern match using the BurrowsWheeler transform
 Proceedings of Data Compression Conference
, 2003
"... Abstract. The compressed pattern matching problem is to locate the occurrence(s) of a pattern P in a text string T using a compressed representation of T, with minimal (or no) decompression. In this paper, we consider approximate pattern matching directly on BWT compressed text. The BWT provides a ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract. The compressed pattern matching problem is to locate the occurrence(s) of a pattern P in a text string T using a compressed representation of T, with minimal (or no) decompression. In this paper, we consider approximate pattern matching directly on BWT compressed text. The BWT provides a lexicographic ordering of the input text as part of its inverse transformation process. Based on this observation, pattern matching is performed by text prefiltering, using a fast qgram intersection of segments from the pattern P and the text T. Algorithms are proposed that solve the kmismatch problem in O(min{m(m − k)Σk log uΣ ,mu log uΣ}) time worst case, and the kapproximate matching problem in O(Σ  log Σ+ m2 k log uΣ  + αk) time on average (α ≤ u), where u = T  is the size of the text, m = P  is the size of the pattern, and Σ is the symbol alphabet. Each algorithm requires an O(u) auxiliary arrays, which are constructed in O(u) time and space. 1
CompressedDomain Pattern Matching With the BurrowsWheeler Transform
, 2001
"... This report investigates two approaches for online patternmatching in files compressed with the BurrowsWheeler transform (Burrows & Wheeler 1994). The first is based on the BoyerMoore pattern matching algorithm (Boyer & Moore 1977), and the second is based on binary search. The new met ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
This report investigates two approaches for online patternmatching in files compressed with the BurrowsWheeler transform (Burrows & Wheeler 1994). The first is based on the BoyerMoore pattern matching algorithm (Boyer & Moore 1977), and the second is based on binary search. The new methods use the special structure of the BurrowsWheeler transform to achieve e#cient, robust pattern matching algorithms that can be used on files that have been only partly decompressed. Experimental results show that both new methods perform considerably faster than a decompressandsearch approach for most applications, with binary search being faster than BoyerMoore at the expense of increased memory usage. The binary search in particular is strongly related to e#cient indexing strategies such as binary trees, and suggests a number of new applications of the BurrowsWheeler transform in data storage and retrieval
A Comparison of BWT Approaches to String Pattern Matching
"... Recently a number of algorithms have been developed to search files compressed with the BurrowsWheeler Transform (BWT) without the need for full decompression first. This allows the storage requirement of data to be reduced through the exceptionally good compression offered by BWT, while allowing ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Recently a number of algorithms have been developed to search files compressed with the BurrowsWheeler Transform (BWT) without the need for full decompression first. This allows the storage requirement of data to be reduced through the exceptionally good compression offered by BWT, while allowing fast access to the information for searching by taking advantage of the sorted nature of BWT files. We provide a detailed description of five of these algorithms: BWTbased BoyerMoore (Bell et al. 2002), Binary Search (Bell et al. 2002), Suffix Arrays (Sadakane & Imai 1999), qgrams (Adjeroh et al. 2002) and the FMindex (Ferragina & Manzini 2001), and also present results from a set of extensive experiments that were performed to evaluate and compare the algorithms. Furthermore, we introduce a technique to improve the search times of Binary Search, Suffix Arrays and qgrams by 22 % on average, as well as reduce the memory requirement of the latter two by 40 % and 31%, respectively. Our results indicate that, while the compressed files of the FMindex are larger than those of the other approaches, it is able to perform searches with considerably less memory. Additionally, when only counting the occurrences of a pattern, or when locating the positions of a small number of matches, it is the fastest algorithm. For larger searches, Binary Search provides the fastest results.
Direct Suffix Sorting and its Applications
, 2008
"... The suffix sorting problem is to construct the suffix array for an input sequence. Given a sequence T[0...n − 1] of size n = T , with symbols from a fixed alphabet Σ, (Σ  ≤ n), the suffix array provides a compact representation of all the suffixes of T in a lexicographic order. Traditionally, t ..."
Abstract
 Add to MetaCart
The suffix sorting problem is to construct the suffix array for an input sequence. Given a sequence T[0...n − 1] of size n = T , with symbols from a fixed alphabet Σ, (Σ  ≤ n), the suffix array provides a compact representation of all the suffixes of T in a lexicographic order. Traditionally, the suffix array is often constructed by first building the suffix tree for T, and then performing an inorder traversal of the suffix tree. The direct suffix sorting problem is to construct the suffix array of T directly without using the suffix tree data structure. We propose a direct suffix sorting algorithm which rearranges the biological sequences of interests and facilitates high throughput pattern query, retrieval and storage in O(n) time. The improved algorithm requires only 7n bytes of storage, including the n bytes for the original string, and the 4n bytes for the suffix array. The basis of our improved algorithm is an extension of ShannonFanoElias codes used in information theory. This is the first time informationtheoretic methods have been used as the basis for solving the suffix sorting problem. The direct suffix sorting algorithm is then applied to solve the multiple sequence alignment problem. The sequences to be aligned are concatenated and then passed to
LongestCommonPrefix Computation in BurrowsWheeler Transformed Text
, 2006
"... In this paper we consider the existing algorithm for computation of the LongestCommonPrefix (LCP) array given a text string and its suffix array and adapt it to work on BurrowsWheeler Transform (BWT) text. We did this by a combination of preprocessing steps and improvement based on existing algo ..."
Abstract
 Add to MetaCart
(Show Context)
In this paper we consider the existing algorithm for computation of the LongestCommonPrefix (LCP) array given a text string and its suffix array and adapt it to work on BurrowsWheeler Transform (BWT) text. We did this by a combination of preprocessing steps and improvement based on existing algorithm. Three LCP array computation algorithms were proposed, namely LCPBA, LCPBB and LCPBC that need only BWT text as an input. LCPBA was a simple adaptation from the existing algorithm which did a preprocessing on BWT text of length n to generate suffix array and its original text. Then, output of this step was fed to existing algorithm which does LCP array computation in O(n) time. LCPBB reduces LCPBA preprocessing time while requiring additional 4n space as it substitutes original text with two auxiliary arrays in LCP array computation. LCPBC takes a step forward from LCPBA and LCPBB by utilizing the advantage of BWT text structure. It effectively reduces preprocessing time and memory requirements for a price of θ(δn) time when calculating LCP array, where δ is the average LCP. Experimental results showed that in terms of speed, LCPA practically performs as well as the original algorithm, compared to LCPBB and LCPBC. However, LCPBC consumes less space than the other algorithms.