Results 1  10
of
14
Compressed suffix arrays and suffix trees with applications to text indexing and string matching
, 2005
"... The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. ..."
Abstract

Cited by 192 (17 self)
 Add to MetaCart
The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in n lg Σ  bits by encoding each symbol with lg Σ  bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Ω(lg Σ  n), which is significant when Σ is of constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching, either in O(m lg Σ) timeorinO(m +lgn) time, plus an outputsensitive cost O(occ) for listing the occ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m / lg Σ  n +lgɛ Σ  n) search time in the worst case, for any constant
Nearest Common Ancestors: A survey and a new distributed algorithm
, 2002
"... Several papers describe linear time algorithms to preprocess a tree, such that one can answer subsequent nearest common ancestor queries in constant time. Here, we survey these algorithms and related results. A common idea used by all the algorithms for the problem is that a solution for complete ba ..."
Abstract

Cited by 80 (11 self)
 Add to MetaCart
Several papers describe linear time algorithms to preprocess a tree, such that one can answer subsequent nearest common ancestor queries in constant time. Here, we survey these algorithms and related results. A common idea used by all the algorithms for the problem is that a solution for complete balanced binary trees is straightforward. Furthermore, for complete balanced binary trees we can easily solve the problem in a distributed way by labeling the nodes of the tree such that from the labels of two nodes alone one can compute the label of their nearest common ancestor. Whether it is possible to distribute the data structure into short labels associated with the nodes is important for several applications such as routing. Therefore, related labeling problems have received a lot of attention recently.
Engineering a fast online persistent suffix tree construction
 Proceedings of the 20th International Conference on Data Engineering
, 2004
"... Online persistent suffix tree construction has been considered impractical due to its excessive I/O costs. However, these prior studies have not taken into account the effects of the buffer management policy and the internal node structure of the suffix tree on I/O behavior of construction and sub ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
Online persistent suffix tree construction has been considered impractical due to its excessive I/O costs. However, these prior studies have not taken into account the effects of the buffer management policy and the internal node structure of the suffix tree on I/O behavior of construction and subsequent retrievals over the tree. In this paper, we study these two issues in detail in the context of large genomic DNA and Protein sequences. In particular, we make the following contributions: (i) a novel, lowoverhead buffering policy called TOPQ which improves the ondisk behavior of suffix tree construction and subsequent retrievals, and (ii) empirical evidence that the space efficient linkedlist representation of suffix tree nodes provides significantly inferior performance when compared to the array representation. These results demonstrate that a careful choice of implementation strategies can make online persistent suffix tree construction considerably more scalable – in terms of length of sequences indexed with a fixed memory budget, than currently perceived. 1.
Cell Probe Lower Bounds For Succinct Data Structures
"... In this paper, we consider several static data structure problems in the deterministic cell probe model. We develop a new technique for proving lower bounds for succinct data structures, where the redundancy in the storage can be small compared to the informationtheoretic minimum. In fact, we succee ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
In this paper, we consider several static data structure problems in the deterministic cell probe model. We develop a new technique for proving lower bounds for succinct data structures, where the redundancy in the storage can be small compared to the informationtheoretic minimum. In fact, we succeed in matching (up to constant factors) the lower order terms of the existing data structures with the lower order terms provided by our lower bound. Using this technique, we obtain (i) the first lower bound for the problem of searching and retrieval of a substring in text; (ii) a cell probe lower bound for the problem of representing permutation π with queries π(i) and π−1 (i) thatmatchesthelower order term of the existing data structures, and (iii) a lower bound for representing binary matrices that is also matches upper bounds for some set of parameters. The nature of all these problems is that we are to implement two operations that are in a reciprocal relation to each other (search and retrieval, computing forward and inverse element, operations on rows and columns of a matrix). As far as we know, this paper is the first to provide an insight into such problems. 1
Efficient string matching algorithms for combinatorial universal denoising
 In Proc. of IEEE Data Compression Conference (DCC), Snowbird
, 2005
"... Inspired by the combinatorial denoising method DUDE [13], we present efficient algorithms for implementing this idea for arbitrary contexts or for using it within subsequences. We also propose effective, efficient denoising error estimators so we can find the best denoising of an input sequence over ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Inspired by the combinatorial denoising method DUDE [13], we present efficient algorithms for implementing this idea for arbitrary contexts or for using it within subsequences. We also propose effective, efficient denoising error estimators so we can find the best denoising of an input sequence over different context lengths. Our methods are simple, drawing from string matching methods and radix sorting. We also present experimental results of our proposed algorithms. 1
Upper and Lower Bounds for Text Indexing Data Structures
"... c○Alexander Golynski 2007I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. (Alexander Golynski) The main go ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
c○Alexander Golynski 2007I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. (Alexander Golynski) The main goal of this thesis is to investigate the complexity of a variety of problems related to text indexing and text searching. We present new data structures that can be used as building blocks for fulltext indices which occupies minute space (FMindexes) and wavelet trees. These data structures also can be used to represent labeled trees and posting lists. Labeled trees are applied in XML documents, and posting lists in search engines. The main emphasis of this thesis is on lower bounds for timespace tradeoffs for the following problems: the rank/select problem, the problem of representing a string of balanced parentheses, the text retrieval problem, the problem of computing a permutation and its inverse, and the problem of representing a binary relation. These results are divided in two groups: lower bounds in the cell probe model and lower bounds in the indexing model.
Direct Suffix Sorting and its Applications
, 2008
"... The suffix sorting problem is to construct the suffix array for an input sequence. Given a sequence T[0...n − 1] of size n = T , with symbols from a fixed alphabet Σ, (Σ  ≤ n), the suffix array provides a compact representation of all the suffixes of T in a lexicographic order. Traditionally, t ..."
Abstract
 Add to MetaCart
The suffix sorting problem is to construct the suffix array for an input sequence. Given a sequence T[0...n − 1] of size n = T , with symbols from a fixed alphabet Σ, (Σ  ≤ n), the suffix array provides a compact representation of all the suffixes of T in a lexicographic order. Traditionally, the suffix array is often constructed by first building the suffix tree for T, and then performing an inorder traversal of the suffix tree. The direct suffix sorting problem is to construct the suffix array of T directly without using the suffix tree data structure. We propose a direct suffix sorting algorithm which rearranges the biological sequences of interests and facilitates high throughput pattern query, retrieval and storage in O(n) time. The improved algorithm requires only 7n bytes of storage, including the n bytes for the original string, and the 4n bytes for the suffix array. The basis of our improved algorithm is an extension of ShannonFanoElias codes used in information theory. This is the first time informationtheoretic methods have been used as the basis for solving the suffix sorting problem. The direct suffix sorting algorithm is then applied to solve the multiple sequence alignment problem. The sequences to be aligned are concatenated and then passed to
Stronger lower bounds for text searching and polynomial evaluation
"... In this paper, we give two main technical results: (i) we show a stronger lower bound for substring search problem via compression extending results of Demaine and LópezOrtiz (SODA ’01); (ii) improve the results of Gal and Miltersen (ICALP ’03) by showing a bound on the redundancy needed by the pol ..."
Abstract
 Add to MetaCart
In this paper, we give two main technical results: (i) we show a stronger lower bound for substring search problem via compression extending results of Demaine and LópezOrtiz (SODA ’01); (ii) improve the results of Gal and Miltersen (ICALP ’03) by showing a bound on the redundancy needed by the polynomial evaluation problem that is linear in terms of the informationtheoretic minimum storage required by a polynomial. 1
Nearest Common Ancestors: A Survey and a New Algorithm for a Distributed Environment
"... Abstract Several papers describe linear time algorithms to preprocess a tree, in order to answer subsequent nearest common ancestor queries in constant time. Here, we survey these algorithms and related results. Whereas previous algorithms produce a linear space data structure, we address in this pa ..."
Abstract
 Add to MetaCart
Abstract Several papers describe linear time algorithms to preprocess a tree, in order to answer subsequent nearest common ancestor queries in constant time. Here, we survey these algorithms and related results. Whereas previous algorithms produce a linear space data structure, we address in this paper the problem of distributing the data structure into short labels associated with the nodes. Localized data structures received a lot of attention recently as they play an important role for distributed applications such as routing. We conclude our survey with a new simple algorithm that labels in O(n) time all the nodes of an nnode rooted tree such that from the labels of any two nodes alone one can compute in constant time the label of their nearest common ancestor. The labels assigned by our algorithm are of size O(log n) bits.
ABSTRACT Genomescale Diskbased Suffix Tree Indexing
"... With the exponential growth of biological sequence databases, it has become critical to develop effective techniques for storing, querying, and analyzing these massive data. Suffix trees are widely used to solve many sequencebased problems, and they can be built in linear time and space, provided t ..."
Abstract
 Add to MetaCart
With the exponential growth of biological sequence databases, it has become critical to develop effective techniques for storing, querying, and analyzing these massive data. Suffix trees are widely used to solve many sequencebased problems, and they can be built in linear time and space, provided the resulting tree fits in mainmemory. To index larger sequences, several external suffix tree algorithms have been proposed in recent years. However, they suffer from several problems such as susceptibility to data skew, nonscalability to genomescale sequences, and nonexistence of suffix links, which are crucial in various suffix tree based algorithms. In this paper, we target DNA sequences and propose a novel diskbased suffix tree algorithm called Trellis which effectively scales up to genomescale sequences. Specifically, it can index the entire human genome using 2GB of memory, in about 4 hours and can recover all its suffix links within 2 hours. Trellis was compared to various stateoftheart persistent diskbased suffix tree construction algorithms, and was shown to outperform the best previous methods, both in terms of indexing time and querying time.