Results 1  10
of
11
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract
 in Proceedings of the 32nd Annual ACM Symposium on the Theory of Computing
, 2000
"... Abstract. The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed al ..."
Abstract

Cited by 189 (17 self)
 Add to MetaCart
Abstract. The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in n lg Σ  bits by encoding each symbol with lg Σ  bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Ω(lg Σ  n), which is significant when Σ is of constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching, either in O(m lg Σ) timeorinO(m +lgn) time, plus an outputsensitive cost O(occ) for listing the occ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m / lg Σ  n +lgɛ Σ  n) search time in the worst case, for any constant
Nearest Common Ancestors: A survey and a new distributed algorithm
, 2002
"... Several papers describe linear time algorithms to preprocess a tree, such that one can answer subsequent nearest common ancestor queries in constant time. Here, we survey these algorithms and related results. A common idea used by all the algorithms for the problem is that a solution for complete ba ..."
Abstract

Cited by 76 (12 self)
 Add to MetaCart
Several papers describe linear time algorithms to preprocess a tree, such that one can answer subsequent nearest common ancestor queries in constant time. Here, we survey these algorithms and related results. A common idea used by all the algorithms for the problem is that a solution for complete balanced binary trees is straightforward. Furthermore, for complete balanced binary trees we can easily solve the problem in a distributed way by labeling the nodes of the tree such that from the labels of two nodes alone one can compute the label of their nearest common ancestor. Whether it is possible to distribute the data structure into short labels associated with the nodes is important for several applications such as routing. Therefore, related labeling problems have received a lot of attention recently.
Engineering a Fast Online Persistent Suffix Tree Construction
 In 20th Int’l Conference on Data Engineering
, 2004
"... Online persistent suffix tree construction has been considered impractical due to its excessive I/O costs. However, these prior studies have not taken into account the effects of the buffer management policy and the internal node structure of the suffix tree on I/O behavior of construction and subse ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
Online persistent suffix tree construction has been considered impractical due to its excessive I/O costs. However, these prior studies have not taken into account the effects of the buffer management policy and the internal node structure of the suffix tree on I/O behavior of construction and subsequent retrievals over the tree. In this paper, we study these two issues in detail in the context of large genomic DNA and Protein sequences. In particular, we make the following contributions: (i) a novel, lowoverhead buffering policy called TOPQ which improves the ondisk behavior of suffix tree construction and subsequent retrievals, and (ii) empirical evidence that the space efficient linkedlist representation of suffix tree nodes provides significantly inferior performance when compared to the array representation. These results demonstrate that a careful choice of implementation strategies can make online persistent suffix tree construction considerably more scalable – in terms of length of sequences indexed with a fixed memory budget, than currently perceived. 1.
Cell Probe Lower Bounds For Succinct Data Structures
"... In this paper, we consider several static data structure problems in the deterministic cell probe model. We develop a new technique for proving lower bounds for succinct data structures, where the redundancy in the storage can be small compared to the informationtheoretic minimum. In fact, we succee ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
In this paper, we consider several static data structure problems in the deterministic cell probe model. We develop a new technique for proving lower bounds for succinct data structures, where the redundancy in the storage can be small compared to the informationtheoretic minimum. In fact, we succeed in matching (up to constant factors) the lower order terms of the existing data structures with the lower order terms provided by our lower bound. Using this technique, we obtain (i) the first lower bound for the problem of searching and retrieval of a substring in text; (ii) a cell probe lower bound for the problem of representing permutation π with queries π(i) and π−1 (i) thatmatchesthelower order term of the existing data structures, and (iii) a lower bound for representing binary matrices that is also matches upper bounds for some set of parameters. The nature of all these problems is that we are to implement two operations that are in a reciprocal relation to each other (search and retrieval, computing forward and inverse element, operations on rows and columns of a matrix). As far as we know, this paper is the first to provide an insight into such problems. 1
Efficient string matching algorithms for combinatorial universal denoising
 In Proc. of IEEE Data Compression Conference (DCC), Snowbird
, 2005
"... Inspired by the combinatorial denoising method DUDE [13], we present efficient algorithms for implementing this idea for arbitrary contexts or for using it within subsequences. We also propose effective, efficient denoising error estimators so we can find the best denoising of an input sequence over ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Inspired by the combinatorial denoising method DUDE [13], we present efficient algorithms for implementing this idea for arbitrary contexts or for using it within subsequences. We also propose effective, efficient denoising error estimators so we can find the best denoising of an input sequence over different context lengths. Our methods are simple, drawing from string matching methods and radix sorting. We also present experimental results of our proposed algorithms. 1
Upper and Lower Bounds for Text Indexing Data Structures
"... c○Alexander Golynski 2007I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. (Alexander Golynski) The main go ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
c○Alexander Golynski 2007I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. (Alexander Golynski) The main goal of this thesis is to investigate the complexity of a variety of problems related to text indexing and text searching. We present new data structures that can be used as building blocks for fulltext indices which occupies minute space (FMindexes) and wavelet trees. These data structures also can be used to represent labeled trees and posting lists. Labeled trees are applied in XML documents, and posting lists in search engines. The main emphasis of this thesis is on lower bounds for timespace tradeoffs for the following problems: the rank/select problem, the problem of representing a string of balanced parentheses, the text retrieval problem, the problem of computing a permutation and its inverse, and the problem of representing a binary relation. These results are divided in two groups: lower bounds in the cell probe model and lower bounds in the indexing model.
BODHI: A Database Engine for . . .
, 2006
"... Biodiversity research generates and uses a variety of data spanning across diverse domains, including taxonomy, geospatial and genetic domains, which vary greatly in their structural features and complexities, query processing costs and storage volumes. In this thesis, we present BODHI, a database ..."
Abstract
 Add to MetaCart
Biodiversity research generates and uses a variety of data spanning across diverse domains, including taxonomy, geospatial and genetic domains, which vary greatly in their structural features and complexities, query processing costs and storage volumes. In this thesis, we present BODHI, a database engine that seamlessly integrates these diverse types of data, spanning the range from molecular to organismlevel information. BODHI is a native objectoriented database system built around a publically available microkernel and extensible query processor, and offers a functionally comprehensive query interface. The server is partitioned into three service modules: object, spatial and sequence, each handling the associated data domain and providing appropriate storage, modeling interfaces, and evaluation algorithms for predicates over the corresponding data types. To accelerate query response times, a variety of specialized access structures are included for each domain. Our experiments with complex crossdomain queries over a representative
Direct Suffix Sorting and its Applications
, 2008
"... The suffix sorting problem is to construct the suffix array for an input sequence. Given a sequence T[0...n − 1] of size n = T , with symbols from a fixed alphabet Σ, (Σ  ≤ n), the suffix array provides a compact representation of all the suffixes of T in a lexicographic order. Traditionally, t ..."
Abstract
 Add to MetaCart
The suffix sorting problem is to construct the suffix array for an input sequence. Given a sequence T[0...n − 1] of size n = T , with symbols from a fixed alphabet Σ, (Σ  ≤ n), the suffix array provides a compact representation of all the suffixes of T in a lexicographic order. Traditionally, the suffix array is often constructed by first building the suffix tree for T, and then performing an inorder traversal of the suffix tree. The direct suffix sorting problem is to construct the suffix array of T directly without using the suffix tree data structure. We propose a direct suffix sorting algorithm which rearranges the biological sequences of interests and facilitates high throughput pattern query, retrieval and storage in O(n) time. The improved algorithm requires only 7n bytes of storage, including the n bytes for the original string, and the 4n bytes for the suffix array. The basis of our improved algorithm is an extension of ShannonFanoElias codes used in information theory. This is the first time informationtheoretic methods have been used as the basis for solving the suffix sorting problem. The direct suffix sorting algorithm is then applied to solve the multiple sequence alignment problem. The sequences to be aligned are concatenated and then passed to
Stronger lower bounds for text searching and polynomial evaluation
"... In this paper, we give two main technical results: (i) we show a stronger lower bound for substring search problem via compression extending results of Demaine and LópezOrtiz (SODA ’01); (ii) improve the results of Gal and Miltersen (ICALP ’03) by showing a bound on the redundancy needed by the pol ..."
Abstract
 Add to MetaCart
In this paper, we give two main technical results: (i) we show a stronger lower bound for substring search problem via compression extending results of Demaine and LópezOrtiz (SODA ’01); (ii) improve the results of Gal and Miltersen (ICALP ’03) by showing a bound on the redundancy needed by the polynomial evaluation problem that is linear in terms of the informationtheoretic minimum storage required by a polynomial. 1
Impact of Buffering on Persistent Suffix Tree Construction
"... Suffix trees are indexes that are used commonly to solve many pattern search and discovery problems in an efficient manner over relatively static text. They are considered a powerful datastructure for various sequence processing tasks in the bioinformatics domain. A serious disadvantage of suffix t ..."
Abstract
 Add to MetaCart
Suffix trees are indexes that are used commonly to solve many pattern search and discovery problems in an efficient manner over relatively static text. They are considered a powerful datastructure for various sequence processing tasks in the bioinformatics domain. A serious disadvantage of suffix trees is that they are usually much larger than the underlying data sequences. This makes it impractical to consider them as memoryresident structures when indexing long sequences. The obvious solution of storing the index overflow on disk is severely hampered due to the random seeks induced by standard suffix tree construction algorithms. In this paper, using a variety of DNA sequences as our testbed, we empirically evaluate two practical issues not considered before, that impact the persistent online construction of suffix trees. First, the impact of buffering – in terms of policies for managing the buffer space as well as the amount of buffer space – is considered. We evaluate the defacto buffer management policy, LRU, against a lowoverhead static policy, called TOP, that we propose in this paper. Second, we evaluate the choice of representation of suffix tree structure. We consider the commonly used and spaceeconomical approach of linkedlist representation and contrast it with the less preferred arraybased node representation. Through a detailed empirical evaluation, we establish that (i) LRU shows worse performance than TOP, (ii) a well tuned TOP, saves upto 75% Contact Author: