Results 1 - 10
of
110
Simple linear work suffix array construction
, 2003
"... Abstract. Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to linear-time construction algorithms and more exp ..."
Abstract
-
Cited by 119 (6 self)
- Add to MetaCart
Abstract. Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to linear-time construction algorithms and more explicit structure. We narrow this gap between theory and practice with a simple linear-time construction algorithm for suffix arrays. The simplicity is demonstrated with a C++ implementation of 50 effective lines of code. The algorithm is called DC3, which stems from the central underlying concept of difference cover. This view leads to a generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space–time tradeoff. For any v ∈ [1, √ n], it runs in O(vn) time using O(n / √ v) space in addition to the input string and the suffix array. We also present variants of the algorithm for several parallel and hierarchical memory models of computation. The algorithms for BSP and EREW-PRAM models are asymptotically faster than all previous suffix tree or array construction algorithms.
Fast construction of nets in low dimensional metrics, and their applications
- SIAM J. Comput
, 2005
"... We present a near linear time algorithm for constructing hierarchical nets in finite metric spaces with constant doubling dimension. This data-structure is then applied to obtain improved algorithms for the following problems: approximate nearest neighbor search, well-separated pair decomposition, s ..."
Abstract
-
Cited by 75 (7 self)
- Add to MetaCart
We present a near linear time algorithm for constructing hierarchical nets in finite metric spaces with constant doubling dimension. This data-structure is then applied to obtain improved algorithms for the following problems: approximate nearest neighbor search, well-separated pair decomposition, spanner construction, compact representation scheme, doubling measure, and computation of the (approximate) Lipschitz constant of a function. In all cases, the running (preprocessing) time is near linear and the space being used is linear. 1
Succinct Representations of lcp Information and Improvements in the Compressed Suffix Arrays
, 2002
"... We introduce two succinct data structures to solve various string problems. One is for storing the information of lcp, the longest common prefix, between suffixes in the suffix array, and the other is an improvement in the compressed suffix array which supports linear time counting queries for any p ..."
Abstract
-
Cited by 46 (5 self)
- Add to MetaCart
We introduce two succinct data structures to solve various string problems. One is for storing the information of lcp, the longest common prefix, between suffixes in the suffix array, and the other is an improvement in the compressed suffix array which supports linear time counting queries for any pattern. The former occupies only 2n + o(n) bits for a text of length n for computing lcp between adjacent suffixes in lexicographic order in constant time, and 6n + o(n) bits between any two suffixes. No data structure in the literature attained linear size. The latter has size proportional to the text size and it is applicable to texts on any alphabet Σ such that |Σ| = log^O(1) n. These space-economical data structures are useful in processing huge amounts of text data.
Compressed suffix trees with full functionality
- Theory of Computing Systems
"... We introduce new data structures for compressed suffix trees whose size are linear in the text size. The size is measured in bits; thus they occupy only O(n log |A|) bits for a text of length n on an alphabet A. This is a remarkable improvement on current suffix trees which require O(n log n) bits. ..."
Abstract
-
Cited by 45 (5 self)
- Add to MetaCart
We introduce new data structures for compressed suffix trees whose size are linear in the text size. The size is measured in bits; thus they occupy only O(n log |A|) bits for a text of length n on an alphabet A. This is a remarkable improvement on current suffix trees which require O(n log n) bits. Though some components of suffix trees have been compressed, there is no linear-size data structure for suffix trees with full functionality such as computing suffix links, string-depths and lowest common ancestors. The data structure proposed in this paper is the first one that has linear size and supports all operations efficiently. Any algorithm running on a suffix tree can also be executed on our compressed suffix trees with a slight slowdown of a factor of polylog(n). 1
Dynamic LCA queries on trees
- SIAM Journal on Computing
, 1999
"... Abstract. We show how to maintain a data structure on trees which allows for the following operations, all in worst-case constant time: 1. insertion of leaves and internal nodes, 2. deletion of leaves, 3. deletion of internal nodes with only one child, 4. determining the least common ancestor of any ..."
Abstract
-
Cited by 40 (0 self)
- Add to MetaCart
Abstract. We show how to maintain a data structure on trees which allows for the following operations, all in worst-case constant time: 1. insertion of leaves and internal nodes, 2. deletion of leaves, 3. deletion of internal nodes with only one child, 4. determining the least common ancestor of any two nodes. We also generalize the Dietz–Sleator “cup-filling ” scheduling methodology, which may be of independent interest.
When indexing equals compression: Experiments with compressing suffix arrays and applications
, 2004
"... We report on a new and improved version of high-order entropy-compressed suffix arrays, which has theoretical performance guarantees similar to those in our earlier work [16], yet represents an improvement in practice. Our experiments indicate that the resulting text index offers state-of-the-art co ..."
Abstract
-
Cited by 40 (5 self)
- Add to MetaCart
We report on a new and improved version of high-order entropy-compressed suffix arrays, which has theoretical performance guarantees similar to those in our earlier work [16], yet represents an improvement in practice. Our experiments indicate that the resulting text index offers state-of-the-art compression. In particular, we require roughly 20 % of the original text size—without requiring a separate instance of the text—and support fast and powerful searches. To our knowledge, this is the best known method in terms of space for fast searching. 1
Reconstructing reticulate evolution in species - theory and practice
- In Proc. of 8’th Annual International Conference on Computational Molecular Biology
, 2004
"... We present new methods for reconstructing reticulate evolution of species due to events such as horizontal transfer or hybrid speciation; both methods are based upon extensions of Wayne Maddison’s approach in his seminal 1997 paper. Our first method is a polynomial time algorithm for constructing ph ..."
Abstract
-
Cited by 38 (7 self)
- Add to MetaCart
We present new methods for reconstructing reticulate evolution of species due to events such as horizontal transfer or hybrid speciation; both methods are based upon extensions of Wayne Maddison’s approach in his seminal 1997 paper. Our first method is a polynomial time algorithm for constructing phylogenetic networks from two gene trees contained inside the network. We allow the network to have an arbitrary number of reticulations, but we limit the reticulation in the network so that the cycles in network are node-disjoint (“galled”). Our second method is a polynomial time algorithm for constructing networks with one reticulation, where we allow for errors in the estimated gene trees. Using simulations, we demonstrate improved performance of this method over both NeighborNet and Maddison’s method. 1
Succinct data structures for flexible text retrieval systems
- Journal of Discrete Algorithms
, 2007
"... University, Fukuoka, Japan. We propose succinct data structures for text retrieval systems supporting docu-ment listing queries and ranking queries based on the tf*idf (term frequency times inverse document frequency) scores of documents. Traditional data structures for these problems support querie ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
University, Fukuoka, Japan. We propose succinct data structures for text retrieval systems supporting docu-ment listing queries and ranking queries based on the tf*idf (term frequency times inverse document frequency) scores of documents. Traditional data structures for these problems support queries only for some predetermined keywords. Recently Muthukrishnan proposed a data structure for document listing queries for arbitrary patterns at the cost of data structure size. For computing the tf*idf scores there has been no efficient data structures for arbitrary patterns. Our new data structures support these queries using small space. The space is only 2/ɛ times the size of compressed documents plus 10n bits for a document collection of length n, for any 0 <ɛ ≤ 1. This is much smaller than the previous O(n log n) bit data structures. Query time is O(m+q log ɛ n) for listing and computing tf*idf scores for all q documents containing a given pattern of length m. Our data structures are flexible in a sense that they support queries for arbitrary patterns.
The Level Ancestor Problem Simplified
"... We present a very simple algorithm for the Level Ancestor Problem. A Level Ancestor Query LA(v; d) requests the depth d ancestor of node v. The Level Ancestor Problem is thus: preprocess a given rooted tree T to answer level ancestor queries. While optimal solutions to this problem already exist ..."
Abstract
-
Cited by 31 (0 self)
- Add to MetaCart
We present a very simple algorithm for the Level Ancestor Problem. A Level Ancestor Query LA(v; d) requests the depth d ancestor of node v. The Level Ancestor Problem is thus: preprocess a given rooted tree T to answer level ancestor queries. While optimal solutions to this problem already exist, our new optimal solution is simple enough to be taught and implemented.
Lowest common ancestors in trees and directed acyclic graphs
- J. Algorithms
, 2005
"... We study the problem of finding lowest common ancestors (LCA) in trees and directed acyclic graphs (DAGs). Specifically, we extend the LCA problem to DAGs and study the LCA variants that arise in this general setting. We begin with a clear exposition of Berkman and Vishkin’s simple optimal algorithm ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
We study the problem of finding lowest common ancestors (LCA) in trees and directed acyclic graphs (DAGs). Specifically, we extend the LCA problem to DAGs and study the LCA variants that arise in this general setting. We begin with a clear exposition of Berkman and Vishkin’s simple optimal algorithm for LCA in trees. The ideas presented are not novel theoretical contributions, but they lay the foundation for our work on LCA problems in DAGs. We present an algorithm that finds all-pairs-representative LCA in DAGs in Õ(n 2.688) operations, provide a transitive-closure lower bound for the all-pairs-representative-LCA problem, and develop an LCA-existence algorithm that preprocesses the DAG in transitive-closure time. We also present a suboptimal but practical O(n 3) algorithm for all-pairs-representative LCA in DAGs that uses ideas from the optimal algorithms in trees and DAGs. Our results reveal a close relationship between the LCA, all-pairs-shortest-path, and transitive-closure problems. We conclude the paper with a short experimental study of LCA algorithms in trees and DAGs. Our experiments and source code demonstrate the elegance of the preprocessing-query algorithms for LCA in trees. We show that for most trees the suboptimal Θ(n log n)-preprocessing Θ(1)-query algorithm should be preferred, and demonstrate that our proposed O(n 3) algorithm for allpairs-representative LCA in DAGs performs well in both low and high density DAGs.

