Results 1 - 10
of
48
Simple linear work suffix array construction
, 2003
"... Abstract. Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to linear-time construction algorithms and more exp ..."
Abstract
-
Cited by 119 (6 self)
- Add to MetaCart
Abstract. Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to linear-time construction algorithms and more explicit structure. We narrow this gap between theory and practice with a simple linear-time construction algorithm for suffix arrays. The simplicity is demonstrated with a C++ implementation of 50 effective lines of code. The algorithm is called DC3, which stems from the central underlying concept of difference cover. This view leads to a generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space–time tradeoff. For any v ∈ [1, √ n], it runs in O(vn) time using O(n / √ v) space in addition to the input string and the suffix array. We also present variants of the algorithm for several parallel and hierarchical memory models of computation. The algorithms for BSP and EREW-PRAM models are asymptotically faster than all previous suffix tree or array construction algorithms.
Reducing the Space Requirement of Suffix Trees
- Software – Practice and Experience
, 1999
"... We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average ..."
Abstract
-
Cited by 109 (10 self)
- Add to MetaCart
We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average for a collection of 42 files of different type. This is an advantage of more than 8 bytes per input character over previous work. Our representations can be constructed without extra space, and as fast as previous representations. The asymptotic running times of suffix tree applications are retained. Copyright © 1999 John Wiley & Sons, Ltd. KEY WORDS: data structures; suffix trees; implementation techniques; space reduction
Suffix Cactus: A Cross between Suffix Tree and Suffix Array
, 1995
"... The suffix cactus is a new alternative to the suffix tree and the suffix array as an index of large static texts. Its size and its performance in searches lies between those of the suffix tree and the suffix array. Structurally, the suffix cactus can be seen either as a compact variation of the suff ..."
Abstract
-
Cited by 38 (2 self)
- Add to MetaCart
The suffix cactus is a new alternative to the suffix tree and the suffix array as an index of large static texts. Its size and its performance in searches lies between those of the suffix tree and the suffix array. Structurally, the suffix cactus can be seen either as a compact variation of the suffix tree or as an augmented suffix array.
Efficient Implementation of Lazy Suffix Trees
, 1999
"... We present an efficient implementation of a write-only top-down construction for suffix trees. Our implementation is based on a new, space-efficient representation of suffix trees that requires only 12 bytes per input character in the worst case, and 8.5 bytes per input character on average for a co ..."
Abstract
-
Cited by 36 (5 self)
- Add to MetaCart
We present an efficient implementation of a write-only top-down construction for suffix trees. Our implementation is based on a new, space-efficient representation of suffix trees that requires only 12 bytes per input character in the worst case, and 8.5 bytes per input character on average for a collection of files of different type. We show how to efficiently implement the lazy evaluation of suffix trees such that a subtree is evaluated only when it is traversed for the first time. Our experiments show that for the problem of searching many exact patterns in a fixed input string, the lazy top-down construction is often faster and more space efficient than other methods. Copyright c ○ 2003 John Wiley & Sons, Ltd. KEY WORDS: string matching; suffix tree; space-efficient implementation; lazy evaluation
Optimal Exact String Matching Based on Suffix Arrays
- In Proceedings of the Ninth International Symposium on String Processing and Information Retrieval. Springer-Verlag, Lecture Notes in Computer Science
, 2002
"... Using the suffix tree of a string S, decision queries of the type "Is P a substring of S?" can be answered in O(|P|) time and enumeration queries of the type "Where are all z occurrences of P in S?" can be answered in O(|P|+z) time, totally independent of the size of S. However, in large scale appli ..."
Abstract
-
Cited by 34 (1 self)
- Add to MetaCart
Using the suffix tree of a string S, decision queries of the type "Is P a substring of S?" can be answered in O(|P|) time and enumeration queries of the type "Where are all z occurrences of P in S?" can be answered in O(|P|+z) time, totally independent of the size of S. However, in large scale applications as genome analysis, the space requirements of the suffix tree are a severe drawback. The suffix array is a more space economical index structure. Using it and an additional table, Manber and Myers (1993) showed that decision queries and enumeration queries can be answered in O(|P|+log |S|) and O(|P|+log |S|+z) time, respectively, but no optimal time algorithms are known. In this paper, we showhow to achieve the optimal O(|P|) and O(|P|+z) time bounds for the suffix array. Our approach is not confined to exact pattern matching. In fact, it can be used to efficiently solve all problems that are usually solved bya top-down traversal of the suffix tree. Experiments show that our method is not only of theoretical interest but also of practical relevance.
An Introduction to Bioinformatics Algorithms
, 2004
"... In the early 1990s when one of us was teaching his first bioinformatics class, he was not sure that there would be enough students to teach. Although ..."
Abstract
-
Cited by 34 (0 self)
- Add to MetaCart
In the early 1990s when one of us was teaching his first bioinformatics class, he was not sure that there would be enough students to teach. Although
Suffix Trees on Words
, 1995
"... We present an intrinsic generalization on the suffix tree, designed to index a string of length n which has a natural partitioning into m multi-character substrings or words. The word suffix tree represents only the m suffixes that start at word boundaries. These boundaries are determined by delimit ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
We present an intrinsic generalization on the suffix tree, designed to index a string of length n which has a natural partitioning into m multi-character substrings or words. The word suffix tree represents only the m suffixes that start at word boundaries. These boundaries are determined by delimiters, whose definition depends on the application. Since traditional suffix tree construction algorithms rely heavily on the fact that all suffixes are inserted, construction of a word suffix tree is nontrivial, in particular when only O(m) construction space is allowed. We solve this problem, presenting an algorithm with O(n) expected running time. In general, construction cost is \Omega(n) due to the need of scanning the entire input. In applications that require strict node ordering, an additional cost of sorting O(m') characters arises, where m' is the number of distinct words. This is a significant improvement over previous solutions. In some cases, when the alphabet is small, we may assume that the n characters in the input string occupy o(n) machine words. We show that this can allow a word suffix tree to be built in sublinear time.
An Alphabet Independent Approach to Two Dimensional Matching
, 1994
"... There are many solutions to the string matching problem which are strictly linear in the input size and independent of alphabet size. Furthermore, the model of computation for these algorithms is very weak: they allow only simple arithmetic and comparisons of equality between characters of the in ..."
Abstract
-
Cited by 25 (8 self)
- Add to MetaCart
There are many solutions to the string matching problem which are strictly linear in the input size and independent of alphabet size. Furthermore, the model of computation for these algorithms is very weak: they allow only simple arithmetic and comparisons of equality between characters of the input. In contrast, algorithm for two dimensional matching have needed stronger models of computation, most notably assuming a totally ordered alphabet. The fastest algorithms for two dimensional matching have therefore had a logarithmic dependence on the alphabet size. In the worst case, this gives an algorithm that runs in O(n log m) with O(m log m) preprocessing.
Fully-compressed suffix trees
- IN: PACS 2000. LNCS
, 2000
"... Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require O(n log n) bits of space, for a string of size n. This is considerably more than the nlog ..."
Abstract
-
Cited by 17 (12 self)
- Add to MetaCart
Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require O(n log n) bits of space, for a string of size n. This is considerably more than the nlog 2 σ bits needed for the string itself, where σ is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice. Recent compressed suffix tree representations require just the space of the compressed string plus Θ(n) extra bits. This is already spectacular, but still unsatisfactory when σ is small as in DNA sequences. In this paper we introduce the first compressed suffix tree representation that breaks this linear-space barrier. Our representation requires sublinear extra space and supports a large set of navigational operations in logarithmic time. An essential ingredient of our representation is the lowest common ancestor (LCA) query. We reveal important connections between LCA queries and suffix tree navigation.
On Compact Directed Acyclic Word Graphs
- Structures in Logic and Computer Science
, 1997
"... The Directed Acyclic Word Graph (DAWG) is a space-efficient data structure to treat and analyze repetitions in a text, especially in DNA genomic sequences. Here, we consider the Compact Directed Acyclic Word Graph of a word. We give the first direct algorithm to construct it. It runs in time lin ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
The Directed Acyclic Word Graph (DAWG) is a space-efficient data structure to treat and analyze repetitions in a text, especially in DNA genomic sequences. Here, we consider the Compact Directed Acyclic Word Graph of a word. We give the first direct algorithm to construct it. It runs in time linear in the length of the string on a fixed alphabet. Our implementation requires half the memory space used by DAWGs.

