Reducing the Space Requirement of Suffix Trees
 Software – Practice and Experience
, 1999
"... We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average ..."
We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average for a collection of 42 files of different type. This is an advantage of more than 8 bytes per input character over previous work. Our representations can be constructed without extra space, and as fast as previous representations. The asymptotic running times of suffix tree applications are retained. Copyright © 1999 John Wiley & Sons, Ltd. KEY WORDS: data structures; suffix trees; implementation techniques; space reduction
Fast and Flexible String Matching by Combining Bitparallelism and Suffix Automata
 ACM JOURNAL OF EXPERIMENTAL ALGORITHMICS (JEA
, 1998
"... ... In this paper we merge bitparallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bitparallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as ShiftOr. It inher ..."
... In this paper we merge bitparallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bitparallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as ShiftOr. It inherits from ShiftOr the ability to handle flexible patterns and from BDM the ability to skip characters. BNDM is 30%40% faster than BDM and up to 7 times faster than ShiftOr. When compared to the fastest existing algorithms on exact patterns (which belong to the BM family), BNDM is from 20% slower to 3 times faster, depending on the alphabet size. With respect to flexible pattern searching, BNDM is by far the fastest technique to deal with classes of characters and is competitive to search allowing errors. In particular, BNDM seems very adequate for computational biology applications, since it is the fastest algorithm to search on DNA sequences and flexible searching is an important problem in that
A Bitparallel Approach to Suffix Automata: Fast Extended String Matching
, 1998
"... . We present a new algorithm for string matching. The algorithm, called BNDM, is the bitparallel simulation of a known (but recent) algorithm called BDM. BDM skips characters using a "suffix automaton " which is made deterministic in the preprocessing. BNDM, instead, simulates the nondete ..."
. We present a new algorithm for string matching. The algorithm, called BNDM, is the bitparallel simulation of a known (but recent) algorithm called BDM. BDM skips characters using a "suffix automaton " which is made deterministic in the preprocessing. BNDM, instead, simulates the nondeterministic version using bitparallelism. This algorithm is 20%25% faster than BDM, 23 times faster than other bitparallel algorithms, and 10%40% faster than all the BoyerMoore family. This makes it the fastest algorithm in all cases except for very short or very long patterns (e.g. on English text it is the fastest between 5 and 110 characters). Moreover, the algorithm is very simple, allowing to easily implement other variants of BDM which are extremely complex in their original formulation. We show that, as other bitparallel algorithms, BNDM can be extended to handle classes of characters in the pattern and in the text, multiple patterns and to allow errors in the pattern or in the text, combin...
Factor Oracle: A New Structure for Pattern Matching
, 1999
"... We introduce a new automaton on a word p, sequence of letters taken in an alphabet \Sigma , that we call factor oracle. This automaton is acyclic, recognizes at least the factors of p, has m+ 1 states and a linear number of transitions. We give an online construction to build it. We use this ne ..."
We introduce a new automaton on a word p, sequence of letters taken in an alphabet \Sigma , that we call factor oracle. This automaton is acyclic, recognizes at least the factors of p, has m+ 1 states and a linear number of transitions. We give an online construction to build it. We use this new structure in string matching algorithms that we conjecture optimal according to the experimental results. These algorithms are as efficient as the ones that already exist using less memory and being more easy to implement. Keywords: indexing, finite automaton, pattern matching, algorithm design. 1
Practical Suffix Tree Construction
 In Proc. 13th International Conference on Very Large Data Bases
, 2004
"... Large string datasets are common in a number of emerging text and biological database applications. ..."
Large string datasets are common in a number of emerging text and biological database applications.
Direct construction of Compact Directed Acyclic Word Graphs
 COMBINATORIAL PATTERN MATCHING (AARHUS, 1997), FRANCE
, 1997
"... The Directed Acyclic Word Graph (DAWG) is an efficient data structure to treat and analyze repetitions in a text, especially in DNA genomic sequences. Here, we consider the Compact Directed Acyclic Word Graph of a word. We give the first direct algorithm to construct it. It runs in time linear in t ..."
The Directed Acyclic Word Graph (DAWG) is an efficient data structure to treat and analyze repetitions in a text, especially in DNA genomic sequences. Here, we consider the Compact Directed Acyclic Word Graph of a word. We give the first direct algorithm to construct it. It runs in time linear in the length of the string on a fixed alphabet. Our implementation requires half the memory space used by DAWGs.
Suffix Trees and their Applications in String Algorithms
, 1993
"... : The suffix tree is a compacted trie that stores all suffixes of a given text string. This data structure has been intensively employed in pattern matching on strings and trees, with a wide range of applications, such as molecular biology, data processing, text editing, term rewriting, interpreter ..."
: The suffix tree is a compacted trie that stores all suffixes of a given text string. This data structure has been intensively employed in pattern matching on strings and trees, with a wide range of applications, such as molecular biology, data processing, text editing, term rewriting, interpreter design, information retrieval, abstract data types and many others. In this paper, we survey some applications of suffix trees and some algorithmic techniques for their construction. Special emphasis is given to the most recent developments in this area, such as parallel algorithms for suffix tree construction and generalizations of suffix trees to higher dimensions, which are important in multidimensional pattern matching. Work partially supported by the ESPRIT BRA ALCOM II under contract no. 7141 and by the Italian MURST Project "Algoritmi, Modelli di Calcolo e Strutture Informative". y Part of this work was done while the author was visiting AT&T Bell Laboratories. Email: grossi@di.uni...
On Compact Directed Acyclic Word Graphs
 Structures in Logic and Computer Science
, 1997
"... The Directed Acyclic Word Graph (DAWG) is a spaceefficient data structure to treat and analyze repetitions in a text, especially in DNA genomic sequences. Here, we consider the Compact Directed Acyclic Word Graph of a word. We give the first direct algorithm to construct it. It runs in time lin ..."
The Directed Acyclic Word Graph (DAWG) is a spaceefficient data structure to treat and analyze repetitions in a text, especially in DNA genomic sequences. Here, we consider the Compact Directed Acyclic Word Graph of a word. We give the first direct algorithm to construct it. It runs in time linear in the length of the string on a fixed alphabet. Our implementation requires half the memory space used by DAWGs.
Suffix Trees and String Complexity
 Advances in Cryptology: Proc. of EUROCRYPT, LNCS 658
, 1992
"... Let s = (s 1 ; s 2 ; : : : ; s n ) be a sequence of characters where s i 2 Z p for 1 i n. One measure of the complexity of the sequence s is the length of the shortest feedback shift register that will generate s, which is known as the maximum order complexity of s [17, 18]. We provide a proof th ..."
Let s = (s 1 ; s 2 ; : : : ; s n ) be a sequence of characters where s i 2 Z p for 1 i n. One measure of the complexity of the sequence s is the length of the shortest feedback shift register that will generate s, which is known as the maximum order complexity of s [17, 18]. We provide a proof that the expected length of the shortest feedback register to generate a sequence of length n is less than 2 log p n+ o(1), and also give several other statistics of interest for distinguishing random strings. The proof is based on relating the maximum order complexity to a data structure known as a suffix tree. 1 Introduction A common form of stream cipher are the socalled running key ciphers [4, 9] which are deterministic approximations to the one time pad. A running key cipher generates an ultimately periodic sequence s = (s 1 ; s 2 ; : : : ; s n ), s i 2 Z p ; 1 i n, for a given seed or key K. Encryption is performed as with the one time pad, using s as the key stream, but perfect secu...
Compact Suffix Trees Resemble PATRICIA Tries: Limiting Distribution of the Depth
"... Abstract. Suffix trees are the most frequently used data structures in algorithms on words. In this paper, we consider the depth of a compact suffix tree, also known as the PAT tree, under some simple probabilistic assumptions. For a biased memoryless source, we prove that the limiting distribution ..."
Abstract. Suffix trees are the most frequently used data structures in algorithms on words. In this paper, we consider the depth of a compact suffix tree, also known as the PAT tree, under some simple probabilistic assumptions. For a biased memoryless source, we prove that the limiting distribution for the depth in a PAT tree is the same as the limiting distribution for the depth in a PATRICIA trie, even though the PATRICIA trie is constructed from statistically independent strings. As a result, we show that the limiting distribution for the depth in a PAT tree built over n suffixes is normal. 1