## Reducing the Space Requirement of Suffix Trees (1999)

### Cached

### Download Links

- [www.daimi.au.dk]
- [www.zbh.uni-hamburg.de]
- [www.cs.cmu.edu]
- [www.zbh.uni-hamburg.de]
- [bioinformatics.cs.vt.edu]
- [www.TechFak.Uni-Bielefeld.DE]
- DBLP

### Other Repositories/Bibliography

Venue: | Software – Practice and Experience |

Citations: | 120 - 11 self |

### BibTeX

@ARTICLE{Kurtz99reducingthe,

author = {Stefan Kurtz},

title = {Reducing the Space Requirement of Suffix Trees},

journal = {Software – Practice and Experience},

year = {1999},

volume = {29},

pages = {1149--1171}

}

### Years of Citing Articles

### OpenURL

### Abstract

We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average for a collection of 42 files of different type. This is an advantage of more than 8 bytes per input character over previous work. Our representations can be constructed without extra space, and as fast as previous representations. The asymptotic running times of suffix tree applications are retained. Copyright © 1999 John Wiley & Sons, Ltd. KEY WORDS: data structures; suffix trees; implementation techniques; space reduction

### Citations

953 |
Algorithms on strings, trees, and sequences
- Gusfield
- 1997
(Show Context)
Citation Context ...r 40 references on suffix trees and Manber and Myers [MM93] add several more recent ones. A very thorough discussion of current knowledge on suffix tree constructions and applications can be found in =-=[Gus97]-=-. Despite these superior features and the wide acceptance by theoretical computer scientists, suffix trees have not seen widespread use in string processing software, in contrast to e.g. finite automa... |

744 |
The Art of Computer Programming, Volume 3: Sorting and Searching
- Knuth
- 1973
(Show Context)
Citation Context ...d so this table requires 3q integers. The space requirement for the hash table depends on the hashing technique. We use an open addressing hashing technique, with double hashing to resolve collisions =-=[Knu73]-=-. The hash function is based on the division method. This implies that the actual size of the hash table is the smallest prime larger than 2n. Each entry of the hash table stores two integers: the has... |

673 | Suffix arrays: A new method for on-line string searches
- Manber, Myers
- 1993
(Show Context)
Citation Context ...lm of string processing can be adapted so easily to achieve superb efficiency in such a great variety of applications. Apostolico [Apo85] gives over 40 references on suffix trees and Manber and Myers =-=[MM93]-=- add several more recent ones. A very thorough discussion of current knowledge on suffix tree constructions and applications can be found in [Gus97]. Despite these superior features and the wide accep... |

644 |
Modeling for text compression
- Bell, Witten, et al.
- 1989
(Show Context)
Citation Context ... to difficult and practice shows that they are worth the effort. 7 Experiments For our experiments, we collected a set of 42 files from different sources: ffl We used 17 files from the Calgary Corpus =-=[BCW90]-=- and all 14 files from the Canterbury Corpus [AB97]. The Calgary Corpus usually consists of 18 files, but since the file pic is identical to the file ptt5 of the Canterbury Corpus, we did not include ... |

596 | A block-sorting lossless data compression algorithm
- Burrows, Wheeler
- 1994
(Show Context)
Citation Context ...n techniques in two applications: ffl In a lossless data compression program [BKS99] suffix trees in both improved implementation techniques are used to compute the Burrows and Wheeler Transformation =-=[BW94]-=- of a string in linear time and space. This basically means to sort all suffixes of a string in lexicographic order, see also Table 4, application 13. ffl In a program called REPuter [KS98] suffix tre... |

568 |
A space economical suffix tree construction algorithm
- McCreight
- 1976
(Show Context)
Citation Context ...lefeld y partially supported by DFG-grant Ku 1257/1-1 trees have a reputation of being very greedy for space. In fact, the most space efficient previous implementation technique for suffix trees (see =-=[McC76]-=-) requires 28n bytes in the worst case where n is the length of the input string. 1 The space requirement in practice is smaller, but previous authors do not give consistent numbers: ffl Manber and My... |

445 | Linear pattern matching algorithms - Weiner - 1973 |

130 | Optimal suffix tree construction with large alphabets
- Farach
- 1997
(Show Context)
Citation Context ...hing node, or the root; and (ii) a string w occurs in T if and only if w is a substring of x$. Figure 1 shows the suffix tree for x = abab. There are several algorithms to construct ST in linear time =-=[4,19,20,21]-=-. Giegerich and Kurtz [22] review three of these algorithms and reveal relationships much closer than one would think. For any i ∈[1,n + 1], letSi = xi ...xn$ denote the ith non-empty suffix of x$. No... |

122 | The string B-tree: a new data structure for string search in external memory and its applications
- Ferragina, Grossi
- 1999
(Show Context)
Citation Context ...tions of substrings of the input string in a less obvious way. To allow constructions and applications of suffix trees for very large input strings (like they occur in genome research), other authors =-=[16,17]-=- developed techniques to organize suffix trees on disk, so that the number of disk accesses is reduced. However, again these techniques are mainly optimized for string matching problems, and the behav... |

118 |
The myriad virtues of subword trees
- Apostolico
- 1985
(Show Context)
Citation Context ...gance is surpassed only by their versatility. No other idea in the realm of string processing can be adapted so easily to achieve superb efficiency in such a great variety of applications. Apostolico =-=[Apo85]-=- gives over 40 references on suffix trees and Manber and Myers [MM93] add several more recent ones. A very thorough discussion of current knowledge on suffix tree constructions and applications can be... |

105 | REPuter: fast computation of maximal repeats in complete genomes
- Kurtz, Schleiermacher
- 1999
(Show Context)
Citation Context ...sformation [BW94] of a string in linear time and space. This basically means to sort all suffixes of a string in lexicographic order, see also Table 4, application 13. ffl In a program called REPuter =-=[KS98]-=- suffix trees in the improved linked list imple27 Table 3: Running times (absolute and throughput in seconds) for different variants of McCreight's Algorithm SLLI ILLI SHTI IHTI file length k time tpu... |

101 |
The smallest automaton recognizing the subwords of a text. Theoret
- Blumer, Blumer, et al.
- 1985
(Show Context)
Citation Context ...nd efficient as suffix trees (and they are not expected to be). Second, the direct construction methods for these index structures do not run in linear worst case time. ¶ Directed acyclic word graphs =-=[12,13]-=- (dawgs, for short), and more space efficient variants thereof [14,15], have essentially the same applications as suffix trees. The compact dawg, which is the most space efficient of these index struc... |

90 | Transducers and repetitions
- Crochemore
- 1986
(Show Context)
Citation Context ...nd efficient as suffix trees (and they are not expected to be). Second, the direct construction methods for these index structures do not run in linear worst case time. ¶ Directed acyclic word graphs =-=[12,13]-=- (dawgs, for short), and more space efficient variants thereof [14,15], have essentially the same applications as suffix trees. The compact dawg, which is the most space efficient of these index struc... |

77 | pattern matching algorithm - Linear - 1973 |

75 | From ukkonen to McCreight and weiner: A unifying view of linear-time suffix tree construction. Algorithmica
- Giegerich, Kurtz
- 1997
(Show Context)
Citation Context ...) = fw 2 \Sigma j w is a substring of x$g. Figure 1 shows the suffix tree for x = abab. ST can be constructed in O(n) time using one of the algorithms described in [Wei73,McC76,Ukk95,Far97]. See also =-=[GK97]-=- which reviews [Wei73,McC76,Ukk95] and reveals relationships between these algorithms much closer that one would think. For any i 2 [1; n + 1], let S i = x i : : : x n $ denote the ith non-empty suffi... |

72 |
On-line construction of suffix-trees
- Ukkonen
- 1995
(Show Context)
Citation Context ...hing node, or the root; and (ii) a string w occurs in T if and only if w is a substring of x$. Figure 1 shows the suffix tree for x = abab. There are several algorithms to construct ST in linear time =-=[4,19,20,21]-=-. Giegerich and Kurtz [22] review three of these algorithms and reveal relationships much closer than one would think. For any i ∈[1,n + 1], letSi = xi ...xn$ denote the ith non-empty suffix of x$. No... |

61 |
Sublinear approximate string matching and biological applications
- Chang, Lawler
- 1994
(Show Context)
Citation Context ...x tree can still be constructed in O(kn) time, since the suffix link is retrieved at most n times during suffix tree construction [4,18]. Moreover, suffix tree applications which utilize suffix links =-=[3,31,32]-=- have an alphabet factor in their running time anyway (if a linked list implementation of suffix trees is used). As a consequence, linear retrieval of suffix links does not influence the asymptotic ru... |

60 | Complete inverted files for efficient text retrieval and analysis
- Blumer, Blumer, et al.
- 1987
(Show Context)
Citation Context ..., the direct construction methods for these index structures do not run in linear worst case time. ¶ Directed acyclic word graphs [12,13] (dawgs, for short), and more space efficient variants thereof =-=[14,15]-=-, have essentially the same applications as suffix trees. The compact dawg, which is the most space efficient of these index structures, occupies 36n bytes in the worst case. Recently, Crochemore and ... |

57 | On the investigation of suffix trees
- Clark
- 1991
(Show Context)
Citation Context ... much of the space requirement is due to this additional feature. It is important to note that these numbers include the space required during the construction of suffix trees. Recently, Munro et al. =-=[8]-=- described a representation of suffix trees which requires n⌈log2 n⌉+o(n) bits. However, it is restricted to searching for string patterns, and it is not clear if there is a linear time algorithm to d... |

56 | A corpus for the evaluation of lossless compression algorithms
- Arnold, Bell
- 1997
(Show Context)
Citation Context ... the effort. 7 Experiments For our experiments, we collected a set of 42 files from different sources: ffl We used 17 files from the Calgary Corpus [BCW90] and all 14 files from the Canterbury Corpus =-=[AB97]-=-. The Calgary Corpus usually consists of 18 files, but since the file pic is identical to the file ptt5 of the Canterbury Corpus, we did not include it here. Figure 7: McCreight's suffix tree construc... |

40 | Efficient Implementation of Lazy Suffix Trees
- Giegerich, Kurtz, et al.
- 1999
(Show Context)
Citation Context ...ould result in a larger space usage in practice. Therefore, we do not consider it further. Note that storing the nodes of the suffix tree in depth first or breadth first order (as in Giegerich et al. =-=[25]-=-) to save the space for the firstchild- orbranchbrother-references does not allow linear time construction. This is because during linear time suffix tree constructions the relations of the nodes chan... |

38 | Suffix cactus: A cross between suffix tree and suffix array
- Kärkkäinen
- 1995
(Show Context)
Citation Context ...sistent numbers: ffl Manber and Myers [MM93] state that their implementation of suffix trees occupies between 18:8n and 22:4n bytes of space for real input strings (text, code, DNA). 2 ffl Karkkainen =-=[Kar95]-=- claims that a suffix tree can be implemented in 15n \Gamma 18n bytes of space for real input strings. Unfortunately, it is not shown how to achieve this. ffl Crochemore and V'erin [CV97] state that s... |

38 | Extended application of suffix trees to data compression
- Larsson
- 1996
(Show Context)
Citation Context ...ber and the depth of the predecessor: Observation 4 If w u → wu is an edge in ST and wu = Sj for some j ∈[1,n+ 1], then u = xi ...xn$, where i = j + w.depth. A similar observation was made by Larsson =-=[23]-=-, but without a clear statement about its consequences concerning the space requirement of ST. Note that storing the depth of a branching node has some practical advantages over storing the length of ... |

35 | S.N.: Efficient implementation of suffix trees
- Andersson
- 1995
(Show Context)
Citation Context ...fix trees and are therefore more space efficient: the suffix array of Manber and Myers [2] requires 9n bytes (including the space for construction). The level compressed trie of Andersson and Nilsson =-=[9]-=- takes about 12n bytes. The suffix binary search tree of Irving [10] requires 10n bytes. The suffix cactus of Kärkkäinen [5] can be implemented in 10n bytes. Finally, the PT-tree of Colussi and De Col... |

30 | Generalized suffix trees for biological sequence data: applications and implementations - Bieganski, Riedl, et al. - 1994 |

28 | Modification of the Burrows and Wheeler data compression algorithm
- Balkenhol, Kurtz, et al.
- 1999
(Show Context)
Citation Context ...t. This should make suffix trees more attractive for practical applications. Indeed, we have already used our implementation techniques in two applications: ffl In a lossless data compression program =-=[BKS99]-=- suffix trees in both improved implementation techniques are used to compute the Burrows and Wheeler Transformation [BW94] of a string in linear time and space. This basically means to sort all suffix... |

21 | The Context Trees of Block Sorting Compression
- Larsson
- 1998
(Show Context)
Citation Context ...shed with the linked list implementation. In contrast, the hash table implementation is less useful here, since it does not immediately reveal the set of edges outgoing from some node. As remarked in =-=[Lar98]-=-, it is possible to sort the hash table such that it allows complete traversals. In a first phase all edges stored in the hash table are sorted according to the nodes they are outgoing from. This can ... |

20 | A Comparison of Imperative and purely Functional Suffix tree constructions
- Giegerich, Kurtz
- 1995
(Show Context)
Citation Context ...he same order as McCreight's algorithm. As a consequence, one could instead use Ukkonen's algorithm. We prefer McCreight's algorithm, since it is slightly faster than Ukkonen's algorithm, as shown in =-=[GK95]-=-. It is easy to see that the functions insertleaf and insertbranch can be implemented in constant time. Thus it remains to show how to ffl decide whether a branching node is small or large, ffl set th... |

19 | An efficient implementation of trie structures
- Aoe, Morimoto, et al.
- 1992
(Show Context)
Citation Context ... hash key). It requires randomizing hash keys, which can be very time consuming in practice. (b) The Bonsai implementation technique [28] is based on Compact Hashing, while the double-array technique =-=[29,30]-=- combines the advantages of arrays and lists. Both techniques are specifically designed to represent trees space efficiently. However, they both required the tree to be built from the root downward, a... |

17 |
Compact hash tables using bidirectional linear probing
- Cleary
- 1984
(Show Context)
Citation Context ... the class of chaining techniques. This hashing technique uses an overflow area and a linked list of synonyms, and saves space by only storing the remainder of the key. However, as remarked by Cleary =-=[Cle84]-=-, each hash table entry (including the original hash location) requires a reference to the next overflow record. This reference will be of about the same size as the reduction in the key size. So, Lam... |

15 |
Suffix Binary Search Trees
- Irving
- 1995
(Show Context)
Citation Context ...of Manber and Myers [2] requires 9n bytes (including the space for construction). The level compressed trie of Andersson and Nilsson [9] takes about 12n bytes. The suffix binary search tree of Irving =-=[10]-=- requires 10n bytes. The suffix cactus of Kärkkäinen [5] can be implemented in 10n bytes. Finally, the PT-tree of Colussi and De Col [11] requires n log2 n + O(n) bits. These five index structures hav... |

13 |
Fundamental Algorithms for a Declarative Pattern Matching System. Dissertation, Technische Fakultat, Universitat Bielefeld, available as Report
- Kurtz
- 1995
(Show Context)
Citation Context ...ST i\Gamma1 , i.e. one computes (headloc i ; tailptr i ) := scanprefix(ST i\Gamma1 ; [w] ST i\Gamma1 ; tailptr i\Gamma1 ). (4) If [w] ST i\Gamma1 is a not a node in ST i\Gamma1 , then head i = w (see =-=[Kur95]-=-). Thus headloc i = [w] ST i\Gamma1 and tailptr i = tailptr i\Gamma1 . This means that with the next call of the function insertbranch, a new branching node head i will be constructed. Its head positi... |

12 | Bonsai: a compact representation of trees - Darragh, Cleary, et al. - 1993 |

11 |
Average sizes of suffix trees and dawgs
- Blumer, Haussler, et al.
- 1989
(Show Context)
Citation Context ...y would be 5n integers, independent on the actual number q of branching nodes. However, q is usually considerably smaller than 0:8n (q = 0:62n is the theoretical average value for random strings, see =-=[BEH89]-=-), so that this worst case improvement would result in a larger space usage in practice. Therefore we do not further consider it. Note that storing the nodes of the suffix tree in breadth first or dep... |

11 |
K: Genome analysis: pattern search in biological macromolecules
- HW, Heumann
(Show Context)
Citation Context ...sible nodes of depthsq \Gamma 1 occur in the suffix tree, and these can be represented more space efficiently using a heap. A similar technique has already been applied for hashed position trees, see =-=[MH95]-=-. Finally, note that the proposed implementation techniques lead to some interesting combinatorial questions: What is the expected number of small and large nodes? Are there better worst case bounds f... |

9 |
The Myriad Virtues of Subword Trees,” Combinatorial Algorithms on
- Apostolico
- 1985
(Show Context)
Citation Context ...gance is surpassed only by their versatility. No other idea in the realm of string processing can be adapted so easily to achieve superb efficiency in such a great variety of applications. Apostolico =-=[1]-=- gives over 40 references on suffix trees, and Manber and Myers [2] add several more recent ones. A very thorough discussion of current knowledge on suffix tree constructions and applications can be f... |

5 |
Direct Construction of Compact Acyclic Word Graphs
- Crochmore, Verin
- 1997
(Show Context)
Citation Context ... Karkkainen [Kar95] claims that a suffix tree can be implemented in 15n \Gamma 18n bytes of space for real input strings. Unfortunately, it is not shown how to achieve this. ffl Crochemore and V'erin =-=[CV97]-=- state that suffix trees require 32:7n bytes for DNA sequences. ffl The strmat software package by Knight, Gusfield, and Stoye [KGS98] implements suffix trees in 24n \Gamma 28n bytes for input sequenc... |

5 |
Col, A time and space efficient data structure for string searching on large texts
- Colussi, De
- 1996
(Show Context)
Citation Context ...takes about 12n bytes. The suffix binary search tree of Irving [10] requires 10n bytes. The suffix cactus of Kärkkäinen [5] can be implemented in 10n bytes. Finally, the PT-tree of Colussi and De Col =-=[11]-=- requires n log2 n + O(n) bits. These five index structures have two properties in common. First, they are specifically tailored to solve string matching problems, and cannot be adapted to other kinds... |

5 |
Bonsai: a compact representation of trees. Software|Practice and Experience
- Darragh, Cleary, et al.
- 1993
(Show Context)
Citation Context ...hash key (which we already did by only storing one component of the hash key). It requires randomizing hash keys, which can be very time consuming in practice. (b) The Bonsai implementation technique =-=[28]-=- is based on Compact Hashing, while the double-array technique [29,30] combines the advantages of arrays and lists. Both techniques are specifically designed to represent trees space efficiently. Howe... |

4 |
Improved behavior of tries by adaptive branching
- Andersson, Nilsson
- 1993
(Show Context)
Citation Context ...hich store less information than suffix trees and are therefore more space efficient: the suffix array of [MM93] requires 9n bytes (including the space for construction). The level compressed trie of =-=[AN93]-=- takes about 12n bytes. The suffix binary search tree of [Irv96] requires 10n bytes. Finally, the suffix cactus of [Kar95] can be implemented in 10n bytes. These four index structure have two properti... |

3 |
The Strmat Software-Package
- Knight, Gusfield, et al.
- 1998
(Show Context)
Citation Context ...ly, it is not shown how to achieve this. ffl Crochemore and V'erin [CV97] state that suffix trees require 32:7n bytes for DNA sequences. ffl The strmat software package by Knight, Gusfield, and Stoye =-=[KGS98]-=- implements suffix trees in 24n \Gamma 28n bytes for input sequences of length at most 2 23 = 8;388;608. However, strmat can handle sets of strings, and it is unclear how much of the space requirement... |

3 |
The position end-set tree: A small automaton for word recognition in biological sequences
- Lefevre, Ikeda
- 1993
(Show Context)
Citation Context ...programs. Since they are easily available, we propose their usage for comparing programs developed for other string processing tasks. ffl We added 8 of the 11 DNA sequences used by Lef'evre and Ikeda =-=[LI93]-=-. These are denoted by their EMBL database accession number. For the remaining three DNA sequences used in [LI93], the authors did not give enough information to locate them. ffl We extracted a sectio... |

2 |
Average size of suffix trees
- Blumer, Ehrenfeucht, et al.
- 1989
(Show Context)
Citation Context ...irement would be 5n integers, independent of the actual number q of branching nodes. However, q is usually considerably smaller than 0.8n(q = 0.62n is the theoretical average value for random strings =-=[24]-=-), so that this worst case improvement would result in a larger space usage in practice. Therefore, we do not consider it further. Note that storing the nodes of the suffix tree in depth first or brea... |

1 |
Aoe, ‘A method of compressing trie structures’, Software—Practice and Experience
- Morimoto, Iriguchi, et al.
- 1994
(Show Context)
Citation Context ... hash key). It requires randomizing hash keys, which can be very time consuming in practice. (b) The Bonsai implementation technique [28] is based on Compact Hashing, while the double-array technique =-=[29,30]-=- combines the advantages of arrays and lists. Both techniques are specifically designed to represent trees space efficiently. However, they both required the tree to be built from the root downward, a... |