## Compressed full-text indexes (2007)

### Cached

### Download Links

Venue: | ACM COMPUTING SURVEYS |

Citations: | 172 - 78 self |

### BibTeX

@ARTICLE{Navarro07compressedfull-text,

author = {Gonzalo Navarro and Veli Mäkinen},

title = { Compressed full-text indexes},

journal = {ACM COMPUTING SURVEYS},

year = {2007}

}

### Years of Citing Articles

### OpenURL

### Abstract

Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into self-indexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying self-indexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant self-indexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.

### Citations

8563 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...numbers in italics outside the nodes are the lexicographic ranks of the corresponding strings. Furthermore, the size of the Lempel-Ziv compressed text (slowly) converges to the entropy of the source [=-=Cover and Thomas 1991-=-]. Of more direct relevance to us is that n ′ is related to the empirical entropy Hk(T) [Kosaraju and Manzini 1999; Ferragina and Manzini 2005]. Lemma 8 Let n ′ be the number of phrases produced by LZ... |

2362 | Modern Information Retrieval
- Baeza-Yates, Ribeiro-Neto
- 1999
(Show Context)
Citation Context ...such as Finnish and German, where “words” are actually concatenations of the particles one wishes to search. When applicable, inverted indexes require only 20%–100% of extra space on top of the text [=-=Baeza-Yates and Ribeiro 1999-=-]. Moreover, there exist compression techniques that can represent inverted index and text in about 35% of the space required by the original text [Witten et al. 1999; Navarro et al. 2000; Ziviani et ... |

1138 | A Universal Algorithm for Sequential Data Compression - Ziv, Lempel - 1977 |

972 | O.: Computational Geometry: Algorithms and Applications - Berg, Kreveld, et al. - 1997 |

942 |
A method for the construction of minimum redundancy codes
- Huffman
- 1952
(Show Context)
Citation Context ...alphabet dependence. The Huffman FM-Index by Grabowski et al. [2004, 2005] (Huff-FMI) uses Huffman as a tool to reduce the alphabet of the text to bits. The idea is to first compress T using Huffman [=-=Huffman 1952-=-; Bell et al. 1990]. In the resulting bit stream T ′ 1,n ′, of n′ < n(H0 +1) bits, logically mark the bits thatCompressed Full-Text Indexes · 53 start a codeword. Apply the BWT over T ′ to obtain bit... |

730 | Compression of individual sequences via variable-rate coding
- Ziv, Lempel
- 1978
(Show Context)
Citation Context ...character of Σ. We will also use a Lempel-Ziv parsing called LZ78, where each phrase is formed by an already known phrase concatenated with a new character at the end. Definition 17 The LZ78 parsing [=-=Ziv and Lempel 1978-=-] of text T1,n is a sequence Z[1, n ′ ] of phrases such that T = Z[1] Z[2] . . .Z[n ′ ], built as follows. The first phrase is Z[1] = ε. Assume we have already processed T1,i−1 producing a sequence Z[... |

696 |
The art of computer programming., volume 3: Sorting and searching
- Knuth
- 1974
(Show Context)
Citation Context ...95; Irving 1995; Colussi and de Col 1996; Kärkkäinen and Ukkonen 1996b; Crochemore and Vérin 1997; Kurtz 1998; Giegerich et al. 1999]. 3.1 Tries or Digital Trees A digital tree or trie [Fredkin 1960; =-=Knuth 1973-=-] is a data structure that stores a set of strings. It can support the search for a string in the set in time proportional to the length of the string sought, independently of the set size. Definition... |

644 | Suffix arrays: a new method for on-line string searches
- Manber, Myers
- 1990
(Show Context)
Citation Context ...sically unchanged. CPU caches are many times faster than standard main memories. On the other hand, the classical indexes for string matching require from 4 to 20 times the text size [McCreight 1976; =-=Manber and Myers 1993-=-; Kurtz 1998]. This means that, even when we may have enough main memory to hold a text, we may need to use the disk to store the index. 1 In the UC Berkeley report How Much Information?, http://www.s... |

617 |
Text Compression
- Bell, Cleary, et al.
- 1990
(Show Context)
Citation Context ...ightly from the original definitions, changing inessential technical details to allow for a simpler exposition. 5.1 k-th Order Empirical Entropy Opposed to the classical notion of k-th order entropy [=-=Bell et al. 1990-=-], which can only be defined for infinite sources, the k-th order empirical entropy defined by Manzini [2001] applies to finite texts. It coincides with the statistical estimation of the entropy of a ... |

565 | A Block-sorting Lossless Data Compression Algorithm
- Burrows, Wheeler
- 1994
(Show Context)
Citation Context ...itions nsr to cover a suffix array A is equal to the number of runs in the corresponding Ψ function.s5.3 The Burrows-Wheeler Transform Compressed Full-Text Indexes · 21 The Burrows-Wheeler Transform [=-=Burrows and Wheeler 1994-=-] is a reversible transformation from strings to strings 4 . This transformed text is easier to compress by local optimization methods [Manzini 2001]. Definition 13 Given a text T1,n and its suffix ar... |

548 |
A space–economical suffix tree construction algorithm
- McCreight
- 1976
(Show Context)
Citation Context ...caches are many times faster than their standard main memories. On the other hand, the classicalCompressed Full-Text Indexes · 3 indexes for string matching require from 4 to 20 times the text size [=-=McCreight 1976-=-; Manber and Myers 1993; Kurtz 1998]. This means that, even when we may have enough main memory to hold a text, we may need to use the disk to store the index. Moreover, most existing indexes are not ... |

485 |
Information retrieval: Data structures and algorithms. Eaglewood Cliffs
- Frakes, Baeza-Yatex
- 1992
(Show Context)
Citation Context ...f suffix trees is their high space consumption, which is Θ(n logn) bits and at the very least 10 times the text size in practice [Kurtz 1998]. 3.3 Suffix Arrays A suffix array [Manber and Myers 1993; =-=Gonnet et al. 1992-=-] is simply a permutation of all the suffixes of T so that the suffixes are lexicographically sorted. Definition 6 The suffix array of a text T1,n is an array A[1, n] containing a permutation of the i... |

426 |
Linear pattern matching algorithms
- Weiner
- 1973
(Show Context)
Citation Context ...are n leaves and no unary nodes, it is easy to see that suffix trees require O(n) space (the strings at the edges are represented with pointers to the text). Moreover, they can be built in O(n) time [=-=Weiner 1973-=-; McCreight 1976; Ukkonen 1995; Farach 1997]. Fig. 2 shows an example. The search for P in the suffix tree of T is similar to a trie search. Now we may use more than one character of P to traverse an ... |

348 |
Universal codeword sets and representations of the integers
- Elias
- 1975
(Show Context)
Citation Context ...lists to compress (exactly as we will have in function Ψ later on). An efficient way to compress the occurrence lists uses differential encoding plus a variable-length coding, such as Elias-δ coding [=-=Elias 1975-=-; Witten et al. 1999]. Take the list for "be": 4, 17. First, we get smaller numbers by representing the differences (called gaps) between adjacent elements: (4 − 0), (17 − 4) = 4, 13. The binary repre... |

328 | On-line construction of suffix trees
- Ukkonen
- 1995
(Show Context)
Citation Context ...fix trees require O(n) space (the strings at the edges are represented with pointers to the8 · V. Mäkinen and G. Navarro text). Moreover, they can be built in O(n) time [Weiner 1973; McCreight 1976; =-=Ukkonen 1995-=-; Farach 1997]. Fig. 2 shows an example. $ _ bar d la r 21 19 a l a _ d _ bar _ d 9 4 16 10 6 18 _ l _ d 7 12 $ _ bar labar r 2 14 20 a l _ d _ d _ d 11 8 3 15 1 13 5 17 1 2 3 4 5 6 7 8 9 10 11 12 13 ... |

268 | An Introduction to the Analysis of Algorithms
- Sedgewick, Flajolet
- 1996
(Show Context)
Citation Context ...as there is only a unary path from the node to a leaf. Instead, a pointer to the text position where the corresponding suffix starts is stored. On average, the pruned suffix trie has only O(n) nodes [=-=Sedgewick and Flajolet 1996-=-]. Yet, albeit unlikely, it might have Θ(n 2 ) nodes in the worst case. Fortunately, there exist equally powerful structures that guarantee linear space and construction time in the worst case [Morris... |

259 |
Trie memory
- Fredkin
- 1960
(Show Context)
Citation Context ... Kärkkäinen 1995; Irving 1995; Colussi and de Col 1996; Kärkkäinen and Ukkonen 1996b; Crochemore and Vérin 1997; Kurtz 1998; Giegerich et al. 1999]. 3.1 Tries or Digital Trees A digital tree or trie [=-=Fredkin 1960-=-; Knuth 1973] is a data structure that stores a set of strings. It can search for a string in the set in time proportional to the length of the string sought, and independent on the set size. Definiti... |

222 |
On the Complexity of Finite Sequences
- Lempel, Ziv
- 1976
(Show Context)
Citation Context ...nen substrings and replacing repetitions by pointers to their former occurrences in T. Lempel-Ziv methods produce a parsing (or partitioning) of the text into phrases. Definition 16 The LZ76 parsing [=-=Lempel and Ziv 1976-=-] of text T1,n is a sequence Z[1, n ′ ] of phrases such that T = Z[1] Z[2] . . .Z[n ′ ], built as follows. Assume we have already processed T1,i−1 producing sequence Z[1, p − 1]. Then, we find the lon... |

193 | High-order entropy-compressed text indexes
- Grossi, Gupta, et al.
- 2003
(Show Context)
Citation Context ...xes that require space close to that of the best existing compression techniques, and provide search and text recovering functionality with almost optimal time complexity [Ferragina and Manzini 2005; =-=Grossi et al. 2003-=-; Ferragina et al. 2006]. More sophisticated problems are also start4 · G. Navarro and V. Mäkinen ing to receive attention. For example, there are studies on efficient construction in little space [H... |

191 | Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees and Multisets - Raman, Raman, et al. - 2002 |

188 | Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching
- Grossi, Vitter
(Show Context)
Citation Context ...ally foreseen by Sadakane [2000, 2003] to achieve zero-order encoding of Ψ in the Sad-CSA (Section 7.1). They do not explain how to do rank and select in constant time on this representation, but in [=-=Grossi and Vitter 2006-=-] they explore binary-searchable gap encodings as a practical alternative.Compressed Full-Text Indexes · 41 An interesting result of Grossi et al. [2004] is that, since the sum of all the γ-encodings... |

180 | Opportunistic Data Structures with Application
- Ferrragina, Manzini
- 2000
(Show Context)
Citation Context ...al to the k-th order entropy of the text (a lower-bound estimate for the compression achievable by many compressor families). However, it was not until this decade that the first self-index appeared [=-=Ferragina and Manzini 2000-=-] and the potential of the relationship between text compression and text indexing was fully realized, in particular regarding the correspondence between the entropy of a text and the regularities ari... |

176 |
Patricia - practical algorithm to retrieve information coded in alphanumeric
- Morrison
- 1968
(Show Context)
Citation Context ...t 1996]. Yet, albeit unlikely, it might have Θ(n 2 ) nodes in the worst case. Fortunately, there exist equally powerful structures that guarantee linear space and construction time in the worst case [=-=Morrison 1968-=-; Apostolico 1985]. Definition 5 The suffix tree of a text T is a suffix trie where each unary path is converted into a single edge. Those edges are labeled by strings obtained by concatenating the ch... |

170 |
Space-efficient static trees and graphs
- JACOBSON
- 1989
(Show Context)
Citation Context ... 2005; Sadakane and Okanohara 2006]. 6.1 Basic n + o(n)-bit Solutions for Binary Sequences We start by explaining the n + o(n) bits solution supporting rank1(B, i) and select1(B, j) in constant time [=-=Jacobson 1989-=-; Munro 1996; Clark 1996]. Then we also have rank0(B, i) = i − rank1(B, i), and select0(B, j) is symmetric. Let us start with rank. The structure is composed of a two-level dictionary with partial sol... |

150 | Linear work suffix array construction - Kärkkäinen, Sanders, et al. - 2006 |

146 |
Algorithms on Strings
- Gusfield
- 1997
(Show Context)
Citation Context ...l-text indexes permit much more sophisticated search tasks, such as approximate pattern matching, regular expression matching, pattern matching with gaps, motif discovery, and so on [Apostolico 1985; =-=Gusfield 1997-=-]. There has been a considerable amount of work on extending compressed suffix arrays functionalities to those of suffix trees [Grossi and Vitter 2000; Munro et al. 2001; Sadakane 2002; Sadakane 2003;... |

140 | Succinct Representation of Balanced Parentheses and Static Trees
- Munro, Raman
(Show Context)
Citation Context ... general is to permit the simulation of suffix tree traversals using a compressed representation of them, such as a compressed suffix array plus a parentheses representation of the suffix tree shape [=-=Munro and Raman 1997-=-]. In addition, there has been some work on approximate string matching over compressed suffix arrays [Huynh et al. 2006; Lam et al. 2005; Chan et al. 2006]. Finally, it is also interesting to mention... |

132 | A functional approach to data structures and its use in multidimensional searching
- Chazelle
- 1988
(Show Context)
Citation Context ...-dimensional Range Searching As we will see later, some compressed indexes reduce some search subproblems to two-dimensional range searching. We present here one classical data structure by Chazelle [=-=Chazelle 1988-=-]. For simplicity we focus on the problem variant that arises in our application: One has a set of n points over an n × n grid, such that there is exactly one point for each row i and one for each col... |

131 | An analysis of the Burrows-Wheeler transform - MANZINI |

128 | Optimal suffix tree construction with large alphabets
- Farach
- 1997
(Show Context)
Citation Context ... to see that suffix trees require O(n) space (the strings at the edges are represented with pointers to the text). Moreover, they can be built in O(n) time [Weiner 1973; McCreight 1976; Ukkonen 1995; =-=Farach 1997-=-]. Fig. 2 shows an example. The search for P in the suffix tree of T is similar to a trie search. Now we may use more than one character of P to traverse an edge, but all edges leaving from a node hav... |

127 |
Replacing suffix trees with enhanced suffix arrays
- MI, Kurtz, et al.
- 2004
(Show Context)
Citation Context ...earch takes worst case time O(m log n) (this can be lowered to O(m + log n) by using more space to store the length of the longest common prefixes between consecutive suffixes [Manber and Myers 1993; =-=Abouelhoda et al. 2004-=-]). A locating query requires additional O(occ) time to report the occ occurrences. Alg. 1 gives the pseudocode. Algorithm SASearch(P1,m,A[1, n],T1,n) (1) sp ← 1; st ← n + 1; (2) while sp < st do (3) ... |

120 | The string B-tree: a new data structure for string search in external memory and its applications
- Ferragina, Grossi
- 1999
(Show Context)
Citation Context ...re will always be cases where they have to operate on disk. There is not much work yet on this important issue. One of the most attractive full-text indexes for secondary memory is the String B-tree [=-=Ferragina and Grossi 1999-=-]. This is not, however, a succinct structure. Some proposals for succinct and compressed structures in this scenario exist [Clark and Munro 1996; Mäkinen et al. 2004]. A good survey on full-text inde... |

119 |
Indexing compressed text
- Ferragina, Manzini
(Show Context)
Citation Context ...this point, there exist indexes that require space close to that of the best existing compression techniques, and provide search and text recovering functionality with almost optimal time complexity [=-=Ferragina and Manzini 2005-=-; Grossi et al. 2003; Ferragina et al. 2006]. More sophisticated problems are also start4 · G. Navarro and V. Mäkinen ing to receive attention. For example, there are studies on efficient constructio... |

117 | Reducing the space requirement of suffix trees
- Kurtz
- 1999
(Show Context)
Citation Context ...caches are many times faster than standard main memories. On the other hand, the classical indexes for string matching require from 4 to 20 times the text size [McCreight 1976; Manber and Myers 1993; =-=Kurtz 1998-=-]. This means that, even when we may have enough main memory to hold a text, we may need to use the disk to store the index. 1 In the UC Berkeley report How Much Information?, http://www.sims.berkeley... |

115 |
The myriad virtues of subword trees
- Apostolico
- 1985
(Show Context)
Citation Context ...lbeit unlikely, it might have Θ(n 2 ) nodes in the worst case. Fortunately, there exist equally powerful structures that guarantee linear space and construction time in the worst case [Morrison 1968; =-=Apostolico 1985-=-]. Definition 5 The suffix tree of a text T is a suffix trie where each unary path is converted into a single edge. Those edges are labeled by strings obtained by concatenating the characters of the r... |

110 | Compressed representations of sequences and full-text indexes
- Ferragina, Manzini, et al.
(Show Context)
Citation Context ...ce close to that of the best existing compression techniques, and provide search and text recovering functionality with almost optimal time complexity [Ferragina and Manzini 2005; Grossi et al. 2003; =-=Ferragina et al. 2006-=-]. More sophisticated problems are also start4 · G. Navarro and V. Mäkinen ing to receive attention. For example, there are studies on efficient construction in little space [Hon et al. 2003], manage... |

92 |
Compact pat trees
- CLARK
- 1998
(Show Context)
Citation Context ...ra 2006]. 6.1 Basic n + o(n)-bit Solutions for Binary Sequences We start by explaining the n + o(n) bits solution supporting rank1(B, i) and select1(B, j) in constant time [Jacobson 1989; Munro 1996; =-=Clark 1996-=-]. Then we also have rank0(B, i) = i − rank1(B, i), and select0(B, j) is symmetric. Let us start with rank. The structure is composed of a two-level dictionary with partial solutions (directly storing... |

87 | New text indexing functionalities of the compressed suffix arrays - SADAKANE |

82 |
Efficient suffix trees on secondary storage (extended abstract
- Clark, Munro
- 1996
(Show Context)
Citation Context ...String B-tree [Ferragina and Grossi 1999], among others [Ko and Aluru 2006]. These are not, however, succinct structures. Some proposals for succinct and compressed structures in this scenario exist [=-=Clark and Munro 1996-=-; Mäkinen et al. 2004]. A good survey on full-text indexes in secondary memory is due to Kärkkäinen and Rao [2003]. See also [Aluru 2005, Chapter 35]. Construction. Compressed indexes are usually deri... |

72 | Space efficient linear time construction of suffix arrays
- Ko, Aluru
- 2003
(Show Context)
Citation Context ... of sorting the n suffixes of the text. There are several more sophisticated algorithms, from the original O(n log n) time [Manber and Myers 1993] to the latest O(n) time algorithms [Kim et al. 2003; =-=Ko and Aluru 2003-=-; Kärkkäinen and Sanders 2003]. In practice, the best current algorithms are not yet linear time [Larsson and Sadakane 1999; Manzini and Ferragina 2004; Schürmann and Stoye 2005]. Fig. 3 shows our exa... |

69 | New data structures for orthogonal range searching
- Alstrup, Brodal, et al.
- 2000
(Show Context)
Citation Context ... i is part of the answer. In the worst case every answer is reported in O(log n) time, and we need O(log n) time if we want just to count the number of occurrences. There exist other data structures [=-=Alstrup et al. 2000-=-] that require O(n log 1+γ n) bits of space, for any constant γ > 0, and can, after spending O(log log n) time for the query, retrieve each occurrence in constant time. Another structure in the same p... |

65 | An Experimental Study of an Opportunistic Index - Ferrragina, Manzini - 2001 |

64 | Indexing text using the ZivLempel trie - Navarro - 2004 |

61 |
On economic construction of the transitive closure of a directed graph (in russian). english translation in soviet math. dokl
- Arlazarov, Dinic, et al.
- 1975
(Show Context)
Citation Context ...visions x/y to give the integer ⌊x/y⌋. Let us start from the last level. Consider a substring smallblock of B1,n of length t = log n 2 . This case is handled by the so-called four-Russians technique [=-=Arlazarov et al. 1975-=-]: We build a table smallrank[0, √ n − 1][0, t − 1] storing all answers to rank queries for all binary sequences of length t (note that 2 t = √ n). Then rank1(smallblock, i) = smallrank[smallblock, i]... |

60 | Complete inverted files for efficient text retrieval and analysis
- Blumer, Blumer, et al.
- 1987
(Show Context)
Citation Context ...uctions in mind, with no relation to text compression nor self-indexing. In general these indexes have had some, but not spectacular, success in lowering the large space requirements of text indexes [=-=Blumer et al. 1987-=-; Andersson and Nilsson 1995; Kärkkäinen 1995; Irving 1995; Colussi and de Col 1996; Kärkkäinen and Ukkonen 1996b; Crochemore and Vérin 1997; Kurtz 1998; Giegerich et al. 2003]. In this section, in pa... |

60 | Rank/select operations on large alphabets: a tool for text indexing - Golynski, Munro, et al. - 2006 |

59 | Engineering a lightweight suffix array construction algorithm
- Manzini, Ferragina
(Show Context)
Citation Context ... time algorithms [Kim et al. 2003; Ko and Aluru 2003; Kärkkäinen and Sanders 2003]. In practice, the best current algorithms are not linear-time ones [Larsson and Sadakane 1999; Itoh and Tanaka 1999; =-=Manzini and Ferragina 2004-=-; Schürmann and Stoye 2005]. Fig. 3 shows our example suffix array. Note that each subtree of the suffix tree corresponds to the subinterval in the suffix array encompassing all its leaves (in the fig... |

59 | Compressed text databases with efficient query algorithms based on the compressed suffix array - Sadakane - 1969 |

55 | Space efficient suffix trees
- Munro, Raman, et al.
(Show Context)
Citation Context ...h) = O(log logσ n). Grossi and Vitter [2000, 2006] show how the occ occurrences can be located more efficiently in batch when m is large enough. They also show how to modify a compressed suffix tree [=-=Munro et al. 2001-=-] so as to obtain O(m/ logσ n + log ǫ n) search time, for any constant 0 < ǫ < 1, using O(n log σ) bits of space. This is obtained by modifying the compressed suffix tree [Munro et al. 2001] in two wa... |

54 |
Linear-time construction of suffix arrays
- Kim, Sim, et al.
- 2003
(Show Context)
Citation Context ...y if there are long repeated substrings within the text. There are several more sophisticated algorithms, from the original O(n log n) time [Manber and Myers 1993] to the latest O(n) time algorithms [=-=Kim et al. 2003-=-; Ko and Aluru 2003; Kärkkäinen and Sanders 2003]. In practice, the best current algorithms are not linear-time ones [Larsson and Sadakane 1999; Itoh and Tanaka 1999; Manzini and Ferragina 2004; Schür... |