## A linear lower bound on index size for text retrieval (2003)

Venue: | J. Algorithms |

Citations: | 5 - 1 self |

### BibTeX

@ARTICLE{Alej03alinear,

author = {Erik D. Demaine Alej and Ro López-ortiz},

title = {A linear lower bound on index size for text retrieval},

journal = {J. Algorithms},

year = {2003},

volume = {48},

pages = {2--15}

}

### OpenURL

### Abstract

Most information-retrieval systems preprocess the data to produce an auxiliary index structure. Empirically, it has been observed that there is a tradeoff between query response time and the size of the index. When indexing a large corpus, such as the web, the size of the index is an important consideration. In this case it would be ideal to produce an index that is substantially smaller than the text. In this work we prove a linear worst-case lower bound on the size of any index that reports the location (if any) of a substring in the text in time proportional to the length of the pattern. In other words, an index supporting linear-time substring searches requires about as much space as the original text. Here “time ” is measured in the number of bit probes to the text; an arbitrary amount of computation may be done on an arbitrary amount of the index. Our lower bound applies to inverted word indices as well. 1

### Citations

1682 | An Introduction to Kolmogorov Complexity and its Applications
- Li, Vitányi
- 1997
(Show Context)
Citation Context ...that the Kolmogorov complexity of an average element from a class of k objects is within an additive constant of the information theoretic lower bound, lg k. For details on Kolmogorov complexity, see =-=[LV97]-=-. Now consider a permutation of the integers 0, . . . , n−1 that is random in this Kolmogorov sense, i.e., the Kolmogorov complexity of the permutation is lg n! + O(1). 1 From such a permutation we co... |

627 |
V.R.: Fast pattern matching in strings
- Knuth, Morris, et al.
- 1977
(Show Context)
Citation Context ...ime. To illustrate with an extreme example, in the absence of an index, it is necessary to examine the entire text to see if the query string is present. For example, the Knuth-Morris-Pratt algorithm =-=[KMP77]-=- requires no index and runs in time proportional to the length of the text plus the pattern. In practice such a search is done often with the UNIX utility grep. On the other end of the spectrum, a que... |

548 | A space–economical suffix tree construction algorithm - McCreight - 1976 |

328 | On-line construction of suffix trees - Ukkonen - 1995 |

188 | Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching
- Grossi, Vitter
(Show Context)
Citation Context ....edu. † Department of Computer Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada, alopez-o@uwaterloo.ca. Supported by NSERC. 1snumber of bits in the text. Recently, Grossi and Vitter =-=[GV00]-=- have shown that the space can be improved to O(N) bits (proportional to the size of the text), and furthermore that the search time can be improved by a factor of lg N by exploiting a machine word at... |

180 | Opportunistic Data Structures with Application
- Ferrragina, Manzini
- 2000
(Show Context)
Citation Context ... a pattern P in a text of length N using their structure is O(|P |/ lg N + |output| lg ε N) time. Related structures. Several other text-indexing structures have been developed. Ferragina and Manzini =-=[FM00]-=- proposed a structure with a similar search time, O(|P |+|output| lg ε N), but whose space requirement is related to the entropy of the text, being smaller for compressible text. Their structure also ... |

137 | Should tables be sorted - Yao - 1981 |

76 | pattern matching algorithm - Linear - 1973 |

68 | From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction. Algorithmica - Giegerich, Kurtz - 1997 |

59 | Compressed text databases with efficient query algorithms based on the compressed suffix array
- Sadakane
- 1969
(Show Context)
Citation Context ...ll requires Θ(lg ε N) time. Recently, Ferragina and Manzini [FM02] have also shown how to remove the lg ε N factor from the search time at the cost of adding a factor of lg ε N to the space. Sadakane =-=[Sad00]-=- designed a structure that applies also to large alphabets, but increases the search cost to O(|P | lg N + |output| lg ε N). All of these structures essentially encode the entire text, and hence in th... |

54 | Are bitvectors optimal - Buhrman, Miltersen, et al. |

50 | Succinct representations of LCP information and improvements in the compressed suffix arrays
- Sadakane
- 2002
(Show Context)
Citation Context ...ets, but increases the search cost to O(|P | lg N + |output| lg ε N). All of these structures essentially encode the entire text, and hence in the worst case also require Ω(N) bits of space. Sadakane =-=[Sad02]-=- showed how to extend the last four structures [GV00, FM00, FM02, Sad00] to support additional queries— finding the longest common prefix—using O(N) extra bits of space. For weaker operations, better ... |

49 | Low redundancy in static dictionaries with constant query time
- Pagh
- 2001
(Show Context)
Citation Context ...bstring) appear in the text? Fredmen, Komlós, and Szemerédi [FKS84] present a structure supporting constant-time queries and using w lg N + o(w lg N) bits of space for w query words of interest. Pagh =-=[Pag01]-=- improved this space bound to w lg(N/w) bits. For w = o(N) query words of interest, this index is smaller than the text of N bits. Our results. An important but relatively unstudied field of research ... |

38 |
János Komlós, and Endre Szemerédi. Storing a sparse table with O(1) worst case access time
- Fredman
- 1984
(Show Context)
Citation Context ...e. For weaker operations, better bounds are known. In particular, static dictionaries support just membership queries: does a given word (substring) appear in the text? Fredmen, Komlós, and Szemerédi =-=[FKS84]-=- present a structure supporting constant-time queries and using w lg N + o(w lg N) bits of space for w query words of interest. Pagh [Pag01] improved this space bound to w lg(N/w) bits. For w = o(N) q... |

22 |
On compressing and indexing data
- Ferragina, Manzini
(Show Context)
Citation Context ...cture also supports counting the number of occurrences of a pattern in O(|P |) time, but to find even a single occurrence of the pattern still requires Θ(lg ε N) time. Recently, Ferragina and Manzini =-=[FM02]-=- have also shown how to remove the lg ε N factor from the search time at the cost of adding a factor of lg ε N to the space. Sadakane [Sad00] designed a structure that applies also to large alphabets,... |

17 | The bit probe complexity measure revisited - Miltersen |