## Opportunistic Data Structures with Applications (2000)

### Cached

### Download Links

- [butirro.di.unipi.it]
- [ftp.di.unipi.it]
- [www.mfn.unipmn.it]
- [web.unipmn.it]
- [www.cs.ucr.edu]
- [people.unipmn.it]
- [goanna.cs.rmit.edu.au]
- DBLP

### Other Repositories/Bibliography

Citations: | 180 - 11 self |

### BibTeX

@INPROCEEDINGS{Ferragina00opportunisticdata,

author = {Paolo Ferragina and Giovanni Manzini},

title = {Opportunistic Data Structures with Applications},

booktitle = {},

year = {2000},

pages = {390--398}

}

### Years of Citing Articles

### OpenURL

### Abstract

There is an upsurging interest in designing succinct data structures for basic searching problems (see [23] and references therein). The motivation has to be found in the exponential increase of electronic data nowadays available which is even surpassing the significant increase in memory and disk storage capacities of current computers. Space reduction is an attractive issue because it is also intimately related to performance improvements as noted by several authors (e.g. Knuth [15], Bentley [5]). In designing these implicit data structures the goal is to reduce as much as possible the auxiliary information kept together with the input data without introducing a significant slowdown in the final query performance. Yet input data are represented in their entirety thus taking no advantage of possible repetitiveness into them. The importance of those issues is well known to programmers who typically use various tricks to squeeze data as much as possible and still achieve good query performance. Their approaches, though, boil down to heuristics whose e#ectiveness is witnessed only by experimentation.

### Citations

805 | Managing Gigabytes: Compressing and Indexing Documents and Images
- Witten, Moffat, et al.
- 1999
(Show Context)
Citation Context ...time versus space usage. The two main approaches are wordbasedsindices and full-text indices. The former achieve succinct space occupancy at the cost of being mainly limited to index linguistic texts =-=[27]-=-, the latter achieve versatility and guaranteed performance at the cost of requiring large space occupancy (see e.g. [10, 18, 21]). Some progress on full-text indices has been recently achieved [12, 2... |

646 | Suffix arrays: A new method for on-line string searches
- Manber, Myers
- 1993
(Show Context)
Citation Context ... space occupancy at the cost of being mainly limited to index linguistic texts [27], the latter achieve versatility and guaranteed performance at the cost of requiring large space occupancy (see e.g. =-=[10, 18, 21]-=-). Some progress on full-text indices has been recently achieved [12, 23], but an asymptotical linear space seems unavoidable and this makes word-based indices much more appealing when space occupancy... |

566 | A block-sorting lossless data compression algorithm
- Burrows, Wheeler
- 1994
(Show Context)
Citation Context ...o search for the occ occurrences of P in T in O(p + occ log u) time, for any fixed > 0. The novelty of our approach resides in the careful combination of the Burrows-Wheeler compression algorithm [7=-=]-=- with the the suffix array data structure [18] to obtain a sort of compressed suffix array. We indeed show how to augmentsthe information kept by the Burrows-Wheeler algorithm, in order to support eff... |

549 |
A space-economical suffix tree construction algorithm
- MCCREIGHT
- 1976
(Show Context)
Citation Context ... space occupancy at the cost of being mainly limited to index linguistic texts [27], the latter achieve versatility and guaranteed performance at the cost of requiring large space occupancy (see e.g. =-=[10, 18, 21]-=-). Some progress on full-text indices has been recently achieved [12, 23], but an asymptotical linear space seems unavoidable and this makes word-based indices much more appealing when space occupancy... |

328 |
Text Algorithms
- CROCHEMORE, RYTTER
- 1994
(Show Context)
Citation Context ... [12], LZ78 [2], Anti-dictionaries [10]. All of those results, however, rely on a full scan of the whole compressed text. So that, although asymptotically faster than the classical scan-based methods =-=[11]-=-, which operate on the uncompressed texts, their overall time requirement is still unacceptable when they work on large text collections. Approaches to combine compression and indexing techniques are ... |

255 |
Sorting and searching, volume 3 of The Art of Computer Programming
- Knuth
- 1973
(Show Context)
Citation Context ...e, now more than ever before, because of the exponential increase of electronic data nowadays available, and because of its intimate relation with algorithmic performance improvements (see e.g. Knuth =-=[16]-=- and Bentley [5]). This has recently motivated an upsurging interest in the design of implicitsdata structures for basic searching problems (see [23] and references therein). The goal is to reduce as ... |

194 |
Overview of the third Text REtrieval Conference (TREC-3
- Harman
- 1995
(Show Context)
Citation Context ...d in Information Retrieval [5]. Thesrst assumption is the Heaps law [16] which states that in a text of size u the number of distinct words grows as V = O(u ) with 0s1. Typical experimental values ofs=-=[15]-=- are in the range [0:4; 0:6], so that the vocabulary has size about p u (in practice this corresponds to few megabytes). The second assumption is the Zipf's law [41] which states that if the words of ... |

192 | Glimpse: a tool to search through entire file systems
- Manber, Wu
- 1994
(Show Context)
Citation Context ...unistic data structure in a dynamic setting and devise a variant achieving effective search and update time bounds. Finally, we show how to plug our opportunistic data structure into the Glimpse tool =-=[19]-=-. The result is an indexing tool which achieves sublinear space and sublinear query time complexity. 1 Introduction Data structure is a central concept in algorithmics and computer science in general.... |

188 | Compressed suffix arrays and suffix trees with applications to text indexing and string matching
- Grossi, Vitter
- 2005
(Show Context)
Citation Context ...tunistic data structure allows to search for the occ occurrences of P in T in O(p + occ log u) time (for any fixed > 0). If data are uncompressible we achieve the best space bound currently known [1=-=2]-=-; on compressible data our solution improves the succinct suffix array of [12] and the classical suffix tree and suffix array data structures either in space or in query time or both. We also study ou... |

161 | Programming Pearls - Bentley - 1986 |

141 |
Data Compression: The Complete Reference
- Salomon
- 2000
(Show Context)
Citation Context ...ression algorithm that achieves compression within a percent or so of that achieved by statistical modeling techniques, but at speeds comparable to those of algorithms based on Lempel-Ziv's (see e.g. =-=[8, 37]-=-). The algorithm does not process the input sequentially, but instead examines a block of text at a time, as a single unit. Possibly a block can be as large as the entire text. On each block is applie... |

131 | An analysis of the Burrows-Wheeler transform - Manzini - 2001 |

130 | Agrep—a fast approximate pattern-matching tool
- WU, MANBER
- 1992
(Show Context)
Citation Context ...larysV , then all candidate blocks of L(w) are sequentially examined to find all the w's occurrences. Complex queries (e.g. approximate or regular expression searches) can be supported by using Agrep =-=[28]-=- both in the vocabulary and in the block searches. Clearly, the search is efficient if the vocabulary is small, if the query is enough selective, and if the block size is not too large. The first two ... |

120 | The string B-tree: a new data structure for string search in external memory and its applications
- Ferragina, Grossi
- 1999
(Show Context)
Citation Context ... space occupancy at the cost of being mainly limited to index linguistic texts [27], the latter achieve versatility and guaranteed performance at the cost of requiring large space occupancy (see e.g. =-=[10, 18, 21]-=-). Some progress on full-text indices has been recently achieved [12, 23], but an asymptotical linear space seems unavoidable and this makes word-based indices much more appealing when space occupancy... |

95 | Let sleeping files lie: pattern matching in Z-compressed files
- Amir, Benson, et al.
- 1996
(Show Context)
Citation Context ... [1; u] is an array containing the lexicographically ordered sequence of the suffixes of T , represented via pointers to their starting positions (i.e., integers). For instance, if T = ababc then A = =-=[1; 3; 2; 4; 5]. Clearly -=-A requires u log 2 u bits, actually a lot when indexing large text collections. It is a long standing belief that suffix arrays are uncompressible because of the "apparently random" permutat... |

85 |
String matching in Lempel-Ziv compressed strings
- Farach, Thorup
- 1998
(Show Context)
Citation Context ...been already investigated only with respect to its impact on algorithmic performance in the context of on-line algorithms (e.g. caching and prefetching [15, 17]), string-matching algorithms (see e.g. =-=[1, 2, 9]-=-), sorting and computational geometry algorithms [8]. The scenario. Most of the research in the design of indexing data structures has been directed to devise solutions which offer a good trade-off be... |

80 | Efficient two-dimensional compressed matching
- Amir, Benson
- 1992
(Show Context)
Citation Context ...1; u] is stored using O(H k (T )) + o(1) bits per input symbol in the worst case, where H k (T ) is the kth order empirical entropy of T (the bound holds for any fixed k). Given an arbitrary string P =-=[1; -=-p], the opportunistic data structure allows to search for the occ occurrences of P in T in O(p + occ log u) time (for any fixed > 0). If data are uncompressible we achieve the best space bound curre... |

55 |
A space-economical sux tree construction algorithm
- McCreight
- 1976
(Show Context)
Citation Context ... Full-text 2 indexes have been then designed to overcome these limitations at the cost of an increase in the additional space occupied by the underlying index. Examples of such indexes are: sux trees =-=[27]-=-, sux arrays [23] and String B-trees [13]. Despite their superior features (elegance, versatility, and guaranteed performances), they have not seen widespread use in string processing software mainly ... |

49 | Adding compression to block addressing inverted indexes
- NAVARRO, MOURA, et al.
- 2000
(Show Context)
Citation Context .... A collection of algorithms is currently known to solve efficiently (possibly optimally) this problem on text compressed by means of various schemes: e.g. run-length [1], LZ77 [9], LZ78 [2], Huffman =-=[24]-=-. All of these results, although asymptotically faster than the classical scan-based methods, they rely on the scan of the whole compressed text and thus result still unacceptable for large text colle... |

48 | Lempel-Ziv parsing and sublinear-size index structures for string matching
- Kärkkäinen, Ukkonen
- 1996
(Show Context)
Citation Context ...eving experimentalstrade-offs between space occupancy and query performance (see e.g. [4, 19, 27]). An interesting idea towards the direct compression of the index data structure has been proposed in =-=[13, 14]-=- where the properties of the Lempel-Ziv's compression scheme have been exploited to reduce the number of index points, still supporting pattern searches. As a result, the overall index requires provab... |

43 | Blockaddressing indices for approximate text retrieval
- BAEZA-YATES, NAVARRO
(Show Context)
Citation Context ...g techniques are nowadays receiving more and more attention, especially in the context of word-based indices, achieving experimentalstrade-offs between space occupancy and query performance (see e.g. =-=[4, 19, 27]-=-). An interesting idea towards the direct compression of the index data structure has been proposed in [13, 14] where the properties of the Lempel-Ziv's compression scheme have been exploited to reduc... |

35 |
An implicit data structure supporting insertion, deletion, and search in O(log 2 n) time
- Munro
- 1986
(Show Context)
Citation Context ...lt extends an elegant technique proposed at the beginning of '80 in [35, 28], here adapted to manage items (i.e. texts) of variable lengths. We point out that in the literature some authors (see e.g. =-=[30]-=-) have investigated the design of implicit data structures for the dynamic dictionary problem, which is indeed a simpler version of the previous one. They achieved sublinear auxiliary space occupancy ... |

27 | Optimal prediction for prefetching in the worst case
- Krishnan, Vitter
- 1998
(Show Context)
Citation Context .... The exploitation of data compressibility have been already investigated only with respect to its impact on algorithmic performance in the context of on-line algorithms (e.g. caching and prefetching =-=[15, 17]-=-), string-matching algorithms (see e.g. [1, 2, 9]), sorting and computational geometry algorithms [8]. The scenario. Most of the research in the design of indexing data structures has been directed to... |

24 | Text compression using antidictionaries
- Crochemore, Mignosi, et al.
- 1999
(Show Context)
Citation Context ...ection of algorithms is currently known to solve eciently (possibly optimally) this problem on text compressed by means of various schemes: e.g. run-length [1], LZ77 [12], LZ78 [2], Anti-dictionaries =-=[10]-=-. All of those results, however, rely on a full scan of the whole compressed text. So that, although asymptotically faster than the classical scan-based methods [11], which operate on the uncompressed... |

22 | Indexing compressed text
- Moura, Navarro, et al.
- 1997
(Show Context)
Citation Context ... some sense, our result can be interpreted as a method to compress the sux array which is commonly considered uncompressible (by using standard compression algorithms only 5% saving has been achieved =-=[29-=-]), and still support eective searches for arbitrary patterns into it. In their seminal paper, Manber and Myers [23] introduced the sux array data structure showing how to search for a pattern P [1; p... |

21 |
Worst-case optimal insertion and deletion methods for decomposable searching problems
- Overmars, Leeuwen
- 1981
(Show Context)
Citation Context ...n we aim at dynamizing our compressed index in order to keep in a reduced space and be able to efficiently support update and search operations. Our result exploits an elegant technique proposed in [=-=22, 25-=-], here adapted to manage items of variable lengths (i.e. texts). In the following we bound the space occupancy of our data structure in terms of the entropy of the concatenation of 's texts. A better... |

17 | Using difficulty of prediction to decrease computation: fast sort, priority queue and convex hull on entropy bounded inputs
- Chen, Reif
- 1993
(Show Context)
Citation Context ...on algorithmic performance in the context of on-line algorithms (e.g. caching and prefetching [15, 17]), string-matching algorithms (see e.g. [1, 2, 9]), sorting and computational geometry algorithms =-=[8]-=-. The scenario. Most of the research in the design of indexing data structures has been directed to devise solutions which offer a good trade-off between query and update time versus space usage. The ... |

17 | Lempel-Ziv index for q-grams - Kärkkäinen, Sutinen - 1998 |

15 |
Searching large text collections
- Baeza-Yates, Moffat, et al.
- 2002
(Show Context)
Citation Context ...etheless an asymptotical linear space seems unavoidable and this makes inverted lists (and their derivatives) much more appealing when space issue is a primary concern. It is in fact a current belief =-=[39, 4]-=- that some space overhead must be paid to use full-text indices with respect to the word-based indices. In this context compression appears always as an attractive choice, if not mandatory. Processing... |

15 |
Compressed sux arrays and sux trees with applications to text indexing and string matching
- Grossi, Vitter
- 2000
(Show Context)
Citation Context ...allows to search for the occ occurrences of P in T requiring O(p + occ log u) time complexity (for anysxed > 0). If data are non compressible, then we achieve the best space bound currently known [1=-=4]-=-; otherwise our solution improves the succinct sux array in [14] and the classical sux tree and sux array data structures either in space or in query time complexity or both. It was a belief [39] that... |

14 | Multi-method dispatching: A geometric approach with applications to string matching problems
- Ferragina, Muthukrishnan, et al.
- 1999
(Show Context)
Citation Context ...-page size. In the RAM, it would be interesting to avoid the o(log u) overhead incurred in the listing of the pattern occurrences. In the full paper we will show how to use known techniques (see e.g. =-=[11-=-]) for designing hybrid indices which achieve O(occ) retrieval time cost under restrictive conditions either on the pattern length or on the number of pattern occurrences. Guaranteeing the (occ) retri... |

13 |
Let sleeping lie: Pattern matching in Z-compressed
- Amir, Benson, et al.
- 1994
(Show Context)
Citation Context ...g T [1; u] is an array containing the lexicographically ordered sequence of the suxes of T , represented via pointers to their starting positions (i.e., integers). For instance, if T = ababc then A = =-=[1; 3; 2; 4; 5]-=-. This way A occupies u dlog 2 ue bits (see [14] for the removal of the logarithmic term). Manber and Myers [23] introduced this data structure in the early 90s and proposed an interesting algorithm t... |

12 |
M.H.: Optimal Dynamization of Decomposable Searching Problems
- Mehlhorn, Overmars
- 1981
(Show Context)
Citation Context ...n we aim at dynamizing our compressed index in order to keep in a reduced space and be able to efficiently support update and search operations. Our result exploits an elegant technique proposed in [=-=22, 25-=-], here adapted to manage items of variable lengths (i.e. texts). In the following we bound the space occupancy of our data structure in terms of the entropy of the concatenation of 's texts. A better... |

12 |
Reducing the space requirement of su#x trees
- Kurtz
- 1998
(Show Context)
Citation Context ...ormances), they have not seen widespread use in string processing software mainly because, although asymptotically optimal in space, the constants hidden in the big-Oh notation range between 4 and 20 =-=[22-=-]. Such a value is sometimes a signicant bottleneck that may even prevent the use of those data structures in practical applications. Some progress has been recently achieved [32, 14] but nonetheless ... |

12 |
Human Behaviour and the Principle of Least Eort
- Zipf
- 1949
(Show Context)
Citation Context ...s1. Typical experimental values ofs[15] are in the range [0:4; 0:6], so that the vocabulary has size about p u (in practice this corresponds to few megabytes). The second assumption is the Zipf's law =-=-=-[41] which states that if the words of the vocabulary are sorted in decreasing order of frequency, then the frequency of the ith word is u=(i H () V ), where H () V = P jV j j=1 1=i is a normalizati... |

11 |
A locally adaptive compression scheme
- Bentley, Sleator, et al.
- 1986
(Show Context)
Citation Context ...". The BWT tends to group together characters which occur adjacent to similar text substrings. This nice property is exploited by locally-adaptive compression algorithms, such as move-to-front co=-=ding [6]-=-, in combination with statistical (i.e. Huffman or Arithmetic coders) or structured coding models. The BWT-based compressors are among the best compressors currently available since they achieve a ver... |

10 |
Implicit Data Structures
- Munro, Suwanda
- 1979
(Show Context)
Citation Context ...gorithmic performance improvements (see e.g. Knuth [16] and Bentley [5]). This has recently motivated an upsurging interest in the design of implicitsdata structures for basic searching problems (see =-=[23]-=- and references therein). The goal is to reduce as much as possible the auxiliary information kept together with the input data without introducing any significant slowdown in the query performance. H... |

9 |
The BZIP2 home page
- Seward
- 1997
(Show Context)
Citation Context ...T is an English text T mtf usually 8 contains more than 50% zeroes. For this reason, the string T mtf can be eciently compressed with a Human or an arithmetic coder or more complex ad-hoc techniques [=-=8, 38]-=-. We point out that even the simplest BWT-based algorithms outperform widely used compressors such as gzip and pkzip. More advanced BWT-based compressors, such as bzip2 [38], are among the best compre... |

8 |
Sorting and searching revisited
- Andersson
- 1996
(Show Context)
Citation Context ... [1; u] is an array containing the lexicographically ordered sequence of the suffixes of T , represented via pointers to their starting positions (i.e., integers). For instance, if T = ababc then A = =-=[1; 3; 2; 4; 5]. Clearly -=-A requires u log 2 u bits, actually a lot when indexing large text collections. It is a long standing belief that suffix arrays are uncompressible because of the "apparently random" permutat... |

8 |
glimpse: A tool to search through entire systems
- Manber, Wu
- 1994
(Show Context)
Citation Context ...e in the worst case. Finally, we investigate applications of our ideas to the development of novel text retrieval systems based on the concept of block addressing (rst introduced in the Glimpse tool [=-=24]-=-). The notable feature of block addressing is that it can achieve both sublinear space overhead and sublinear query time, whereas inverted indexes pointing to words or documents achieve only the secon... |

6 |
Information Retrieval: Theoretical and Computational Aspects
- Heaps
- 1978
(Show Context)
Citation Context ...) over a compressed text. A theoretical investigation of the performance of CGlimpse is yet feasible using a model generally accepted in Information Retrieval [5]. Thesrst assumption is the Heaps law =-=[16]-=- which states that in a text of size u the number of distinct words grows as V = O(u ) with 0s1. Typical experimental values ofs[15] are in the range [0:4; 0:6], so that the vocabulary has size about ... |

6 |
Space ecient sux trees
- Munro, Raman, et al.
- 2001
(Show Context)
Citation Context ...ng time complexity but without the use of the so called lcp-array [23]. In [14] other hybrid indexing data structures based on the combination of their succinct sux array and various known techniques =-=[17, 3-=-3] have been introduced achieving O( p log u + occ log u) query-time complexity but yet requiring u) bits of storage. In Section 5, we also investigate the modiability of our opportunistic data struc... |

5 | A modified Burrows-Wheeler transformation for case-insensitive search with application to suffix array compression
- Sadakane
- 1999
(Show Context)
Citation Context ...prefix L[1; i] (see observation (b) above). 3. Reconstruct T backward as follows: set s = 1 and T [u] = L[1] (because M[1] = #T ); then, for each i = u 1; : : : ; 1 do s = LF [s] and T [i] = L[s]. In =-=[26]-=- it is shown how to derive the suffix array A from L in linear time; however, in the context of pattern searching, the algorithm in [26] is no better than the known scanbased opportunistic algorithms ... |

5 | The Burrows-Wheeler transform: Theory and practice
- Manzini
- 1999
(Show Context)
Citation Context ... These results have shed some light on the positive features of BWT-based compressors and have shown that the BWT has remarkable theoretical properties not shared by other compression algorithms (see =-=[26]-=- and references therein). In order to describe our results on searching in BWT-compressedsles, we must commit ourselves to one of the several algorithms based on the BWT. The variant we consider, we c... |

4 |
Markov paging (extended abstract
- Karlin, Phillips, et al.
- 1992
(Show Context)
Citation Context .... The exploitation of data compressibility have been already investigated only with respect to its impact on algorithmic performance in the context of on-line algorithms (e.g. caching and prefetching =-=[15, 17]-=-), string-matching algorithms (see e.g. [1, 2, 9]), sorting and computational geometry algorithms [8]. The scenario. Most of the research in the design of indexing data structures has been directed to... |

2 | Using di culty of prediction to decrease computation: Fast sort, priority queue and convex hull on entropy bounded inputs - Chen, Reif - 1993 |