## New Algorithms on Wavelet Trees and Applications to Information Retrieval (2011)

### BibTeX

@MISC{Gagie11newalgorithms,

author = {Travis Gagie and Gonzalo Navarro and Simon J. Puglisi},

title = {New Algorithms on Wavelet Trees and Applications to Information Retrieval },

year = {2011}

}

### OpenURL

### Abstract

Wavelet trees are widely used in the representation of sequences, permutations, text collections, binary relations, discrete points, and other succinct data structures. We show, however, that this still falls short of exploiting all of the virtues of this versatile data structure. In particular we show how to use wavelet trees to solve fundamental algorithmic problems such as range quantile queries, range next value queries, and range intersection queries. We explore several applications of these queries in Information Retrieval, in particular document retrieval in hierarchical and temporal documents, and in the representation of inverted lists.

### Citations

2656 | Modern Information Retrieval
- Baeza-Yates, Ribeiro-Neto
- 1999
(Show Context)
Citation Context ...ts where two (or more) queries appear simultaneously. We extend these solutions to temporal and hierarchical document collections. 4.2. Inverted Indexes The inverted index is a classical IR structure =-=[49, 50]-=-, lying at the heart of most modern Web search engines and applications handling natural-language text collections. By “natural language” texts one refers to those that can be easily split into a sequ... |

672 | Suffix arrays: A new method for on-line string searches
- Manber, Myers
- 1993
(Show Context)
Citation Context ...n variables are used simultaneously in a large software development system, or ranking a set of gene sequences by the number of times a given substring marker occurs. By constructing a suffix array A =-=[26]-=- on the text collection, one can obtain in time O ( |q| log |C| ) (where |C| denotes the sum of document lengths in C) the range of A where all the occurrence positions of q in C are listed. The class... |

392 | Expected Time Bounds for Selection
- Floyd, Rivest
- 1975
(Show Context)
Citation Context ... 4, for a total of O ( log u + r log ℓ ) r . 3. New Algorithms 3.1. Range Quantile Two näive ways of solving query range quantile(i, j, k) are by sequentially scanning the range in time O( j − i + 1) =-=[30]-=-, and by storing the answers to the O ( n 3) possible queries in a table and returning answers in O(1) time. Neither of these solutions is really satisfactory. Until recently there was no work on rang... |

209 | Inverted files for text search engines
- Zobel, Moffat
(Show Context)
Citation Context ... problem for general strings described above, the restriction of word queries allows inverted indexes to precompute the answer to each possible word query. Two main variants of inverted indexes exist =-=[51, 52]-=-. Ranked retrieval is aimed at retrieving documents that are most “relevant” to a query, under some criterion. As explained, a popular formula for relevance is wd,q = tf d,q × idf q, but others built ... |

200 | Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets
- Raman, Raman, et al.
(Show Context)
Citation Context ...ee will have ⌈log σ⌉ levels and the lemma holds. Otherwise, we can store a mapping between the universe of σ possible values and the u ≤ n actual values using an “indexable dictionary” data structure =-=[29]-=-. This requires u log σ u + O( u + log log σ ) bits and maps in both directions (telling also whether a value from the universe appears in S or not) in constant time. Thus we can act as if S were a se... |

199 | High-order entropy-compressed text indexes
- Grossi, Gupta, et al.
(Show Context)
Citation Context ...temporal documents, and in the representation of inverted lists. Keywords: Information retrieval, Document retrieval, Data structures, 1D range queries, Wavelet trees 1. Introduction The wavelet tree =-=[3]-=- is a versatile data structure that stores a sequence S [1, n] of elements from a symbol universe [1, σ] within asymptotically the same space required by a plain representation of the sequence, n log ... |

181 |
Scaling and related techniques for geometry problems
- Gabow, Bentley, et al.
- 1984
(Show Context)
Citation Context ... time to O ( log n/ log log n ) , within linear space. For the special case of semi-infinite queries (i.e., i = 1 or j = n) one can use an O(n)-words and O ( log log n ) time solution by Gabow et al. =-=[41]-=-. By using wavelet trees, we also solve the general problem in time O ( log u ) . Our space is better than the simple linear-space solution, n + O ( n/ log σ ) words (n of which actually replace the s... |

180 | Compressed full-text indexes
- Navarro, Mäkinen
(Show Context)
Citation Context ...within the space required to represent C in compressed form, and for example determine the range [sp, ep] within time O ( |q| log σ ) and list each A[i] in time O ( log 1+ɛ n ) for any constant ɛ > 0 =-=[4, 45]-=-. For listing the distinct documents where q appears, one option is to find out the document to which each A[i] belongs and remove duplicates. This, however, requires Ω(ep − sp + 1) time; that is, it ... |

133 | A functional approach to data structures and its use in multidimensional searching
- Chazelle
- 1988
(Show Context)
Citation Context ...ass of compressed text indexes — the FM-index family [5] — giving birth to most of its modern variants [6, 7, 4, 8]. The connection between the wavelet tree and an old geometric structure by Chazelle =-=[9]-=- made it evident that wavelet trees could be used for range counting and reporting points in the plane. More formally, given a set of t points P = {(xi, yi), 1 ≤ i ≤ t} on a discrete grid [1, n] × [1,... |

122 |
Indexing compressed texts
- Ferragina, Manzini
(Show Context)
Citation Context ...ulness of the wavelet tree for many other scenarios was quickly realized. For example, it was soon adopted as a fundamental component of a large class of compressed text indexes — the FM-index family =-=[5]-=- — giving birth to most of its modern variants [6, 7, 4, 8]. The connection between the wavelet tree and an old geometric structure by Chazelle [9] made it evident that wavelet trees could be used for... |

114 | Compressed Representations of Sequences and FullText Indexes - Ferragina - 2006 |

74 | Efficient algorithms for document retrieval problems
- Muthukrishnan
- 2002
(Show Context)
Citation Context ...me O ( |q| log |C| ) (where |C| denotes the sum of document lengths in C) the range of A where all the occurrence positions of q in C are listed. The classical solution to document retrieval problems =-=[27]-=- starts by defining a document array D giving the document to which each suffix of A belongs. Then problems like document listing reduce to listing the distinct values in a range of D, and intersectio... |

66 | Indexing text using the Ziv-Lempel trie
- Navarro
(Show Context)
Citation Context ...sequently used to design powerful succinct representations of twodimensional point grids [10, 11, 12], permutations [13], and binary relations [14], with applications to other compressed text indexes =-=[15, 16, 17]-=-, document retrieval problems [18] and many others. In this paper we show, by uncovering new capabilities, that the full potential of wavelet trees is far from realized. In particular, we show that th... |

65 | Adaptive set intersections, unions, and differences
- Demaine, López-Ortiz, et al.
- 2000
(Show Context)
Citation Context .... A lower bound in terms of alternation (holding even for randomized algorithms) [19] is Ω ( α · ∑ 1≤r≤k log nr ) α , where nr = jr − ir + 1. There exist adaptive algorithms matching this lower bound =-=[42, 19, 43]-=-. We show now that the wavelet tree representation of S [1, n] allows a rather simple intersection algorithm that approaches the lower bound, even if one starts from ranges of disordered values, possi... |

53 | Succinct suffix arrays based on run-length encoding
- Mäkinen, Navarro
(Show Context)
Citation Context ...s was quickly realized. For example, it was soon adopted as a fundamental component of a large class of compressed text indexes — the FM-index family [5] — giving birth to most of its modern variants =-=[6, 7, 4, 8]-=-. The connection between the wavelet tree and an old geometric structure by Chazelle [9] made it evident that wavelet trees could be used for range counting and reporting points in the plane. More for... |

53 | Succinct data structures for flexible text retrieval systems
- Sadakane
(Show Context)
Citation Context ...equencies are computed with wavelet trees as q doc frequency(C, q, d) = rankd(D, ep) − rankd(D, sp − 1). Document frequencies can be computed with just 2n + o(n) more bits for the case of the D array =-=[47]-=-, and on top of a wavelet tree for the E array for more general scenarios [48]. In Section 5 we show how our new algorithms solve the document listing problem within the same time complexity O ( docc ... |

45 | An alphabet-friendly FMindex
- Ferragina, Manzini, et al.
- 2004
(Show Context)
Citation Context ...s was quickly realized. For example, it was soon adopted as a fundamental component of a large class of compressed text indexes — the FM-index family [5] — giving birth to most of its modern variants =-=[6, 7, 4, 8]-=-. The connection between the wavelet tree and an old geometric structure by Chazelle [9] made it evident that wavelet trees could be used for range counting and reporting points in the plane. More for... |

41 | A new succinct representation of rmqinformation and improvements in the enhanced suffix array
- Fischer, Heun
- 2007
(Show Context)
Citation Context ...he power of wavelet trees for this problem. By representing D with a wavelet tree, they simulated E[i] = selectD[i](D, rankD[i](D, i − 1)) without storing it. By using a 2n-bit data structure for RMQ =-=[46]-=-, the total space was reduced to n log m + O(n) bits, and still Muthukrishnan’s algorithm was simulated within reasonable time, O ( docc log m ) . 9Ranked document retrieval is usually built around t... |

33 | Space-efficient algorithms for document retrieval
- Välimäki, Mäkinen
(Show Context)
Citation Context ... representations of twodimensional point grids [10, 11, 12], permutations [13], and binary relations [14], with applications to other compressed text indexes [15, 16, 17], document retrieval problems =-=[18]-=- and many others. In this paper we show, by uncovering new capabilities, that the full potential of wavelet trees is far from realized. In particular, we show that the wavelet tree allows us to solve ... |

32 | Adaptive intersection and t-threshold problems
- Barbay, Kenyon
- 2002
(Show Context)
Citation Context ...n(σ, j1 − i1 + 1, . . . , jk − ik + 1)). However, we give an adaptive analysis of our method, showing it requires O ( αk log σ ) α time, where α is the so-called alternation complexity of the problem =-=[19]-=-. All these algorithmic problems are well known. Har-Peled and Muthukrishnan [20] describe applications of range median queries (a special case of range quantile) to the analysis of Web advertising lo... |

30 | Implicit compression boosting with applications to self-indexing
- Mäkinen, Navarro
(Show Context)
Citation Context ...s was quickly realized. For example, it was soon adopted as a fundamental component of a large class of compressed text indexes — the FM-index family [5] — giving birth to most of its modern variants =-=[6, 7, 4, 8]-=-. The connection between the wavelet tree and an old geometric structure by Chazelle [9] made it evident that wavelet trees could be used for range counting and reporting points in the plane. More for... |

26 | Range quantile queries: Another virtue of wavelet trees
- Gagie, Puglisi, et al.
(Show Context)
Citation Context ...irs (xi, yi) ∈ P such that x s ≤ xi ≤ x e , y s ≤ yi ≤ y e , range report(P, x s , x e , y s , y e ) = list of those pairs (xi, yi) ∈ P in some order. 1Early parts of this work appeared in SPIRE 2009 =-=[1]-=- and SPIRE 2010 [2]. The second author was partially supported by Fondecyt Grant 1-110066, Chile. The third author was partially supported by the Australian Research Council. Email addresses: travis.g... |

22 | Transposition invariant string matching - Makinen, Navarro, et al. - 2005 |

19 | Position-restricted substring searching
- MÄKINEN, NAVARRO
(Show Context)
Citation Context ...icle. Preprint submitted to Theoretical Computer Science November 15, 2011Query range count is solved in time O ( log σ ) , whereas range report takes time O ( (1 + occ) log σ ) to report occ points =-=[10]-=-. 4 These new capabilities were subsequently used to design powerful succinct representations of twodimensional point grids [10, 11, 12], permutations [13], and binary relations [14], with application... |

18 | Geometric BurrowsWheeler transform: Linking range searching and text indexing
- Chien, Hon, et al.
- 2008
(Show Context)
Citation Context ...sequently used to design powerful succinct representations of twodimensional point grids [10, 11, 12], permutations [13], and binary relations [14], with applications to other compressed text indexes =-=[15, 16, 17]-=-, document retrieval problems [18] and many others. In this paper we show, by uncovering new capabilities, that the full potential of wavelet trees is far from realized. In particular, we show that th... |

16 | Range mode and range median queries on lists and trees
- Krizanc, Morin, et al.
(Show Context)
Citation Context ...ently there was no work on range quantile queries, but several authors wrote about range median queries, the special case in which k is half the length of the interval between i and j. Krizanc et al. =-=[31]-=- introduced the problem of preprocessing for range median queries and gave four solutions, three of which require time superlogarithmic in n. Their fourth solution requires almost quadratic space, sto... |

16 | An experimental investigation of set intersection algorithms for text searching
- Barbay, López-Ortiz, et al.
(Show Context)
Citation Context ... to the number of distinct ancestors of the α leaves arrived at, the complexity is O ( α log u ) α by Lemma 3. This second procedure is the basis of most algorithms for intersecting two or more lists =-=[44]-=-. The rint method we have presented has the same complexity, yet it is simpler, potentially faster, and more flexible (e.g., it is easily adapted to t-thresholded queries). Moreover, it is specific to... |

16 | Colored range queries and document retrieval, in
- Gagie, Navarro, et al.
(Show Context)
Citation Context ...D, ep) − rankd(D, sp − 1). Document frequencies can be computed with just 2n + o(n) more bits for the case of the D array [47], and on top of a wavelet tree for the E array for more general scenarios =-=[48]-=-. In Section 5 we show how our new algorithms solve the document listing problem within the same time complexity O ( docc log m ) , without using any RMQ data structure, while reporting the documents ... |

16 |
Managing Gigabytes, 2nd Edition
- Witten, Moffat, et al.
- 1999
(Show Context)
Citation Context ...ts where two (or more) queries appear simultaneously. We extend these solutions to temporal and hierarchical document collections. 4.2. Inverted Indexes The inverted index is a classical IR structure =-=[49, 50]-=-, lying at the heart of most modern Web search engines and applications handling natural-language text collections. By “natural language” texts one refers to those that can be easily split into a sequ... |

15 | Succinct orthogonal range search structures on a grid with applications to text indexing
- Bose, He, et al.
(Show Context)
Citation Context ...reas range report takes time O ( (1 + occ) log σ ) to report occ points [10]. 4 These new capabilities were subsequently used to design powerful succinct representations of twodimensional point grids =-=[10, 11, 12]-=-, permutations [13], and binary relations [14], with applications to other compressed text indexes [15, 16, 17], document retrieval problems [18] and many others. In this paper we show, by uncovering ... |

15 |
Searching large text collections
- Baeza-Yates, Moffat, et al.
- 2002
(Show Context)
Citation Context ... problem for general strings described above, the restriction of word queries allows inverted indexes to precompute the answer to each possible word query. Two main variants of inverted indexes exist =-=[51, 52]-=-. Ranked retrieval is aimed at retrieving documents that are most “relevant” to a query, under some criterion. As explained, a popular formula for relevance is wd,q = tf d,q × idf q, but others built ... |

14 | Approximate range mode and range median queries
- Bose, Kranakis, et al.
- 2005
(Show Context)
Citation Context ...e time superlogarithmic in n. Their fourth solution requires almost quadratic space, storing O ( n 2 log log n/ log n ) words to answer queries in constant time (a word holds log σ bits). Bose et al. =-=[32]-=- considered approximate queries, and Har-Peled and Muthukrishnan [20] and Gfeller and Sanders [33] considered batched queries. Recently, Krizanc et al.’s fourth solution was superseded by one due to P... |

13 |
Improved bounds for range mode and range median queries
- Petersen
- 2008
(Show Context)
Citation Context ...e queries, and Har-Peled and Muthukrishnan [20] and Gfeller and Sanders [33] considered batched queries. Recently, Krizanc et al.’s fourth solution was superseded by one due to Petersen and Grabowski =-=[34, 35]-=-, who slightly reduced the space bound to O ( n 2 (log log n) 2 / log 2 n ) words. At about the same time we presented the early version of our work [1], Gfeller and Sanders [33] gave a similar O(n)-w... |

13 |
Range mode and range median queries in constant time and sub-quadratic
- Petersen, Grabowski
- 2009
(Show Context)
Citation Context ...e queries, and Har-Peled and Muthukrishnan [20] and Gfeller and Sanders [33] considered batched queries. Recently, Krizanc et al.’s fourth solution was superseded by one due to Petersen and Grabowski =-=[34, 35]-=-, who slightly reduced the space bound to O ( n 2 (log log n) 2 / log 2 n ) words. At about the same time we presented the early version of our work [1], Gfeller and Sanders [33] gave a similar O(n)-w... |

12 | Compact rich-functional binary relation representations
- Barbay, Claude, et al.
(Show Context)
Citation Context ...report occ points [10]. 4 These new capabilities were subsequently used to design powerful succinct representations of twodimensional point grids [10, 11, 12], permutations [13], and binary relations =-=[14]-=-, with applications to other compressed text indexes [15, 16, 17], document retrieval problems [18] and many others. In this paper we show, by uncovering new capabilities, that the full potential of w... |

12 | Range selection and median: Tight cell probe lower bounds and adaptive data structures
- Jørgensen, Larsen
- 2011
(Show Context)
Citation Context ... [36] gave a more involved data structure that still takes O(n) words but only O ( log n/ log log n ) time for queries. These two papers have now been merged [37]. Very recently, Jørgensen and Larsen =-=[38]-=- proved a matching lower bound for any data structure that takes n log O(1) n space. In the sequel we show that, if S is represented using a wavelet tree, we can answer general range quantile queries ... |

11 | T.: Improved algorithms for the range next value problem and applications
- Crochemore, Iliopoulos, et al.
(Show Context)
Citation Context ...se them for noise reduction in grey scale images. Similarly, Crochemore et al. [22] use range next value queries for interval-restricted pattern matching, and Keller et al. [23] and Crochemore et al. =-=[24]-=- use them for many other sophisticated pattern matching problems. Hon et al. [25] use range intersect queries for generalized document retrieval, and in a simplified form the problem also appears when... |

10 |
Towards optimal range medians
- Gfeller, Sanders
- 2009
(Show Context)
Citation Context ... 2 log log n/ log n ) words to answer queries in constant time (a word holds log σ bits). Bose et al. [32] considered approximate queries, and Har-Peled and Muthukrishnan [20] and Gfeller and Sanders =-=[33]-=- considered batched queries. Recently, Krizanc et al.’s fourth solution was superseded by one due to Petersen and Grabowski [34, 35], who slightly reduced the space bound to O ( n 2 (log log n) 2 / lo... |

10 | Alternation and redundancy analysis of the intersection problem
- Barbay, Kenyon
(Show Context)
Citation Context .... A lower bound in terms of alternation (holding even for randomized algorithms) [19] is Ω ( α · ∑ 1≤r≤k log nr ) α , where nr = jr − ir + 1. There exist adaptive algorithms matching this lower bound =-=[42, 19, 43]-=-. We show now that the wavelet tree representation of S [1, n] allows a rather simple intersection algorithm that approaches the lower bound, even if one starts from ranges of disordered values, possi... |

9 | Range medians - Har-Peled, Muthukrishnan - 2008 |

8 | Self-indexed text compression using straight-line programs
- Claude, Navarro
- 2009
(Show Context)
Citation Context ...sequently used to design powerful succinct representations of twodimensional point grids [10, 11, 12], permutations [13], and binary relations [14], with applications to other compressed text indexes =-=[15, 16, 17]-=-, document retrieval problems [18] and many others. In this paper we show, by uncovering new capabilities, that the full potential of wavelet trees is far from realized. In particular, we show that th... |

7 |
String retrieval for multipattern queries
- Hon, Shah, et al.
(Show Context)
Citation Context ...2] use range next value queries for interval-restricted pattern matching, and Keller et al. [23] and Crochemore et al. [24] use them for many other sophisticated pattern matching problems. Hon et al. =-=[25]-=- use range intersect queries for generalized document retrieval, and in a simplified form the problem also appears when processing conjunctive queries in inverted indexes. We further illustrate the im... |

6 | Dual-sorted inverted lists - Navarro, Puglisi |

6 | Data structures for range median queries
- Brodal, Jørgensen
- 2009
(Show Context)
Citation Context ...tructure that supports range median queries in O ( log n ) time and observed in a footnote that “a generalization to arbitrary ranks will be straightforward”. A few months later, Brodal and Jørgensen =-=[36]-=- gave a more involved data structure that still takes O(n) words but only O ( log n/ log log n ) time for queries. These two papers have now been merged [37]. Very recently, Jørgensen and Larsen [38] ... |

5 | A fun application of compact data structures to indexing geographic data
- Brisaboa, Luaces, et al.
(Show Context)
Citation Context ...reas range report takes time O ( (1 + occ) log σ ) to report occ points [10]. 4 These new capabilities were subsequently used to design powerful succinct representations of twodimensional point grids =-=[10, 11, 12]-=-, permutations [13], and binary relations [14], with applications to other compressed text indexes [15, 16, 17], document retrieval problems [18] and many others. In this paper we show, by uncovering ... |

5 | Finding patterns in given intervals
- Crochemore, Iliopoulos, et al.
- 2007
(Show Context)
Citation Context ... range median queries (a special case of range quantile) to the analysis of Web advertising logs. Stolinski et al. [21] use them for noise reduction in grey scale images. Similarly, Crochemore et al. =-=[22]-=- use range next value queries for interval-restricted pattern matching, and Keller et al. [23] and Crochemore et al. [24] use them for many other sophisticated pattern matching problems. Hon et al. [2... |

5 | Range non-overlapping indexing and successive list indexing
- Keller, Kopelowitz, et al.
- 2007
(Show Context)
Citation Context ...gs. Stolinski et al. [21] use them for noise reduction in grey scale images. Similarly, Crochemore et al. [22] use range next value queries for interval-restricted pattern matching, and Keller et al. =-=[23]-=- and Crochemore et al. [24] use them for many other sophisticated pattern matching problems. Hon et al. [25] use range intersect queries for generalized document retrieval, and in a simplified form th... |

3 |
Towards optimal range medians. Theoretical Computer Science, toappear
- Brodal, Gfeller, et al.
(Show Context)
Citation Context .... A few months later, Brodal and Jørgensen [36] gave a more involved data structure that still takes O(n) words but only O ( log n/ log log n ) time for queries. These two papers have now been merged =-=[37]-=-. Very recently, Jørgensen and Larsen [38] proved a matching lower bound for any data structure that takes n log O(1) n space. In the sequel we show that, if S is represented using a wavelet tree, we ... |

3 |
Efficient data structures for the orthogonal range successor problem
- Yu, Hon, et al.
- 2009
(Show Context)
Citation Context ...ce solution based on an augmented binary search tree, with query time O ( log u ) , where once again u ≤ min(n, σ) is the number of distinct symbols in S and [1, σ] is the domain of values. Yu et al. =-=[40]-=- improved the time to O ( log n/ log log n ) , within linear space. For the special case of semi-infinite queries (i.e., i = 1 or j = n) one can use an O(n)-words and O ( log log n ) time solution by ... |

1 |
On efficient implementations of median filters in theory and practice, unpublished manuscript
- Stolinski, Grabowski, et al.
- 2010
(Show Context)
Citation Context ...ic problems are well known. Har-Peled and Muthukrishnan [20] describe applications of range median queries (a special case of range quantile) to the analysis of Web advertising logs. Stolinski et al. =-=[21]-=- use them for noise reduction in grey scale images. Similarly, Crochemore et al. [22] use range next value queries for interval-restricted pattern matching, and Keller et al. [23] and Crochemore et al... |