## Dual-Sorted Inverted Lists

Citations: | 7 - 4 self |

### BibTeX

@MISC{Navarro_dual-sortedinverted,

author = {Gonzalo Navarro and Simon J. Puglisi},

title = {Dual-Sorted Inverted Lists },

year = {}

}

### OpenURL

### Abstract

Several IR tasks rely, to achieve high efficiency, on a single pervasive data structure called the inverted index. This is a mapping from the terms in a text collection to the documents where they appear, plus some supplementary data. Different orderings in the list of documents associated to a term, and different supplementary data, fit widely different IR tasks. Index designers have to choose the right order for one such task, rendering the index difficult to use for others. In this paper we introduce a general technique, based on wavelet trees, to maintain a single data structure that offers the combined functionality of two independent orderings for an inverted index, with competitive efficiency and within the space of one compressed inverted index. We show in particular that the technique allows combining an ordering by decreasing term frequency (useful for ranked document retrieval) with an ordering by increasing document identifier (useful for phrase and Boolean queries). We show that we can support not only the primitives required by the different search paradigms (e.g., in order to implement any intersection algorithm on top of our data structure), but also that the data structure offers novel ways of carrying out many operations of interest, including space-free treatment of stemming and hierarchical documents.

### Citations

2686 | Modern Information Retrieval
- Baeza-Yates, Ribeiro-Neto
- 1999
(Show Context)
Citation Context ... and by the Australian Research Council (second author).Inverted indexes are an old and simple data structure, yet one of the most successful in IR. They play a central role in any book on the topic =-=[6, 31, 12, 22, 11]-=-, and are also at the heart of most modern Web search engines. Given a text collection regarded as a set of documents, an inverted index is an array of lists. Each array entry corresponds to a differe... |

974 | H.: Introduction to Information Retrieval
- Manning, Raghavan, et al.
- 2008
(Show Context)
Citation Context ... and by the Australian Research Council (second author).Inverted indexes are an old and simple data structure, yet one of the most successful in IR. They play a central role in any book on the topic =-=[6, 31, 12, 22, 11]-=-, and are also at the heart of most modern Web search engines. Given a text collection regarded as a set of documents, an inverted index is an array of lists. Each array entry corresponds to a differe... |

365 |
Human Behaviour and the Principle of Least Effort
- Zipf
(Show Context)
Citation Context ...ueries, where one has to intersect the lists. While intersection can be done also by scanning all the lists in synchronization, it is usually the case that some lists are much shorter than the others =-=[34]-=-, and so faster intersection algorithms are possible. These algorithms are especially relevant when many words have to be intersected. Intersection queries have become extremely popular as Google-like... |

213 | Inverted files for text search engines
- Zobel, Moffat
(Show Context)
Citation Context ... that the vocabulary is much smaller than the collection size n, more precisely of size O(n β ), for some constant 0 < β < 1 that depends on the text type. Two main variants of inverted indexes exist =-=[5, 35]-=-. Ranked retrieval is aimed at retrieving documents which are “relevant” to a query, under some criterion. Documents are regarded as vectors, where terms are the dimensions, and the values of the vect... |

208 | High-order entropy-compressed text indexes
- Grossi, Gupta, et al.
- 2003
(Show Context)
Citation Context ...n Bv of the ith occurence of bit b. As we shall see, this preprocessing allows for efficient navigation of the tree when resolving certain range queries on L. The wavelet tree was originally designed =-=[17]-=- to allow accessing any S[i], as well as computing queries rankd(L, i) and selectd(L, i) on L for any value d, all in O(log D) time. 4 Our data representation Let D be the total number of documents in... |

202 | Succinct indexable dictionaries with applications to encoding kary trees and multisets
- Raman, Raman, et al.
(Show Context)
Citation Context ... symbol Sv[i] is appears below the right child of v, and Bv[i] = 0 otherwise. Note that Sv is not actually stored, only Bv. Finally, each bitvectorBv is preprocessed for O(1) rank and select queries =-=[26]-=-: rankb(Bv, i) returns the number of occurrences of bit b in Bv[1, i]; and selectb(Bv, i) returns the position in Bv of the ith occurence of bit b. As we shall see, this preprocessing allows for effic... |

195 |
Managing Gigabytes
- Witten, Bell
- 1999
(Show Context)
Citation Context ... and by the Australian Research Council (second author).Inverted indexes are an old and simple data structure, yet one of the most successful in IR. They play a central role in any book on the topic =-=[6, 31, 12, 22, 11]-=-, and are also at the heart of most modern Web search engines. Given a text collection regarded as a set of documents, an inverted index is an array of lists. Each array entry corresponds to a differe... |

189 | Compressed full-text indexes
- Navarro, Mäkinen
(Show Context)
Citation Context ...uence is represented using a bitmap S[1, N], also preprocessed for rank and select queries. Thus st = select1(S, t), and also rank1(S, i) tells the list L[i] belongs to. The analysis of wavelet trees =-=[17, 23]-=- shows that the space occupied by that of L is NH0(L) + o(N log D) bits. Here NH0(L) = ∑ d dtd log N ≤ N log D, dtd where dtd is the number of distinct terms in document d. The classical differential ... |

164 | Stemming algorithms: A case study for detailed evaluation
- Hull
- 1996
(Show Context)
Citation Context ... is O(log D). Before considering the classical and extended operations that can be carried out with our data structure, let us raise a couple of issues: 1. Stemming is a useful tool to enhance recall =-=[21, 32]-=-. A way to provide it is by stemming the terms directly during the parsing, yet in this case the index is unable to provide at the same time non-stemmed searching. One can of course index the stemmed ... |

135 |
Information Retrieval - Computational and Theoretical Aspects
- Heaps
- 1978
(Show Context)
Citation Context ...or term of the collection, and its list points to the documents where that word appears in the text collection. The set of different words is called the vocabulary. Empirical laws well accepted in IR =-=[19]-=- establish that the vocabulary is much smaller than the collection size n, more precisely of size O(n β ), for some constant 0 < β < 1 that depends on the text type. Two main variants of inverted inde... |

120 | Filtered Document Retrieval with Frequency-Sorted Indexes
- Persin, Zobel, et al.
- 1996
(Show Context)
Citation Context ...g the involved lists, so that documents can be assigned the combined weights over the different terms. Algorithms and different data organizations for this type of query have been intensively studied =-=[25, 31, 35, 3, 30]-=-. List entries are usually sorted into order of descending weight of the term in the documents. The second variant is the inverted index for so-called full-text retrieval (also known as boolean retrie... |

96 | Corpus-based stemming using cooccurrence of word variants
- Xu, Croft
- 1998
(Show Context)
Citation Context ... is O(log D). Before considering the classical and extended operations that can be carried out with our data structure, let us raise a couple of issues: 1. Stemming is a useful tool to enhance recall =-=[21, 32]-=-. A way to provide it is by stemming the terms directly during the parsing, yet in this case the index is unable to provide at the same time non-stemmed searching. One can of course index the stemmed ... |

92 | Compression of inverted indexes for fast query evaluation
- Scholer, Williams, et al.
- 2002
(Show Context)
Citation Context ...f d-gaps 〈p1, p2 − p1, p3 − p2, . . . , pℓ − pℓ−1〉, and uses a variablelength encoding for these differences, for example γ-codes, δ-codes or Golomb codes [31]. More recent proposals use byte-aligned =-=[29, 10, 13]-=- or word-aligned [2, 33] codes, which lose little compression and are faster at decoding. Intersection of compressed inverted lists is still possible using a merge-type algorithm. However, approaches ... |

83 |
Vector-space ranking with effective early termination
- Anh, Kretser, et al.
- 2001
(Show Context)
Citation Context ...accumulated if found; otherwise they are inserted as new candidates. There is a threshold for continuing processing each list: if the tft,d values fall below it, the list is abandoned (see Ahn et al. =-=[1]-=- and references therein). There is also a stricter threshold for inserting new elements as candidates. These heuristic thresholds provide a time/quality tradeoff. 3 Wavelet trees Let L[1, N] be a sequ... |

67 | Adaptive set intersections, unions, and differences
- Demaine, López-Ortiz, et al.
- 2000
(Show Context)
Citation Context ...is the phrase query, where intersecting the documents where the words appear is the first step. The amount of recent research on intersection of inverted lists witnesses the importance of the problem =-=[15, 8, 4, 7, 27, 28, 13, 9]-=-. In particular, in-memory algorithms have received much attention recently, as large main memories and distributed systems make it feasible to hold the inverted index entirely in RAM. Needless to say... |

56 | Pruned query evaluation using pre-computed impacts, in
- Anh, Moffat
(Show Context)
Citation Context ...g the involved lists, so that documents can be assigned the combined weights over the different terms. Algorithms and different data organizations for this type of query have been intensively studied =-=[25, 31, 35, 3, 30]-=-. List entries are usually sorted into order of descending weight of the term in the documents. The second variant is the inverted index for so-called full-text retrieval (also known as boolean retrie... |

52 |
Inverted Index Compression Using Word-Aligned Binary Codes
- Anh, Moffat
(Show Context)
Citation Context ...2, . . . , pℓ − pℓ−1〉, and uses a variablelength encoding for these differences, for example γ-codes, δ-codes or Golomb codes [31]. More recent proposals use byte-aligned [29, 10, 13] or word-aligned =-=[2, 33]-=- codes, which lose little compression and are faster at decoding. Intersection of compressed inverted lists is still possible using a merge-type algorithm. However, approaches that require direct acce... |

52 | Adding compression to block addressing inverted indexes
- Navarro, Moura, et al.
(Show Context)
Citation Context ...ly in RAM. Needless to say, space is an issue in inverted indexes, especially when combined with the goal of operating in main memory. Much research has been carriedout on compressing inverted lists =-=[31, 24, 35, 13]-=-, and on the interaction of various compressed representation with different query algorithms, including list intersections. Most of the list compression algorithms for full-text indexes rely on the f... |

40 |
Inverted index compression and query processing with optimized document ordering
- Yan, Ding, et al.
- 2009
(Show Context)
Citation Context ...2, . . . , pℓ − pℓ−1〉, and uses a variablelength encoding for these differences, for example γ-codes, δ-codes or Golomb codes [31]. More recent proposals use byte-aligned [29, 10, 13] or word-aligned =-=[2, 33]-=- codes, which lose little compression and are faster at decoding. Intersection of compressed inverted lists is still possible using a merge-type algorithm. However, approaches that require direct acce... |

39 |
A fast set intersection algorithm for sorted sequences
- Baeza-Yates
- 2004
(Show Context)
Citation Context ...is the phrase query, where intersecting the documents where the words appear is the first step. The amount of recent research on intersection of inverted lists witnesses the importance of the problem =-=[15, 8, 4, 7, 27, 28, 13, 9]-=-. In particular, in-memory algorithms have received much attention recently, as large main memories and distributed systems make it feasible to hold the inverted index entirely in RAM. Needless to say... |

39 | Efficient document retrieval in main memory
- Strohman, Croft
(Show Context)
Citation Context ...g the involved lists, so that documents can be assigned the combined weights over the different terms. Algorithms and different data organizations for this type of query have been intensively studied =-=[25, 31, 35, 3, 30]-=-. List entries are usually sorted into order of descending weight of the term in the documents. The second variant is the inverted index for so-called full-text retrieval (also known as boolean retrie... |

34 | Adaptive intersection and t-threshold problems
- Barbay, Kenyon
- 2002
(Show Context)
Citation Context ...is the phrase query, where intersecting the documents where the words appear is the first step. The amount of recent research on intersection of inverted lists witnesses the importance of the problem =-=[15, 8, 4, 7, 27, 28, 13, 9]-=-. In particular, in-memory algorithms have received much attention recently, as large main memories and distributed systems make it feasible to hold the inverted index entirely in RAM. Needless to say... |

33 |
Information Retrieval: Implementing and Evaluating Search Engines
- Büttcher, Clarke, et al.
- 2010
(Show Context)
Citation Context |

28 | Range quantile queries: Another virtue of wavelet trees
- Gagie, Puglisi, et al.
- 2009
(Show Context)
Citation Context ...our wavelet tree representation of L we can find any value Ft[i]. This is equivalent to finding the i-th smallest value in L[stt, stt+1 − 1]. The algorithm, for a general range L[l, r], is as follows =-=[16]-=-. Let v be the root of the wavelet tree and Bv its bitmap. We count with n1 = rank1(Bv, r) − rank1(Bv, l − 1) the number of 1s in Bv[l, r], and with n0 = (r−l+1)−n1 the number of 0s. If i ≤ n0, then t... |

23 | Experimental analysis of a fast intersection algorithm for sorted sequences, in
- Baeza-Yates, Salinger
(Show Context)
Citation Context |

23 | Top-k ranked document search in general text databases
- Culpepper, Navarro, et al.
(Show Context)
Citation Context ...r approach can be applied to any ordering on the documents. A very different and interesting ordering from the one consideredhere is that induced by the suffix array (the D array of Culpepper et al. =-=[14]-=-). Applying our data structure and bys-like intersection algorithm over this ordering immediately yields efficient “bag-of-strings” queries from suffix arrays, further bridging the gap between IR prob... |

21 |
Intersection in integer inverted indices
- Sanders, Transier
(Show Context)
Citation Context |

19 | Compact set representation for information retrieval
- Culpepper, Moffat
- 2007
(Show Context)
Citation Context |

18 | S,C)-dense coding: An optimized compression code for natural language text databases
- Brisaboa, Fariña, et al.
- 2003
(Show Context)
Citation Context ...f d-gaps 〈p1, p2 − p1, p3 − p2, . . . , pℓ − pℓ−1〉, and uses a variablelength encoding for these differences, for example γ-codes, δ-codes or Golomb codes [31]. More recent proposals use byte-aligned =-=[29, 10, 13]-=- or word-aligned [2, 33] codes, which lose little compression and are faster at decoding. Intersection of compressed inverted lists is still possible using a merge-type algorithm. However, approaches ... |

17 | An experimental investigation of set intersection algorithms for text searching
- Barbay, López-Ortiz, et al.
- 2009
(Show Context)
Citation Context |

16 |
Searching large text collections
- Baeza-Yates, Moffat, et al.
- 2002
(Show Context)
Citation Context ... that the vocabulary is much smaller than the collection size n, more precisely of size O(n β ), for some constant 0 < β < 1 that depends on the text type. Two main variants of inverted indexes exist =-=[5, 35]-=-. Ranked retrieval is aimed at retrieving documents which are “relevant” to a query, under some criterion. Documents are regarded as vectors, where terms are the dimensions, and the values of the vect... |

5 |
Compressed inverted indexes for in-memory search engines
- Transier, Sanders
- 2008
(Show Context)
Citation Context |

4 | Succinct data structures - Gupta - 2007 |