## Top-k document retrieval in optimal time and linear space (2012)

Venue: | In Proc. 22nd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2012 |

Citations: | 13 - 7 self |

### BibTeX

@INPROCEEDINGS{Navarro12top-kdocument,

author = {Gonzalo Navarro and Yakov Nekrich},

title = {Top-k document retrieval in optimal time and linear space},

booktitle = {In Proc. 22nd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2012},

year = {2012},

pages = {1066--1077}

}

### OpenURL

### Abstract

We describe a data structure that uses O(n)-word space and reports k most relevant documents that contain a query pattern P in optimal O(|P | + k) time. Our construction supports an ample set of important relevance measures, such as the frequency of P in a document and the minimal distance between two occurrences of P in a document. We show how to reduce the space of the data structure from O(n log n) to O(n(log σ+log D+log log n)) bits, where σ is the alphabet size and D is the total number of documents. 1

### Citations

899 |
Algorithms on Strings, Trees, and Sequences
- Gusfield
- 1997
(Show Context)
Citation Context ...e locus of a string P is the highest node v such that P is a prefix of path(v). Every occurrence of P corresponds to a unique leaf that descends from its locus. We refer the reader to, e.g., Gusfield =-=[18]-=- for an extensive description of the generalized suffix tree and related notions. We say that a leaf l is marked with document d if the suffix stored in l belongs to d. An internal node v is marked wi... |

426 | Linear pattern matching algorithms - Weiner - 1973 |

193 | A.: Inverted files for text search engines
- Zobel, Moffat
- 2006
(Show Context)
Citation Context ...d challenge is to move on towards the bag-of-words paradigm of information retrieval. Our model easily handles single-word searches, and also phrases (which is quite complicated with inverted indexes =-=[39, 3]-=-, particularly if their weights have to be computed). Handling a set of words or phrases, whose weights within any document d must be combined in some form (for example using the tf × idf model) is mo... |

183 | The LCA problem revisited
- Bender, Farach-Colton
(Show Context)
Citation Context ...cause the size per point in Lemma 3.1 is O(log m), and our widths decrease doubly exponentially. As a query may span several stripes, a structure similar to the one used in the classical RMQ solution =-=[6]-=- is used. This gives linear space for stripes of width up to Ω(log δ n). Smaller ones are solved with universal tables. In addition to the global array storing p.y for each p.x, we use another array s... |

132 | A functional approach to data structures and its use in multidimensional searching
- Chazelle
- 1988
(Show Context)
Citation Context ... the need to traverse W in order to find out the real weights, so as to compare weights from different nodes. However, those weights can be computed in time O(log ε n) and using O(n log n) extra bits =-=[10, 31, 9]-=-. The operations on the priority queue can be carried out in O((log log n) 2 ) time [1]. Thus we have the following result. Lemma 7.1. Given a grid of n × n points, there exists a data structure that ... |

115 | Filtered document retrieval with frequency-sorted indexes - Persin, Zobel, et al. - 1996 |

74 | Efficient Algorithms for Document Retrieval Problems
- Muthukrishnan
- 2002
(Show Context)
Citation Context ...queries; their structure uses O(n) words of space and reports all docc documents that contain P in O(|P | log D + docc) time, where D is the total number of documents in the collection. Muthukrishnan =-=[27]-=- presented a data structure that uses O(n) words of space and answers document listing queries in O(|P | + docc) time. Muthukrishnan [27] also initiated the study of more sophisticated problems in whi... |

53 | Succinct data structures for flexible text retrieval systems
- Sadakane
(Show Context)
Citation Context ...rested in reporting all documents d with tf (P, d) × idf (P ) ≥ τ, where idf (P ) = log(N/df (P )) and df (P ) is the number of documents where P appears [3]. Using the O(n)-bit structure of Sadakane =-=[34]-=-, we can compute idf (P ) in O(|P |) time. To answer the query, we use our data structure of Theorem 1.1 in online mode on measure tf : For every reported document d we find tf (P, d) and compute tf (... |

52 | Compressed suffix trees with full functionality
- Sadakane
(Show Context)
Citation Context ..., but still an edge of a minitree can be labeled with a string of length Θ(n). Instead of representing the contracted tree and the minitrees separately, we use Sadakane’s compressed suffix tree (CST) =-=[35]-=- to represent the topology of the whole T in O(n) bits, and a recent compressed representation [5] of the global suffix array (SA) of the string collection, which takes O(n log σ) bits. This SA repres... |

34 | Space-efficient algorithms for document retrieval
- VÄLIMÄKI, MÄKINEN
(Show Context)
Citation Context ... insufficient. We have shown that our structure can use, instead, O(n(log σ + log D + log log n)) bits. There is a whole trend of reduced-space representations for general document retrieval problems =-=[34, 37, 16, 21, 15, 11, 4]-=-. Most of them make use of the so-called document array [27]. This approach has been shown to be very competitive in space and time for top-k problems, even using heuristic solutions [11, 29]. The spa... |

27 | Range quantile queries: Another virtue of wavelet trees
- GAGIE, PUGLISI, et al.
(Show Context)
Citation Context ... insufficient. We have shown that our structure can use, instead, O(n(log σ + log D + log log n)) bits. There is a whole trend of reduced-space representations for general document retrieval problems =-=[34, 37, 16, 21, 15, 11, 4]-=-. Most of them make use of the so-called document array [27]. This approach has been shown to be very competitive in space and time for top-k problems, even using heuristic solutions [11, 29]. The spa... |

26 | Dynamic ordered sets with exponential search trees
- Andersson, Thorup
- 2007
(Show Context)
Citation Context ...fferent nodes. However, those weights can be computed in time O(log ε n) and using O(n log n) extra bits [10, 31, 9]. The operations on the priority queue can be carried out in O((log log n) 2 ) time =-=[1]-=-. Thus we have the following result. Lemma 7.1. Given a grid of n × n points, there exists a data structure that uses O(n) words of space and reports k most highly weighted points in a range Q = [a, b... |

25 | Space-efficient framework for top-k string retrieval problems
- HON, SHAH, et al.
(Show Context)
Citation Context ...roblem for the case when the relevance measure is tf (P, d). Their data structure uses O(n log n) words of space and answers queries in O(|P | + k + log n log log n) time. Later, Hon, Shah and Vitter =-=[21]-=- presented a general solution for a wide class of relevance measures. Their data structure uses linear space and needs O(|P | + k log k)time to answer a top-k query. A recent O(n) space data structur... |

22 | Optimal succinctness for range minimum queries
- Fischer
- 2010
(Show Context)
Citation Context ...ge increases by O(v log m) bits of space. Now we sort the v points in x-coordinate order, build the sequence Y [1..v] of their y-coordinates, and build a Range Minimum Query (RMQ) data structure on Y =-=[13]-=-. This structure requires only O(v) bits of space, does not need to access Y after construction (so we do not store Y ), and answers in constant time the query rmq(c, d) = arg minc≤i≤d Y [i] for any c... |

12 |
Probabilistic Analysis of Generalized Suffix Trees
- Szpankowski
- 1992
(Show Context)
Citation Context ... data structures. For example, the height of our grids was bounded by O(n), but it corresponds to the height of the suffix tree. This is O(log n) on average for any text generated from a Markov model =-=[36]-=-, and indeed small in most practical cases. A common pitfall to practicality is space usage. Even achieving linear space (i.e., O(n log n) bits) can be insufficient. We have shown that our structure c... |

10 | Improved compressed indexes for full-text document retrieval
- BELAZZOUGUI, NAVARRO, et al.
- 2012
(Show Context)
Citation Context ... insufficient. We have shown that our structure can use, instead, O(n(log σ + log D + log log n)) bits. There is a whole trend of reduced-space representations for general document retrieval problems =-=[34, 37, 16, 21, 15, 11, 4]-=-. Most of them make use of the so-called document array [27]. This approach has been shown to be very competitive in space and time for top-k problems, even using heuristic solutions [11, 29]. The spa... |

9 |
Efficient index for retrieving top-k most frequent documents
- Hon, Shah, et al.
- 2009
(Show Context)
Citation Context ...e mind(P, d), the minimum distance between two occurrences of P in d, and docrank(d), an arbitrary static rank assigned to a document d. Some more complex measures have also been proposed. Hon et al. =-=[19]-=- presented a solution for the top-k document retrieval problem for the case when the relevance measure is tf (P, d). Their data structure uses O(n log n) words of space and answers queries in O(|P | +... |

7 |
String retrieval for multipattern queries
- Hon, Shah, et al.
(Show Context)
Citation Context ...phrases, whose weights within any document d must be combined in some form (for example using the tf × idf model) is more challenging. We are only aware of some very preliminary results for this case =-=[28, 20]-=-. It is interesting to note that our online result allows simulating the left-to-right traversal, in decreasing weight order, of the virtual list of occurrences of any string pattern P . Therefore, fo... |

7 | Top-k color queries for document retrieval - Karpinski, Nekrich - 2011 |

1 | Pruned Query Evaluation using - Anh, Moffat - 2006 |

1 |
Modern Information Retrieval, 2nd edition, Addison-Wesley
- Baeza-Yates, Ribeiro-Neto
- 2010
(Show Context)
Citation Context ...in Theorem 1.1. For instance, we might be interested in reporting all documents d with tf (P, d) × idf (P ) ≥ τ, where idf (P ) = log(N/df (P )) and df (P ) is the number of documents where P appears =-=[3]-=-. Using the O(n)-bit structure of Sadakane [34], we can compute idf (P ) in O(|P |) time. To answer the query, we use our data structure of Theorem 1.1 in online mode on measure tf : For every reporte... |

1 |
A Linear Space Data Structure for Orthogonal Range Reporting and
- Nekrich
- 2009
(Show Context)
Citation Context ... the need to traverse W in order to find out the real weights, so as to compare weights from different nodes. However, those weights can be computed in time O(log ε n) and using O(n log n) extra bits =-=[10, 31, 9]-=-. The operations on the priority queue can be carried out in O((log log n) 2 ) time [1]. Thus we have the following result. Lemma 7.1. Given a grid of n × n points, there exists a data structure that ... |