Results 1 
8 of
8
Colored Range Queries and Document Retrieval
"... Colored range queries are a wellstudied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important onedimensional colored range queries — colore ..."
Abstract

Cited by 17 (9 self)
 Add to MetaCart
Colored range queries are a wellstudied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important onedimensional colored range queries — colored range listing, colored range topk queries and colored range counting — and, thus, new bounds for various document retrieval problems on general collections of sequences. Specifically, we first describe a framework including almost all recent results on colored range listing and document listing, which suggests new combinations of data structures for these problems. For example, we give the fastest compressed data structures for colored range listing and document listing, and an efficient data structure for document listing whose size is bounded in terms of the highorder entropies of the library of documents. We then show how (approximate) colored topk queries can be reduced to (approximate) rangemode queries on subsequences, yielding the first efficient data structure for this problem. Finally, we show how a modified wavelet tree can support colored range counting in logarithmic time and space that is succinct whenever the number of colors is superpolylogarithmic in the length of the sequence.
Topk document retrieval in optimal time and linear space
 In Proc. 22nd Annual ACMSIAM Symposium on Discrete Algorithms (SODA 2012
, 2012
"... We describe a data structure that uses O(n)word space and reports k most relevant documents that contain a query pattern P in optimal O(P  + k) time. Our construction supports an ample set of important relevance measures, such as the frequency of P in a document and the minimal distance between t ..."
Abstract

Cited by 13 (7 self)
 Add to MetaCart
We describe a data structure that uses O(n)word space and reports k most relevant documents that contain a query pattern P in optimal O(P  + k) time. Our construction supports an ample set of important relevance measures, such as the frequency of P in a document and the minimal distance between two occurrences of P in a document. We show how to reduce the space of the data structure from O(n log n) to O(n(log σ+log D+log log n)) bits, where σ is the alphabet size and D is the total number of documents. 1
SpaceEfficient Topk Document Retrieval
"... Supporting topk document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usag ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
Supporting topk document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reducedspace structures that support topk retrieval and propose new alternatives. Our experimental results show that our novel structures and algorithms dominate almost all the space/time tradeoff.
Faster Compact Topk Document Retrieval
"... An optimal index solving topk document retrieval [Navarro and Nekrich, SODA’12] takes O(m + k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n– 3n bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is u ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
An optimal index solving topk document retrieval [Navarro and Nekrich, SODA’12] takes O(m + k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n– 3n bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is up to 25 times faster than the best previous compressed solutions, and requires at most 5 % more space in practice (and in some cases as little as one half). Apart from replacing classical by compressed data structures, our main idea is to replace suffix tree sampling by frequency thresholding to achieve compression.
Faster Topk Document Retrieval in Optimal Space ⋆
"... Abstract. We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears most often. We show that, by representing the collection using a Compressed Suffix Array CSA, a data structure using the asymptotically optimal CSA+o(n) bits can answer quer ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Abstract. We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears most often. We show that, by representing the collection using a Compressed Suffix Array CSA, a data structure using the asymptotically optimal CSA+o(n) bits can answer queries in the time needed by CSA to find the suffix array interval of the pattern plus O(k lg 2 k lg ɛ n) accesses to suffix array cells, for any constant ɛ> 0. This is lg n / lg k times faster than the only previous solution using optimal space, lg k times slower than the fastest structure that uses twice the space, and lg 2 k lg ɛ n times the lowerbound cost of obtaining k document identifiers from the CSA. To obtain the result we introduce a tool called the sampled document array, which can be of independent interest. 1
Estructuras de datos sucintas para Recuperación De Documentos
, 2012
"... La recuperación de documentos consiste en, dada una colección de documentos y un patrón de consulta, obtener los documentos más relevantes para la consulta. Cuando los documentos están disponibles con anterioridad a las consultas, es posible construir un índice que permita, al momento de realizar la ..."
Abstract
 Add to MetaCart
La recuperación de documentos consiste en, dada una colección de documentos y un patrón de consulta, obtener los documentos más relevantes para la consulta. Cuando los documentos están disponibles con anterioridad a las consultas, es posible construir un índice que permita, al momento de realizar las consultas, obtener documentos relevantes en tiempo razonable. Contar con índices que resuelvan un problema como éste es fundamental en áreas como recuperación de la información, minería de datos y bioinformática, entre otros. Cuando el texto que se indexa es lenguaje natural, la solución paradigmática corresponde al índice invertido. Sin embargo, los problemas de recuperación de documentos emergen también en escenarios en que el texto y los patrones de consulta pueden ser secuencias generales de caracteres, como lenguajes orientales, bases de datos multimedia, secuencias genómicas, etc. En estos escenarios los índices invertidos clásicos no se aplican con el mismo éxito. Si bien existen soluciones que requieren espacio lineal en este escenario de texto general, el espacio que utilizan es un problema importante: estas soluciones pueden utilizar más de 20 veces el espacio de la colección. Esta tesis presenta nuevos algoritmos y estructuras de datos para resolver algunos problemas
A General Document Retrieval in Compact Space
"... Given a collection of documents and a query pattern, document retrieval is the problem of obtaining documents that are relevant to the query. The collection is available beforehand so that a data structure, called an index, can be built on it to speed up queries. While initially restricted to natura ..."
Abstract
 Add to MetaCart
Given a collection of documents and a query pattern, document retrieval is the problem of obtaining documents that are relevant to the query. The collection is available beforehand so that a data structure, called an index, can be built on it to speed up queries. While initially restricted to natural language text collections, document retrieval problems arise nowadays in applications like bioinformatics, multimedia databases and Web mining. This requires a more general setup where text and pattern can be general sequences of symbols, and the classical inverted indexes developed for words cannot be applied. While linearspace timeoptimal solutions have been developed for most interesting queries in this general case, space usage is a serious problem in practice. In this article we develop compact data structures that solve various important document retrieval problems on general text collections. More specifically, we provide practical solutions for listing the documents where a query pattern appears, together with its frequency in each document, and for listing k documents where a query pattern appears most frequently. Some of our techniques build on existing theoretical proposals, while others are new. In particular, we introduce a novel grammarbased compressed bitmap representation that may be of independent interest when dealing with repetitive sequences. Ours are the first practical indexes that use less space when the text collection is compressible. Our experimental results show that, on various reallife text collections, our data structures are significantly smaller than the most spaceefficient previous solutions, using up to half the space without noticeably increasing the query time. Overall, document listing can be carried out in 10 to 40 milliseconds for patterns that appear 100 to 10,000 times in the collection, whereas topk retrieval is carried out in k to 10 k milliseconds.