Results 1  10
of
11
Colored Range Queries and Document Retrieval
"... Colored range queries are a wellstudied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important onedimensional colored range queries — colore ..."
Abstract

Cited by 17 (9 self)
 Add to MetaCart
Colored range queries are a wellstudied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important onedimensional colored range queries — colored range listing, colored range topk queries and colored range counting — and, thus, new bounds for various document retrieval problems on general collections of sequences. Specifically, we first describe a framework including almost all recent results on colored range listing and document listing, which suggests new combinations of data structures for these problems. For example, we give the fastest compressed data structures for colored range listing and document listing, and an efficient data structure for document listing whose size is bounded in terms of the highorder entropies of the library of documents. We then show how (approximate) colored topk queries can be reduced to (approximate) rangemode queries on subsequences, yielding the first efficient data structure for this problem. Finally, we show how a modified wavelet tree can support colored range counting in logarithmic time and space that is succinct whenever the number of colors is superpolylogarithmic in the length of the sequence.
SpaceEfficient Topk Document Retrieval
"... Supporting topk document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usag ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
Supporting topk document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reducedspace structures that support topk retrieval and propose new alternatives. Our experimental results show that our novel structures and algorithms dominate almost all the space/time tradeoff.
Faster Compact Topk Document Retrieval
"... An optimal index solving topk document retrieval [Navarro and Nekrich, SODA’12] takes O(m + k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n– 3n bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is u ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
An optimal index solving topk document retrieval [Navarro and Nekrich, SODA’12] takes O(m + k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n– 3n bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is up to 25 times faster than the best previous compressed solutions, and requires at most 5 % more space in practice (and in some cases as little as one half). Apart from replacing classical by compressed data structures, our main idea is to replace suffix tree sampling by frequency thresholding to achieve compression.
Faster Topk Document Retrieval in Optimal Space ⋆
"... Abstract. We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears most often. We show that, by representing the collection using a Compressed Suffix Array CSA, a data structure using the asymptotically optimal CSA+o(n) bits can answer quer ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Abstract. We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears most often. We show that, by representing the collection using a Compressed Suffix Array CSA, a data structure using the asymptotically optimal CSA+o(n) bits can answer queries in the time needed by CSA to find the suffix array interval of the pattern plus O(k lg 2 k lg ɛ n) accesses to suffix array cells, for any constant ɛ> 0. This is lg n / lg k times faster than the only previous solution using optimal space, lg k times slower than the fastest structure that uses twice the space, and lg 2 k lg ɛ n times the lowerbound cost of obtaining k document identifiers from the CSA. To obtain the result we introduce a tool called the sampled document array, which can be of independent interest. 1
Sorted Range Reporting
"... Abstract. In this paper we consider a variant of the orthogonal range reporting problem when all points should be reported in the sorted order of their xcoordinates. We show that reporting twodimensional points with this additional condition can be organized (almost) as efficiently as the standard ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract. In this paper we consider a variant of the orthogonal range reporting problem when all points should be reported in the sorted order of their xcoordinates. We show that reporting twodimensional points with this additional condition can be organized (almost) as efficiently as the standard range reporting. Moreover, our results generalize and improve the previously known results for the orthogonal range successor problem and can be used to obtain better solutions for some stringology problems. 1
Forbidden Patterns
"... Abstract. We consider the problem of indexing a collection of documents (a.k.a. strings) of total length n such that the following kind of queries are supported: given two patterns P + and P − , list all nmatch documents containing P + but not P −. This is a natural extension of the classic problem ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract. We consider the problem of indexing a collection of documents (a.k.a. strings) of total length n such that the following kind of queries are supported: given two patterns P + and P − , list all nmatch documents containing P + but not P −. This is a natural extension of the classic problem of document listing as considered by Muthukrishnan [SODA’02], where only the positive pattern P + is given. Our main solution is an index of size O(n 3/2) bits that supports queries in O(P +  + P −  + nmatch + √ n) time. 1
SpaceEfficient DataAnalysis Queries on Grids
"... We consider various dataanalysis queries on twodimensional points. We give new space/time tradeoffs over previous work on geometric queries such as dominance and rectangle visibility, and on semigroup and group queries such as sum, average, variance, minimum and maximum. We also introduce new solu ..."
Abstract
 Add to MetaCart
We consider various dataanalysis queries on twodimensional points. We give new space/time tradeoffs over previous work on geometric queries such as dominance and rectangle visibility, and on semigroup and group queries such as sum, average, variance, minimum and maximum. We also introduce new solutions to queries less frequently considered in the literature such as twodimensional quantiles, majorities, successor/predecessor, mode, and various topk queries, considering static and dynamic scenarios.
Compressed Data Structures with Relevance (Invited Keynote)
"... We describe recent breakthroughs in the field of compressed data structures, in which the data structure is stored in a compressed representation that still allows fast answers to queries. We focus in particular on compressed data structures to support the important application of pattern matching o ..."
Abstract
 Add to MetaCart
We describe recent breakthroughs in the field of compressed data structures, in which the data structure is stored in a compressed representation that still allows fast answers to queries. We focus in particular on compressed data structures to support the important application of pattern matching on massive document collections. Given an arbitrary query pattern in textual form, the job of the data structure is to report all the locations where the pattern appears. Another variant is to report all the documents that contain at least one instance of the pattern. We are particularly interested in reporting only the most relevant documents, using a variety of notions of relevance. We discuss recently developed techniques that support fast search in these contexts as well as under additional positional and temporal constraints.
Estructuras de datos sucintas para Recuperación De Documentos
, 2012
"... La recuperación de documentos consiste en, dada una colección de documentos y un patrón de consulta, obtener los documentos más relevantes para la consulta. Cuando los documentos están disponibles con anterioridad a las consultas, es posible construir un índice que permita, al momento de realizar la ..."
Abstract
 Add to MetaCart
La recuperación de documentos consiste en, dada una colección de documentos y un patrón de consulta, obtener los documentos más relevantes para la consulta. Cuando los documentos están disponibles con anterioridad a las consultas, es posible construir un índice que permita, al momento de realizar las consultas, obtener documentos relevantes en tiempo razonable. Contar con índices que resuelvan un problema como éste es fundamental en áreas como recuperación de la información, minería de datos y bioinformática, entre otros. Cuando el texto que se indexa es lenguaje natural, la solución paradigmática corresponde al índice invertido. Sin embargo, los problemas de recuperación de documentos emergen también en escenarios en que el texto y los patrones de consulta pueden ser secuencias generales de caracteres, como lenguajes orientales, bases de datos multimedia, secuencias genómicas, etc. En estos escenarios los índices invertidos clásicos no se aplican con el mismo éxito. Si bien existen soluciones que requieren espacio lineal en este escenario de texto general, el espacio que utilizan es un problema importante: estas soluciones pueden utilizar más de 20 veces el espacio de la colección. Esta tesis presenta nuevos algoritmos y estructuras de datos para resolver algunos problemas