Results 1  10
of
13
Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences
 CoRR
"... Document retrieval is one of the best established information retrieval activities since the sixties, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to “natural language” text coll ..."
Abstract

Cited by 14 (9 self)
 Add to MetaCart
Document retrieval is one of the best established information retrieval activities since the sixties, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to “natural language” text collections, where inverted indexes are the preferred solution. As successful as this paradigm has been, it fails to properly handle various East Asian languages and other scenarios where the “natural language ” assumptions do not hold. In this survey we cover the recent research in extending the document retrieval techniques to a broader class of sequence collections, which has applications in bioinformatics, data and Web mining, chemoinformatics, software engineering, multimedia information retrieval, and many other fields. We focus on the algorithmic aspects of the techniques, uncovering a rich world of relations between document retrieval challenges and fundamental problems on trees, strings, range queries, discrete geometry, and other areas.
Efficient inmemory topk document retrieval
, 2012
"... For over forty years the dominant data structure for ranked document retrieval has been the inverted index. Inverted indexes are effective for a variety of document retrieval tasks, and particularly efficient for large data collection scenarios that require disk access and storage. However, many eff ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
For over forty years the dominant data structure for ranked document retrieval has been the inverted index. Inverted indexes are effective for a variety of document retrieval tasks, and particularly efficient for large data collection scenarios that require disk access and storage. However, many efficiencybound search tasks can now easily be supported entirely inmemory as a result of recent hardware advances. In this paper we present a hybrid algorithmic framework for inmemory bagofwords ranked document retrieval using a selfindex derived from the FMIndex, wavelet tree, and the compressed suffix tree data structures, and evaluate the various algorithmic tradeoffs for performing efficient queries entirely inmemory. We compare our approach with two classic approaches to bagofwords queries using inverted indexes, termatatime (TAAT) and documentatatime (DAAT) query processing. We show that our framework is competitive with stateoftheart indexing structures, and describe new capabilities provided by our algorithms that can be leveraged by future systems to improve effectiveness and efficiency for a variety of fundamental search operations.
Faster Compact Topk Document Retrieval
"... An optimal index solving topk document retrieval [Navarro and Nekrich, SODA’12] takes O(m + k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n– 3n bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is u ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
(Show Context)
An optimal index solving topk document retrieval [Navarro and Nekrich, SODA’12] takes O(m + k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n– 3n bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is up to 25 times faster than the best previous compressed solutions, and requires at most 5 % more space in practice (and in some cases as little as one half). Apart from replacing classical by compressed data structures, our main idea is to replace suffix tree sampling by frequency thresholding to achieve compression.
Faster Topk Document Retrieval in Optimal Space
"... We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears most often. We show that, by representing the collection using a Compressed Suffix Array CSA, a data structure using the asymptotically optimal CSA+o(n) bits can answer queries in th ..."
Abstract

Cited by 6 (6 self)
 Add to MetaCart
(Show Context)
We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears most often. We show that, by representing the collection using a Compressed Suffix Array CSA, a data structure using the asymptotically optimal CSA+o(n) bits can answer queries in the time needed by CSA to find the suffix array interval of the pattern plus O(k lg 2 k lg ɛ n) accesses to suffix array cells, for any constant ɛ> 0. This is lg n / lg k times faster than the only previous solution using optimal space, lg k times slower than the fastest structure that uses twice the space, and lg 2 k lg ɛ n times the lowerbound cost of obtaining k document identifiers from the CSA. To obtain the result we introduce a tool called the sampled document array, which can be of independent interest.
Topk document retrieval in compact space and nearoptimal time
 In Proc. 24th Annual International Symposium on Algorithms and Computation
"... Abstract. Let D={d1, d2,...dD} be a given set of D string documents of total length n. Our task is to index D such that the k most relevant documents for an online query pattern P of length p can be retrieved efficiently. There exist linear space data structures of O(n) words for answering such quer ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
(Show Context)
Abstract. Let D={d1, d2,...dD} be a given set of D string documents of total length n. Our task is to index D such that the k most relevant documents for an online query pattern P of length p can be retrieved efficiently. There exist linear space data structures of O(n) words for answering such queries in optimal O(p + k) time. In this paper, we describe a compact index of size CSA  + n lg D + o(n lg D) bits with near optimal time, O(p+k lg ∗ n), for the basic relevance metric termfrequency, where CSA  is the size (in bits) of a compressed fulltext index of D, and lg ∗ n is the iterated logarithm of n. 1 Introduction and Related Work Topk document retrieval is the problem of preprocessing a text collection so that, given a search pattern P [1..p] and a threshold k, we retrieve the k documents most “relevant ” to P, for some definition of relevance. This is the basic problem of search engines and forms the core of the Information Retrieval (IR) field [5].
Document Retrieval on Repetitive Collections?
"... Abstract. Document retrieval aims at finding the most important documents where a pattern appears in a collection of strings. Traditional patternmatching techniques yield bruteforce document retrieval solutions, which has motivated the research on tailored indexes that offer nearoptimal perform ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Document retrieval aims at finding the most important documents where a pattern appears in a collection of strings. Traditional patternmatching techniques yield bruteforce document retrieval solutions, which has motivated the research on tailored indexes that offer nearoptimal performance. However, an experimental study establishing which alternatives are actually better than brute force, and which perform best depending on the collection characteristics, has not been carried out. In this paper we address this shortcoming by exploring the relationship between the nature of the underlying collection and the performance of current methods. Via extensive experiments we show that established solutions are often beaten in practice by bruteforce alternatives. We also design new methods that offer superior time/space tradeoffs, particularly on repetitive collections. 1
Estructuras de datos sucintas para Recuperación De Documentos
, 2012
"... La recuperación de documentos consiste en, dada una colección de documentos y un patrón de consulta, obtener los documentos más relevantes para la consulta. Cuando los documentos están disponibles con anterioridad a las consultas, es posible construir un índice que permita, al momento de realizar la ..."
Abstract
 Add to MetaCart
La recuperación de documentos consiste en, dada una colección de documentos y un patrón de consulta, obtener los documentos más relevantes para la consulta. Cuando los documentos están disponibles con anterioridad a las consultas, es posible construir un índice que permita, al momento de realizar las consultas, obtener documentos relevantes en tiempo razonable. Contar con índices que resuelvan un problema como éste es fundamental en áreas como recuperación de la información, minería de datos y bioinformática, entre otros. Cuando el texto que se indexa es lenguaje natural, la solución paradigmática corresponde al índice invertido. Sin embargo, los problemas de recuperación de documentos emergen también en escenarios en que el texto y los patrones de consulta pueden ser secuencias generales de caracteres, como lenguajes orientales, bases de datos multimedia, secuencias genómicas, etc. En estos escenarios los índices invertidos clásicos no se aplican con el mismo éxito. Si bien existen soluciones que requieren espacio lineal en este escenario de texto general, el espacio que utilizan es un problema importante: estas soluciones pueden utilizar más de 20 veces el espacio de la colección. Esta tesis presenta nuevos algoritmos y estructuras de datos para resolver algunos problemas
Bottomk Document Retrieval
"... Abstract. We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears least often. This has potential applications in data mining, bioinformatics, security, and big data. We show that adapting the classical linearspace solutions for this proble ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears least often. This has potential applications in data mining, bioinformatics, security, and big data. We show that adapting the classical linearspace solutions for this problem is trivial, but the compressedspace solutions are not easy to extend. We design a new solution for this problem that matches the bestknown result when using 2CSA  + o(n) bits, where CSA is a Compressed Suffix Array. Our structure answers queries in the time needed by the CSA to find the suffix array interval of the pattern plus O(k lg k lg n) accesses to suffix array cells, for any constant > 0. 1
Document Counting in Practice?
"... Abstract. We address the problem of counting the number of strings in a collection where a given pattern appears, which has applications in information retrieval and data mining. Existing solutions are in a theoretical stage. We implement these solutions and develop some new variants, comparing them ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. We address the problem of counting the number of strings in a collection where a given pattern appears, which has applications in information retrieval and data mining. Existing solutions are in a theoretical stage. We implement these solutions and develop some new variants, comparing them experimentally on various datasets. Our results not only show which are the best options for each situation and help discard practically unappealing solutions, but also uncover some unexpected compressibility properties of the best data structures. By taking advantage of these properties, we can reduce the size of the structures by a factor of 5–400, depending on the dataset.
Rank, select and access in grammarcompressed strings
, 2014
"... Given a string S of length N on a fixed alphabet of σ symbols, a grammar compressor produces a contextfree grammar G of size n that generates S and only S. In this paper we describe data structures to support the following operations on a grammarcompressed string: rankc(S, i) (return the number ..."
Abstract
 Add to MetaCart
(Show Context)
Given a string S of length N on a fixed alphabet of σ symbols, a grammar compressor produces a contextfree grammar G of size n that generates S and only S. In this paper we describe data structures to support the following operations on a grammarcompressed string: rankc(S, i) (return the number of occurrences of symbol c before position i in S); selectc(S, i) (return the position of the ith occurrence of c in S); and access(S, i, j) (return substring S[i, j]). For rank and select we describe data structures of size O(nσ logN) bits that support the two operations in O(logN) time. We propose another structure that uses O(nσ log(N/n)(logN)1+) bits and that supports the two queries in O(logN / log logN), where > 0 is an arbitrary constant. To our knowledge, we are the first to study the asymptotic complexity of rank and select in the grammarcompressed setting, and we provide a hardness result showing that significantly improving the bounds we achieve would imply a major breakthrough on a hard graphtheoretical problem. Our main result for access is a method that requires O(n logN) bits of space and O(logN + m / logσ N) time to extract m = j − i + 1 consecutive symbols from S. Alternatively, we can achieve O(logN / log logN+m / logσ N) query time using O(n log(N/n)(logN) 1+) bits of space. This matches a lower bound stated by Verbin and Yu for strings where N is polynomially related to n.