Results 1  10
of
18
Colored Range Queries and Document Retrieval
"... Colored range queries are a wellstudied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important onedimensional colored range queries — colore ..."
Abstract

Cited by 32 (18 self)
 Add to MetaCart
Colored range queries are a wellstudied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important onedimensional colored range queries — colored range listing, colored range topk queries and colored range counting — and, thus, new bounds for various document retrieval problems on general collections of sequences. Specifically, we first describe a framework including almost all recent results on colored range listing and document listing, which suggests new combinations of data structures for these problems. For example, we give the fastest compressed data structures for colored range listing and document listing, and an efficient data structure for document listing whose size is bounded in terms of the highorder entropies of the library of documents. We then show how (approximate) colored topk queries can be reduced to (approximate) rangemode queries on subsequences, yielding the first efficient data structure for this problem. Finally, we show how a modified wavelet tree can support colored range counting in logarithmic time and space that is succinct whenever the number of colors is superpolylogarithmic in the length of the sequence.
Topk document retrieval in optimal time and linear space
 In Proc. 22nd Annual ACMSIAM Symposium on Discrete Algorithms (SODA 2012
, 2012
"... We describe a data structure that uses O(n)word space and reports k most relevant documents that contain a query pattern P in optimal O(P  + k) time. Our construction supports an ample set of important relevance measures, such as the frequency of P in a document and the minimal distance between t ..."
Abstract

Cited by 29 (17 self)
 Add to MetaCart
(Show Context)
We describe a data structure that uses O(n)word space and reports k most relevant documents that contain a query pattern P in optimal O(P  + k) time. Our construction supports an ample set of important relevance measures, such as the frequency of P in a document and the minimal distance between two occurrences of P in a document. We show how to reduce the space of the data structure from O(n log n) to O(n(log σ+log D+log log n)) bits, where σ is the alphabet size and D is the total number of documents. 1
Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences
 CoRR
"... Document retrieval is one of the best established information retrieval activities since the sixties, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to “natural language” text coll ..."
Abstract

Cited by 14 (9 self)
 Add to MetaCart
(Show Context)
Document retrieval is one of the best established information retrieval activities since the sixties, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to “natural language” text collections, where inverted indexes are the preferred solution. As successful as this paradigm has been, it fails to properly handle various East Asian languages and other scenarios where the “natural language ” assumptions do not hold. In this survey we cover the recent research in extending the document retrieval techniques to a broader class of sequence collections, which has applications in bioinformatics, data and Web mining, chemoinformatics, software engineering, multimedia information retrieval, and many other fields. We focus on the algorithmic aspects of the techniques, uncovering a rich world of relations between document retrieval challenges and fundamental problems on trees, strings, range queries, discrete geometry, and other areas.
SpaceEfficient Topk Document Retrieval
"... Supporting topk document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usag ..."
Abstract

Cited by 14 (8 self)
 Add to MetaCart
(Show Context)
Supporting topk document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reducedspace structures that support topk retrieval and propose new alternatives. Our experimental results show that our novel structures and algorithms dominate almost all the space/time tradeoff.
Faster Compact Topk Document Retrieval
"... An optimal index solving topk document retrieval [Navarro and Nekrich, SODA’12] takes O(m + k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n– 3n bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is u ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
(Show Context)
An optimal index solving topk document retrieval [Navarro and Nekrich, SODA’12] takes O(m + k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n– 3n bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is up to 25 times faster than the best previous compressed solutions, and requires at most 5 % more space in practice (and in some cases as little as one half). Apart from replacing classical by compressed data structures, our main idea is to replace suffix tree sampling by frequency thresholding to achieve compression.
Faster Topk Document Retrieval in Optimal Space
"... We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears most often. We show that, by representing the collection using a Compressed Suffix Array CSA, a data structure using the asymptotically optimal CSA+o(n) bits can answer queries in th ..."
Abstract

Cited by 6 (6 self)
 Add to MetaCart
(Show Context)
We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears most often. We show that, by representing the collection using a Compressed Suffix Array CSA, a data structure using the asymptotically optimal CSA+o(n) bits can answer queries in the time needed by CSA to find the suffix array interval of the pattern plus O(k lg 2 k lg ɛ n) accesses to suffix array cells, for any constant ɛ> 0. This is lg n / lg k times faster than the only previous solution using optimal space, lg k times slower than the fastest structure that uses twice the space, and lg 2 k lg ɛ n times the lowerbound cost of obtaining k document identifiers from the CSA. To obtain the result we introduce a tool called the sampled document array, which can be of independent interest.
Linearspace data structures for range frequency queries on arrays and trees
 In Proc. MFCS, volume 8087 of LNCS
, 2013
"... Abstract. We present O(n)space data structures to support various range frequency queries on a given array A[0: n − 1] or tree T with n nodes. Given a query consisting of an arbitrary pair of preorder rank indices (i, j), our data structures return a least frequent element, mode, or αminority of ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
Abstract. We present O(n)space data structures to support various range frequency queries on a given array A[0: n − 1] or tree T with n nodes. Given a query consisting of an arbitrary pair of preorder rank indices (i, j), our data structures return a least frequent element, mode, or αminority of the multiset of elements in the unique path with endpoints at indices i and j in A or T. We describe a data structure that supports range least frequent element queries on arrays in O( n/w) time, improving the Θ( n) worstcase time required by the data structure of Chan et al. (SWAT 2012), where w ∈ Ω(logn) is the word size in bits. We describe a data structure that supports range mode queries on trees in O(log log n n/w) time, improving the Θ( n logn) worstcase time required by the data structure of Krizanc et al. (ISAAC 2003). Finally, we describe a data structure that supports range αminority queries on trees in O(α−1 log log n) time, where α ∈ [0, 1] can be specified at query time. 1
Topk document retrieval in compact space and nearoptimal time
 In Proc. 24th Annual International Symposium on Algorithms and Computation
"... Abstract. Let D={d1, d2,...dD} be a given set of D string documents of total length n. Our task is to index D such that the k most relevant documents for an online query pattern P of length p can be retrieved efficiently. There exist linear space data structures of O(n) words for answering such quer ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
(Show Context)
Abstract. Let D={d1, d2,...dD} be a given set of D string documents of total length n. Our task is to index D such that the k most relevant documents for an online query pattern P of length p can be retrieved efficiently. There exist linear space data structures of O(n) words for answering such queries in optimal O(p + k) time. In this paper, we describe a compact index of size CSA  + n lg D + o(n lg D) bits with near optimal time, O(p+k lg ∗ n), for the basic relevance metric termfrequency, where CSA  is the size (in bits) of a compressed fulltext index of D, and lg ∗ n is the iterated logarithm of n. 1 Introduction and Related Work Topk document retrieval is the problem of preprocessing a text collection so that, given a search pattern P [1..p] and a threshold k, we retrieve the k documents most “relevant ” to P, for some definition of relevance. This is the basic problem of search engines and forms the core of the Information Retrieval (IR) field [5].
Topk Color Queries On Tree Paths
, 2013
"... We present a data structure for the following problem: Given a tree T, with each of its nodes assigned a color in a totally ordered set, preprocess T to efficiently answer queries for the top k distinct colors on the path between two nodes, reporting the colors sorted in descending order. Our data ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
We present a data structure for the following problem: Given a tree T, with each of its nodes assigned a color in a totally ordered set, preprocess T to efficiently answer queries for the top k distinct colors on the path between two nodes, reporting the colors sorted in descending order. Our data structure requires linear space of O(n) words and answers queries in O(k) time.