Results 1 
4 of
4
Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences
 CoRR
"... Document retrieval is one of the best established information retrieval activities since the sixties, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to “natural language” text coll ..."
Abstract

Cited by 14 (9 self)
 Add to MetaCart
Document retrieval is one of the best established information retrieval activities since the sixties, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to “natural language” text collections, where inverted indexes are the preferred solution. As successful as this paradigm has been, it fails to properly handle various East Asian languages and other scenarios where the “natural language ” assumptions do not hold. In this survey we cover the recent research in extending the document retrieval techniques to a broader class of sequence collections, which has applications in bioinformatics, data and Web mining, chemoinformatics, software engineering, multimedia information retrieval, and many other fields. We focus on the algorithmic aspects of the techniques, uncovering a rich world of relations between document retrieval challenges and fundamental problems on trees, strings, range queries, discrete geometry, and other areas.
SpaceEfficient Topk Document Retrieval
"... Supporting topk document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usag ..."
Abstract

Cited by 13 (8 self)
 Add to MetaCart
(Show Context)
Supporting topk document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reducedspace structures that support topk retrieval and propose new alternatives. Our experimental results show that our novel structures and algorithms dominate almost all the space/time tradeoff.
Bottomk Document Retrieval
"... Abstract. We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears least often. This has potential applications in data mining, bioinformatics, security, and big data. We show that adapting the classical linearspace solutions for this proble ..."
Abstract
 Add to MetaCart
Abstract. We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears least often. This has potential applications in data mining, bioinformatics, security, and big data. We show that adapting the classical linearspace solutions for this problem is trivial, but the compressedspace solutions are not easy to extend. We design a new solution for this problem that matches the bestknown result when using 2CSA  + o(n) bits, where CSA is a Compressed Suffix Array. Our structure answers queries in the time needed by the CSA to find the suffix array interval of the pattern plus O(k lg k lg n) accesses to suffix array cells, for any constant > 0. 1
Topk TermProximity in Succinct Space?
"... Abstract. Let D = {T1,T2,...,TD} be a collection of D string documents of n characters in total, that are drawn from an alphabet set Σ = [σ]. The topk document retrieval problem is to preprocess D into a data structure that, given a query (P [1..p], k), can return the k documents of D most releva ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. Let D = {T1,T2,...,TD} be a collection of D string documents of n characters in total, that are drawn from an alphabet set Σ = [σ]. The topk document retrieval problem is to preprocess D into a data structure that, given a query (P [1..p], k), can return the k documents of D most relevant to pattern P. The relevance is captured using a predefined ranking function, which depends on the set of occurrences of P in Td. For example, it can be the term frequency (i.e., the number of occurrences of P in Td), or it can be the term proximity (i.e., the distance between the closest pair of occurrences of P in Td), or a patternindependent importance score of Td such as PageRank. Linear space and optimal query time solutions already exist for this problem. Compressed and compact space solutions are also known, but only for a few ranking functions such as term frequency and importance. However, space efficient data structures for term proximity based retrieval have been evasive. In this paper we present the first sublinear space data structure for this relevance function, which uses only o(n) bits on top of any compressed suffix array of D and solves queries in time O((p+ k) polylogn). 1