Results 1 
8 of
8
Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences
 CoRR
"... Document retrieval is one of the best established information retrieval activities since the sixties, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to “natural language” text coll ..."
Abstract

Cited by 14 (9 self)
 Add to MetaCart
Document retrieval is one of the best established information retrieval activities since the sixties, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to “natural language” text collections, where inverted indexes are the preferred solution. As successful as this paradigm has been, it fails to properly handle various East Asian languages and other scenarios where the “natural language ” assumptions do not hold. In this survey we cover the recent research in extending the document retrieval techniques to a broader class of sequence collections, which has applications in bioinformatics, data and Web mining, chemoinformatics, software engineering, multimedia information retrieval, and many other fields. We focus on the algorithmic aspects of the techniques, uncovering a rich world of relations between document retrieval challenges and fundamental problems on trees, strings, range queries, discrete geometry, and other areas.
SpaceEfficient Topk Document Retrieval
"... Supporting topk document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usag ..."
Abstract

Cited by 13 (8 self)
 Add to MetaCart
(Show Context)
Supporting topk document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reducedspace structures that support topk retrieval and propose new alternatives. Our experimental results show that our novel structures and algorithms dominate almost all the space/time tradeoff.
Topk document retrieval in compact space and nearoptimal time
 In Proc. 24th Annual International Symposium on Algorithms and Computation
"... Abstract. Let D={d1, d2,...dD} be a given set of D string documents of total length n. Our task is to index D such that the k most relevant documents for an online query pattern P of length p can be retrieved efficiently. There exist linear space data structures of O(n) words for answering such quer ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
(Show Context)
Abstract. Let D={d1, d2,...dD} be a given set of D string documents of total length n. Our task is to index D such that the k most relevant documents for an online query pattern P of length p can be retrieved efficiently. There exist linear space data structures of O(n) words for answering such queries in optimal O(p + k) time. In this paper, we describe a compact index of size CSA  + n lg D + o(n lg D) bits with near optimal time, O(p+k lg ∗ n), for the basic relevance metric termfrequency, where CSA  is the size (in bits) of a compressed fulltext index of D, and lg ∗ n is the iterated logarithm of n. 1 Introduction and Related Work Topk document retrieval is the problem of preprocessing a text collection so that, given a search pattern P [1..p] and a threshold k, we retrieve the k documents most “relevant ” to P, for some definition of relevance. This is the basic problem of search engines and forms the core of the Information Retrieval (IR) field [5].
Document Retrieval on Repetitive Collections?
"... Abstract. Document retrieval aims at finding the most important documents where a pattern appears in a collection of strings. Traditional patternmatching techniques yield bruteforce document retrieval solutions, which has motivated the research on tailored indexes that offer nearoptimal perform ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Document retrieval aims at finding the most important documents where a pattern appears in a collection of strings. Traditional patternmatching techniques yield bruteforce document retrieval solutions, which has motivated the research on tailored indexes that offer nearoptimal performance. However, an experimental study establishing which alternatives are actually better than brute force, and which perform best depending on the collection characteristics, has not been carried out. In this paper we address this shortcoming by exploring the relationship between the nature of the underlying collection and the performance of current methods. Via extensive experiments we show that established solutions are often beaten in practice by bruteforce alternatives. We also design new methods that offer superior time/space tradeoffs, particularly on repetitive collections. 1
D.: General document retrieval in compact space
 ACM Journal of Experimental Algorithmics
"... ..."
General Terms
"... Scoresafe index processing has received a great deal of attention over the last two decades. By precalculating maximum term impacts during indexing, the number of scoring operations can be minimized, and the topk documents for a query can be located efficiently. However, these methods often igno ..."
Abstract
 Add to MetaCart
(Show Context)
Scoresafe index processing has received a great deal of attention over the last two decades. By precalculating maximum term impacts during indexing, the number of scoring operations can be minimized, and the topk documents for a query can be located efficiently. However, these methods often ignore the importance of the effectiveness gains possible when using sequential dependency models. We present a hybrid approach which leverages scoresafe processing and suffixbased selfindexing structures in order to provide efficient and effective topk document retrieval.
New Space/Time Tradeoffs for Topk Document Retrieval on Sequences 1
"... We address the problem of indexing a collectionD = {T1,T2,...TD} of D string documents of total length n, so that we can efficiently answer topk queries: retrieve k documents most relevant to a pattern P of length p given at query time. There exist linearspace data structures, that is, using O(n) ..."
Abstract
 Add to MetaCart
(Show Context)
We address the problem of indexing a collectionD = {T1,T2,...TD} of D string documents of total length n, so that we can efficiently answer topk queries: retrieve k documents most relevant to a pattern P of length p given at query time. There exist linearspace data structures, that is, using O(n) words, that answer such queries in optimal O(p + k) time for an ample set of notions of relevance. However, using linear space is not sufficiently good for large text collections. In this paper we explore how far the space/time tradeoff for this problem can be pushed. We obtain three results: (1) When relevance is measured as term frequency (number of times P appears in a document Ti), an index occupying CSA+o(n) bits answers the query in time O(tsearch(p)+k lg2 k lgε n), where CSA is a compressed suffix array indexing D, tsearch is its time to find the suffix array interval of P, and ε> 0 is any constant. (2) With the same measure of relevance, an index occupying CSA  + n lg D + o(n lgσ + n lg D) bits answers the query in time O(tsearch(p) + k lg ∗ k), where lg ∗ k is the iterated logarithm of k. (3) When the relevance depends only on the documents, an index occupying CSA+ O(n lg lg n) bits answers the query in O(tsearch(p) + k tSA) time, where tSA is the time the CSA needs to retrieve a suffix array cell. On our way, we obtain some other results of independent interest.