Results 1  10
of
14
SpaceEfficient Topk Document Retrieval
"... Supporting topk document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usag ..."
Abstract

Cited by 13 (8 self)
 Add to MetaCart
(Show Context)
Supporting topk document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reducedspace structures that support topk retrieval and propose new alternatives. Our experimental results show that our novel structures and algorithms dominate almost all the space/time tradeoff.
From theory to practice: Plug and play with succinct data structures
 In SEA 2014
, 2014
"... Abstract: Engineering efficient implementations of compact and succinct structures is a timeconsuming and challenging task, since there is no standard library of easytouse, highly optimized, and composable components. One consequence is that measuring the practical impact of new theoretical prop ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
(Show Context)
Abstract: Engineering efficient implementations of compact and succinct structures is a timeconsuming and challenging task, since there is no standard library of easytouse, highly optimized, and composable components. One consequence is that measuring the practical impact of new theoretical proposals is a difficult task, since older baseline implementations may not rely on the same basic components, and reimplementing from scratch can be very timeconsuming. In this paper we present a framework for experimentation with succinct data structures, providing a large set of configurable components, together with tests, benchmarks, and tools to analyze resource requirements. We demonstrate the functionality of the framework by recomposing succinct solutions for document retrieval. 1
Faster Topk Document Retrieval in Optimal Space
"... We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears most often. We show that, by representing the collection using a Compressed Suffix Array CSA, a data structure using the asymptotically optimal CSA+o(n) bits can answer queries in th ..."
Abstract

Cited by 6 (6 self)
 Add to MetaCart
We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears most often. We show that, by representing the collection using a Compressed Suffix Array CSA, a data structure using the asymptotically optimal CSA+o(n) bits can answer queries in the time needed by CSA to find the suffix array interval of the pattern plus O(k lg 2 k lg ɛ n) accesses to suffix array cells, for any constant ɛ> 0. This is lg n / lg k times faster than the only previous solution using optimal space, lg k times slower than the fastest structure that uses twice the space, and lg 2 k lg ɛ n times the lowerbound cost of obtaining k document identifiers from the CSA. To obtain the result we introduce a tool called the sampled document array, which can be of independent interest.
Orthogonal range searching for text indexing
 In SpaceEfficient Data Structures, Streams, and Algorithms
, 2013
"... Abstract. Text indexing, the problem in which one desires to preprocess a (usually large) text for future (shorter) queries, has been researched ever since the suffix tree was invented in the early 70’s. With textual data continuing to increase and with changes in the way it is accessed, new data s ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Text indexing, the problem in which one desires to preprocess a (usually large) text for future (shorter) queries, has been researched ever since the suffix tree was invented in the early 70’s. With textual data continuing to increase and with changes in the way it is accessed, new data structures and new algorithmic methods are continuously required. Therefore, text indexing is of utmost importance and is a very active research domain. Orthogonal range searching, classically associated with the computational geometry community, is one of the tools that has increasingly become important for various text indexing applications. Initially, in the mid 90’s there were a couple of results recognizing this connection. In the last few years we have seen an increase in use of this method and are reaching a deeper understanding of the range searching uses for text indexing. In this monograph we survey some of these results.
A LempelZiv Compressed Structure for Document Listing
"... Document listing is the problem of preprocessing a set of sequences, called documents, so that later, given a short string called the pattern, we retrieve the documents where the pattern appears. While optimaltime and linearspace solutions exist, the current emphasis is in reducing the space req ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
Document listing is the problem of preprocessing a set of sequences, called documents, so that later, given a short string called the pattern, we retrieve the documents where the pattern appears. While optimaltime and linearspace solutions exist, the current emphasis is in reducing the space requirements. Current document listing solutions build on compressed suffix arrays. This paper is the first attempt to solve the problem using a LempelZiv compressed index of the text collections. We show that the resulting solution is very fast to output most of the resulting documents, taking more time for the final ones. This makes this index particularly useful for interactive scenarios or when listing some documents is sufficient. Yet, it also offers a competitive space/time tradeoff when returning the full answers.
Document Retrieval on Repetitive Collections?
"... Abstract. Document retrieval aims at finding the most important documents where a pattern appears in a collection of strings. Traditional patternmatching techniques yield bruteforce document retrieval solutions, which has motivated the research on tailored indexes that offer nearoptimal perform ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Document retrieval aims at finding the most important documents where a pattern appears in a collection of strings. Traditional patternmatching techniques yield bruteforce document retrieval solutions, which has motivated the research on tailored indexes that offer nearoptimal performance. However, an experimental study establishing which alternatives are actually better than brute force, and which perform best depending on the collection characteristics, has not been carried out. In this paper we address this shortcoming by exploring the relationship between the nature of the underlying collection and the performance of current methods. Via extensive experiments we show that established solutions are often beaten in practice by bruteforce alternatives. We also design new methods that offer superior time/space tradeoffs, particularly on repetitive collections. 1
Ranked document selection
 In SWAT
, 2014
"... Abstract. Let D be a collection of string documents of n characters in total. The topk document retrieval problem is to preprocess D into a data structure that, given a query (P, k), can return the k documents of D most relevant to pattern P. The relevance of a document d for a pattern P is given b ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Let D be a collection of string documents of n characters in total. The topk document retrieval problem is to preprocess D into a data structure that, given a query (P, k), can return the k documents of D most relevant to pattern P. The relevance of a document d for a pattern P is given by a predefined ranking function w(P, d). Linear space and optimal query time solutions already exist for this problem. In this paper we consider a novel problem, document selection queries, which aim to report the kth document most relevant to P (instead of reporting all topk documents). We present a data structure using O(n log n) space, for any constant > 0, answering selection queries in time O(log k / log logn), and a linearspace data structure answering queries in time O(log k), given the locus node of P in a (generalized) suffix tree of D. We also prove that it is unlikely that a succinctspace solution for this problem exists with polylogarithmic query time. 1 Introduction and Related Work
Document Counting in Compressed Space∗
"... We address the problem of counting the number of strings in a collection where a given pattern appears, which has applications in information retrieval and data mining. Existing solutions are in a theoretical stage. In this paper we implement these solutions and explore compressed variants, aiming ..."
Abstract
 Add to MetaCart
We address the problem of counting the number of strings in a collection where a given pattern appears, which has applications in information retrieval and data mining. Existing solutions are in a theoretical stage. In this paper we implement these solutions and explore compressed variants, aiming to reduce data structure size. Our main result is to uncover some unexpected compressibility properties of the fastest known data structure for the problem. By taking advantage of these properties, we can reduce the size of the structure by a factor of 5400, depending on the dataset. 1
Bottomk Document Retrieval
"... Abstract. We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears least often. This has potential applications in data mining, bioinformatics, security, and big data. We show that adapting the classical linearspace solutions for this proble ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears least often. This has potential applications in data mining, bioinformatics, security, and big data. We show that adapting the classical linearspace solutions for this problem is trivial, but the compressedspace solutions are not easy to extend. We design a new solution for this problem that matches the bestknown result when using 2CSA  + o(n) bits, where CSA is a Compressed Suffix Array. Our structure answers queries in the time needed by the CSA to find the suffix array interval of the pattern plus O(k lg k lg n) accesses to suffix array cells, for any constant > 0. 1
Document Counting in Practice?
"... Abstract. We address the problem of counting the number of strings in a collection where a given pattern appears, which has applications in information retrieval and data mining. Existing solutions are in a theoretical stage. We implement these solutions and develop some new variants, comparing them ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. We address the problem of counting the number of strings in a collection where a given pattern appears, which has applications in information retrieval and data mining. Existing solutions are in a theoretical stage. We implement these solutions and develop some new variants, comparing them experimentally on various datasets. Our results not only show which are the best options for each situation and help discard practically unappealing solutions, but also uncover some unexpected compressibility properties of the best data structures. By taking advantage of these properties, we can reduce the size of the structures by a factor of 5–400, depending on the dataset.