Results 1 
7 of
7
Rank and select revisited and extended
 Workshop on SpaceConscious Algorithms, University of
, 2006
"... The deep connection between the BurrowsWheeler transform (BWT) and the socalled rank and select data structures for symbol sequences is the basis of most successful approaches to compressed text indexing. Rank of a symbol at a given position equals the number of times the symbol appears in the corr ..."
Abstract

Cited by 33 (17 self)
 Add to MetaCart
The deep connection between the BurrowsWheeler transform (BWT) and the socalled rank and select data structures for symbol sequences is the basis of most successful approaches to compressed text indexing. Rank of a symbol at a given position equals the number of times the symbol appears in the corresponding prefix of the sequence. Select is the inverse, retrieving the positions of the symbol occurrences. It has been shown that improvements to rank/select algorithms, in combination with the BWT, turn into improved compressed text indexes. This paper is devoted to alternative implementations and extensions of rank and select data structures. First, we show that one can use gap encoding techniques to obtain constant time rank and select queries in essentially the same space as what is achieved by the best current direct solution (and sometimes less). Second, we extend symbol rank and select to substring rank and select, giving several space/time tradeoffs for the problem. An application of these queries is in positionrestricted substring searching, where one can specify the range in the text where the search is restricted to, and only occurrences residing in that range are to be reported. In addition, arbitrary occurrences are reported in text position order. Several byproducts of our results display connections with searchable partial sums, Chazelle’s twodimensional data structures, and Grossi et al.’s wavelet trees.
SpaceEfficient Framework for Topk String Retrieval Problems
"... Given a set D = {d1, d2,..., dD} of D strings of total length n, our task is to report the “most relevant” strings for a given query pattern P. This involves somewhat more advanced query functionality than the usual pattern matching, as some notion of “most relevant” is involved. In information retr ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
Given a set D = {d1, d2,..., dD} of D strings of total length n, our task is to report the “most relevant” strings for a given query pattern P. This involves somewhat more advanced query functionality than the usual pattern matching, as some notion of “most relevant” is involved. In information retrieval literature, this task is best achieved by using inverted indexes. However, inverted indexes work only for some predefined set of patterns. In the pattern matching community, the most popular patternmatching data structures are suffix trees and suffix arrays. However, a typical suffix tree search involves going through all the occurrences of the pattern over the entire string collection, which might be a lot more than the required relevant documents. The first formal framework to study such kind of retrieval problems was given by Muthukrishnan [25]. He considered two metrics for relevance: frequency and proximity. He took a thresholdbased approach on these metrics and gave data structures taking O(n log n) words of space. We study this problem in a slightly different framework of reporting the top k most relevant documents (in sorted order) under similar and more general relevance metrics. Our framework gives linear space data structure with optimal query times for arbitrary score functions. As a corollary, it improves the space utilization for the problems in [25] while maintaining optimal query performance. We also develop compressed variants of these data structures for several specific relevance metrics.
Scalable Ranked Publish/Subscribe
"... Publish/subscribe (pub/sub) systems are designed to efficiently match incoming events (e.g., stock quotes) against a set of subscriptions (e.g., trader profiles specifying quotes of interest). However, current pub/sub systems only support a simple binary notion of matching: an event either matches a ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
Publish/subscribe (pub/sub) systems are designed to efficiently match incoming events (e.g., stock quotes) against a set of subscriptions (e.g., trader profiles specifying quotes of interest). However, current pub/sub systems only support a simple binary notion of matching: an event either matches a subscription or it does not; for instance, a stock quote will either match or not match a trader profile. In this paper, we argue that this simple notion of matching is inadequate for many applications where only the “best ” matching subscriptions are of interest. For instance, in targeted Web advertising, an incoming user (“event”) may match several different advertiserspecified user profiles (“subscriptions”), but given the limited advertising realestate, we want to quickly discover the best (e.g., most relevant) ads to display. To address this need, we initiate a study of ranked pub/sub systems. We focus on the case where subscriptions correspond to interval ranges (e.g, age in [25,35] and salary> $50, 000), and events are points that match all the intervals that they stab (e.g., age=28, salary = $65,000). In addition, each interval has a score and our goal is to quickly recover the topscoring matching subscriptions. Unfortunately, adapting existing index structures to solve this problem results in either an unacceptable space overhead or a significant performance degradation. We thus propose two novel index structures that are both compact and efficient. Our experimental evaluation shows that the proposed structures provide a scalable basis for designing ranked pub/sub systems. 1.
Compression, indexing, and retrieval for massive string data
 COMBINATORIAL PATTERN MATCHING. LNCS
, 2010
"... The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as t ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the wellknown technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/Oefficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.
Spaceefficient data structures for topk completion
 IN: PROCEEDINGS OF THE 22ST WORLD WIDE WEB CONFERENCE (WWW) (2013
"... Virtually every modern search application, either desktop, web, or mobile, features some kind of query autocompletion. In its basic form, the problem consists in retrieving from a string set a small number of completions, i.e. strings beginning with a given prefix, that have the highest scores acco ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Virtually every modern search application, either desktop, web, or mobile, features some kind of query autocompletion. In its basic form, the problem consists in retrieving from a string set a small number of completions, i.e. strings beginning with a given prefix, that have the highest scores according to some static ranking. In this paper, we focus on the case where the string set is so large that compression is needed to fit the data structure in memory. This is a compelling case for web search engines and social networks, where it is necessary to index hundreds of millions of distinct queries to guarantee a reasonable coverage; and for mobile devices, where the amount of memory is limited. We present three different triebased data structures to address this problem, each one with different space/time/ complexity tradeoffs. Experiments on largescale datasets show that it is possible to compress the string sets, including the scores, down to spaces competitive with the gzip’ed data, while supporting efficient retrieval of completions at about a microsecond per completion.
Range NonOverlapping Indexing
, 909
"... Abstract. We study the nonoverlapping indexing problem: Given a text T, preprocess it in order to answer queries of the form: given a pattern P, report the maximal set of nonoverlapping occurrences of P in T. A generalization of this problem is the range nonoverlapping indexing where in addition ..."
Abstract
 Add to MetaCart
Abstract. We study the nonoverlapping indexing problem: Given a text T, preprocess it in order to answer queries of the form: given a pattern P, report the maximal set of nonoverlapping occurrences of P in T. A generalization of this problem is the range nonoverlapping indexing where in addition we are given two indexes i, j to report the maximal set of nonoverlapping occurrences between these two indexes. We suggest new solutions for these problems. For the nonoverlapping problem our solution uses O(n) space with query time of O(m+occNO). For the range nonoverlapping problem we propose a solution with O(n log ǫ n) space for some 0 < ǫ < 1 and O(m + log log n + occij,NO) query time. 1 Introduction and Related Work Given a text T of length n over an alphabet Σ, the text indexing problem is to build an index on T which can answer pattern matching queries efficiently: Given a pattern P of length m, we want to report all its occurrences in T. There are some known solutions for this problem. For instance, the suffix tree, proposed by
Faster and Smaller Inverted Indices with Treaps ∗
"... We introduce a new representation of the inverted index that performs faster ranked unions and intersections while using less space. Our index is based on the treap data structure, which allows us to intersect/merge the document identifiers while simultaneously thresholding by frequency, instead of ..."
Abstract
 Add to MetaCart
We introduce a new representation of the inverted index that performs faster ranked unions and intersections while using less space. Our index is based on the treap data structure, which allows us to intersect/merge the document identifiers while simultaneously thresholding by frequency, instead of the costlier twostep classical processing methods. To achieve compression we represent the treap topology using compact data structures. Further, the treap invariants allow us to elegantly encode differentially both document identifiers and frequencies. Results show that the space consumption is below 10 % of the size of the corpus and the index performs queries up to twice as fast than previous compact representations, which in addition require more space. Modern twostage (massive filtering / detailed ranking) information retrieval systems would benefit from this boosting of the filtration stage of the query resolution process, which would free more resources for the ranking stage, thus enabling more precise results within a given time budget. 1.