Results 1 -
5 of
5
Rank and select revisited and extended
- Workshop on SpaceConscious Algorithms, University of
, 2006
"... The deep connection between the Burrows-Wheeler transform (BWT) and the socalled rank and select data structures for symbol sequences is the basis of most successful approaches to compressed text indexing. Rank of a symbol at a given position equals the number of times the symbol appears in the corr ..."
Abstract
-
Cited by 29 (17 self)
- Add to MetaCart
The deep connection between the Burrows-Wheeler transform (BWT) and the socalled rank and select data structures for symbol sequences is the basis of most successful approaches to compressed text indexing. Rank of a symbol at a given position equals the number of times the symbol appears in the corresponding prefix of the sequence. Select is the inverse, retrieving the positions of the symbol occurrences. It has been shown that improvements to rank/select algorithms, in combination with the BWT, turn into improved compressed text indexes. This paper is devoted to alternative implementations and extensions of rank and select data structures. First, we show that one can use gap encoding techniques to obtain constant time rank and select queries in essentially the same space as what is achieved by the best current direct solution (and sometimes less). Second, we extend symbol rank and select to substring rank and select, giving several space/time tradeoffs for the problem. An application of these queries is in position-restricted substring searching, where one can specify the range in the text where the search is restricted to, and only occurrences residing in that range are to be reported. In addition, arbitrary occurrences are reported in text position order. Several byproducts of our results display connections with searchable partial sums, Chazelle’s two-dimensional data structures, and Grossi et al.’s wavelet trees.
Scalable Ranked Publish/Subscribe
"... Publish/subscribe (pub/sub) systems are designed to efficiently match incoming events (e.g., stock quotes) against a set of subscriptions (e.g., trader profiles specifying quotes of interest). However, current pub/sub systems only support a simple binary notion of matching: an event either matches a ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
Publish/subscribe (pub/sub) systems are designed to efficiently match incoming events (e.g., stock quotes) against a set of subscriptions (e.g., trader profiles specifying quotes of interest). However, current pub/sub systems only support a simple binary notion of matching: an event either matches a subscription or it does not; for instance, a stock quote will either match or not match a trader profile. In this paper, we argue that this simple notion of matching is inadequate for many applications where only the “best ” matching subscriptions are of interest. For instance, in targeted Web advertising, an incoming user (“event”) may match several different advertiser-specified user profiles (“subscriptions”), but given the limited advertising real-estate, we want to quickly discover the best (e.g., most relevant) ads to display. To address this need, we initiate a study of ranked pub/sub systems. We focus on the case where subscriptions correspond to interval ranges (e.g, age in [25,35] and salary> $50, 000), and events are points that match all the intervals that they stab (e.g., age=28, salary = $65,000). In addition, each interval has a score and our goal is to quickly recover the top-scoring matching subscriptions. Unfortunately, adapting existing index structures to solve this problem results in either an unacceptable space overhead or a significant performance degradation. We thus propose two novel index structures that are both compact and efficient. Our experimental evaluation shows that the proposed structures provide a scalable basis for designing ranked pub/sub systems. 1.
Space-Efficient Framework for Top-k String Retrieval Problems ∗
"... Abstract — Given a set D = {d1, d2,..., dD} of D strings of total length n, our task is to report the “most relevant ” strings for a given query pattern P. This involves somewhat more advanced query functionality than the usual pattern matching, as some notion of “most relevant ” is involved. In inf ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Abstract — Given a set D = {d1, d2,..., dD} of D strings of total length n, our task is to report the “most relevant ” strings for a given query pattern P. This involves somewhat more advanced query functionality than the usual pattern matching, as some notion of “most relevant ” is involved. In information retrieval literature, this task is best achieved by using inverted indexes. However, inverted indexes work only for some predefined set of patterns. In the pattern matching community, the most popular pattern-matching data structures are suffix trees and suffix arrays. However, a typical suffix tree search involves going through all the occurrences of the pattern over the entire string collection, which might be a lot more than the required relevant documents. The first formal framework to study such kind of retrieval problems was given by Muthukrishnan [25]. He considered two metrics for relevance: frequency and proximity. He took a thresholdbased approach on these metrics and gave data structures taking O(n log n) words of space. We study this problem in a slightly different framework of reporting the top k most relevant documents (in sorted order) under similar and more general relevance metrics. Our framework gives linear space data structure with optimal query times for arbitrary score functions. As a corollary, it improves the space utilization for the problems in [25] while maintaining optimal query performance. We also develop compressed variants of these data structures for several specific relevance metrics. Keywords-document retrieval; text indexing; succinct data structures; top-k queries 1.
J.S.: Compression, indexing, and retrieval for massive string data
- Combinatorial Pattern Matching. LNCS
, 2010
"... Abstract. The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and s ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers. 1
Range Non-Overlapping Indexing
, 909
"... Abstract. We study the non-overlapping indexing problem: Given a text T, preprocess it in order to answer queries of the form: given a pattern P, report the maximal set of non-overlapping occurrences of P in T. A generalization of this problem is the range non-overlapping indexing where in addition ..."
Abstract
- Add to MetaCart
Abstract. We study the non-overlapping indexing problem: Given a text T, preprocess it in order to answer queries of the form: given a pattern P, report the maximal set of non-overlapping occurrences of P in T. A generalization of this problem is the range non-overlapping indexing where in addition we are given two indexes i, j to report the maximal set of non-overlapping occurrences between these two indexes. We suggest new solutions for these problems. For the non-overlapping problem our solution uses O(n) space with query time of O(m+occNO). For the range non-overlapping problem we propose a solution with O(n log ǫ n) space for some 0 < ǫ < 1 and O(m + log log n + occij,NO) query time. 1 Introduction and Related Work Given a text T of length n over an alphabet Σ, the text indexing problem is to build an index on T which can answer pattern matching queries efficiently: Given a pattern P of length m, we want to report all its occurrences in T. There are some known solutions for this problem. For instance, the suffix tree, proposed by

