Results 1  10
of
33
Compressed fulltext indexes
 ACM COMPUTING SURVEYS
, 2007
"... Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract

Cited by 173 (78 self)
 Add to MetaCart
Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into selfindexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying selfindexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant selfindexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
K.: Practical EntropyCompressed Rank/Select Dictionary
 Proceedings of ALENEX’07, ACM
, 2007
"... Rank/Select dictionaries are data structures for an ordered set S ⊂ {0, 1,..., n − 1} to compute rank(x, S) (the number of elements in S which are no greater than x), and select(i, S) (the ith smallest element in S), which are the fundamental components of succinct data structures of strings, trees ..."
Abstract

Cited by 51 (1 self)
 Add to MetaCart
Rank/Select dictionaries are data structures for an ordered set S ⊂ {0, 1,..., n − 1} to compute rank(x, S) (the number of elements in S which are no greater than x), and select(i, S) (the ith smallest element in S), which are the fundamental components of succinct data structures of strings, trees, graphs, etc. In those data structures, however, only asymptotic behavior has been considered and their performance for real data is not satisfactory. In this paper, we propose novel four Rank/Select dictionaries, esp, recrank, vcode and sdarray, each of which is small if the number of elements in S is small, and indeed close to nH0(S) (H0(S) ≤ 1 is the zeroth order empirical entropy of S) in practice, and its query time is superior to the previous ones. Experimental results reveal the characteristics of our data structures and also show that these data structures are superior to existing implementations in both size and query time. 1
S.S.: Succinct indexes for strings, binary relations, and multilabeled trees
 In: Proc. SODA
, 2007
"... We define and design succinct indexes for several abstract data types (ADTs). The concept is to design auxiliary data structures that ideally occupy asymptotically less space than the informationtheoretic lower bound on the space required to encode the given data, and support an extended set of ope ..."
Abstract

Cited by 42 (11 self)
 Add to MetaCart
We define and design succinct indexes for several abstract data types (ADTs). The concept is to design auxiliary data structures that ideally occupy asymptotically less space than the informationtheoretic lower bound on the space required to encode the given data, and support an extended set of operations using the basic operators defined in the ADT. The main advantage of succinct indexes as opposed to succinct (integrated data/index) encodings is that we make assumptions only on the ADT through which the main data is accessed, rather than the way in which the data is encoded. This allows more freedom in the encoding of the main data. In this paper, we present succinct indexes for various data types, namely strings, binary relations and multilabeled trees. Given the support for the interface of the ADTs of these data types, we can support various useful operations efficiently by constructing succinct indexes for them. When the operators in the ADTs are supported in constant time, our results are comparable to previous results, while allowing more flexibility in the encoding of the given data. Usingourtechniques,wedesignasuccinctencodingthatrepresentsastringoflengthnoveranalphabetof size σ using nHk(S)+lgσ·o(n)+O ( nlgσ lglglgσ) bits to support access/rank/select operations in o((lglgσ)1+ɛ) time, for any fixed constant ɛ> 0. We also design a succinct text index using nH0(S)+O ( nlgσ) bits that lglgσ
Rank and select revisited and extended
 Workshop on SpaceConscious Algorithms, University of
, 2006
"... The deep connection between the BurrowsWheeler transform (BWT) and the socalled rank and select data structures for symbol sequences is the basis of most successful approaches to compressed text indexing. Rank of a symbol at a given position equals the number of times the symbol appears in the corr ..."
Abstract

Cited by 33 (17 self)
 Add to MetaCart
The deep connection between the BurrowsWheeler transform (BWT) and the socalled rank and select data structures for symbol sequences is the basis of most successful approaches to compressed text indexing. Rank of a symbol at a given position equals the number of times the symbol appears in the corresponding prefix of the sequence. Select is the inverse, retrieving the positions of the symbol occurrences. It has been shown that improvements to rank/select algorithms, in combination with the BWT, turn into improved compressed text indexes. This paper is devoted to alternative implementations and extensions of rank and select data structures. First, we show that one can use gap encoding techniques to obtain constant time rank and select queries in essentially the same space as what is achieved by the best current direct solution (and sometimes less). Second, we extend symbol rank and select to substring rank and select, giving several space/time tradeoffs for the problem. An application of these queries is in positionrestricted substring searching, where one can specify the range in the text where the search is restricted to, and only occurrences residing in that range are to be reported. In addition, arbitrary occurrences are reported in text position order. Several byproducts of our results display connections with searchable partial sums, Chazelle’s twodimensional data structures, and Grossi et al.’s wavelet trees.
A simple storage scheme for strings achieving entropy bounds
, 2007
"... We propose a storage scheme for a string S[1, n], drawn from an alphabet Σ, that requires space close to the kth order empirical entropy of S, and allows to retrieve any ℓlong substring of S in optimal O(1 + ..."
Abstract

Cited by 33 (5 self)
 Add to MetaCart
We propose a storage scheme for a string S[1, n], drawn from an alphabet Σ, that requires space close to the kth order empirical entropy of S, and allows to retrieve any ℓlong substring of S in optimal O(1 +
Compressed representations of permutations, and applications
 SYMPOSIUM ON THEORETICAL ASPECTS OF COMPUTER SCIENCE
"... We explore various techniques to compress a permutation π over n integers, taking advantage of ordered subsequences in π, while supporting its application π(i) and the application of its inverse π −1 (i) in small time. Our compression schemes yield several interesting byproducts, in many cases mat ..."
Abstract

Cited by 19 (11 self)
 Add to MetaCart
We explore various techniques to compress a permutation π over n integers, taking advantage of ordered subsequences in π, while supporting its application π(i) and the application of its inverse π −1 (i) in small time. Our compression schemes yield several interesting byproducts, in many cases matching, improving or extending the best existing results on applications such as the encoding of a permutation in order to support iterated applications π k (i) of it, of integer functions, and of inverted lists and suffix arrays.
SpaceEfficient Preprocessing Schemes for Range Minimum Queries on Static Arrays
, 2009
"... Given a static array of n totally ordered object, the range minimum query problem is to build an additional data structure that allows to answer subsequent online queries of the form “what is the position of a minimum element in the subarray ranging from i to j? ” efficiently. We focus on two sett ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
Given a static array of n totally ordered object, the range minimum query problem is to build an additional data structure that allows to answer subsequent online queries of the form “what is the position of a minimum element in the subarray ranging from i to j? ” efficiently. We focus on two settings, where (1) the input array is available at query time, and (2) the input array is only available at construction time. In setting (1), we show new data structures (a) of n c(n) (2 + o(1)) bits and query time O(c(n)), or (b) with O(nHk) + o(n) bits and O(1) query size time, where Hk denotes the empirical entropy of k’th order of the input array. In setting (2), we give a data structure of optimal size 2n + o(n) bits and query time O(1). All data structures can be constructed in linear time and almost inplace.
Broadword Implementation of Rank/Select Queries
"... Research on succinct data structures (data structures occupying space close to the informationtheoretical lower bound, but achieving speed similar to their standard counterparts) has steadily increased in the last few years. However, many theoretical constructions providing asymptotically optimal bo ..."
Abstract

Cited by 17 (5 self)
 Add to MetaCart
Research on succinct data structures (data structures occupying space close to the informationtheoretical lower bound, but achieving speed similar to their standard counterparts) has steadily increased in the last few years. However, many theoretical constructions providing asymptotically optimal bounds are unusable in practise because of the very large constants involved. The study of practical implementations of the basic building blocks of such data structures is thus fundamental to obtain practical applications. In this paper we argue that 64bit and wider architectures are particularly suited to very efficient implementations of rank (counting the number of ones up to a given position) and select (finding the position of the ith bit set), two essential building blocks of all succinct data structures. Contrarily to typical 32bit approaches, involving precomputed tables, we use pervasively broadword (a.k.a. SWAR—“SIMD in A Register”) programming, which compensates the constant burden associated to succinct structures by solving problems in parallel in a register. We provide an implementation named rank9 that addresses 2 64 bits, consumes much less space and is significantly faster then current stateoftheart 32bit implementations, and a companion select9 structure that selects in nearly constant time using only access to aligned data. For sparsely populated arrays, we provide a broadword implementation of Okanohara & Sadakane’s sarray data structure [OS07]. In doing so, we develop broadword algorithms for performing selection in a word or in a sequence of words that are of independent interest. 1
Colored Range Queries and Document Retrieval
"... Colored range queries are a wellstudied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important onedimensional colored range queries — colore ..."
Abstract

Cited by 17 (9 self)
 Add to MetaCart
Colored range queries are a wellstudied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important onedimensional colored range queries — colored range listing, colored range topk queries and colored range counting — and, thus, new bounds for various document retrieval problems on general collections of sequences. Specifically, we first describe a framework including almost all recent results on colored range listing and document listing, which suggests new combinations of data structures for these problems. For example, we give the fastest compressed data structures for colored range listing and document listing, and an efficient data structure for document listing whose size is bounded in terms of the highorder entropies of the library of documents. We then show how (approximate) colored topk queries can be reduced to (approximate) rangemode queries on subsequences, yielding the first efficient data structure for this problem. Finally, we show how a modified wavelet tree can support colored range counting in logarithmic time and space that is succinct whenever the number of colors is superpolylogarithmic in the length of the sequence.
Theory and Practise of Monotone Minimal Perfect Hashing
"... Minimal perfect hash functions have been shown to be useful to compress data in several data management tasks. In particular, orderpreserving minimal perfect hash functions [12] have been used to retrieve the position of a key in a given list of keys: however, the ability to preserve any given orde ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
Minimal perfect hash functions have been shown to be useful to compress data in several data management tasks. In particular, orderpreserving minimal perfect hash functions [12] have been used to retrieve the position of a key in a given list of keys: however, the ability to preserve any given order leads to an unavoidable �(n log n) lower bound on the number of bits required to store the function. Recently, it was observed [1] that very frequently the keys to be hashed are sorted in their intrinsic (i.e., lexicographical) order. This is typically the case of dictionaries of search engines, list of URLs of web graphs, etc. We refer to this restricted version of the problem as monotone minimal perfect hashing. We analyse experimentally the data structures proposed in [1], and along our way we propose some new methods that, albeit asymptotically equivalent or worse, perform very well in practise, and provide a balance between access speed, ease of construction, and space usage. 1