Results 1  10
of
20
Compressed fulltext indexes
 ACM COMPUTING SURVEYS
, 2007
"... Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract

Cited by 257 (97 self)
 Add to MetaCart
(Show Context)
Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into selfindexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying selfindexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant selfindexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Compressed representations of sequences and fulltext indexes
 ACM Transactions on Algorithms
, 2007
"... Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zeroorder empirical entropy of S and nH0(S) pro ..."
Abstract

Cited by 159 (79 self)
 Add to MetaCart
(Show Context)
Abstract. Given a sequence S = s1s2... sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zeroorder empirical entropy of S and nH0(S) provides an Information Theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r, we can still represent S in nH0(S) + o(n log r) bits and answer queries in O(log r / log log n) time. Another contribution of this paper is to show how to combine our compressed representation of integer sequences with an existing compression boosting technique to design compressed fulltext indexes that scale well with the size of the input alphabet Σ. Namely, we design a variant of the FMindex that indexes a string T [1, n] within nHk(T) + o(n) bits of storage, where Hk(T) is the kth order empirical entropy of T. This space bound holds simultaneously for all k ≤ α log Σ  n, constant 0 < α < 1, and Σ  = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P [1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log 1+ε n) time, for any constant 0 < ε < 1; and it reports a text substring of length ℓ in O(ℓ + log 1+ε n) time.
Succinct suffix arrays based on runlength encoding
 Nordic Journal of Computing
, 2005
"... A succinct fulltext selfindex is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the selfindex re ..."
Abstract

Cited by 62 (33 self)
 Add to MetaCart
A succinct fulltext selfindex is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the selfindex replaces the text. Several remarkable selfindexes have been developed in recent years. Many of those take space proportional to nH0 or nHk bits, where Hk is the kth order empirical entropy of T. The time to count how many times does P occur in T ranges from O(m) to O(m log n). In this paper we present a new selfindex, called RLFM index for “runlength FMindex”, that counts the occurrences of P in T in O(m) time when the alphabet size is σ = O(polylog(n)). The RLFM index requires nHk log σ + O(n) bits of space, for any k ≤ α log σ n and constant 0 < α < 1. Previous indexes that achieve O(m) counting time either require more than nH0 bits of space or require that σ = O(1). We also show that the RLFM index can be enhanced to locate occurrences in the text and display text substrings in time independent of σ. In addition, we prove a close relationship between the kth order entropy of the text and some regularities that show up in their suffix arrays and in the BurrowsWheeler transform of T. This relationship is of independent interest and permits bounding the space occupancy of the RLFM index, as well as that of other existing compressed indexes. Finally, we present some practical considerations in order to implement the RLFM index, obtaining two implementations with different spacetime tradeoffs. We empirically compare our indexes against the best existing implementations and show that they are practical and competitive against those. 1
Compressed Text Indexes with Fast Locate
"... Abstract. Compressed text (self)indexes have matured up to a point where they can replace a text by a data structure that requires less space and, in addition to giving access to arbitrary text passages, support indexed text searches. At this point those indexes are competitive with traditional tex ..."
Abstract

Cited by 33 (16 self)
 Add to MetaCart
(Show Context)
Abstract. Compressed text (self)indexes have matured up to a point where they can replace a text by a data structure that requires less space and, in addition to giving access to arbitrary text passages, support indexed text searches. At this point those indexes are competitive with traditional text indexes (which are very large) for counting the number of occurrences of a pattern in the text. Yet, they are still hundreds to thousands of times slower when it comes to locating those occurrences in the text. In this paper we introduce a new compression scheme for suffix arrays which permits locating the occurrences extremely fast, while still being much smaller than classical indexes. In addition, our index permits a very efficient secondary memory implementation, where compression permits reducing the amount of I/O needed to answer queries. 1 Introduction and Related Work Compressed text indexing has become a popular alternative to cope with the problem of giving indexed access to large text collections without using up too much space. Reducing space is important because it gives one the chance of maintaining the whole collection in main memory. The current trend in compressed indexing is fulltext compressed selfindexes [14, 1, 4, 15, 13, 2]. Such a selfindex (for short) replaces the
Storage and Retrieval of Highly Repetitive Sequence Collections ∗
"... A repetitive sequence collection is a set of sequences which are small variations of each other. A prominent example are genome sequences of individuals of the same or close species, where the differences can be expressed by short lists of basic edit operations. Flexible and efficient data analysis ..."
Abstract

Cited by 30 (16 self)
 Add to MetaCart
A repetitive sequence collection is a set of sequences which are small variations of each other. A prominent example are genome sequences of individuals of the same or close species, where the differences can be expressed by short lists of basic edit operations. Flexible and efficient data analysis on such a typically huge collection is plausible using suffix trees. However, the suffix tree occupies much space, which very soon inhibits inmemory analyses. Recent advances in fulltext indexing reduce the space of the suffix tree to, essentially, that of the compressed sequences, while retaining its functionality with only a polylogarithmic slowdown. However, the underlying compression model considers only the predictability of the next sequence symbol given the k previous ones, where k is a small integer. This is unable to capture longerterm repetitiveness. For example, r identical copies of an incompressible sequence will be incompressible under this model. We develop new static and dynamic fulltext indexes that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations. The new indexes can be plugged into a recent dynamic fullycompressed suffix tree, achieving full functionality for sequence analysis, while retaining the reduced space and the polylogarithmic slowdown. Our experimental results confirm the practicality of our proposal.
Runlength compressed indexes are superior for highly repetitive sequence collections
 In Proc. 15th SPIRE, LNCS 5280
, 2008
"... Abstract. A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
(Show Context)
Abstract. A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. This paper is devoted to studying ways to store massive sets of highly repetitive sequence collections in spaceefficient manner so that retrieval of the content as well as queries on the content of the sequences can be provided timeefficiently. We show that the stateoftheart entropybound fulltext selfindexes do not yet provide satisfactory space bounds for this specific task. We engineer some new structures that use runlength encoding and give empirical evidence that these structures are superior to the current structures. 1
Compression, indexing, and retrieval for massive string data
 COMBINATORIAL PATTERN MATCHING. LNCS
, 2010
"... The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as t ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the wellknown technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/Oefficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.
A LempelZiv text index on secondary storage
 IN PROC. CPM, LNCS 4580
, 2007
"... Fulltext searching consists in locating the occurrences of a given pattern P[1..m] in a text T[1..u], both sequences over an alphabet of size σ. In this paper we define a new index for fulltext searching on secondary storage, based on the LempelZiv compression algorithm and requiring 8uHk +o(u lo ..."
Abstract

Cited by 11 (6 self)
 Add to MetaCart
(Show Context)
Fulltext searching consists in locating the occurrences of a given pattern P[1..m] in a text T[1..u], both sequences over an alphabet of size σ. In this paper we define a new index for fulltext searching on secondary storage, based on the LempelZiv compression algorithm and requiring 8uHk +o(u log σ) bits of space, where Hk denotes the kth order empirical entropy of T, for any k = o(log σ u). Our experimental results show that our index is significantly smaller than any other practical secondarymemory data structure: 1.4–2.3 times the text size including the text, which means 39%–65 % the size of traditional indexes like String Btrees [Ferragina and Grossi, JACM 1999]. In exchange, our index requires more disk access to locate the pattern occurrences. Our index is able to report up to 600 occurrences per disk access, for a disk page of 32 kilobytes. If we only need to count pattern occurrences, the space can be reduced to about 1.04–1.68 times the text size, requiring about 20–60 disk accesses, depending on the pattern length.
A Compressed Text Index on Secondary Memory
"... Abstract. We introduce a practical diskbased compressed text index that, when the text is compressible, takes much less space than the suffix array. It provides very good I/O times for searching, which in particular improve when the text is compressible. In this aspect our index is unique, as compr ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We introduce a practical diskbased compressed text index that, when the text is compressible, takes much less space than the suffix array. It provides very good I/O times for searching, which in particular improve when the text is compressible. In this aspect our index is unique, as compressed indexes have been slower than their classical counterparts on secondary memory. We analyze our index and show experimentally that it is extremely competitive on compressible texts. 1 Introduction and Related Work Compressed fulltext selfindexing [22] is a recent trend that builds on the discovery that traditional text indexes like suffix trees and suffix arrays can be compacted to take space proportional to the compressed text size, and moreover be able to reproduce any text context. Therefore selfindexes replace the text,
Parallel and Distributed Compressed Indexes ⋆
"... Abstract. We study parallel and distributed compressed indexes. Compressed indexes are a new and functional way to index text strings. They exploit the compressibility of the text, so that their size is a function of the compressed text size. Moreover, they support a considerable amount of functions ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We study parallel and distributed compressed indexes. Compressed indexes are a new and functional way to index text strings. They exploit the compressibility of the text, so that their size is a function of the compressed text size. Moreover, they support a considerable amount of functions, more than many classical indexes. We make use of this extended functionality to obtain, in a sharedmemory parallel machine, nearoptimal speedups for solving several stringology problems. We also show how to distribute compressed indexes across several machines. 1 Introduction and Related Work Suffix trees are extremely important for a large number of string processing problems, in particular in bioinformatics, where large DNA and protein sequences are analyzed. This partnership has produced several important results, but it has also exposed the main shortcoming of suffix trees. Their large space requirements,