Results 1  10
of
39
Compressed fulltext indexes
 ACM COMPUTING SURVEYS
, 2007
"... Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract

Cited by 172 (79 self)
 Add to MetaCart
Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into selfindexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying selfindexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant selfindexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Improved dynamic rankselect entropybound structures
 in Proc. of the Latin American Theoretical Informatics (LATIN
"... Abstract. Operations rank and select over a sequence of symbols have many applications to the design of succinct and compressed data structures to manage text collections, structured text, binary relations, trees, graphs, and so on. We are interested in the case where the collections can be updated ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
Abstract. Operations rank and select over a sequence of symbols have many applications to the design of succinct and compressed data structures to manage text collections, structured text, binary relations, trees, graphs, and so on. We are interested in the case where the collections can be updated via insertions and deletions of symbols. Two current solutions stand out as the best in the tradeoff of space versus time (considering all the operations). One by Mäkinen and Navarro achieves compressed space (i.e., nH0 + o(n log σ) bits) and O(log nlog σ) worstcase time for all the operations, where n is the sequence length, σ is the alphabet size, and H0 is the zeroorder entropy of the sequence. The other log σ log log n solution, by Lee and Park, achieves O(log n(1 +)) amortized time and uncompressed space, i.e. nlog σ +O(n)+o(nlog σ) bits. In this paper we show that the best of both worlds can be achieved. We log σ combine the solutions to obtain nH0+o(nlog σ) bits of space and O(log n(1+)) worstcase time log log n for all the operations. Apart from the best current solution, we obtain some byproducts that might be
Faster Lightweight Suffix Array Construction
"... The suffix array is a data structure formed by sorting the suffixes of a string into lexicographical order. It is important for a variety of applications, perhaps most notably pattern matching, pattern discovery and blocksorting data compression. The last decade has seen intensive research toward e ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
The suffix array is a data structure formed by sorting the suffixes of a string into lexicographical order. It is important for a variety of applications, perhaps most notably pattern matching, pattern discovery and blocksorting data compression. The last decade has seen intensive research toward efficient construction of suffix arrays with algorithms striving not only to be fast, but also “lightweight” (in the sense that they use small working memory). In this paper we describe a new lightweight suffix array construction algorithm. By exploiting several interesting properties of suffixes in combination with cache concious programming we acheive excellent runtimes. Extensive experiments show our approach to be faster that all other known algorithms for the task.
Compressed permuterm index
 In Proceedings 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
"... The Permuterm index (Garfield, 1976) is a timeefficient and elegant solution to the string dictionary problem in which pattern queries may possibly include one wildcard symbol (called, Tolerant Retrieval problem). Unfortunately the Permuterm index is space inefficient because it quadruples the dic ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
The Permuterm index (Garfield, 1976) is a timeefficient and elegant solution to the string dictionary problem in which pattern queries may possibly include one wildcard symbol (called, Tolerant Retrieval problem). Unfortunately the Permuterm index is space inefficient because it quadruples the dictionary size. In this paper we propose the Compressed Permuterm Index which solves the Tolerant Retrieval problem in time proportional to the length of the searched pattern, and space close to the kth order empirical entropy of the indexed dictionary. We also design a dynamic version of this index which allows to efficiently manage insertion in, and deletion from, the dictionary of individual strings. The result is based on a simple variant of the BurrowsWheeler Transform defined on a dictionary of strings of variable length, that allows to efficiently solve the Tolerant Retrieval problem via known (dynamic) compressed indexes [17]. We will complement our theoretical study with a rich set of experiments which show that the Compressed Permuterm Index supports fast queries within a space occupancy that is close to the one achievable by compressing the string dictionary via gzip or bzip2. This improves known approaches based on FrontCoding [19] by more than 50 % in absolute space occupancy, still guaranteeing comparable query time.
Improving Suffix Array Locality for Fast Pattern Matching on Disk
, 2008
"... The suffix tree (or equivalently, the enhanced suffix array) provides efficient solutions to many problems involving pattern matching and pattern discovery in large strings, such as those arising in computational biology. Here we address the problem of arranging a suffix array on disk so that queryi ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
The suffix tree (or equivalently, the enhanced suffix array) provides efficient solutions to many problems involving pattern matching and pattern discovery in large strings, such as those arising in computational biology. Here we address the problem of arranging a suffix array on disk so that querying is fast in practice. We show that the combination of a small trie and a suffix arraylike blocked data structure allows queries to be answered as much as three times faster than the best alternative diskbased suffix array arrangement. Construction of our data structure requires only modest processing time on top of that required to build the suffix tree, and requires negligible extra memory.
A FourStage Algorithm for Updating a BurrowsWheeler Transform
, 2009
"... We present a fourstage algorithm that updates the BurrowsWheeler Transform of a text T, when this text is modified. The BurrowsWheeler Transform is used by many text compression applications and some selfindex data structures. It operates by reordering the letters of a text T to obtain a new tex ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
We present a fourstage algorithm that updates the BurrowsWheeler Transform of a text T, when this text is modified. The BurrowsWheeler Transform is used by many text compression applications and some selfindex data structures. It operates by reordering the letters of a text T to obtain a new text bwt(T) which can be better compressed. Even if recent advances are offering this structure new applications, a major bottleneck still exists: bwt(T) has to be entirely reconstructed from scratch whenever T is modified. We are studying how standard edit operations (insertion, deletion, substitution of a letter or a factor) that are transforming a text T into T ′ are impacting bwt(T). Then we are presenting an algorithm that directly converts bwt(T) into bwt(T ′). Based on this algorithm, we also sketch a method for converting the suffix array of T into the suffix array of T ′. We finally show, based on the experiments we conducted, that this algorithm, whose worstcase time complexity is O(T  log T (1 + log σ / log log T ), performs really well in practice and replaces advantageously the traditional approach.
Rank/Select on Dynamic Compressed Sequences and Applications
, 2008
"... Operations rank and select over a sequence of symbols have many applications to the design of succinct and compressed data structures managing text collections, structured text, binary relations, trees, graphs, and so on. We are interested in the case where the collections can be updated via inserti ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Operations rank and select over a sequence of symbols have many applications to the design of succinct and compressed data structures managing text collections, structured text, binary relations, trees, graphs, and so on. We are interested in the case where the collections can be updated via insertions and deletions of symbols. Two current solutions stand out as the best in the tradeoff of space versus time (when considering all the operations). One solution, by Mäkinen and Navarro, achieves compressed space (i.e., nH0 +o(n log σ) bits) and O(log n log σ) worstcase time for all the operations, where n is the sequence length, σ is the alphabet size, and H0 is the zeroorder entropy of the sequence. The other solution, by Lee and log σ Park, achieves O(log n(1 + log log n)) amortized time and uncompressed space, i.e. n log2 σ +O(n)+o(n log σ) bits. In this paper we show that the best of both worlds can be achieved. We combine the solutions to obtain nH0 + o(n log σ) bits of space log σ log log n and O(log n(1 +)) worstcase time for all the operations. Apart from the best current solution to the problem, we obtain several byproducts of independent interest applicable to partial sums, text indexes, suffix arrays, the BurrowsWheeler transform, and others.
On the Number of Elements to Reorder When Updating a Suffix Array
, 2011
"... Recently new algorithms appeared for updating the BurrowsWheeler transform or the suffix array, when the text they index is modified. These algorithms proceed by reordering entries and the number of such reordered entries may be as high as the length of the text. However, in practice, these algorit ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Recently new algorithms appeared for updating the BurrowsWheeler transform or the suffix array, when the text they index is modified. These algorithms proceed by reordering entries and the number of such reordered entries may be as high as the length of the text. However, in practice, these algorithms are faster for updating the BurrowsWheeler transform or the suffix array than the fastest reconstruction algorithms. In this article we focus on the number of elements to be reordered for reallife texts. We show that this number is related to LCP values and that, on average, Lave entries are reordered, where Lave denotes the average LCP value, defined as the average length of the longest common prefix between two consecutive sorted suffixes. Since we know little about the LCP distribution for reallife texts, we conduct experiments on a corpus that consists of DNA sequences and natural language texts. The results show that apart from texts containing large repetitions, the average LCP value is close to the one expected on a random text.
MACHINE TRANSLATION BY PATTERN MATCHING
, 2008
"... The best systems for machine translation of natural language are based on statistical models learned from data. Conventional representation of a statistical translation model requires substantial offline computation and representation in main memory. Therefore, the principal bottlenecks to the amoun ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
The best systems for machine translation of natural language are based on statistical models learned from data. Conventional representation of a statistical translation model requires substantial offline computation and representation in main memory. Therefore, the principal bottlenecks to the amount of data we can exploit and the complexity of models we can use are available memory and CPU time, and current state of the art already pushes these limits. With data size and model complexity continually increasing, a scalable solution to this problem is central to future improvement. CallisonBurch et al. (2005) and Zhang and Vogel (2005) proposed a solution that we call translation by pattern matching, which we bring to fruition in this dissertation. The training data itself serves as a proxy to the model; rules and parameters are computed on demand. It achieves our desiderata of minimal offline computation and compact representation, but is dependent on fast pattern matching algorithms on text. They demonstrated its application to a common model based on the translation of contiguous substrings, but leave some open problems. Among these is a question: can this approach match the performance of conventional methods despite unavoidable differences that it induces in the model? We show how to answer this question affirmatively. The main
Spacetime tradeoffs for longestcommonprefix array computation
 In Proc. 19th ISAAC
, 2008
"... Abstract. The suffix array, a space efficient alternative to the suffix tree, is an important data structure for string processing, enabling efficient and often optimal algorithms for pattern matching, data compression, repeat finding and many problems arising in computational biology. An essential ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Abstract. The suffix array, a space efficient alternative to the suffix tree, is an important data structure for string processing, enabling efficient and often optimal algorithms for pattern matching, data compression, repeat finding and many problems arising in computational biology. An essential augmentation to the suffix array for many of these tasks is the Longest Common Prefix (LCP) array. In particular the LCP array allows one to simulate bottomup and topdown traversals of the suffix tree with significantly less memory overhead (but in the same time bounds). Since 2001 the LCP array has been computable in Θ(n) time, but the algorithm (even after subsequent refinements) requires relatively large working memory. In this paper we describe a new algorithm that provides a continuous spacetime tradeoff for LCP array construction, running in O(nv) time and requiring n+O(n / √ v+v) bytesofworking space, where v can be chosen to suit the available memory. Furthermore, the algorithm processes the suffix array, and outputs the LCP, strictly lefttoright, making it suitable for use with external memory. We show experimentally that for many naturally occurring strings our algorithm is faster than the linear time algorithms, while using significantly less working memory. 1