Results 1 
8 of
8
Compressed fulltext indexes
 ACM COMPUTING SURVEYS
, 2007
"... Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract

Cited by 172 (78 self)
 Add to MetaCart
Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into selfindexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying selfindexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant selfindexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Approximate String Matching with LempelZiv Compressed Indexes
, 2007
"... A compressed fulltext selfindex for a text T is a data structure requiring reduced space and able of searching for patterns P in T. Furthermore, the structure can reproduce any substring of T, thus it actually replaces T. Despite the explosion of interest on selfindexes in recent years, there has ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
A compressed fulltext selfindex for a text T is a data structure requiring reduced space and able of searching for patterns P in T. Furthermore, the structure can reproduce any substring of T, thus it actually replaces T. Despite the explosion of interest on selfindexes in recent years, there has not been much progress on search functionalities beyond the basic exact search. In this paper we focus on indexed approximate string matching (ASM), which is of great interest, say, in computational biology applications. We present an ASM algorithm that works on top of a LempelZiv selfindex. We consider the socalled hybrid indexes, which are the best in practice for this problem. We show that a LempelZiv index can be seen as an extension of the classical qsamples index. We give new insights on this type of index, which can be of independent interest, and then apply them to the LempelZiv index. We show experimentally that our algorithm has a competitive performance and provides a useful spacetime tradeoff compared to classical indexes.
Pattern matching with don’t cares and few errors
"... We present solutions for the kmismatch pattern matching problem with don’t cares. Given a text t of length n and a pattern p of length m with don’t care symbols and a bound k, our algorithms find all the places that the pattern matches the text with at most k mismatches. We first give an Θ (n(k + l ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We present solutions for the kmismatch pattern matching problem with don’t cares. Given a text t of length n and a pattern p of length m with don’t care symbols and a bound k, our algorithms find all the places that the pattern matches the text with at most k mismatches. We first give an Θ (n(k + log m log k) log n) time randomised algorithm which finds the correct answer with high probability. We then present a new deterministic Θ ( nk 2 log 2 m) time solution that uses tools originally developed for group testing. Taking our derandomisation approach further we develop an approach based on kselectors that runs in Θ (nk polylogm) time. Further, in each case the location of the mismatches at each alignment is also given at no extra cost.
CacheOblivious Index for Approximate String Matching
, 2007
"... This paper revisits the problem of indexing a text for approximate string matching. Specifically, given a text T of length n and a positive integer k, we want to construct an index of T such that for any input pattern P, we can find all its kerror matches in T efficiently. This problem is wellstu ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
This paper revisits the problem of indexing a text for approximate string matching. Specifically, given a text T of length n and a positive integer k, we want to construct an index of T such that for any input pattern P, we can find all its kerror matches in T efficiently. This problem is wellstudied in the internalmemory setting. Here, we extend some of these recent results to externalmemory solutions, which are also cacheoblivious. Our first index occupies O((n log k n)/B) disk pages and finds all kerror matches with O((P  + occ)/B +log k n log log B n) I/Os, where B denotes the number of words in a disk page. To the best of our knowledge, this index is the first externalmemory data structure that does not require Ω(P  + occ +poly(logn)) I/Os. The second index reduces the space to O((n log n)/B) disk pages, and the I/O complexity is O((P  + occ)/B +log k(k+1) n log log n).
Faster Filters for Approximate String Matching
"... We introduce a new filtering method for approximate string matching called the suffix filter. It has some similarity with wellknown filtration algorithms, which we call factor filters, and which are among the best practical algorithms for approximate string matching using a text index. Suffix filte ..."
Abstract
 Add to MetaCart
We introduce a new filtering method for approximate string matching called the suffix filter. It has some similarity with wellknown filtration algorithms, which we call factor filters, and which are among the best practical algorithms for approximate string matching using a text index. Suffix filters are stronger, i.e., produce fewer false matches than factor filters. We demonstrate experimentally that suffix filters are faster in practice, too. 1
Approximate String Matching with ZivLempel Compressed Indexes
"... Abstract. A compressed fulltext selfindex for a text T is a data structure requiring reduced space and able of searching for patterns P in T. Furthermore, the structure can reproduce any substring of T, thus it actually replaces T. Despite the explosion of interest on selfindexes in recent years, ..."
Abstract
 Add to MetaCart
Abstract. A compressed fulltext selfindex for a text T is a data structure requiring reduced space and able of searching for patterns P in T. Furthermore, the structure can reproduce any substring of T, thus it actually replaces T. Despite the explosion of interest on selfindexes in recent years, there has not been much progress on search functionalities beyond the basic exact search. In this paper we focus on indexed approximate string matching (ASM), which is of great interest, say, in computational biology applications. We present an ASM algorithm that works on top of a LempelZiv selfindex. We consider the socalled hybrid indexes, which are the best in practice for this problem. We show that a LempelZiv index can be seen as an extension of the classical qsamples index. We give new insights on this type of index, which can be of independent interest, and then apply them to the ZivLempel index. We show experimentally that our algorithm has a competitive performance and provides a useful spacetime tradeoff compared to classical indexes. 1 Introduction and Related Work Approximate string matching (ASM) is an important problem that arises in applications related to text searching, pattern recognition, signal processing, and computational biology,