Results 1 - 10
of
61
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract
- in Proceedings of the 32nd Annual ACM Symposium on the Theory of Computing
, 2000
"... Abstract. The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed al ..."
Abstract
-
Cited by 172 (15 self)
- Add to MetaCart
Abstract. The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in n lg |Σ | bits by encoding each symbol with lg |Σ | bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Ω(lg |Σ | n), which is significant when Σ is of constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching, either in O(m lg |Σ|) timeorinO(m +lgn) time, plus an output-sensitive cost O(occ) for listing the occ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m / lg |Σ | n +lgɛ |Σ | n) search time in the worst case, for any constant
Opportunistic Data Structures with Applications
, 2000
"... In this paper we address the issue of compressing and indexing data. We devise a data structure whose space occupancy is a function of the entropy of the underlying data set. We call the data structure opportunistic since its space occupancy is decreased when the input is compressible and this space ..."
Abstract
-
Cited by 142 (11 self)
- Add to MetaCart
In this paper we address the issue of compressing and indexing data. We devise a data structure whose space occupancy is a function of the entropy of the underlying data set. We call the data structure opportunistic since its space occupancy is decreased when the input is compressible and this space reduction is achieved at no significant slowdown in the query performance. More precisely, its space occupancy is optimal in an information-content sense because a text T [1, u] is stored using O(H k (T )) + o(1) bits per input symbol in the worst case, where H k (T ) is the kth order empirical entropy of T (the bound holds for any fixed k). Given an arbitrary string P [1; p], the opportunistic data structure allows to search for the occ occurrences of P in T in O(p + occ log u) time (for any fixed > 0). If data are uncompressible we achieve the best space bound currently known [12]; on compressible data our solution improves the succinct suffix array of [12] and the classical suffix tree and suffix array data structures either in space or in query time or both.
Let Sleeping Files Lie: Pattern Matching in Z-compressed Files
, 1994
"... The current explosion of stored information necessitates a new model of pattern matching, that of compressed matching. In this model one tries to find all occurrences of a pattern in a compressed text in time proportional to the compressed text size, i.e., without decompressing the text. The most ef ..."
Abstract
-
Cited by 86 (2 self)
- Add to MetaCart
The current explosion of stored information necessitates a new model of pattern matching, that of compressed matching. In this model one tries to find all occurrences of a pattern in a compressed text in time proportional to the compressed text size, i.e., without decompressing the text. The most effective general purpose compression algorithms are adaptive, in that the text represented by each compression symbol is determined dynamically by the data. As a result, the encoding of a substring depends on its location. Thus the same substring may "look different" every time it appears in the compressed text. In this paper we consider pattern matching without decompression in the UNIX Z-compression. This is a variant of the Lempel-Ziv adaptive compression scheme. If n is the length of the compressed text and m is the length of the pattern, our algorithms find the first pattern occurrence in time O(n+m 2 ) or O(n log m+m). We also introduce a new criterion to measure compressed matching ...
A Text Compression Scheme That Allows Fast Searching Directly In The Compressed File
- ACM Transactions on Information Systems
, 1993
"... . A new text compression scheme is presented in this paper. The main purpose of this scheme is to speed up string matching by searching the compressed file directly. The scheme requires no modification of the string-matching algorithm, which is used as a black box; any string-matching procedure can ..."
Abstract
-
Cited by 56 (1 self)
- Add to MetaCart
. A new text compression scheme is presented in this paper. The main purpose of this scheme is to speed up string matching by searching the compressed file directly. The scheme requires no modification of the string-matching algorithm, which is used as a black box; any string-matching procedure can be used. Instead, the pattern is modified; only the outcome of the matching of the modified pattern against the compressed file is decompressed. Since the compressed file is smaller than the original file, the search is faster both in terms of I/O time and processing time than a search in the original file. For typical text files, we achieve about 30% reduction of space and slightly less of search time. A 30% space saving is not competitive with good text compression schemes, and thus should not be used where space is the predominant concern. The intended applications of this scheme are files that are searched often, such as catalogs, bibliographic files, and address books. Such files are ty...
Approximate String Matching over Ziv-Lempel Compressed Text
, 2000
"... We present the first nontrivial algorithm for approximate pattern matching on compressed text. The format we choose is the Ziv-Lempel family. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text allowing up to k inse ..."
Abstract
-
Cited by 38 (11 self)
- Add to MetaCart
We present the first nontrivial algorithm for approximate pattern matching on compressed text. The format we choose is the Ziv-Lempel family. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text allowing up to k insertions, deletions and substitutions. On LZ78/LZW we need O(mkn + R) time in the worst case and O(k ) +R) on average where is the alphabet size. The experimental results show a practical speedup over the basic approach of up to 2X for moderate m and small k. We extend the algorithms to more general compression formats and approximate matching models.
A General Practical Approach to Pattern Matching over Ziv-Lempel Compressed Text
, 1998
"... . We address the problem of string matching on Ziv-Lempel compressed text. The goal is to search a pattern in a text without uncompressing it. This is a highly relevant issue to keep compressed text databases where efficient searching is still possible. We develop a general technique for string matc ..."
Abstract
-
Cited by 38 (7 self)
- Add to MetaCart
. We address the problem of string matching on Ziv-Lempel compressed text. The goal is to search a pattern in a text without uncompressing it. This is a highly relevant issue to keep compressed text databases where efficient searching is still possible. We develop a general technique for string matching when the text comes as a sequence of blocks. This abstracts the essential features of Ziv-Lempel compression. We then apply the scheme to each particular type of compression. We present the first algorithm to find all the matches of a pattern in a text compressed using LZ77. When we apply our scheme to LZ78, we obtain a much more efficient search algorithm, which is faster than uncompressing the text and then searching on it. Finally, we propose a new hybrid compression scheme which is between LZ77 and LZ78, being in practice as good to compress as LZ77 and as fast to search in as LZ78. 1 Introduction String matching is one of the most pervasive problems in computer science, with appli...
Fast Searching on Compressed Text Allowing Errors
, 1998
"... We present a fast compression and decompression scheme for natural language texts that allows efficient and flexible string matching by searching the compressed text directly. The compression scheme uses a word-based Huffman encoding and the coding alphabet is byte-oriented rather than bit-oriented. ..."
Abstract
-
Cited by 23 (13 self)
- Add to MetaCart
We present a fast compression and decompression scheme for natural language texts that allows efficient and flexible string matching by searching the compressed text directly. The compression scheme uses a word-based Huffman encoding and the coding alphabet is byte-oriented rather than bit-oriented. We compress typical English texts to about 30% of their original size, against 40% and 35% for Compress and Gzip, respectively. Compression times are close to the times of Compress and approximately half the times of Gzip, and decompression times are lower than those of Gzip and one third of those of Compress. The searching algorithm allows a large number of variations of the exact and approximate compressed string matching problem, such as phrases, ranges, complements, wild cards and arbitrary regular expressions. Separators and stopwords can be discarded at search time without significantly increasing the cost. The algorithm is based on a word-oriented shift-or algorithm and a fast Boyer-Moore-type filter. It concomitantly uses the vocabulary of the text available as part of the Huffman coding data. When searching for simple patterns, our experiments show that running our algorithm on a compressed text is twice as fast as running Agrep on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithm is up to 8 times faster than Agrep. We also mention the impact of our technique in inverted files pointing to documents or logical blocks as Glimpse.
Multiple Pattern Matching in LZW Compressed Text
- In Proc. DCC'98
, 1998
"... In this paper we address the problem of searching in LZW compressed text directly, and present a new algorithm for finding multiple patterns bysimulating the moveofthe Aho-Corasick pattern matching machine. The new algorithm finds all occurrences of multiple patterns whereas the algorithm propose ..."
Abstract
-
Cited by 22 (9 self)
- Add to MetaCart
In this paper we address the problem of searching in LZW compressed text directly, and present a new algorithm for finding multiple patterns bysimulating the moveofthe Aho-Corasick pattern matching machine. The new algorithm finds all occurrences of multiple patterns whereas the algorithm proposed by Amir, Benson, and Farach finds only the first occurrence of a single pattern.
Faster Approximate String Matching over Compressed Text
- In Proc. 11th IEEE Data Compression Conference (DCC'01
"... Approximate string matching on compressed text was a problem open during almost a decade. The two existing solutions are very recent. Despite that they represent important complexity breakthroughs, in most practical cases they are not useful, in the sense that they are slower than uncompressing t ..."
Abstract
-
Cited by 21 (6 self)
- Add to MetaCart
Approximate string matching on compressed text was a problem open during almost a decade. The two existing solutions are very recent. Despite that they represent important complexity breakthroughs, in most practical cases they are not useful, in the sense that they are slower than uncompressing the text and then searching the uncompressed text. In this paper we present a different approach, which reduces the problem to multipattern searching of pattern pieces plus local decompression and direct verification of candidate text areas. We show experimentally that this solution is 10--30 times faster than previous work and up to three times faster than the trivial approach of uncompressing and searching, thus becoming the first practical solution to the problem. 1
An Improved Pattern Matching Algorithm for Strings in Terms of Straight-Line Programs
- In Proc. 8th Ann. Symp. on Combinatorial Pattern Matching, volume 1264 of Lecture Notes in Computer Science
, 1997
"... We show an efficient pattern matching algorithm for strings that are succinctly described in terms of straight-line programs, in which the constants are symbols and the only operation is the concatenation. In this paper, both text T and pattern P are given by straight-line programs T and P . The len ..."
Abstract
-
Cited by 19 (5 self)
- Add to MetaCart
We show an efficient pattern matching algorithm for strings that are succinctly described in terms of straight-line programs, in which the constants are symbols and the only operation is the concatenation. In this paper, both text T and pattern P are given by straight-line programs T and P . The length of the text T (pattern P , resp.) may grows exponentially with respect to its description size jT j = n (jPj = m, resp.). We show a new combinatorial property concerning with the periodic occurrences in a text. Based on this property, we develop an O(n 2 m 2 ) time algorithm using O(nm) space, which outputs a compact representation of all occurrences of P in T . This is superior to the algorithm proposed by Karpinski et al.[11], which runs in O((n +m) 4 log (n +m)) time using O((n+m) 3 ) space, and finds only one occurrence. Moreover, our algorithm is much simpler than theirs. 1 Introduction The string pattern matching is a task to find all occurrences of a pattern in a text. In...

