Results 1  10
of
20
A Subquadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices
, 2002
"... The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in subquadratic time, for metrics which use a scoring ..."
Abstract

Cited by 56 (4 self)
 Add to MetaCart
The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in subquadratic time, for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global alignment computations. The speedup is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by LempelZiv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an O(n 2 = log n) algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(hn 2 = log n) where h 1 is the entropy of the text. Institut GaspardMonge, Universite de MarnelaVallee, Cite Descartes, ChampssurMarne, 77454 MarnelaVallee Cedex 2, France, email: mac@univmlv.fr. y Department of Computer Science, Haifa University, Haifa 31905, Israel, phone: (9724) 8240103, FAX: (9724) 8249331; Department of Computer and Information Science, Polytechnic University, Six MetroTech Center, Brooklyn, NY 112013840; email: landau@poly.edu; partially supported by NSF grant CCR0104307, by NATO Science Programme grant PST.CLG.977017, by the Israel Science Foundation (grants 173/98 and 282/01), by the FIRST Foundation of the Israel Academy of Science and Humanities, and by IBM Faculty Partnership Award. z Department of Computer Science, Haifa University, Haifa 31905, Israel; On Education Leave from the IBM T.J.W. Research Center; email: michal@cs.haifa.il; partially supported by by the Israel Science Foundation (grants 173/98 and 282/01), and by the FIRST Foundation of the Israel Academy of Science ...
Extending autocompletion to tolerate errors
 In SIGMOD Conference. 707–718
, 2009
"... Autocompletion is a useful feature when a user is doing a look up from a table of records. With every letter being typed, autocompletion displays strings that are present in the table containing as their prefix the search string typed so far. Just as there is a need for making the lookup operation t ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
Autocompletion is a useful feature when a user is doing a look up from a table of records. With every letter being typed, autocompletion displays strings that are present in the table containing as their prefix the search string typed so far. Just as there is a need for making the lookup operation tolerant to typing errors, we argue that autocompletion also needs to be errortolerant. In this paper, we take a first step towards addressing this problem. We capture input typing errors via edit distance. We show that a naive approach of invoking an offline edit distance matching algorithm at each step performs poorly and present more efficient algorithms. Our empirical evaluation demonstrates the effectiveness of our algorithms.
Approximate Matching of RunLength Compressed Strings
 Algorithmica
, 2001
"... We focus on the problem of approximate matching of strings that have been compressed using runlength encoding. Previous studies have concentrated on the problem of computing the longest common subsequence (LCS) between two strings of length m and n, compressed to m runs. We extend an existi ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
We focus on the problem of approximate matching of strings that have been compressed using runlength encoding. Previous studies have concentrated on the problem of computing the longest common subsequence (LCS) between two strings of length m and n, compressed to m runs. We extend an existing algorithm for the LCS to the Levenshtein distance achieving O(m m) complexity.
Regular Expression Searching on Compressed Text
 Journal of Discrete Algorithms
, 2003
"... We present a solution to the problem of regular expression searching on compressed text. ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
We present a solution to the problem of regular expression searching on compressed text.
Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions
"... We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete timeindependent HMMs. We discuss the application of our method to Viterbi’s decoding and training algorithms [33], as well as to the forwardbackward and BaumWelch [ ..."
Abstract

Cited by 11 (5 self)
 Add to MetaCart
We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete timeindependent HMMs. We discuss the application of our method to Viterbi’s decoding and training algorithms [33], as well as to the forwardbackward and BaumWelch [6] algorithms. Our approach is based on identifying repeated substrings in the observed input sequence. Initially, we show how to exploit repetitions of all sufficiently small substrings (this is similar to the Four Russians method). Then, we describe four algorithms based alternatively on run length encoding (RLE), LempelZiv (LZ78) parsing, grammarbased compression (SLP), and byte pair encoding (BPE). Compared to Viterbi’s algorithm, we achieve speedups of Θ(log n) using the Four Russians method, Ω ( r log n r) using RLE, Ω ( ) using LZ78, Ω ( ) using SLP, and Ω(r) using BPE, where k is the number log r k k of hidden states, n is the length of the observed sequence and r is its compression ratio (under each compression scheme). Our experimental results demonstrate that our new algorithms are indeed faster in practice. Furthermore, unlike Viterbi’s algorithm, our algorithms are highly parallelizable.
Random Access to GrammarCompressed Strings
, 2011
"... Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is the inverse of the k th row of Ackermann’s function. Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammarcompressed strings without decompression. For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{P k, k 4 + P } + log N) + occ), where occ is the number of occurrences of P in S. Finally, we are able to generalize our results to navigation and other operations on grammarcompressed trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two ”biased” weighted ancestor data structures, and a compact representation of heavypaths in grammars.
LZgrep: A BoyerMoore String Matching Tool for ZivLempel Compressed Text
 Soft. Pract. Exper
, 2005
"... We present a BoyerMoore approach to string matching over LZ78 and LZW compressed text. The idea is to search the text directly in compressed form instead of decompressing and then searching it. We modify the BoyerMoore approach so as to skip text using the characters explicitly represented in the ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
We present a BoyerMoore approach to string matching over LZ78 and LZW compressed text. The idea is to search the text directly in compressed form instead of decompressing and then searching it. We modify the BoyerMoore approach so as to skip text using the characters explicitly represented in the LZ78/LZW formats, modifying the basic technique where the algorithm can choose which characters to inspect. We present and compare several solutions for single and multipattern search. We show that our algorithms obtain speedups of up to 50 % compared to the simple decompressthensearch approach. Finally, we present a public tool, LZgrep, which uses our algorithms to offer greplike capabilities searching directly files compressed using Unix's Compress, a LZW compressor. LZgrep can also search files compressed with Unix gzip, using new decompressthensearch techniques we develop, which are faster than the current tools. This way, users can always keep their files in compressed form and still search them, uncompressing only when they want to see them.
ReUse Dynamic Programming for Sequence Alignment: An Algorithmic Toolkit
 STRING ALGORITHMICS, UNITED KINGDOM
, 2005
"... ..."
A UNIFIED ALGORITHM FOR ACCELERATING EDITDISTANCE COMPUTATION via . . .
, 2009
"... The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamicprogramming solution for this problem computes the editdistance between a pair of strings of total length O(N) in O(N²) time. To th ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamicprogramming solution for this problem computes the editdistance between a pair of strings of total length O(N) in O(N²) time. To this date, this quadratic upperbound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o(N 2) editdistance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single editdistance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straightline programs. These provide a generic platform
Compressed Pattern Matching for SEQUITUR
, 2000
"... Sequitur due to NevillManning and Witten. [18] is a powerful program to infer a phrase hierarchy from the input text, that also provides extremely effective compression of large quantities of semistructured text [17]. In this paper, we address the problem of searching in Sequitur compressed text d ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Sequitur due to NevillManning and Witten. [18] is a powerful program to infer a phrase hierarchy from the input text, that also provides extremely effective compression of large quantities of semistructured text [17]. In this paper, we address the problem of searching in Sequitur compressed text directly. We show a compressed pattern matching algorithm that finds a pattern in compressed text without explicit decompression. We show that our algorithm is approximately 1.27 times faster than a decompression followed by an ordinal search.