Results 1  10
of
23
A Subquadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices
, 2002
"... The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in subquadratic time, for metrics which use a scoring ..."
Abstract

Cited by 73 (4 self)
 Add to MetaCart
(Show Context)
The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in subquadratic time, for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global alignment computations. The speedup is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by LempelZiv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an O(n 2 = log n) algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(hn 2 = log n) where h 1 is the entropy of the text. Institut GaspardMonge, Universite de MarnelaVallee, Cite Descartes, ChampssurMarne, 77454 MarnelaVallee Cedex 2, France, email: mac@univmlv.fr. y Department of Computer Science, Haifa University, Haifa 31905, Israel, phone: (9724) 8240103, FAX: (9724) 8249331; Department of Computer and Information Science, Polytechnic University, Six MetroTech Center, Brooklyn, NY 112013840; email: landau@poly.edu; partially supported by NSF grant CCR0104307, by NATO Science Programme grant PST.CLG.977017, by the Israel Science Foundation (grants 173/98 and 282/01), by the FIRST Foundation of the Israel Academy of Science and Humanities, and by IBM Faculty Partnership Award. z Department of Computer Science, Haifa University, Haifa 31905, Israel; On Education Leave from the IBM T.J.W. Research Center; email: michal@cs.haifa.il; partially supported by by the Israel Science Foundation (grants 173/98 and 282/01), and by the FIRST Foundation of the Israel Academy of Science ...
Random Access to GrammarCompressed Strings
, 2011
"... Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is ..."
Abstract

Cited by 30 (3 self)
 Add to MetaCart
Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is the inverse of the k th row of Ackermann’s function. Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammarcompressed strings without decompression. For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{P k, k 4 + P } + log N) + occ), where occ is the number of occurrences of P in S. Finally, we are able to generalize our results to navigation and other operations on grammarcompressed trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two ”biased” weighted ancestor data structures, and a compact representation of heavypaths in grammars.
Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions
"... We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete timeindependent HMMs. We discuss the application of our method to Viterbi’s decoding and training algorithms [33], as well as to the forwardbackward and BaumWelch [ ..."
Abstract

Cited by 21 (6 self)
 Add to MetaCart
(Show Context)
We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete timeindependent HMMs. We discuss the application of our method to Viterbi’s decoding and training algorithms [33], as well as to the forwardbackward and BaumWelch [6] algorithms. Our approach is based on identifying repeated substrings in the observed input sequence. Initially, we show how to exploit repetitions of all sufficiently small substrings (this is similar to the Four Russians method). Then, we describe four algorithms based alternatively on run length encoding (RLE), LempelZiv (LZ78) parsing, grammarbased compression (SLP), and byte pair encoding (BPE). Compared to Viterbi’s algorithm, we achieve speedups of Θ(log n) using the Four Russians method, Ω ( r log n r) using RLE, Ω ( ) using LZ78, Ω ( ) using SLP, and Ω(r) using BPE, where k is the number log r k k of hidden states, n is the length of the observed sequence and r is its compression ratio (under each compression scheme). Our experimental results demonstrate that our new algorithms are indeed faster in practice. Furthermore, unlike Viterbi’s algorithm, our algorithms are highly parallelizable.
A UNIFIED ALGORITHM FOR ACCELERATING EDITDISTANCE COMPUTATION via . . .
, 2009
"... The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamicprogramming solution for this problem computes the editdistance between a pair of strings of total length O(N) in O(N²) time. To th ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamicprogramming solution for this problem computes the editdistance between a pair of strings of total length O(N) in O(N²) time. To this date, this quadratic upperbound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o(N 2) editdistance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single editdistance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straightline programs. These provide a generic platform
The SBCTree: An Index for RunLength Compressed Sequences
, 2008
"... RunLengthEncoding (RLE) is a data compression technique that is used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges is how to operate on (e.g., index, search, and retrieve) compressed data without decompressing it. In this pap ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
RunLengthEncoding (RLE) is a data compression technique that is used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges is how to operate on (e.g., index, search, and retrieve) compressed data without decompressing it. In this paper, we introduce the String Btree for Compressed sequences, termed the SBCtree, for indexing and searching RLEcompressed sequences of arbitrary length. The SBCtree is a twolevel index structure based on the wellknown String Btree and a 3sided range query structure [7]. The SBCtree supports pattern matching queries such as substring matching, prefix matching, and range search operations over RLEcompressed sequences. The SBCtree has an optimal externalmemory space complexity of O(N/B) pages, where N is the total length of the compressed sequences, and B is the disk page size. Substring matching, prefix matching, and range search execute in an optimal O(logB N + p+T) I/O operations, where p  is the
ReUse Dynamic Programming for Sequence Alignment: An Algorithmic Toolkit
 STRING ALGORITHMICS, UNITED KINGDOM
, 2005
"... ..."
Towards Formal Structural Representation of Spoken Language: An Evolving Transformation System (ETS) Approach
, 2005
"... Speech recognition has been a very active area of research over the past twenty years. Despite an evident progress, it is generally agreed by the practitioners of the field that performance of the current speech recognition systems is rather suboptimal and new approaches are needed. The motivation ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Speech recognition has been a very active area of research over the past twenty years. Despite an evident progress, it is generally agreed by the practitioners of the field that performance of the current speech recognition systems is rather suboptimal and new approaches are needed. The motivation behind the undertaken research is an observation that the notion of representation of objects and concepts that once was considered to be central in the early days of pattern recognition, has been largely marginalised by the advent of statistical approaches. As a consequence of a predominantly statistical approach to speech recognition problem, due to the numeric, feature vectorbased, nature of representation, the classes inductively discovered from real data using decisiontheoretic techniques have little meaning outside the statistical framework. This is because decision surfaces or probability distributions are difficult to analyse linguistically. Because of the later limitation it is doubtful that the gap between speech recognition and linguistic research can be bridged by the numeric representations. This thesis investigates an alternative, structural, approach to spoken language representation and categorisa
Improved Approximate String Matching and Regular Expression Matching on ZivLempel Compressed Texts ∗
, 2007
"... We study the approximate string matching and regular expression matching problem for the case when the text to be searched is compressed with the ZivLempel adaptive dictionary compression schemes. We present a timespace tradeoff that leads to algorithms improving the previously known complexities ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
We study the approximate string matching and regular expression matching problem for the case when the text to be searched is compressed with the ZivLempel adaptive dictionary compression schemes. We present a timespace tradeoff that leads to algorithms improving the previously known complexities for both problems. In particular, we significantly improve the space bounds, which in practical applications are likely to be a bottleneck. 1
Using Edit Distance in PointPattern Matching
 In Proc. 8th Workshop on String Processing and Information Retrieval (SPIRE 2001), IEEE CS
, 2001
"... Edit distance is a powerful measure of similarity in string matching, measuring the minimum amount of insertions, deletions, and substitutions to convert a string into another string. This measure is often contrasted with time warping in speech processing, that measures how close two trajectories ar ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Edit distance is a powerful measure of similarity in string matching, measuring the minimum amount of insertions, deletions, and substitutions to convert a string into another string. This measure is often contrasted with time warping in speech processing, that measures how close two trajectories are by allowing compression and expansion operations on time scale. Time warping can be easily generalized to measure the similarity between 1D pointpatterns (ascending lists of real values), as the difference between and (i 1) points in a pointpattern can be considered as the value of a trajectory at the time i. However, we show that edit distance is more natural choice, and derive a measure by calculating the minimum amount of space needed to insert and delete between points to convert a pointpattern into another. We show that this measure defines a metric. We also define a substitution operation such that the distance calculation automatically separates the points into matching and mismatching points. The algorithms are based on dynamic programming. The main motivation for these methods is two and higher dimensional pointpattern matching, and therefore we generalize these methods into the 2D case, and show that this generalization leads to an NPcomplete problem. There is also applications for the 1D case; we discuss shortly the matching of tree ring sequences in dendrochronology.
Approximate pattern match using the BurrowsWheeler transform
 Proceedings of Data Compression Conference
, 2003
"... Abstract. The compressed pattern matching problem is to locate the occurrence(s) of a pattern P in a text string T using a compressed representation of T, with minimal (or no) decompression. In this paper, we consider approximate pattern matching directly on BWT compressed text. The BWT provides a ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract. The compressed pattern matching problem is to locate the occurrence(s) of a pattern P in a text string T using a compressed representation of T, with minimal (or no) decompression. In this paper, we consider approximate pattern matching directly on BWT compressed text. The BWT provides a lexicographic ordering of the input text as part of its inverse transformation process. Based on this observation, pattern matching is performed by text prefiltering, using a fast qgram intersection of segments from the pattern P and the text T. Algorithms are proposed that solve the kmismatch problem in O(min{m(m − k)Σk log uΣ ,mu log uΣ}) time worst case, and the kapproximate matching problem in O(Σ  log Σ+ m2 k log uΣ  + αk) time on average (α ≤ u), where u = T  is the size of the text, m = P  is the size of the pattern, and Σ is the symbol alphabet. Each algorithm requires an O(u) auxiliary arrays, which are constructed in O(u) time and space. 1