Results 1  10
of
20
A Subquadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices
, 2002
"... The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in subquadratic time, for metrics which use a scoring ..."
Abstract

Cited by 56 (4 self)
 Add to MetaCart
The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in subquadratic time, for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global alignment computations. The speedup is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by LempelZiv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an O(n 2 = log n) algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(hn 2 = log n) where h 1 is the entropy of the text. Institut GaspardMonge, Universite de MarnelaVallee, Cite Descartes, ChampssurMarne, 77454 MarnelaVallee Cedex 2, France, email: mac@univmlv.fr. y Department of Computer Science, Haifa University, Haifa 31905, Israel, phone: (9724) 8240103, FAX: (9724) 8249331; Department of Computer and Information Science, Polytechnic University, Six MetroTech Center, Brooklyn, NY 112013840; email: landau@poly.edu; partially supported by NSF grant CCR0104307, by NATO Science Programme grant PST.CLG.977017, by the Israel Science Foundation (grants 173/98 and 282/01), by the FIRST Foundation of the Israel Academy of Science and Humanities, and by IBM Faculty Partnership Award. z Department of Computer Science, Haifa University, Haifa 31905, Israel; On Education Leave from the IBM T.J.W. Research Center; email: michal@cs.haifa.il; partially supported by by the Israel Science Foundation (grants 173/98 and 282/01), and by the FIRST Foundation of the Israel Academy of Science ...
Approximate Matching of RunLength Compressed Strings
 Algorithmica
, 2001
"... We focus on the problem of approximate matching of strings that have been compressed using runlength encoding. Previous studies have concentrated on the problem of computing the longest common subsequence (LCS) between two strings of length m and n, compressed to m runs. We extend an existi ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
We focus on the problem of approximate matching of strings that have been compressed using runlength encoding. Previous studies have concentrated on the problem of computing the longest common subsequence (LCS) between two strings of length m and n, compressed to m runs. We extend an existing algorithm for the LCS to the Levenshtein distance achieving O(m m) complexity.
Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions
"... We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete timeindependent HMMs. We discuss the application of our method to Viterbi’s decoding and training algorithms [33], as well as to the forwardbackward and BaumWelch [ ..."
Abstract

Cited by 11 (5 self)
 Add to MetaCart
We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete timeindependent HMMs. We discuss the application of our method to Viterbi’s decoding and training algorithms [33], as well as to the forwardbackward and BaumWelch [6] algorithms. Our approach is based on identifying repeated substrings in the observed input sequence. Initially, we show how to exploit repetitions of all sufficiently small substrings (this is similar to the Four Russians method). Then, we describe four algorithms based alternatively on run length encoding (RLE), LempelZiv (LZ78) parsing, grammarbased compression (SLP), and byte pair encoding (BPE). Compared to Viterbi’s algorithm, we achieve speedups of Θ(log n) using the Four Russians method, Ω ( r log n r) using RLE, Ω ( ) using LZ78, Ω ( ) using SLP, and Ω(r) using BPE, where k is the number log r k k of hidden states, n is the length of the observed sequence and r is its compression ratio (under each compression scheme). Our experimental results demonstrate that our new algorithms are indeed faster in practice. Furthermore, unlike Viterbi’s algorithm, our algorithms are highly parallelizable.
Pattern Matching in Compressed Text and Images
, 2001
"... Normally compressed data needs to be decompressed before it is processed, but if the compression has been done in the fight way, it is often possible to search the data without having to decompress it, or at least only partially decompress it. The problem can be divided into lossless and lossy c ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
Normally compressed data needs to be decompressed before it is processed, but if the compression has been done in the fight way, it is often possible to search the data without having to decompress it, or at least only partially decompress it. The problem can be divided into lossless and lossy compression methods, and then in each of these cases the pattern matching can be either exact or inexact. Much work has been reported in the literature on techniques for all of these cases, including algorithms that are suitable for pattern matching for various compression methods, and compression methods designed specifically for pattern matching. This work is surveyed in this paper. The paper also exposes the important relationship between pattern matching and compression, and proposes some performance measures for compressed pattern matching algorithms. Ideas and directions for future work are also described.
ReUse Dynamic Programming for Sequence Alignment: An Algorithmic Toolkit
 STRING ALGORITHMICS, UNITED KINGDOM
, 2005
"... ..."
Towards Formal Structural Representation of Spoken Language: An Evolving Transformation System (ETS) Approach
, 2005
"... Speech recognition has been a very active area of research over the past twenty years. Despite an evident progress, it is generally agreed by the practitioners of the field that performance of the current speech recognition systems is rather suboptimal and new approaches are needed. The motivation ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Speech recognition has been a very active area of research over the past twenty years. Despite an evident progress, it is generally agreed by the practitioners of the field that performance of the current speech recognition systems is rather suboptimal and new approaches are needed. The motivation behind the undertaken research is an observation that the notion of representation of objects and concepts that once was considered to be central in the early days of pattern recognition, has been largely marginalised by the advent of statistical approaches. As a consequence of a predominantly statistical approach to speech recognition problem, due to the numeric, feature vectorbased, nature of representation, the classes inductively discovered from real data using decisiontheoretic techniques have little meaning outside the statistical framework. This is because decision surfaces or probability distributions are difficult to analyse linguistically. Because of the later limitation it is doubtful that the gap between speech recognition and linguistic research can be bridged by the numeric representations. This thesis investigates an alternative, structural, approach to spoken language representation and categorisa
The SBCTree: An Index for RunLength Compressed Sequences
"... RunLengthEncoding (RLE) is a data compression technique that is used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges is how to operate on (e.g., index, search, and retrieve) compressed data without decompressing it. In this pap ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
RunLengthEncoding (RLE) is a data compression technique that is used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges is how to operate on (e.g., index, search, and retrieve) compressed data without decompressing it. In this paper, we introduce the String Btree for Compressed sequences, termed the SBCtree, for indexing and searching RLEcompressed sequences of arbitrary length. The SBCtree is a twolevel index structure based on the wellknown String Btree and a 3sided range query structure [7]. The SBCtree supports pattern matching queries such as substring matching, prefix matching, and range search operations over RLEcompressed sequences. The SBCtree has an optimal externalmemory space complexity of O(N/B) pages, where N is the total length of the compressed sequences, and B is the disk page size. Substring matching, prefix matching, and range search execute in an optimal O(logB N + p+T) I/O operations, where p  is the
A UNIFIED ALGORITHM FOR ACCELERATING EDITDISTANCE COMPUTATION via . . .
, 2009
"... The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamicprogramming solution for this problem computes the editdistance between a pair of strings of total length O(N) in O(N²) time. To th ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamicprogramming solution for this problem computes the editdistance between a pair of strings of total length O(N) in O(N²) time. To this date, this quadratic upperbound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o(N 2) editdistance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single editdistance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straightline programs. These provide a generic platform
Using Edit Distance in PointPattern Matching
 In Proc. 8th Workshop on String Processing and Information Retrieval (SPIRE 2001), IEEE CS
, 2001
"... Edit distance is a powerful measure of similarity in string matching, measuring the minimum amount of insertions, deletions, and substitutions to convert a string into another string. This measure is often contrasted with time warping in speech processing, that measures how close two trajectories ar ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Edit distance is a powerful measure of similarity in string matching, measuring the minimum amount of insertions, deletions, and substitutions to convert a string into another string. This measure is often contrasted with time warping in speech processing, that measures how close two trajectories are by allowing compression and expansion operations on time scale. Time warping can be easily generalized to measure the similarity between 1D pointpatterns (ascending lists of real values), as the difference between and (i 1) points in a pointpattern can be considered as the value of a trajectory at the time i. However, we show that edit distance is more natural choice, and derive a measure by calculating the minimum amount of space needed to insert and delete between points to convert a pointpattern into another. We show that this measure defines a metric. We also define a substitution operation such that the distance calculation automatically separates the points into matching and mismatching points. The algorithms are based on dynamic programming. The main motivation for these methods is two and higher dimensional pointpattern matching, and therefore we generalize these methods into the 2D case, and show that this generalization leads to an NPcomplete problem. There is also applications for the 1D case; we discuss shortly the matching of tree ring sequences in dendrochronology.
Fast Algorithms for Computing the Constrained LCS of RunLength Encoded Strings
"... Abstract — In the constrained longest common subsequence (CLCS) problem, we are given two sequences X, Y and the constrained sequence P in runlength encoded (RLE) format, where X  = n, Y  = m and P  = r and the numbers of runs in RLE format are N, M and R, respectively. In this paper, we s ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract — In the constrained longest common subsequence (CLCS) problem, we are given two sequences X, Y and the constrained sequence P in runlength encoded (RLE) format, where X  = n, Y  = m and P  = r and the numbers of runs in RLE format are N, M and R, respectively. In this paper, we show that after the sequences are encoded, the CLCS problem can be solved in O(NMr+ r × min{q1, q2} + q3) time, where q1 and q2 denote the numbers of elements in the bottom and right boundaries of the partially matched blocks on the first layer, and q3 denotes the number of elements of whole boundaries of all fully matched cuboids in the DP lattice. If the compression ratio is good, our work obviously outperforms the previously known DP algorithm and the HuntandSzymanskilike algorithm. 1.