Results 1 - 10
of
14
A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices
, 2002
"... The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in sub-quadratic time, for metrics which use a scoring ..."
Abstract
-
Cited by 46 (3 self)
- Add to MetaCart
The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in sub-quadratic time, for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global alignment computations. The speed-up is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by Lempel-Ziv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an O(n 2 = log n) algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(hn 2 = log n) where h 1 is the entropy of the text. Institut Gaspard-Monge, Universite de Marne-la-Vallee, Cite Descartes, Champs-surMarne, 77454 Marne-la-Vallee Cedex 2, France, email: mac@univ-mlv.fr. y Department of Computer Science, Haifa University, Haifa 31905, Israel, phone: (972-4) 824-0103, FAX: (972-4) 824-9331; Department of Computer and Information Science, Polytechnic University, Six MetroTech Center, Brooklyn, NY 11201-3840; email: landau@poly.edu; partially supported by NSF grant CCR-0104307, by NATO Science Programme grant PST.CLG.977017, by the Israel Science Foundation (grants 173/98 and 282/01), by the FIRST Foundation of the Israel Academy of Science and Humanities, and by IBM Faculty Partnership Award. z Department of Computer Science, Haifa University, Haifa 31905, Israel; On Education Leave from the IBM T.J.W. Research Center; email: michal@cs.haifa.il; partially supported by by the Israel Science Foundation (grants 173/98 and 282/01), and by the FIRST Foundation of the Israel Academy of Science ...
Approximate Matching of Run-Length Compressed Strings
- Algorithmica
, 2001
"... We focus on the problem of approximate matching of strings that have been compressed using run-length encoding. Previous studies have concentrated on the problem of computing the longest common subsequence (LCS) between two strings of length m and n, compressed to m runs. We extend an existi ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
We focus on the problem of approximate matching of strings that have been compressed using run-length encoding. Previous studies have concentrated on the problem of computing the longest common subsequence (LCS) between two strings of length m and n, compressed to m runs. We extend an existing algorithm for the LCS to the Levenshtein distance achieving O(m m) complexity.
Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions
"... We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete time-independent HMMs. We discuss the application of our method to Viterbi’s decoding and training algorithms [33], as well as to the forward-backward and Baum-Welch [ ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete time-independent HMMs. We discuss the application of our method to Viterbi’s decoding and training algorithms [33], as well as to the forward-backward and Baum-Welch [6] algorithms. Our approach is based on identifying repeated substrings in the observed input sequence. Initially, we show how to exploit repetitions of all sufficiently small substrings (this is similar to the Four Russians method). Then, we describe four algorithms based alternatively on run length encoding (RLE), Lempel-Ziv (LZ78) parsing, grammar-based compression (SLP), and byte pair encoding (BPE). Compared to Viterbi’s algorithm, we achieve speedups of Θ(log n) using the Four Russians method, Ω ( r log n r) using RLE, Ω ( ) using LZ78, Ω ( ) using SLP, and Ω(r) using BPE, where k is the number log r k k of hidden states, n is the length of the observed sequence and r is its compression ratio (under each compression scheme). Our experimental results demonstrate that our new algorithms are indeed faster in practice. Furthermore, unlike Viterbi’s algorithm, our algorithms are highly parallelizable.
Towards Formal Structural Representation of Spoken Language: An Evolving Transformation System (ETS) Approach
, 2005
"... Speech recognition has been a very active area of research over the past twenty years. Despite an evident progress, it is generally agreed by the practitioners of the field that performance of the current speech recognition systems is rather suboptimal and new ap-proaches are needed. The motivation ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Speech recognition has been a very active area of research over the past twenty years. Despite an evident progress, it is generally agreed by the practitioners of the field that performance of the current speech recognition systems is rather suboptimal and new ap-proaches are needed. The motivation behind the undertaken research is an observation that the notion of representation of objects and concepts that once was considered to be central in the early days of pattern recognition, has been largely marginalised by the ad-vent of statistical approaches. As a consequence of a predominantly statistical approach to speech recognition problem, due to the numeric, feature vector-based, nature of rep-resentation, the classes inductively discovered from real data using decision-theoretic techniques have little meaning outside the statistical framework. This is because deci-sion surfaces or probability distributions are difficult to analyse linguistically. Because of the later limitation it is doubtful that the gap between speech recognition and lin-guistic research can be bridged by the numeric representations. This thesis investigates an alternative, structural, approach to spoken language representation and categorisa-
Pattern Matching in Compressed Text and Images
, 2001
"... Normally compressed data needs to be decompressed before it is processed, but if the compression has been done in the fight way, it is often possible to search the data without having to decompress it, or at least only partially decompress it. The problem can be divided into lossless and lossy c ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Normally compressed data needs to be decompressed before it is processed, but if the compression has been done in the fight way, it is often possible to search the data without having to decompress it, or at least only partially decompress it. The problem can be divided into lossless and lossy compression methods, and then in each of these cases the pattern matching can be either exact or inexact. Much work has been reported in the literature on techniques for all of these cases, including algorithms that are suitable for pattern matching for various compression methods, and compression methods designed specifically for pattern matching. This work is surveyed in this paper. The paper also exposes the important relationship between pattern matching and compression, and proposes some performance measures for compressed pattern matching algorithms. Ideas and directions for future work are also described.
The SBC-Tree: An Index for Run-Length Compressed Sequences
"... Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges is how to operate on (e.g., index, search, and retrieve) compressed data without decompressing it. In this pap ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges is how to operate on (e.g., index, search, and retrieve) compressed data without decompressing it. In this paper, we introduce the String B-tree for Compressed sequences, termed the SBC-tree, for indexing and searching RLE-compressed sequences of arbitrary length. The SBCtree is a two-level index structure based on the well-known String B-tree and a 3-sided range query structure [7]. The SBC-tree supports pattern matching queries such as substring matching, prefix matching, and range search operations over RLE-compressed sequences. The SBC-tree has an optimal external-memory space complexity of O(N/B) pages, where N is the total length of the compressed sequences, and B is the disk page size. Substring matching, prefix matching, and range search execute in an optimal O(logB N + |p|+T) I/O operations, where |p | is the
Accelerating Dynamic Programming
, 2009
"... Dynamic Programming (DP) is a fundamental problem-solving technique that has been widely used for solving a broad range of search and optimization problems. While DP can be invoked when more specialized methods fail, this generality often incurs a cost in efficiency. We explore a unifying toolkit fo ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Dynamic Programming (DP) is a fundamental problem-solving technique that has been widely used for solving a broad range of search and optimization problems. While DP can be invoked when more specialized methods fail, this generality often incurs a cost in efficiency. We explore a unifying toolkit for speeding up DP, and algorithms that use DP as subroutines. Our methods and results can be summarized as follows. – Acceleration via Compression. Compression is traditionally used to efficiently store data. We use compression in order to identify repeats in the table that imply a redundant computation. Utilizing these repeats requires a new DP, and often different DPs for different compression schemes. We present the first provable speedup of the celebrated Viterbi algorithm (1967) that is used for the decoding and training of Hidden Markov Models (HMMs). Our speedup relies on the compression of the HMM’s observable sequence. – Totally Monotone Matrices. It is well known that a wide variety of DPs can be reduced to the problem of finding row minima in totally monotone matrices. We introduce this scheme in the context of planar graph problems. In particular, we show that planar graph problems
Using Edit Distance in Point-Pattern Matching
- In Proc. 8th Workshop on String Processing and Information Retrieval (SPIRE 2001), IEEE CS
, 2001
"... Edit distance is a powerful measure of similarity in string matching, measuring the minimum amount of insertions, deletions, and substitutions to convert a string into another string. This measure is often contrasted with time warping in speech processing, that measures how close two trajectories ar ..."
Abstract
- Add to MetaCart
Edit distance is a powerful measure of similarity in string matching, measuring the minimum amount of insertions, deletions, and substitutions to convert a string into another string. This measure is often contrasted with time warping in speech processing, that measures how close two trajectories are by allowing compression and expansion operations on time scale. Time warping can be easily generalized to measure the similarity between 1D point-patterns (ascending lists of real values), as the difference between and (i 1) points in a point-pattern can be considered as the value of a trajectory at the time i. However, we show that edit distance is more natural choice, and derive a measure by calculating the minimum amount of space needed to insert and delete between points to convert a point-pattern into another. We show that this measure defines a metric. We also define a substitution operation such that the distance calculation automatically separates the points into matching and mismatching points. The algorithms are based on dynamic programming. The main motivation for these methods is two and higher dimensional point-pattern matching, and therefore we generalize these methods into the 2D case, and show that this generalization leads to an NP-complete problem. There is also applications for the 1D case; we discuss shortly the matching of tree ring sequences in dendrochronology.
Fast Algorithms for Computing the Constrained LCS of Run-Length Encoded Strings
"... Abstract — In the constrained longest common subsequence (CLCS) problem, we are given two sequences X, Y and the constrained sequence P in run-length encoded (RLE) format, where |X | = n, |Y | = m and |P | = r and the numbers of runs in RLE format are N, M and R, respectively. In this paper, we s ..."
Abstract
- Add to MetaCart
Abstract — In the constrained longest common subsequence (CLCS) problem, we are given two sequences X, Y and the constrained sequence P in run-length encoded (RLE) format, where |X | = n, |Y | = m and |P | = r and the numbers of runs in RLE format are N, M and R, respectively. In this paper, we show that after the sequences are encoded, the CLCS problem can be solved in O(NMr+ r × min{q1, q2} + q3) time, where q1 and q2 denote the numbers of elements in the bottom and right boundaries of the partially matched blocks on the first layer, and q3 denotes the number of elements of whole boundaries of all fully matched cuboids in the DP lattice. If the compression ratio is good, our work obviously outperforms the previously known DP algorithm and the Hunt-and-Szymanski-like algorithm. 1.
Contents lists available at ScienceDirect Information Processing Letters
"... www.elsevier.com/locate/ipl A fast and simple algorithm for computing the longest common ..."
Abstract
- Add to MetaCart
www.elsevier.com/locate/ipl A fast and simple algorithm for computing the longest common

