Results 1 - 10
of
17
An O(ND) Difference Algorithm and Its Variations
- Algorithmica
, 1986
"... The problems of finding a longest common subsequence of two sequences A and B and a shortest edit script for transforming A into B have long been known to be dual problems. In this paper, they are shown to be equivalent to finding a shortest/longest path in an edit graph. Using this perspective, a s ..."
Abstract
-
Cited by 133 (4 self)
- Add to MetaCart
The problems of finding a longest common subsequence of two sequences A and B and a shortest edit script for transforming A into B have long been known to be dual problems. In this paper, they are shown to be equivalent to finding a shortest/longest path in an edit graph. Using this perspective, a simple O(ND) time and space algorithm is developed where N is the sum of the lengths of A and B and D is the size of the minimum edit script for A and B. The algorithm performs well when differences are small (sequences are similar) and is consequently fast in typical applications. The algorithm is shown to have O(N +D expected-time performance under a basic stochastic model. A refinement of the algorithm requires only O(N) space, and the use of suffix trees leads to an O(NlgN +D ) time variation.
Identifying the Semantic and Textual Differences Between Two Versions of a Program
- Proceedings of the ACM SIGPLAN 90 Conference on Programming Language Design and Implementation
, 1990
"... Text-based file comparators (e.g., the Unix utility diff), are very general tools that can be applied to arbitrary files. However, using such tools to compare programs can be unsatisfactory because their only notion of change is based on program text rather than program behavior. This paper describe ..."
Abstract
-
Cited by 86 (5 self)
- Add to MetaCart
Text-based file comparators (e.g., the Unix utility diff), are very general tools that can be applied to arbitrary files. However, using such tools to compare programs can be unsatisfactory because their only notion of change is based on program text rather than program behavior. This paper describes a technique for comparing two versions of a program, determining which program components represent changes, and classifying each changed component as representing either a semantic or a textual change. ######################## This work was supported in part by the Defense Advanced Research Projects Agency, monitored by the Office of Naval Research under contract N00014-88-K, by the National Science Foundation under grant CCR8958530, and by grants from Xerox, Kodak, and Cray. Author's address: Computer Sciences Department, Univ. of Wisconsin, 1210 W. Dayton St., Madison, WI 53706. Permission to copy without fee all or part of this material is granted provided that the copies are not made...
Identifying Syntactic Differences Between Two Programs
- Software - Practice and Experience
, 1991
"... this paper is organized into five sections, as follows. The internal form of a program, which is a variant of a parse tree, is discussed in the next section. Then the tree-matching algorithm and the synchronous pretty-printing technique are described. Experience with the comparator for the C languag ..."
Abstract
-
Cited by 64 (0 self)
- Add to MetaCart
this paper is organized into five sections, as follows. The internal form of a program, which is a variant of a parse tree, is discussed in the next section. Then the tree-matching algorithm and the synchronous pretty-printing technique are described. Experience with the comparator for the C language and some performance measurements are also presented. The last section discusses related work and concludes this paper
A file comparison program
- Software: Practice and Experience
, 1985
"... This paper presents a simple method for computing a shortest sequence of insertion and deletion commands that converts one given file to another. The method is particularly efficient when the difference between the two files is small compared to the files ' lengths. In experiments performed on typic ..."
Abstract
-
Cited by 51 (3 self)
- Add to MetaCart
This paper presents a simple method for computing a shortest sequence of insertion and deletion commands that converts one given file to another. The method is particularly efficient when the difference between the two files is small compared to the files ' lengths. In experiments performed on typical files, the program often ran four times faster than the UNIX diff command. KEY WORDS Edit distance Edit script Filc comparison
Incremental String Comparison
- SIAM JOURNAL ON COMPUTING
, 1995
"... The problem of comparing two sequences A and B to determine their LCS or the edit distance between them has been much studied. In this paper we consider the following incremental version of these problems: given an appropriate encoding of a comparison between A and B, can one incrementally compute t ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
The problem of comparing two sequences A and B to determine their LCS or the edit distance between them has been much studied. In this paper we consider the following incremental version of these problems: given an appropriate encoding of a comparison between A and B, can one incrementally compute the answer for A and bB, and the answer for A and Bb with equal efficiency, where b is an additional symbol? Our main result is a theorem exposing a surprising relationship between the dynamic programming solutions for two such "adjacent" problems. Given a threshold k on the number of differences to be permitted in an alignment, the theorem leads directly to an O(k) algorithm for incrementally computing a new solution from an old one, as contrasts the O(k²) time required to compute a solution from scratch. We further show with a series of applications that this algorithm is indeed more powerful than its non-incremental counterpart by solving the applications with greater asymptotic ef...
Longest Common Subsequences
- In Proc. of 19th MFCS, number 841 in LNCS
, 1994
"... . The length of a longest common subsequence (LLCS) of two or more strings is a useful measure of their similarity. The LLCS of a pair of strings is related to the `edit distance', or number of mutations /errors/editing steps required in passing from one string to the other. In this talk, we explore ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
. The length of a longest common subsequence (LLCS) of two or more strings is a useful measure of their similarity. The LLCS of a pair of strings is related to the `edit distance', or number of mutations /errors/editing steps required in passing from one string to the other. In this talk, we explore some of the combinatorial properties of the suband super-sequence relations, survey various algorithms for computing the LLCS, and introduce some results on the expected LLCS for pairs of random strings. 1 Introduction The set \Sigma of finite strings over an unordered finite alphabet \Sigma admits of several natural partial orders. Some, such as the substring, prefix, and suffix relations, depend on contiguity and lead to many interesting combinatorial questions with practical applications to string-matching. An excellent survey is given by Aho in [1]. In this talk however we will focus on the `subsequence' partial order. We say that u = u 1 \Delta \Delta \Delta um is a subsequence of ...
Measuring the Accuracy of Page-Reading Systems
- PH.D. DISSERTATION, UNLV, LAS VEGAS
, 1996
"... Given a bitmapped image of a page from any document, a page-reading system identifies the characters on the page and stores them in a text file. This “OCR-generated” text is represented by a string and com-pared with the correct string to determine the accuracy of this process. The string editing ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
Given a bitmapped image of a page from any document, a page-reading system identifies the characters on the page and stores them in a text file. This “OCR-generated” text is represented by a string and com-pared with the correct string to determine the accuracy of this process. The string editing problem is applied to find an optimal correspondence of these strings using an appropriate cost function. The ISRI annual test of page-reading systems utilizes the following performance measures, which are defined in terms of this correspondence and the string edit distance: character accuracy, throughput, accuracy by character class, marked char-acter efficiency, word accuracy, non-stopword accuracy, and phrase accu-racy. It is shown that the universe of cost functions is divided into equivalence classes, and the cost functions related to the longest common subsequence (LCS) are identified. The computation of a LCS can be made faster by a linear-time preprocessing step.
Sequence Comparison: Some Theory and Some Practice
, 1988
"... A brief survey of the theory and practice of sequence comparison is made focusing on diff, the UNIX 1 file difference utility. 1 Sequence comparison Sequence comparison is a deep and fascinating subject in Computer Science, both theoretical and practical. However, in our opinion, neither the theo ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
A brief survey of the theory and practice of sequence comparison is made focusing on diff, the UNIX 1 file difference utility. 1 Sequence comparison Sequence comparison is a deep and fascinating subject in Computer Science, both theoretical and practical. However, in our opinion, neither the theoretical nor the practical aspects of the problem are well understood and we feel that their mastery is a true challenge for Computer Science. The central problem can be stated very easily: find an algorithm, as efficient and practical as possible, to compute a longest common subsequence (lcs for short) of two given sequences 2 . As usual, a subsequence of a sequence is another sequence obtained from it by deleting some (not necessarily contiguous) terms. Thus, both en/pri and en/pai are longest common subsequences of sequence/comparison and theory/and/practice. Part of this work was done while the author was visiting the Universit'e de Rouen, in 1987. That visit was partially supported...
New Algorithms for the Longest Common Subsequence Problem
, 1994
"... Given two sequences A = a 1 a 2 : : : am and B = b 1 b 2 : : : b n , m n, over some alphabet \Sigma, a common subsequence C = c 1 c 2 : : : c l of A and B is a sequence that can be obtained from both A and B by deleting zero or more (not necessarily adjacent) symbols. Finding a common subsequenc ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Given two sequences A = a 1 a 2 : : : am and B = b 1 b 2 : : : b n , m n, over some alphabet \Sigma, a common subsequence C = c 1 c 2 : : : c l of A and B is a sequence that can be obtained from both A and B by deleting zero or more (not necessarily adjacent) symbols. Finding a common subsequence of maximal length is called the Longest CommonSubsequence (LCS) Problem. Two new algorithms based on the well-known paradigm of computing minimal matches are presented. One runs in time O(ns+minfds; pmg) and the other runs in time O(ns +minfp(n \Gamma p); pmg) where s = j\Sigmaj is the alphabet size, p is the length of a longest common subsequence and d is the number of minimal matches. The ns term is charged by a standard preprocessing phase. When m n both algorithms are fast in situations when a LCS is expected to be short as well as in situations when a LCS is expected to be long. Further they show a much smaller degeneration in intermediate situations, especially the second al...
Bit-Parallel LCS-length Computation Revisited
- In Proc. 15th Australasian Workshop on Combinatorial Algorithms (AWOCA
, 2004
"... The longest common subsequence (LCS) is a classic and well-studied measure of similarity between two strings A and B. This problem has two variants: determining the length of the LCS (LLCS), and recovering an LCS itself. In this paper we address the first of these two. Let m and n denote the leng ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
The longest common subsequence (LCS) is a classic and well-studied measure of similarity between two strings A and B. This problem has two variants: determining the length of the LCS (LLCS), and recovering an LCS itself. In this paper we address the first of these two. Let m and n denote the lengths of the strings A and B, respectively, and w denote the computer word size. First we give a slightly improved formula for the bit-parallel O(#m/w#n) LLCS algorithm of Crochemore et al. [4]. Then we discuss the relative performance of the bit-parallel algorithms and compare our variant against one of the best conventional LLCS algorithms. Finally we propose and evaluate an O(#d/w#n) version of the algorithm, where d is the simple (indel) edit distance between A and B.

