A Guided Tour to Approximate String Matching
 ACM Computing Surveys
, 1999
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1
An O(ND) Difference Algorithm and Its Variations
 Algorithmica
, 1986
The problems of finding a longest common subsequence of two sequences A and B and a shortest edit script for transforming A into B have long been known to be dual problems. In this paper, they are shown to be equivalent to finding a shortest/longest path in an edit graph. Using this perspective, a simple O(ND) time and space algorithm is developed where N is the sum of the lengths of A and B and D is the size of the minimum edit script for A and B. The algorithm performs well when differences are small (sequences are similar) and is consequently fast in typical applications. The algorithm is shown to have O(N +D expectedtime performance under a basic stochastic model. A refinement of the algorithm requires only O(N) space, and the use of suffix trees leads to an O(NlgN +D ) time variation.
Delta Algorithms: An Empirical Analysis
, 1998
Delta algorithms compress data by encoding one file in terms of another. This type of compression is useful in a number of situations: storing multiple versions of data, displaying differences, merging changes, distributing updates, storing backups, transmitting video sequences, and others. This paper studies the performance parameters of several delta algorithms, using a benchmark of over 1300 pairs of files taken from two successive releases of GNU software. Results indicate that modern delta compression algorithms based on ZivLempel techniques significantly outperform diff, a popular but older delta compressor, in terms of compression ratio. The modern compressors also correlate better with the actual difference between files without sacrificing performance.
Incremental String Comparison
 SIAM JOURNAL ON COMPUTING
, 1995
The problem of comparing two sequences A and B to determine their LCS or the edit distance between them has been much studied. In this paper we consider the following incremental version of these problems: given an appropriate encoding of a comparison between A and B, can one incrementally compute the answer for A and bB, and the answer for A and Bb with equal efficiency, where b is an additional symbol? Our main result is a theorem exposing a surprising relationship between the dynamic programming solutions for two such "adjacent" problems. Given a threshold k on the number of differences to be permitted in an alignment, the theorem leads directly to an O(k) algorithm for incrementally computing a new solution from an old one, as contrasts the O(k²) time required to compute a solution from scratch. We further show with a series of applications that this algorithm is indeed more powerful than its nonincremental counterpart by solving the applications with greater asymptotic ef...
An Empirical Study of Delta Algorithms
, 1996
. Delta algorithms compress data by encoding one file in terms of another. This type of compression is useful in a number of situations: storing multiple versions of data, distributing updates, storing backups, transmitting video sequences, and others. This paper studies the performance parameters of several delta algorithms, using a benchmark of over 1300 pairs of files taken from two successive releases of GNU software. Results indicate that modern delta compression algorithms based on ZivLempel techniques significantly outperform diff, a popular but older delta compressor, in terms of compression ratio. The modern compressors also correlate better with the actual difference between files; one of them is even faster than diff in both compression and decompression speed. 1 Introduction Delta algorithms, i.e., algorithms that compute differences between two files or strings, have a number of uses when multiple versions of data objects must be stored, transmitted, or proce...
The Table Layout Problem
 In Proc. 15th SoCG
, 1999
In this paper we study a geometric problem arising in typography: the problem of laying out a two dimensional table. Each cell of the table has content associated with it. We may have choices on the geometry of cells (e.g., number of rows to use for the text in a cell.) The problem is to choose configurations for the cells to optimize an objective function such as minimum table height given a fixed width for the table. We formulate a combinatorial version of the table layout problem, where the objective is to choose cell geometry to minimize table size. The table layout problem is NPcomplete, even for very restricted instances. One of our main results is an algorithm for computing the convex hull of the set of feasible table configurations, which gives a heuristic algorithm for table layout. We establish a connection between the fractional (LP) solution to the table layout problem and generalized network flow. We also present experimental results comparing the performance of heuristic...