A Guided Tour to Approximate String Matching
 ACM Computing Surveys
, 1999
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1
An Introduction to Machine Translation
, 1992
Abstract. In the last ten years there has been a significant amount of research in Machine Translation within a “new ” paradigm of empirical approaches, often labelled collectively as “Examplebased” approaches. The first manifestation of this approach caused some surprise and hostility among observers more used to different ways of working, but the techniques were quickly adopted and adapted by many researchers, often creating hybrid systems. This paper reviews the various research efforts within this paradigm reported to date, and attempts a categorisation of different manifestations of the general approach.
Optimal alignments in linear space
 CABIOS
, 1988
Space, not time, is often the limiting factor when computing optimal sequence alignments, and a number of recent papers in the biology literature have proposed spacesaving strategies. However, a 1975 computer science paper by Hirschberg presented a method that is superior to the newer proposals, both in theory and in practice. The goal of this note is to give Hirschberg’s idea the visibility it deserves by developing a linearspace version of Gotoh’s algorithm, which accommodates affine gap penalties. A portable Csoftware package implementing this algorithm is available on the BIONET free of charge.
Algorithms for the longest common subsequence problem
 J. ACM
, 1977
AaS~ACT Two algorithms are presented that solve the longest common subsequence problem The first algorithm is applicable in the general case and requires O(pn + n log n) time where p is the length of the longest common subsequence The second algorithm requires time bounded by O(p(m + 1 p)log n) In the common speoal case where p is close to m, this algorithm takes much less time than n ~ KEY WORDS AND PHRASES ' subsequence, common subsequence, algorithm CR CATEOORIES 3 73, 3 79, 5 25, 5 39
An O(ND) Difference Algorithm and Its Variations
 Algorithmica
, 1986
The problems of finding a longest common subsequence of two sequences A and B and a shortest edit script for transforming A into B have long been known to be dual problems. In this paper, they are shown to be equivalent to finding a shortest/longest path in an edit graph. Using this perspective, a simple O(ND) time and space algorithm is developed where N is the sum of the lengths of A and B and D is the size of the minimum edit script for A and B. The algorithm performs well when differences are small (sequences are similar) and is consequently fast in typical applications. The algorithm is shown to have O(N +D expectedtime performance under a basic stochastic model. A refinement of the algorithm requires only O(N) space, and the use of suffix trees leads to an O(NlgN +D ) time variation.
Recognition of Shapes by Editing Their Shock Graphs
 Proc. Int’l Conf. Computer Vision
, 2001
Abstract—This paper presents a novel framework for the recognition of objects based on their silhouettes. The main idea is to measure the distance between two shapes as the minimum extent of deformation necessary for one shape to match the other. Since the space of deformations is very highdimensional, three steps are taken to make the search practical: 1) define an equivalence class for shapes based on shockgraph topology, 2) define an equivalence class for deformation paths based on shockgraph transitions, and 3) avoid complexityincreasing deformation paths by moving toward shockgraph degeneracy. Despite these steps, which tremendously reduce the search requirement, there still remain numerous deformation paths to consider. To that end, we employ an editdistance algorithm for shock graphs that finds the optimal deformation path in polynomial time. The proposed approach gives intuitive correspondences for a variety of shapes and is robust in the presence of a wide range of visual transformations. The recognition rates on two distinct databases of 99 and 216 shapes each indicate highly successful within category matches (100 percent in top three matches), which render the framework potentially usable in a range of shapebased recognition applications. Index Terms—Shape deformation, shock graphs, graph matching, edit distance, shape matching, object recognition, dynamic programming. æ 1
A PATTERN MATCHING MODEL FOR MISUSE INTRUSION DETECTION
This paper describes a generic model of matching that can be usefully applied to misuse intrusion detection. The model is based on Colored Petri Nets. Guards define the context in which signatures are matched. The notion of start and final states, and paths between them define the set of event sequences matched by the net. Partial order matching can also be specified in this model. The main benefits of the model are its generality, portability and flexibility.
Detecting Changes in XML Documents
 In ICDE
, 2001
We present a diff algorithm for XML data. This work is motivated by the support for change control in the context of the Xyleme project that is investigating dynamic warehouses capable of storing massive volume of XML data. Because of the context, our algorithm has to be very efficient in terms of speed and memory space even at the cost of some loss of "quality". Also, it considers, besides insertions, deletions and updates (standard in diffs), a move operation on subtrees that is essential in the context of XML. Intuitively, our diff algorithm uses signatures to match (large) subtrees that were left unchanged between the old and new versions. Such exact matchings are then possibly propagated to ancestors and descendants to obtain more matchings. It also uses XML specific information such as ID attributes. We provide a performance analysis of the algorithm. We show that it runs in average in linear time vs. quadratic time for previous algorithms. We present experiments on synthetic data that confirm the analysis. Since this problem is NPhard, the linear time is obtained by trading some quality. We present experiments (again on synthetic data) that show that the output of our algorithm is reasonably close to the "optimal" in terms of quality. Finally we present experiments on a small sample of XML pages found on the Web. 1
Meaningful Change Detection in Structured Data
 IN PROCEEDINGS OF THE ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA
, 1997
Detecting changes by comparing data snapshots is an important requirement for difference queries, active databases, and version and configuration management. In this paper we focus on detecting meaningful changes in hierarchically structured data, such as nestedobject data. This problem is much more challenging than the corresponding one for relational or flatfile data. In order to describe changes better, we base our work not just on the traditional "atomic" insert, delete, update operations, but also on operations that move an entire subtree of nodes, and that copy an entire subtree. These operations allows us to describe changes in a semantically more meaningful way. Since this change detection problem is NPhard, in this paper we present a heuristic change detection algorithm that yields close to "minimal" descriptions of the changes, and that has fewer restrictions than previous algorithms. Our algorithm is based on transforming the change detection problem to a problem of com...