Results 1  10
of
57
A Guided Tour to Approximate String Matching
 ACM Computing Surveys
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract

Cited by 418 (38 self)
 Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1
Identifying Syntactic Differences Between Two Programs
 Software  Practice and Experience
, 1991
"... this paper is organized into five sections, as follows. The internal form of a program, which is a variant of a parse tree, is discussed in the next section. Then the treematching algorithm and the synchronous prettyprinting technique are described. Experience with the comparator for the C languag ..."
Abstract

Cited by 81 (0 self)
 Add to MetaCart
this paper is organized into five sections, as follows. The internal form of a program, which is a variant of a parse tree, is discussed in the next section. Then the treematching algorithm and the synchronous prettyprinting technique are described. Experience with the comparator for the C language and some performance measurements are also presented. The last section discusses related work and concludes this paper
Longest Common Subsequences
 In Proc. of 19th MFCS, number 841 in LNCS
, 1994
"... . The length of a longest common subsequence (LLCS) of two or more strings is a useful measure of their similarity. The LLCS of a pair of strings is related to the `edit distance', or number of mutations /errors/editing steps required in passing from one string to the other. In this talk, we ex ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
. The length of a longest common subsequence (LLCS) of two or more strings is a useful measure of their similarity. The LLCS of a pair of strings is related to the `edit distance', or number of mutations /errors/editing steps required in passing from one string to the other. In this talk, we explore some of the combinatorial properties of the suband supersequence relations, survey various algorithms for computing the LLCS, and introduce some results on the expected LLCS for pairs of random strings. 1 Introduction The set \Sigma of finite strings over an unordered finite alphabet \Sigma admits of several natural partial orders. Some, such as the substring, prefix, and suffix relations, depend on contiguity and lead to many interesting combinatorial questions with practical applications to stringmatching. An excellent survey is given by Aho in [1]. In this talk however we will focus on the `subsequence' partial order. We say that u = u 1 \Delta \Delta \Delta um is a subsequence of ...
Approximate Text Searching
, 1998
"... This thesis focuses on the problem of text retrieval allowing errors, also called \approximate " string matching. The problem is to nd a pattern in a text, where the pattern and the text may have \errors". This problem has received a lot of attention in recent years because of its applicat ..."
Abstract

Cited by 22 (6 self)
 Add to MetaCart
This thesis focuses on the problem of text retrieval allowing errors, also called \approximate " string matching. The problem is to nd a pattern in a text, where the pattern and the text may have \errors". This problem has received a lot of attention in recent years because of its applications in many areas, such as information retrieval, computational biology and signal processing, to name a few. The aim of this work is the development and analysis of novel algorithms to deal with the problem under various conditions, as well as a better understanding of the problem itself and its statistical behavior. Although our results are valid in many dierent areas, we focus our attention on typical text searching for information retrieval applications. This makes some ranges of values for the parameters of the problem more interesting than others. We have divided this presentation in two parts. The rst one deals with online approximate string matching, i.e. when there is no time or space to preprocess the text. These algorithms are the core of oline algorithms as well. Online searching is the area of the problem where better algorithms existed. We have obtained new bounds for the probability of an approximate match of a pattern in
Expected Length of Longest Common Subsequences
"... Contents 1 Introduction 1 2 Notation and preliminaries 4 2.1 Notation and basic definitions : : : : : : : : : : : : : : : : : : 4 2.2 Longest common subsequences : : : : : : : : : : : : : : : : : : 7 2.3 Computing longest common subsequences : : : : : : : : : : : 10 2.4 Expected length of longest c ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
Contents 1 Introduction 1 2 Notation and preliminaries 4 2.1 Notation and basic definitions : : : : : : : : : : : : : : : : : : 4 2.2 Longest common subsequences : : : : : : : : : : : : : : : : : : 7 2.3 Computing longest common subsequences : : : : : : : : : : : 10 2.4 Expected length of longest common subsequences : : : : : : : 14 3 Lower Bounds 20 3.1 Css machines : : : : : : : : : : : : : : : : : : : : : : : : : : : 20 3.2 Analysis of css machines : : : : : : : : : : : : : : : : : : : : : 26 3.3 Design of css machines : : : : : : : : : : : : : : : : : : : : : : 31 3.4 Labeled css machines : : : : : : : : : : : : : : : : : : : : : : : 38 4 Upper bounds 45 4.1 Collations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 45 4.2 Previous upper bounds : : : : : : : : : : : : : : : : : : : : : : 51 4.3 Simple upper bound (binary alphabet) : : : : : : : : : : : : : 55 4.4 Simple upper bound (alphabet size 3) : : : : : : : : : : : : : : 59 4.5 Upper bounds for binary alphabet : :
Finding Longest Increasing and Common Subsequences in Streaming Data
, 2003
"... In this paper, we present algorithms and lower bounds for the Longest Increasing Subsequence (LIS) and Longest Common Subsequence (LCS) problems in the data streaming model. ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
In this paper, we present algorithms and lower bounds for the Longest Increasing Subsequence (LIS) and Longest Common Subsequence (LCS) problems in the data streaming model.
Algorithms for Transposition Invariant String Matching (Extended Abstract)
 Journal of Algorithms
, 2002
"... Given strings A and B over an alphabet Σ ⊆ U, where U is some numerical universe closed... ..."
Abstract

Cited by 9 (6 self)
 Add to MetaCart
Given strings A and B over an alphabet &Sigma; &sube; U, where U is some numerical universe closed...
Experimenting an Approximation Algorithm for the LCS
 Discrete Applied Mathematics
, 1998
"... The problem of finding the longest common subsequence (lcs) of a given set of sequences over an alphabet # occurs in many interesting contexts, such as data compression and molecular biology, in order to measure the "similarity degree" among biological sequences. Since the problem is NPco ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
The problem of finding the longest common subsequence (lcs) of a given set of sequences over an alphabet # occurs in many interesting contexts, such as data compression and molecular biology, in order to measure the "similarity degree" among biological sequences. Since the problem is NPcomplete in its decision version (i.e. does there exist a lcs of length at least k, for a given k?) even over fixed alphabet, polynomial algorithms which give approximate solutions have been proposed. Among them, Long Run (LR) is the only one with guaranteed constant performance ratio.
Measuring the Accuracy of PageReading Systems
 PH.D. DISSERTATION, UNLV, LAS VEGAS
, 1996
"... Given a bitmapped image of a page from any document, a pagereading system identifies the characters on the page and stores them in a text file. This “OCRgenerated” text is represented by a string and compared with the correct string to determine the accuracy of this process. The string editing ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Given a bitmapped image of a page from any document, a pagereading system identifies the characters on the page and stores them in a text file. This “OCRgenerated” text is represented by a string and compared with the correct string to determine the accuracy of this process. The string editing problem is applied to find an optimal correspondence of these strings using an appropriate cost function. The ISRI annual test of pagereading systems utilizes the following performance measures, which are defined in terms of this correspondence and the string edit distance: character accuracy, throughput, accuracy by character class, marked character efficiency, word accuracy, nonstopword accuracy, and phrase accuracy. It is shown that the universe of cost functions is divided into equivalence classes, and the cost functions related to the longest common subsequence (LCS) are identified. The computation of a LCS can be made faster by a lineartime preprocessing step.
Sequence Comparison: Some Theory and Some Practice
, 1988
"... A brief survey of the theory and practice of sequence comparison is made focusing on diff, the UNIX 1 file difference utility. 1 Sequence comparison Sequence comparison is a deep and fascinating subject in Computer Science, both theoretical and practical. However, in our opinion, neither the theo ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
A brief survey of the theory and practice of sequence comparison is made focusing on diff, the UNIX 1 file difference utility. 1 Sequence comparison Sequence comparison is a deep and fascinating subject in Computer Science, both theoretical and practical. However, in our opinion, neither the theoretical nor the practical aspects of the problem are well understood and we feel that their mastery is a true challenge for Computer Science. The central problem can be stated very easily: find an algorithm, as efficient and practical as possible, to compute a longest common subsequence (lcs for short) of two given sequences 2 . As usual, a subsequence of a sequence is another sequence obtained from it by deleting some (not necessarily contiguous) terms. Thus, both en/pri and en/pai are longest common subsequences of sequence/comparison and theory/and/practice. Part of this work was done while the author was visiting the Universit'e de Rouen, in 1987. That visit was partially supported...