Results 1 - 10
of
35
From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction
- Algorithmica
, 1997
"... We review the linear time suffix tree constructions by Weiner, McCreight, and Ukkonen. We use the terminology of the most recent algorithm, Ukkonen's online construction, to explain its historic predecessors. This reveals relationships much closer than one would expect, since the three algorithms ar ..."
Abstract
-
Cited by 61 (6 self)
- Add to MetaCart
We review the linear time suffix tree constructions by Weiner, McCreight, and Ukkonen. We use the terminology of the most recent algorithm, Ukkonen's online construction, to explain its historic predecessors. This reveals relationships much closer than one would expect, since the three algorithms are based on rather different intuitive ideas. Moreover, it completely explains the differences between these algorithms in terms of simplicity, efficiency, and implementation complexity.
Parameterized Pattern Matching: Algorithms and Applications
, 1994
"... The problem of finding sections of code that either are identical or are related by the systematic renaming of variables or constants can be modeled in terms of parameterized strings (p-strings) and parameterized matches (p- matches) [Baker93a]. P-strings are strings over two alphabets, one of whic ..."
Abstract
-
Cited by 57 (5 self)
- Add to MetaCart
The problem of finding sections of code that either are identical or are related by the systematic renaming of variables or constants can be modeled in terms of parameterized strings (p-strings) and parameterized matches (p- matches) [Baker93a]. P-strings are strings over two alphabets, one of which represents parameters. Two p-strings are a parameterized match (p-match) if one pstring is obtained by renaming the parameters of the other by a one-to-one function. In this paper, we investigate parameterized pattern matching via parameterized suffix trees (p-suffix trees), defined in [Baker93a]. We give two algorithms for constructing p-suffix trees: one (eager) that runs in linear time for fixed alphabets, and another that uses auxiliary data structures and runs in O(nlog (n)) time for variable alphabets, where n is input length. We show that using a psuffix tree for a pattern p-string P, it is possible to search for all p-matches of P within a text p-string T in space linear in ï P ï...
Approximate string matching over suffix trees
- PROCEEDINGS OF THE 4TH ANNUAL SYMPOSIUM ON COMBINATORIAL PATTERN MATCHING, NUMBER 684 IN LECTURE NOTES IN COMPUTER SCIENCE
, 1993
"... The classical approximate string-matching problem of finding the locations of approximate occurrences P 0 of pattern string P in text string T such that the edit distance between P and P 0 is k is considered. We concentrate on the special case in which T is available for preprocessing before the se ..."
Abstract
-
Cited by 53 (1 self)
- Add to MetaCart
The classical approximate string-matching problem of finding the locations of approximate occurrences P 0 of pattern string P in text string T such that the edit distance between P and P 0 is k is considered. We concentrate on the special case in which T is available for preprocessing before the searches with varying P and k. It is shown how the searches can be done fast using the suffix tree of T augmented with the suffix links as the preprocessed form of T and applying dynamic programming over the tree. Three variations of the search algorithm are developed with running times O(mq + n), O(mq log q + size of the output), and O(m
Fast and Practical Approximate String Matching
- In Combinatorial Pattern Matching, Third Annual Symposium
, 1992
"... We present new algorithms for approximate string matching based in simple, but efficient, ideas. First, we present an algorithm for string matching with mismatches based in arithmetical operations that runs in linear worst case time for most practical cases. This is a new approach to string searchin ..."
Abstract
-
Cited by 47 (0 self)
- Add to MetaCart
We present new algorithms for approximate string matching based in simple, but efficient, ideas. First, we present an algorithm for string matching with mismatches based in arithmetical operations that runs in linear worst case time for most practical cases. This is a new approach to string searching. Second, we present an algorithm for string matching with errors based on partitioning the pattern that requires linear expected time for typical inputs. 1 Introduction Approximate string matching is one of the main problems in combinatorial pattern matching. Recently, several new approaches emphasizing the expected search time and practicality have appeared [3, 4, 27, 32, 31, 17], in contrast to older results, most of them are only of theoretical interest. Here, we continue this trend, by presenting two new simple and efficient algorithms for approximate string matching. First, we present an algorithm for string matching with k mismatches. This problem consists of finding all instances o...
A Generalized Suffix Tree and Its (Un)Expected Asymptotic Behaviors
- SIAM J. Computing
, 1996
"... Suffix trees find several applications in computer science and telecommunications, most notably in algorithms on strings, data compressions and codes. Despite this, very little is known about their typical behaviors. In a probabilistic framework, we consider a family of suffix trees -- further calle ..."
Abstract
-
Cited by 45 (27 self)
- Add to MetaCart
Suffix trees find several applications in computer science and telecommunications, most notably in algorithms on strings, data compressions and codes. Despite this, very little is known about their typical behaviors. In a probabilistic framework, we consider a family of suffix trees -- further called b-suffix trees -- built from the first n suffixes of a random word. In this family a noncompact suffix tree (i.e., such that every edge is labeled by a single symbol) is represented by b = 1, and a compact suffix tree (i.e., without unary nodes) is asymptotically equivalent to b ! 1 as n ! 1. We study several parameters of b-suffix trees, namely: the depth of a given suffix, the depth of insertion, the height and the shortest feasible path. Some new results concerning typical (i.e., almost sure) behaviors of these parameters are established. These findings are used to obtain several insights into certain algorithms on words, molecular biology and universal data compression schemes. Key Wo...
Block Edit Models for Approximate String Matching
- Theoretical Computer Science
, 1997
"... In this paper we examine string block edit distance, in which two strings A and B are compared by extracting collections of substrings and placing them into correspondence. This model accounts for certain phenomena encountered in important real-world applications, including pen computing and molecu ..."
Abstract
-
Cited by 44 (4 self)
- Add to MetaCart
In this paper we examine string block edit distance, in which two strings A and B are compared by extracting collections of substrings and placing them into correspondence. This model accounts for certain phenomena encountered in important real-world applications, including pen computing and molecular biology. The basic problem admits a family of variations depending on whether the strings must be matched in their entireties, and whether overlap is permitted. We show that several variants are NPcomplete, and give polynomial-time algorithms for solving the remainder. Keywords: block edit distance, approximate string matching, sequence comparison, approximate ink matching, dynamic programming. 1 Introduction The edit distance model for string comparison [Lev66, NW70, WF74] has found widespread application in fields ranging from molecular biology to bird song classification [SK83]. A great deal of research has been devoted to this area, and numerous algorithms have been proposed for com...
Text Retrieval: Theory and Practice
- In 12th IFIP World Computer Congress, volume I
, 1992
"... We present the state of the art of the main component of text retrieval systems: the searching engine. We outline the main lines of research and issues involved. We survey recently published results for text searching and we explore the gap between theoretical vs. practical algorithms. The main obse ..."
Abstract
-
Cited by 43 (14 self)
- Add to MetaCart
We present the state of the art of the main component of text retrieval systems: the searching engine. We outline the main lines of research and issues involved. We survey recently published results for text searching and we explore the gap between theoretical vs. practical algorithms. The main observation is that simpler ideas are better in practice. 1597 Shaks. Lover's Compl. 2 From off a hill whose concaue wombe reworded A plaintfull story from a sistring vale. OED2, reword, sistering 1 1 Introduction Full text retrieval systems are becoming a popular way of providing support for on-line text. Their main advantage is that they avoid the complicated and expensive process of semantic indexing. From the end-user point of view, full text searching of on-line documents is appealing because a valid query is just any word or sentence of the document. However, when the desired answer cannot be obtained with a simple query, the user must perform his/her own semantic processing to guess w...
Approximate Nearest Neighbors and Sequence Comparison With Block Operations
- IN STOC
, 2000
"... We study sequence nearest neighbors (SNN). Let D be a database of n sequences; we would like to preprocess D so that given any on-line query sequence Q we can quickly find a sequence S in D for which d(S; Q) d(S; T ) for any other sequence T in D. Here d(S; Q) denotes the distance between sequences ..."
Abstract
-
Cited by 36 (6 self)
- Add to MetaCart
We study sequence nearest neighbors (SNN). Let D be a database of n sequences; we would like to preprocess D so that given any on-line query sequence Q we can quickly find a sequence S in D for which d(S; Q) d(S; T ) for any other sequence T in D. Here d(S; Q) denotes the distance between sequences S and Q, defined to be the minimum number of edit operations needed to transform one to another (all edit operations will be reversible so that d(S; T ) = d(T; S) for any two sequences T and S). These operations correspond to the notion of similarity between sequences that we wish to capture in a given application. Natural edit operations include character edits (inserts, replacements, deletes etc), block edits (moves, copies, deletes, reversals) and block numerical transformations (scaling by an additive or a multiplicative constant). The SNN problem arises in many applications. We present the first known efficient algorithm for "approximate" nearest neighbor search for sequences with p...
Asymptotic Properties Of Data Compression And Suffix Trees
- IEEE Trans. Inform. Theory
, 1993
"... Recently, Wyner and Ziv have proved that the typical length of a repeated subword found within the first n positions of a stationary ergodic sequence is (1=h) log n in probability where h is the entropy of the alphabet. This finding was used to obtain several insights into certain universal data com ..."
Abstract
-
Cited by 36 (10 self)
- Add to MetaCart
Recently, Wyner and Ziv have proved that the typical length of a repeated subword found within the first n positions of a stationary ergodic sequence is (1=h) log n in probability where h is the entropy of the alphabet. This finding was used to obtain several insights into certain universal data compression schemes, most notably the Lempel-Ziv data compression algorithm. Wyner and Ziv have also conjectured that their result can be extended to a stronger almost sure convergence. In this paper, we settle this conjecture in the negative in the so called right domain asymptotic, that is, during a dynamic phase of expanding the data base. We prove -- under an additional assumption involving mixing conditions -- that the length of a typical repeated subword oscillates almost surely (a.s.) between (1=h 1 ) log n and (1=h 2 ) log n where 0 ! h 2 ! h h 1 ! 1. We also show that the length of the nth block in the Lempel-Ziv parsing algorithm reveals a similar behavior. We relate our findings to...
A sublinear algorithm for weakly approximating edit distance
- In Proc. STOC 2003
, 2003
"... We show how to determine whether the edit distance between two given strings is small in sublinear time. Specifically, we present a test which, given two n-character strings A and B, runs in time o(n) and with high probability returns “CLOSE ” if their edit distance is O(n α), and “FAR”if their edit ..."
Abstract
-
Cited by 36 (4 self)
- Add to MetaCart
We show how to determine whether the edit distance between two given strings is small in sublinear time. Specifically, we present a test which, given two n-character strings A and B, runs in time o(n) and with high probability returns “CLOSE ” if their edit distance is O(n α), and “FAR”if their edit distance is Ω(n), where α is a fixed parameter less than 1. Our algorithm for testing the edit distance works by recursively subdividing the strings A and B into smaller substrings and looking for pairs of substrings in A, B with small edit distance. To do this, we query both strings at random places using a special technique for economizing on the samples which does not pick the samples independently and provides better query and overall complexity. As a result, our test runs in time Õ n max { α 2,2α−1} for any fixed α < 1. Our algorithm thus provides a trade-off between accuracy and efficiency that is particularly useful when the input data is very large. We also show a lower bound of Ω(n α/2)onthequerycomplexity of every algorithm that distinguishes pairs of strings with edit distance at most n α from those with edit distance at least n/6.

