Results 1  10
of
41
Suffix arrays: A new method for online string searches
 SIAM J. Comput
, 1993
"... Abstract. A new and conceptually simple data structure, called a suffix array, for online string searches is introduced in this paper. Constructing and querying suffix arrays is reduced to a sort and search paradigm that employs novel algorithms. The main advantage of suffix arrays over suffix tree ..."
Abstract

Cited by 646 (1 self)
 Add to MetaCart
Abstract. A new and conceptually simple data structure, called a suffix array, for online string searches is introduced in this paper. Constructing and querying suffix arrays is reduced to a sort and search paradigm that employs novel algorithms. The main advantage of suffix arrays over suffix trees is that, in practice, they use three to five times less space. From a complexity standpoint, suffix arrays permit online string searches of the type, "Is W a substring of A? " to be answered in time O(P + log N), where P is the length of W and N is the length of A, which is competitive with (and in some cases slightly better than) suffix trees. The only drawback is that in those instances where the underlying alphabet is finite and small, suffix trees can be constructed in O (N) time in the worst case, versus O (N log N) time for suffix arrays. However, an augmented algorithm is given that, regardless of the alphabet size, constructs suffix arrays in O (N) expected time, albeit with lesser.space efficiency. It is believed that suffix arrays will prove to be better in practice than suffix trees for many applications.
Parameterized Pattern Matching: Algorithms and Applications
, 1994
"... The problem of finding sections of code that either are identical or are related by the systematic renaming of variables or constants can be modeled in terms of parameterized strings (pstrings) and parameterized matches (p matches) [Baker93a]. Pstrings are strings over two alphabets, one of whic ..."
Abstract

Cited by 71 (5 self)
 Add to MetaCart
The problem of finding sections of code that either are identical or are related by the systematic renaming of variables or constants can be modeled in terms of parameterized strings (pstrings) and parameterized matches (p matches) [Baker93a]. Pstrings are strings over two alphabets, one of which represents parameters. Two pstrings are a parameterized match (pmatch) if one pstring is obtained by renaming the parameters of the other by a onetoone function. In this paper, we investigate parameterized pattern matching via parameterized suffix trees (psuffix trees), defined in [Baker93a]. We give two algorithms for constructing psuffix trees: one (eager) that runs in linear time for fixed alphabets, and another that uses auxiliary data structures and runs in O(nlog (n)) time for variable alphabets, where n is input length. We show that using a psuffix tree for a pattern pstring P, it is possible to search for all pmatches of P within a text pstring T in space linear in ï P ï...
From Ukkonen to McCreight and Weiner: A Unifying View of LinearTime Suffix Tree Construction
 Algorithmica
, 1997
"... We review the linear time suffix tree constructions by Weiner, McCreight, and Ukkonen. We use the terminology of the most recent algorithm, Ukkonen's online construction, to explain its historic predecessors. This reveals relationships much closer than one would expect, since the three algorithms ar ..."
Abstract

Cited by 66 (6 self)
 Add to MetaCart
We review the linear time suffix tree constructions by Weiner, McCreight, and Ukkonen. We use the terminology of the most recent algorithm, Ukkonen's online construction, to explain its historic predecessors. This reveals relationships much closer than one would expect, since the three algorithms are based on rather different intuitive ideas. Moreover, it completely explains the differences between these algorithms in terms of simplicity, efficiency, and implementation complexity.
Approximate string matching over suffix trees
 PROCEEDINGS OF THE 4TH ANNUAL SYMPOSIUM ON COMBINATORIAL PATTERN MATCHING, NUMBER 684 IN LECTURE NOTES IN COMPUTER SCIENCE
, 1993
"... The classical approximate stringmatching problem of finding the locations of approximate occurrences P 0 of pattern string P in text string T such that the edit distance between P and P 0 is k is considered. We concentrate on the special case in which T is available for preprocessing before the se ..."
Abstract

Cited by 55 (1 self)
 Add to MetaCart
The classical approximate stringmatching problem of finding the locations of approximate occurrences P 0 of pattern string P in text string T such that the edit distance between P and P 0 is k is considered. We concentrate on the special case in which T is available for preprocessing before the searches with varying P and k. It is shown how the searches can be done fast using the suffix tree of T augmented with the suffix links as the preprocessed form of T and applying dynamic programming over the tree. Three variations of the search algorithm are developed with running times O(mq + n), O(mq log q + size of the output), and O(m
Fast and Practical Approximate String Matching
 In Combinatorial Pattern Matching, Third Annual Symposium
, 1992
"... We present new algorithms for approximate string matching based in simple, but efficient, ideas. First, we present an algorithm for string matching with mismatches based in arithmetical operations that runs in linear worst case time for most practical cases. This is a new approach to string searchin ..."
Abstract

Cited by 54 (0 self)
 Add to MetaCart
We present new algorithms for approximate string matching based in simple, but efficient, ideas. First, we present an algorithm for string matching with mismatches based in arithmetical operations that runs in linear worst case time for most practical cases. This is a new approach to string searching. Second, we present an algorithm for string matching with errors based on partitioning the pattern that requires linear expected time for typical inputs. 1 Introduction Approximate string matching is one of the main problems in combinatorial pattern matching. Recently, several new approaches emphasizing the expected search time and practicality have appeared [3, 4, 27, 32, 31, 17], in contrast to older results, most of them are only of theoretical interest. Here, we continue this trend, by presenting two new simple and efficient algorithms for approximate string matching. First, we present an algorithm for string matching with k mismatches. This problem consists of finding all instances o...
A Generalized Suffix Tree and Its (Un)Expected Asymptotic Behaviors
 SIAM J. Computing
, 1996
"... Suffix trees find several applications in computer science and telecommunications, most notably in algorithms on strings, data compressions and codes. Despite this, very little is known about their typical behaviors. In a probabilistic framework, we consider a family of suffix trees  further calle ..."
Abstract

Cited by 52 (29 self)
 Add to MetaCart
Suffix trees find several applications in computer science and telecommunications, most notably in algorithms on strings, data compressions and codes. Despite this, very little is known about their typical behaviors. In a probabilistic framework, we consider a family of suffix trees  further called bsuffix trees  built from the first n suffixes of a random word. In this family a noncompact suffix tree (i.e., such that every edge is labeled by a single symbol) is represented by b = 1, and a compact suffix tree (i.e., without unary nodes) is asymptotically equivalent to b ! 1 as n ! 1. We study several parameters of bsuffix trees, namely: the depth of a given suffix, the depth of insertion, the height and the shortest feasible path. Some new results concerning typical (i.e., almost sure) behaviors of these parameters are established. These findings are used to obtain several insights into certain algorithms on words, molecular biology and universal data compression schemes. Key Wo...
Block Edit Models for Approximate String Matching
 Theoretical Computer Science
, 1997
"... In this paper we examine string block edit distance, in which two strings A and B are compared by extracting collections of substrings and placing them into correspondence. This model accounts for certain phenomena encountered in important realworld applications, including pen computing and molecu ..."
Abstract

Cited by 48 (4 self)
 Add to MetaCart
In this paper we examine string block edit distance, in which two strings A and B are compared by extracting collections of substrings and placing them into correspondence. This model accounts for certain phenomena encountered in important realworld applications, including pen computing and molecular biology. The basic problem admits a family of variations depending on whether the strings must be matched in their entireties, and whether overlap is permitted. We show that several variants are NPcomplete, and give polynomialtime algorithms for solving the remainder. Keywords: block edit distance, approximate string matching, sequence comparison, approximate ink matching, dynamic programming. 1 Introduction The edit distance model for string comparison [Lev66, NW70, WF74] has found widespread application in fields ranging from molecular biology to bird song classification [SK83]. A great deal of research has been devoted to this area, and numerous algorithms have been proposed for com...
Text Retrieval: Theory and Practice
 In 12th IFIP World Computer Congress, volume I
, 1992
"... We present the state of the art of the main component of text retrieval systems: the searching engine. We outline the main lines of research and issues involved. We survey recently published results for text searching and we explore the gap between theoretical vs. practical algorithms. The main obse ..."
Abstract

Cited by 46 (14 self)
 Add to MetaCart
We present the state of the art of the main component of text retrieval systems: the searching engine. We outline the main lines of research and issues involved. We survey recently published results for text searching and we explore the gap between theoretical vs. practical algorithms. The main observation is that simpler ideas are better in practice. 1597 Shaks. Lover's Compl. 2 From off a hill whose concaue wombe reworded A plaintfull story from a sistring vale. OED2, reword, sistering 1 1 Introduction Full text retrieval systems are becoming a popular way of providing support for online text. Their main advantage is that they avoid the complicated and expensive process of semantic indexing. From the enduser point of view, full text searching of online documents is appealing because a valid query is just any word or sentence of the document. However, when the desired answer cannot be obtained with a simple query, the user must perform his/her own semantic processing to guess w...
Combinatorial algorithms for DNA sequence assembly
 Algorithmica
, 1993
"... The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The seq ..."
Abstract

Cited by 42 (3 self)
 Add to MetaCart
The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NPhard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates and list a series of alternate solutions in the event that several appear equally good. Moreover it uses a limited form ...
Asymptotic Properties Of Data Compression And Suffix Trees
 IEEE Trans. Inform. Theory
, 1993
"... Recently, Wyner and Ziv have proved that the typical length of a repeated subword found within the first n positions of a stationary ergodic sequence is (1=h) log n in probability where h is the entropy of the alphabet. This finding was used to obtain several insights into certain universal data com ..."
Abstract

Cited by 40 (11 self)
 Add to MetaCart
Recently, Wyner and Ziv have proved that the typical length of a repeated subword found within the first n positions of a stationary ergodic sequence is (1=h) log n in probability where h is the entropy of the alphabet. This finding was used to obtain several insights into certain universal data compression schemes, most notably the LempelZiv data compression algorithm. Wyner and Ziv have also conjectured that their result can be extended to a stronger almost sure convergence. In this paper, we settle this conjecture in the negative in the so called right domain asymptotic, that is, during a dynamic phase of expanding the data base. We prove  under an additional assumption involving mixing conditions  that the length of a typical repeated subword oscillates almost surely (a.s.) between (1=h 1 ) log n and (1=h 2 ) log n where 0 ! h 2 ! h h 1 ! 1. We also show that the length of the nth block in the LempelZiv parsing algorithm reveals a similar behavior. We relate our findings to...