Results 1  10
of
165
A Guided Tour to Approximate String Matching
 ACM Computing Surveys
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract

Cited by 413 (38 self)
 Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1
A PATTERN MATCHING MODEL FOR MISUSE INTRUSION DETECTION
"... This paper describes a generic model of matching that can be usefully applied to misuse intrusion detection. The model is based on Colored Petri Nets. Guards define the context in which signatures are matched. The notion of start and final states, and paths between them define the set of event seque ..."
Abstract

Cited by 150 (4 self)
 Add to MetaCart
This paper describes a generic model of matching that can be usefully applied to misuse intrusion detection. The model is based on Colored Petri Nets. Guards define the context in which signatures are matched. The notion of start and final states, and paths between them define the set of event sequences matched by the net. Partial order matching can also be specified in this model. The main benefits of the model are its generality, portability and flexibility.
A fast bitvector algorithm for approximate string matching based on dynamic programming
 J. ACM
, 1999
"... Abstract. The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with korfewer differences. Simple and practical bitvector algorithms have been designed for this problem, most notably the one used in agrep. These alg ..."
Abstract

Cited by 139 (1 self)
 Add to MetaCart
Abstract. The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with korfewer differences. Simple and practical bitvector algorithms have been designed for this problem, most notably the one used in agrep. These algorithms compute a bit representation of the current stateset of the kdifference automaton for the query, and asymptotically run in either O(nmk/w) orO(nm log �/w) time where w is the word size of the machine (e.g., 32 or 64 in practice), and � is the size of the pattern alphabet. Here we present an algorithm of comparable simplicity that requires only O(nm/w) time by virtue of computing a bit representation of the relocatable dynamic programming matrix for the problem. Thus, the algorithm’s performance is independent of k, and it is found to be more efficient than the previous results for many choices of k and small m. Moreover, because the algorithm is not dependent on k, it can be used to rapidly compute blocks of the dynamic programming matrix as in the 4Russians algorithm of Wu et al. [1996]. This gives rise to an O(kn/w) expectedtime algorithm for the case where m may be arbitrarily large. In practice this new algorithm, that computes a region of the dynamic programming (d.p.) matrix w entries at a time using the basic algorithm as a subroutine, is significantly faster than our previous 4Russians algorithm, that computes the same region 4 or 5 entries at a time using table lookup. This performance improvement yields a code that is either superior or competitive with all existing algorithms except for some filtration algorithms that are superior when k/m is sufficiently small.
Fast and Flexible Word Searching on Compressed Text
, 2000
"... ... text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, ..."
Abstract

Cited by 81 (33 self)
 Add to MetaCart
... text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes.
Faster Approximate String Matching
 Algorithmica
, 1999
"... We present a new algorithm for online approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = \Omega\Gamma137 n) bits, ..."
Abstract

Cited by 72 (24 self)
 Add to MetaCart
We present a new algorithm for online approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = \Omega\Gamma137 n) bits, where n is the text size. This is essentially similar to the model used in Wu and Manber's work, although we improve the search time by packing the automaton states differently. The running time achieved is O(n) for small patterns (i.e. whenever mk = O(log n)), where m is the pattern length and k ! m the number of allowed errors. This is in contrast with the result of Wu and Manber, which is O(kn) for m = O(log n). Longer patterns can be processed by partitioning the automaton into many machine words, at O(mk=w n) search cost. We allow generalizations in the pattern, such as classes of characters, gaps and others, at essentially the same search cost. We then explore other novel techniques t...
Spelling Approximate Repeated Or Common Motifs Using a Suffix Tree
, 1998
"... . We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet \Sigma . For instance, \Sigma may be equal to fA, C, G, Tg and the sequence represents an encoding of a DNA macromolecule. The motifs searched correspond to words over the s ..."
Abstract

Cited by 70 (7 self)
 Add to MetaCart
. We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet \Sigma . For instance, \Sigma may be equal to fA, C, G, Tg and the sequence represents an encoding of a DNA macromolecule. The motifs searched correspond to words over the same alphabet which occur a minimum number q of times in the sequence with at most e mismatches each time (q is called the quorum constraint). The second algorithm extracts common motifs from a set of N 2 sequences. In this case, the motifs must occur, again with at most e mismatches, in 1 q N distinct sequences of the set. In both cases, the words representing the motifs may never be present exactly in the sequences. We therefore speak of the motifs, repeated in a sequence or common to a set of them, as being "external" objects and denote them by the expression "valid models" if they verify the quorum constraint q. The approach we introduce here for finding all valid models corr...
Fast and Flexible String Matching by Combining Bitparallelism and Suffix Automata
 ACM JOURNAL OF EXPERIMENTAL ALGORITHMICS (JEA
, 1998
"... ... In this paper we merge bitparallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bitparallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as ShiftOr. It inher ..."
Abstract

Cited by 60 (11 self)
 Add to MetaCart
... In this paper we merge bitparallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bitparallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as ShiftOr. It inherits from ShiftOr the ability to handle flexible patterns and from BDM the ability to skip characters. BNDM is 30%40% faster than BDM and up to 7 times faster than ShiftOr. When compared to the fastest existing algorithms on exact patterns (which belong to the BM family), BNDM is from 20% slower to 3 times faster, depending on the alphabet size. With respect to flexible pattern searching, BNDM is by far the fastest technique to deal with classes of characters and is competitive to search allowing errors. In particular, BNDM seems very adequate for computational biology applications, since it is the fastest algorithm to search on DNA sequences and flexible searching is an important problem in that
Fast and Practical Approximate String Matching
 In Combinatorial Pattern Matching, Third Annual Symposium
, 1992
"... We present new algorithms for approximate string matching based in simple, but efficient, ideas. First, we present an algorithm for string matching with mismatches based in arithmetical operations that runs in linear worst case time for most practical cases. This is a new approach to string searchin ..."
Abstract

Cited by 53 (0 self)
 Add to MetaCart
We present new algorithms for approximate string matching based in simple, but efficient, ideas. First, we present an algorithm for string matching with mismatches based in arithmetical operations that runs in linear worst case time for most practical cases. This is a new approach to string searching. Second, we present an algorithm for string matching with errors based on partitioning the pattern that requires linear expected time for typical inputs. 1 Introduction Approximate string matching is one of the main problems in combinatorial pattern matching. Recently, several new approaches emphasizing the expected search time and practicality have appeared [3, 4, 27, 32, 31, 17], in contrast to older results, most of them are only of theoretical interest. Here, we continue this trend, by presenting two new simple and efficient algorithms for approximate string matching. First, we present an algorithm for string matching with k mismatches. This problem consists of finding all instances o...
Adding Compression to Block Addressing Inverted Indexes
, 2000
"... . Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed for efficient, lowoverhead text retrieval. Modern text compression techniques can reduce the text to less than 30% of its size and allow searching it dire ..."
Abstract

Cited by 49 (28 self)
 Add to MetaCart
. Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed for efficient, lowoverhead text retrieval. Modern text compression techniques can reduce the text to less than 30% of its size and allow searching it directly and faster than the uncompressed text. Inverted index compression obtains significant reduction of their original size at the same processing speed. Block addressing makes the inverted lists point to text blocks instead of exact positions and pay the reduction in space with some sequential text scanning. In this work we combine the three ideas in a single scheme. We present a compressed inverted file that indexes compressed text and uses block addressing. We consider different techniques to compress the index and study their performance with respect to the block size. We compare the index against three separate techniques for varying block sizes, showing that our index is superior to each isolated approach. For instance, with just 4% of extra space overhead the index has to scan less than 12% of the text for exact searches and about 20% allowing one error in the matches. Keywords: Text compression, inverted files, block addressing, text databases. 1.
Text Retrieval: Theory and Practice
 In 12th IFIP World Computer Congress, volume I
, 1992
"... We present the state of the art of the main component of text retrieval systems: the searching engine. We outline the main lines of research and issues involved. We survey recently published results for text searching and we explore the gap between theoretical vs. practical algorithms. The main obse ..."
Abstract

Cited by 46 (14 self)
 Add to MetaCart
We present the state of the art of the main component of text retrieval systems: the searching engine. We outline the main lines of research and issues involved. We survey recently published results for text searching and we explore the gap between theoretical vs. practical algorithms. The main observation is that simpler ideas are better in practice. 1597 Shaks. Lover's Compl. 2 From off a hill whose concaue wombe reworded A plaintfull story from a sistring vale. OED2, reword, sistering 1 1 Introduction Full text retrieval systems are becoming a popular way of providing support for online text. Their main advantage is that they avoid the complicated and expensive process of semantic indexing. From the enduser point of view, full text searching of online documents is appealing because a valid query is just any word or sentence of the document. However, when the desired answer cannot be obtained with a simple query, the user must perform his/her own semantic processing to guess w...