Results 1 - 10
of
146
A Guided Tour to Approximate String Matching
- ACM Computing Surveys
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract
-
Cited by 306 (38 self)
- Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1
A PATTERN MATCHING MODEL FOR MISUSE INTRUSION DETECTION
"... This paper describes a generic model of matching that can be usefully applied to misuse intrusion detection. The model is based on Colored Petri Nets. Guards define the context in which signatures are matched. The notion of start and final states, and paths between them define the set of event seque ..."
Abstract
-
Cited by 123 (4 self)
- Add to MetaCart
This paper describes a generic model of matching that can be usefully applied to misuse intrusion detection. The model is based on Colored Petri Nets. Guards define the context in which signatures are matched. The notion of start and final states, and paths between them define the set of event sequences matched by the net. Partial order matching can also be specified in this model. The main benefits of the model are its generality, portability and flexibility.
A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming
- Journal of the ACM
, 1999
"... The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with k-or-fewer differences. Simple and practical bitvector algorithms have been designed for this problem, most notably the one used in agrep. These algorithms co ..."
Abstract
-
Cited by 121 (1 self)
- Add to MetaCart
The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with k-or-fewer differences. Simple and practical bitvector algorithms have been designed for this problem, most notably the one used in agrep. These algorithms compute a bit representation of the current state-set of the k-difference automaton for the query, and asymptotically run in O(nmk=w) time where w is the word size of the machine (e.g. 32 or 64 in practice). Here we present an algorithm of comparable simplicity that requires only O(nm=w) time by virtue of computing a bit representation of the relocatable dynamic programming matrix for the problem. Thus the algorithm's performance is independent of k, and it is found to be more efficient than the previous results for many useful choices of k and small m. Moreover, because the algorithm is not dependent on k, it can be used to rapidly compute blocks of the dynamic programming matrix as in the ...
Fast and Flexible Word Searching on Compressed Text
, 2000
"... ... text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, ..."
Abstract
-
Cited by 74 (30 self)
- Add to MetaCart
... text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes.
Faster Approximate String Matching
- Algorithmica
, 1999
"... We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a non-deterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = \Omega\Gamma137 n) bits, ..."
Abstract
-
Cited by 69 (23 self)
- Add to MetaCart
We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a non-deterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = \Omega\Gamma137 n) bits, where n is the text size. This is essentially similar to the model used in Wu and Manber's work, although we improve the search time by packing the automaton states differently. The running time achieved is O(n) for small patterns (i.e. whenever mk = O(log n)), where m is the pattern length and k ! m the number of allowed errors. This is in contrast with the result of Wu and Manber, which is O(kn) for m = O(log n). Longer patterns can be processed by partitioning the automaton into many machine words, at O(mk=w n) search cost. We allow generalizations in the pattern, such as classes of characters, gaps and others, at essentially the same search cost. We then explore other novel techniques t...
Spelling Approximate Repeated Or Common Motifs Using a Suffix Tree
, 1998
"... . We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet \Sigma . For instance, \Sigma may be equal to fA, C, G, Tg and the sequence represents an encoding of a DNA macromolecule. The motifs searched correspond to words over the s ..."
Abstract
-
Cited by 57 (6 self)
- Add to MetaCart
. We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet \Sigma . For instance, \Sigma may be equal to fA, C, G, Tg and the sequence represents an encoding of a DNA macromolecule. The motifs searched correspond to words over the same alphabet which occur a minimum number q of times in the sequence with at most e mismatches each time (q is called the quorum constraint). The second algorithm extracts common motifs from a set of N 2 sequences. In this case, the motifs must occur, again with at most e mismatches, in 1 q N distinct sequences of the set. In both cases, the words representing the motifs may never be present exactly in the sequences. We therefore speak of the motifs, repeated in a sequence or common to a set of them, as being "external" objects and denote them by the expression "valid models" if they verify the quorum constraint q. The approach we introduce here for finding all valid models corr...
Fast and Flexible String Matching by Combining Bit-parallelism and Suffix Automata
- ACM JOURNAL OF EXPERIMENTAL ALGORITHMICS (JEA
, 1998
"... ... In this paper we merge bit-parallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bit-parallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as Shift-Or. It inher ..."
Abstract
-
Cited by 51 (11 self)
- Add to MetaCart
... In this paper we merge bit-parallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bit-parallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as Shift-Or. It inherits from Shift-Or the ability to handle flexible patterns and from BDM the ability to skip characters. BNDM is 30%-40% faster than BDM and up to 7 times faster than Shift-Or. When compared to the fastest existing algorithms on exact patterns (which belong to the BM family), BNDM is from 20% slower to 3 times faster, depending on the alphabet size. With respect to flexible pattern searching, BNDM is by far the fastest technique to deal with classes of characters and is competitive to search allowing errors. In particular, BNDM seems very adequate for computational biology applications, since it is the fastest algorithm to search on DNA sequences and flexible searching is an important problem in that
Fast and Practical Approximate String Matching
- In Combinatorial Pattern Matching, Third Annual Symposium
, 1992
"... We present new algorithms for approximate string matching based in simple, but efficient, ideas. First, we present an algorithm for string matching with mismatches based in arithmetical operations that runs in linear worst case time for most practical cases. This is a new approach to string searchin ..."
Abstract
-
Cited by 47 (0 self)
- Add to MetaCart
We present new algorithms for approximate string matching based in simple, but efficient, ideas. First, we present an algorithm for string matching with mismatches based in arithmetical operations that runs in linear worst case time for most practical cases. This is a new approach to string searching. Second, we present an algorithm for string matching with errors based on partitioning the pattern that requires linear expected time for typical inputs. 1 Introduction Approximate string matching is one of the main problems in combinatorial pattern matching. Recently, several new approaches emphasizing the expected search time and practicality have appeared [3, 4, 27, 32, 31, 17], in contrast to older results, most of them are only of theoretical interest. Here, we continue this trend, by presenting two new simple and efficient algorithms for approximate string matching. First, we present an algorithm for string matching with k mismatches. This problem consists of finding all instances o...
Adding Compression to Block Addressing Inverted Indexes
, 2000
"... . Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed for efficient, low-overhead text retrieval. Modern text compression techniques can reduce the text to less than 30% of its size and allow searching it dire ..."
Abstract
-
Cited by 47 (26 self)
- Add to MetaCart
. Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed for efficient, low-overhead text retrieval. Modern text compression techniques can reduce the text to less than 30% of its size and allow searching it directly and faster than the uncompressed text. Inverted index compression obtains significant reduction of their original size at the same processing speed. Block addressing makes the inverted lists point to text blocks instead of exact positions and pay the reduction in space with some sequential text scanning. In this work we combine the three ideas in a single scheme. We present a compressed inverted file that indexes compressed text and uses block addressing. We consider different techniques to compress the index and study their performance with respect to the block size. We compare the index against three separate techniques for varying block sizes, showing that our index is superior to each isolated approach. For instance, with just 4% of extra space overhead the index has to scan less than 12% of the text for exact searches and about 20% allowing one error in the matches. Keywords: Text compression, inverted files, block addressing, text databases. 1.
Text Retrieval: Theory and Practice
- In 12th IFIP World Computer Congress, volume I
, 1992
"... We present the state of the art of the main component of text retrieval systems: the searching engine. We outline the main lines of research and issues involved. We survey recently published results for text searching and we explore the gap between theoretical vs. practical algorithms. The main obse ..."
Abstract
-
Cited by 43 (14 self)
- Add to MetaCart
We present the state of the art of the main component of text retrieval systems: the searching engine. We outline the main lines of research and issues involved. We survey recently published results for text searching and we explore the gap between theoretical vs. practical algorithms. The main observation is that simpler ideas are better in practice. 1597 Shaks. Lover's Compl. 2 From off a hill whose concaue wombe reworded A plaintfull story from a sistring vale. OED2, reword, sistering 1 1 Introduction Full text retrieval systems are becoming a popular way of providing support for on-line text. Their main advantage is that they avoid the complicated and expensive process of semantic indexing. From the end-user point of view, full text searching of on-line documents is appealing because a valid query is just any word or sentence of the document. However, when the desired answer cannot be obtained with a simple query, the user must perform his/her own semantic processing to guess w...

