Results 1 - 10
of
37
A Guided Tour to Approximate String Matching
- ACM Computing Surveys
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract
-
Cited by 306 (38 self)
- Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1
NR-grep: A Fast and Flexible Pattern Matching Tool
- Software Practice and Experience (SPE
, 2000
"... We present nrgrep ("nondeterministic reverse grep"), a new pattern matching tool designed for efficient search of complex patterns. Unlike previous tools of the grep family, such as agrep and Gnu grep, nrgrep is based on a single and uniform concept: the bit-parallel simulation of a nondeterminis ..."
Abstract
-
Cited by 36 (7 self)
- Add to MetaCart
We present nrgrep ("nondeterministic reverse grep"), a new pattern matching tool designed for efficient search of complex patterns. Unlike previous tools of the grep family, such as agrep and Gnu grep, nrgrep is based on a single and uniform concept: the bit-parallel simulation of a nondeterministic suffix automaton. As a result, nrgrep can find from simple patterns to regular expressions, exactly or allowing errors in the matches, with an efficiency that degrades smoothly as the complexity of the searched pattern increases. Another concept fully integrated into nrgrep and that contributes to this smoothness is the selection of adequate subpatterns for fast scanning, which is also absent in many current tools. We show that the efficiency of nrgrep is similar to that of the fastest existing string matching tools for the simplest patterns, and by far unpaired for more complex patterns.
Substring selectivity estimation
- In Proceedings of the ACM Symposium on Principles of Database Systems
, 1999
"... We study the problem of estimating selectivity of approximate substring queries. Its importance in databases is ever increasing as more and more data are input by users and are integrated with many typographical errors and different spelling conventions. To begin with, we consider edit distance for ..."
Abstract
-
Cited by 29 (4 self)
- Add to MetaCart
We study the problem of estimating selectivity of approximate substring queries. Its importance in databases is ever increasing as more and more data are input by users and are integrated with many typographical errors and different spelling conventions. To begin with, we consider edit distance for the similarity between a pair of strings. Based on information stored in an extended N-gram table, we propose two estimation algorithms, MOF and LBS for the task. The latter extends the former with ideas from set hashing signatures. The experimental results show that MOF is a light-weight algorithm that gives fairly accurate estimations. However, if more space is available, LBS can give better accuracy than MOF and other baseline methods. Next, we extend the proposed solution to other similarity predicates, SQL LIKE operator and Jaccard similarity. 1.
Extending Q-grams to estimate selectivity of string matching with low edit distance
- In VLDB
, 2007
"... There are many emerging database applications that require accurate selectivity estimation of approximate string matching queries. Edit distance is one of the most commonly used string similarity measures. In this paper, we study the problem of estimating selectivity of string matching with low edit ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
There are many emerging database applications that require accurate selectivity estimation of approximate string matching queries. Edit distance is one of the most commonly used string similarity measures. In this paper, we study the problem of estimating selectivity of string matching with low edit distance. Our framework is based on extending q-grams with wildcards. Based on the concepts of replacement semilattice, string hierarchy and a combinatorial analysis, we develop the formulas for selectivity estimation and provide the algorithm BasicEQ. We next develop the algorithm OptEQ by enhancing BasicEQ with two novel improvements. Finally we show a comprehensive set of experiments using three benchmarks comparing OptEQ with the stateof-the-art method SEP IA. Our experimental results show that OptEQ delivers more accurate selectivity estimations. 1.
Bit-parallel Witnesses and their Applications to Approximate String Matching
- Algorithmica
, 2005
"... We present a new bit-parallel technique for approximate string matching. We build on two previous techniques. The first one, BPM [Myers, J. of the ACM, 1999], searches for a pattern of length m in a text of length n permitting k differences in O(⌈m/w⌉n) time, where w is the width of the ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
We present a new bit-parallel technique for approximate string matching. We build on two previous techniques. The first one, BPM [Myers, J. of the ACM, 1999], searches for a pattern of length m in a text of length n permitting k differences in O(⌈m/w⌉n) time, where w is the width of the computer word. The second one, ABNDM [Navarro and Raffinot, ACM JEA, 2000], extends a sublinear-time exact algorithm to approximate searching. ABNDM relies on another algorithm, BPA [Wu and Manber, Comm. ACM, 1992], which makes use of an O(k⌈m/w⌉n) time algorithm for its internal workings. BPA is slow but flexible enough to support all operations required by ABNDM. We improve previous ABNDM analyses, showing that it is average-optimal in number of inspected characters, although the overall complexity is higher because of the O(k⌈m/w⌉) work done per inspected character. We then show that the faster BPM can be adapted to support all the operations required by ABNDM. This involves extending it to compute edit distance, to search for any pattern suffix, and to detect in advance the impossibility of a later match. The solution to those challenges is based on the concept of a witness, which permits sampling some dynamic programming matrix values so as to bound, deduce, or compute others fast. The resulting algorithm is average-optimal for m ≤ w, assuming the alphabet size is constant. In practice, it performs better than the original ABNDM and is the fastest algorithm for several combinations of m, k and alphabet sizes that are useful, for example, in natural language searching and computational biology. To show that the concept of witnesses can be used in further scenarios, we also improve a recent bit-parallel algorithm based on Myers [Fredriksson, SPIRE 2003]. The use of witn...
Compact DFA Representation for Fast Regular Expression Search
, 2001
"... . We present a new technique to encode a deterministic finite automaton (DFA). Based on the specific properties of Glushkov's nondeterministic finite automaton (NFA) construction algorithm, we are able to encode the DFA using (m + 1)(2 m+1 + j\Sigma j) bits, where m is the number of characters ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
. We present a new technique to encode a deterministic finite automaton (DFA). Based on the specific properties of Glushkov's nondeterministic finite automaton (NFA) construction algorithm, we are able to encode the DFA using (m + 1)(2 m+1 + j\Sigma j) bits, where m is the number of characters (excluding operator symbols) in the regular expression and \Sigma is the alphabet. This compares favorably against the worst case of (m+1)2 m+1 j\Sigma j bits needed by a classical DFA representation and m(2 2m+1 + j\Sigma j) bits needed by the Wu and Manber approach implemented in Agrep. Our approach is practical and simple to implement, and it permits searching regular expressions of moderate size (which include most cases of interest) faster than with any previously existing algorithm, as we show experimentally. 1
Alternative algorithms for bit-parallel string matching
- In Proc SPIRE’03, Lecture Notes in Computer Science 2857:80–93
, 2003
"... Abstract. We consider bit-parallel algorithms of Boyer-Moore type for exact string matching. We introduce a two-way modification of the BNDM algorithm. If the text character aligned with the end of the pattern is a mismatch, we continue by examining text characters after the alignment. Besides this ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Abstract. We consider bit-parallel algorithms of Boyer-Moore type for exact string matching. We introduce a two-way modification of the BNDM algorithm. If the text character aligned with the end of the pattern is a mismatch, we continue by examining text characters after the alignment. Besides this two-way variation, we present a simplified version of BNDM without prefix search and an algorithm scheme for long patterns. We also study a different bit-parallel algorithm, which keeps the history of examined characters in a bit-vector and where shifting is based on this bit-vector. We report experiments where we compared the new algorithms with existing ones. The simplified BNDM is the most promising of the new algorithms in practice. 1
LZgrep: A Boyer-Moore String Matching Tool for Ziv-Lempel Compressed Text
- Soft. Pract. Exper
, 2005
"... We present a Boyer-Moore approach to string matching over LZ78 and LZW compressed text. The idea is to search the text directly in compressed form instead of decompressing and then searching it. We modify the Boyer-Moore approach so as to skip text using the characters explicitly represented in the ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
We present a Boyer-Moore approach to string matching over LZ78 and LZW compressed text. The idea is to search the text directly in compressed form instead of decompressing and then searching it. We modify the Boyer-Moore approach so as to skip text using the characters explicitly represented in the LZ78/LZW formats, modifying the basic technique where the algorithm can choose which characters to inspect. We present and compare several solutions for single and multipattern search. We show that our algorithms obtain speedups of up to 50 % compared to the simple decompress-then-search approach. Finally, we present a public tool, LZgrep, which uses our algorithms to offer grep-like capabilities searching directly files compressed using Unix's Compress, a LZW compressor. LZgrep can also search files compressed with Unix gzip, using new decompress-then-search techniques we develop, which are faster than the current tools. This way, users can always keep their files in compressed form and still search them, uncompressing only when they want to see them.
Flexible pattern matching
- Journal of Applied Statistics
, 2002
"... An important subtask of the pattern discovery process is pattern matching, where the pattern sought is already known and we want to determine how often and where it occurs in a sequence. In this paper we review the most practical techniques to find patterns of different kinds. We show how regular ex ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
An important subtask of the pattern discovery process is pattern matching, where the pattern sought is already known and we want to determine how often and where it occurs in a sequence. In this paper we review the most practical techniques to find patterns of different kinds. We show how regular expressions can be searched for with general techniques, and how simpler patterns can be dealt with more simply and efficiently. We consider exact as well as approximate pattern matching. Also, we cover both sequential searching, where the sequence cannot be preprocessed, and indexed searching, where we have a data structure built over the sequence to speed up the search. 1
Multi-pattern string matching with q-grams
- ACM Journal of Experimental Algorithmics
, 2006
"... We present three algorithms for exact string matching of multiple patterns. Our algorithms are filtering methods, which apply q-grams and bit parallelism. We ran extensive experiments with them and compared them with various versions of earlier algorithms, e.g. different trie implementations of the ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
We present three algorithms for exact string matching of multiple patterns. Our algorithms are filtering methods, which apply q-grams and bit parallelism. We ran extensive experiments with them and compared them with various versions of earlier algorithms, e.g. different trie implementations of the Aho-Corasick algorithm. All of our algorithms showed to be substantially faster than earlier solutions for sets of 1,000–10,000 patterns and the good performance of two of them continues to 100,000 patterns. The gain is due to the improved filtering efficiency caused by q-grams.

