Results 1  10
of
47
A Guided Tour to Approximate String Matching
 ACM Computing Surveys
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract

Cited by 404 (38 self)
 Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1
NRgrep: A Fast and Flexible Pattern Matching Tool
 Software Practice and Experience (SPE
, 2000
"... We present nrgrep ("nondeterministic reverse grep"), a new pattern matching tool designed for efficient search of complex patterns. Unlike previous tools of the grep family, such as agrep and Gnu grep, nrgrep is based on a single and uniform concept: the bitparallel simulation of a nondeterminis ..."
Abstract

Cited by 37 (7 self)
 Add to MetaCart
We present nrgrep ("nondeterministic reverse grep"), a new pattern matching tool designed for efficient search of complex patterns. Unlike previous tools of the grep family, such as agrep and Gnu grep, nrgrep is based on a single and uniform concept: the bitparallel simulation of a nondeterministic suffix automaton. As a result, nrgrep can find from simple patterns to regular expressions, exactly or allowing errors in the matches, with an efficiency that degrades smoothly as the complexity of the searched pattern increases. Another concept fully integrated into nrgrep and that contributes to this smoothness is the selection of adequate subpatterns for fast scanning, which is also absent in many current tools. We show that the efficiency of nrgrep is similar to that of the fastest existing string matching tools for the simplest patterns, and by far unpaired for more complex patterns.
Substring selectivity estimation
 In Proceedings of the ACM Symposium on Principles of Database Systems
, 1999
"... We study the problem of estimating selectivity of approximate substring queries. Its importance in databases is ever increasing as more and more data are input by users and are integrated with many typographical errors and different spelling conventions. To begin with, we consider edit distance for ..."
Abstract

Cited by 30 (4 self)
 Add to MetaCart
We study the problem of estimating selectivity of approximate substring queries. Its importance in databases is ever increasing as more and more data are input by users and are integrated with many typographical errors and different spelling conventions. To begin with, we consider edit distance for the similarity between a pair of strings. Based on information stored in an extended Ngram table, we propose two estimation algorithms, MOF and LBS for the task. The latter extends the former with ideas from set hashing signatures. The experimental results show that MOF is a lightweight algorithm that gives fairly accurate estimations. However, if more space is available, LBS can give better accuracy than MOF and other baseline methods. Next, we extend the proposed solution to other similarity predicates, SQL LIKE operator and Jaccard similarity. 1.
Extending Qgrams to estimate selectivity of string matching with low edit distance
 In VLDB
, 2007
"... There are many emerging database applications that require accurate selectivity estimation of approximate string matching queries. Edit distance is one of the most commonly used string similarity measures. In this paper, we study the problem of estimating selectivity of string matching with low edit ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
There are many emerging database applications that require accurate selectivity estimation of approximate string matching queries. Edit distance is one of the most commonly used string similarity measures. In this paper, we study the problem of estimating selectivity of string matching with low edit distance. Our framework is based on extending qgrams with wildcards. Based on the concepts of replacement semilattice, string hierarchy and a combinatorial analysis, we develop the formulas for selectivity estimation and provide the algorithm BasicEQ. We next develop the algorithm OptEQ by enhancing BasicEQ with two novel improvements. Finally we show a comprehensive set of experiments using three benchmarks comparing OptEQ with the stateoftheart method SEP IA. Our experimental results show that OptEQ delivers more accurate selectivity estimations. 1.
Bitparallel Witnesses and their Applications to Approximate String Matching
 Algorithmica
, 2005
"... We present a new bitparallel technique for approximate string matching. We build on two previous techniques. The first one, BPM [Myers, J. of the ACM, 1999], searches for a pattern of length m in a text of length n permitting k differences in O(⌈m/w⌉n) time, where w is the width of the ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
We present a new bitparallel technique for approximate string matching. We build on two previous techniques. The first one, BPM [Myers, J. of the ACM, 1999], searches for a pattern of length m in a text of length n permitting k differences in O(⌈m/w⌉n) time, where w is the width of the computer word. The second one, ABNDM [Navarro and Raffinot, ACM JEA, 2000], extends a sublineartime exact algorithm to approximate searching. ABNDM relies on another algorithm, BPA [Wu and Manber, Comm. ACM, 1992], which makes use of an O(k⌈m/w⌉n) time algorithm for its internal workings. BPA is slow but flexible enough to support all operations required by ABNDM. We improve previous ABNDM analyses, showing that it is averageoptimal in number of inspected characters, although the overall complexity is higher because of the O(k⌈m/w⌉) work done per inspected character. We then show that the faster BPM can be adapted to support all the operations required by ABNDM. This involves extending it to compute edit distance, to search for any pattern suffix, and to detect in advance the impossibility of a later match. The solution to those challenges is based on the concept of a witness, which permits sampling some dynamic programming matrix values so as to bound, deduce, or compute others fast. The resulting algorithm is averageoptimal for m ≤ w, assuming the alphabet size is constant. In practice, it performs better than the original ABNDM and is the fastest algorithm for several combinations of m, k and alphabet sizes that are useful, for example, in natural language searching and computational biology. To show that the concept of witnesses can be used in further scenarios, we also improve a recent bitparallel algorithm based on Myers [Fredriksson, SPIRE 2003]. The use of witn...
Alternative algorithms for bitparallel string matching
 In Proc SPIRE’03, Lecture Notes in Computer Science 2857:80–93
, 2003
"... Abstract. We consider bitparallel algorithms of BoyerMoore type for exact string matching. We introduce a twoway modification of the BNDM algorithm. If the text character aligned with the end of the pattern is a mismatch, we continue by examining text characters after the alignment. Besides this ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
Abstract. We consider bitparallel algorithms of BoyerMoore type for exact string matching. We introduce a twoway modification of the BNDM algorithm. If the text character aligned with the end of the pattern is a mismatch, we continue by examining text characters after the alignment. Besides this twoway variation, we present a simplified version of BNDM without prefix search and an algorithm scheme for long patterns. We also study a different bitparallel algorithm, which keeps the history of examined characters in a bitvector and where shifting is based on this bitvector. We report experiments where we compared the new algorithms with existing ones. The simplified BNDM is the most promising of the new algorithms in practice. 1
Compact DFA Representation for Fast Regular Expression Search
, 2001
"... . We present a new technique to encode a deterministic finite automaton (DFA). Based on the specific properties of Glushkov's nondeterministic finite automaton (NFA) construction algorithm, we are able to encode the DFA using (m + 1)(2 m+1 + j\Sigma j) bits, where m is the number of characters ..."
Abstract

Cited by 10 (6 self)
 Add to MetaCart
. We present a new technique to encode a deterministic finite automaton (DFA). Based on the specific properties of Glushkov's nondeterministic finite automaton (NFA) construction algorithm, we are able to encode the DFA using (m + 1)(2 m+1 + j\Sigma j) bits, where m is the number of characters (excluding operator symbols) in the regular expression and \Sigma is the alphabet. This compares favorably against the worst case of (m+1)2 m+1 j\Sigma j bits needed by a classical DFA representation and m(2 2m+1 + j\Sigma j) bits needed by the Wu and Manber approach implemented in Agrep. Our approach is practical and simple to implement, and it permits searching regular expressions of moderate size (which include most cases of interest) faster than with any previously existing algorithm, as we show experimentally. 1
Multipattern string matching with qgrams
 ACM Journal of Experimental Algorithmics
, 2006
"... We present three algorithms for exact string matching of multiple patterns. Our algorithms are filtering methods, which apply qgrams and bit parallelism. We ran extensive experiments with them and compared them with various versions of earlier algorithms, e.g. different trie implementations of the ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
We present three algorithms for exact string matching of multiple patterns. Our algorithms are filtering methods, which apply qgrams and bit parallelism. We ran extensive experiments with them and compared them with various versions of earlier algorithms, e.g. different trie implementations of the AhoCorasick algorithm. All of our algorithms showed to be substantially faster than earlier solutions for sets of 1,000–10,000 patterns and the good performance of two of them continues to 100,000 patterns. The gain is due to the improved filtering efficiency caused by qgrams.
Flexible pattern matching
 Journal of Applied Statistics
, 2002
"... An important subtask of the pattern discovery process is pattern matching, where the pattern sought is already known and we want to determine how often and where it occurs in a sequence. In this paper we review the most practical techniques to find patterns of different kinds. We show how regular ex ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
An important subtask of the pattern discovery process is pattern matching, where the pattern sought is already known and we want to determine how often and where it occurs in a sequence. In this paper we review the most practical techniques to find patterns of different kinds. We show how regular expressions can be searched for with general techniques, and how simpler patterns can be dealt with more simply and efficiently. We consider exact as well as approximate pattern matching. Also, we cover both sequential searching, where the sequence cannot be preprocessed, and indexed searching, where we have a data structure built over the sequence to speed up the search. 1
LZgrep: A BoyerMoore String Matching Tool for ZivLempel Compressed Text
 Soft. Pract. Exper
, 2005
"... We present a BoyerMoore approach to string matching over LZ78 and LZW compressed text. The idea is to search the text directly in compressed form instead of decompressing and then searching it. We modify the BoyerMoore approach so as to skip text using the characters explicitly represented in the ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
We present a BoyerMoore approach to string matching over LZ78 and LZW compressed text. The idea is to search the text directly in compressed form instead of decompressing and then searching it. We modify the BoyerMoore approach so as to skip text using the characters explicitly represented in the LZ78/LZW formats, modifying the basic technique where the algorithm can choose which characters to inspect. We present and compare several solutions for single and multipattern search. We show that our algorithms obtain speedups of up to 50 % compared to the simple decompressthensearch approach. Finally, we present a public tool, LZgrep, which uses our algorithms to offer greplike capabilities searching directly files compressed using Unix's Compress, a LZW compressor. LZgrep can also search files compressed with Unix gzip, using new decompressthensearch techniques we develop, which are faster than the current tools. This way, users can always keep their files in compressed form and still search them, uncompressing only when they want to see them.