Results 1  10
of
33
Fast and Flexible String Matching by Combining Bitparallelism and Suffix Automata
 ACM JOURNAL OF EXPERIMENTAL ALGORITHMICS (JEA
, 1998
"... ... In this paper we merge bitparallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bitparallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as ShiftOr. It inher ..."
Abstract

Cited by 77 (11 self)
 Add to MetaCart
... In this paper we merge bitparallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bitparallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as ShiftOr. It inherits from ShiftOr the ability to handle flexible patterns and from BDM the ability to skip characters. BNDM is 30%40% faster than BDM and up to 7 times faster than ShiftOr. When compared to the fastest existing algorithms on exact patterns (which belong to the BM family), BNDM is from 20% slower to 3 times faster, depending on the alphabet size. With respect to flexible pattern searching, BNDM is by far the fastest technique to deal with classes of characters and is competitive to search allowing errors. In particular, BNDM seems very adequate for computational biology applications, since it is the fastest algorithm to search on DNA sequences and flexible searching is an important problem in that
Finding patterns with variable length gaps or don’t cares
 of Lecture Notes in Computer Science
, 2006
"... Abstract. In this paper we have presented new algorithms to handle the pattern matching problem where the pattern can contain variable length gaps. Given a pattern P with variable length gaps and a text T our algorithm works in O(n + m + α log(max1<=i<=l(bi − ai))) time where n is the length o ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
(Show Context)
Abstract. In this paper we have presented new algorithms to handle the pattern matching problem where the pattern can contain variable length gaps. Given a pattern P with variable length gaps and a text T our algorithm works in O(n + m + α log(max1<=i<=l(bi − ai))) time where n is the length of the text, m is the summation of the lengths of the component subpatterns, α is the total number of occurrences of the component subpatterns in the text and ai and bi are, respectively, the minimum and maximum number of don’t cares allowed between the ith and (i+1)st component of the pattern. We also present another algorithm which, given a suffix array of the text, can report whether P occurs in T in O(m + α log log n) time. Both the algorithms record information to report all the occurrences of P in T. Furthermore, the techniques used in our algorithms are shown to be useful in many other contexts. 1
Efficient algorithms for pattern matching with general gaps
"... and character classes ..."
(Show Context)
New techniques for regular expression searching
 Algorithmica
, 2005
"... We present two new techniques for regular expression searching and use them to derive faster practical algorithms. Based on the specific properties of Glushkov’s nondeterministic finite automaton construction algorithm, we show how to encode a deterministic finite automaton (DFA) using O(m2 m) bits, ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
(Show Context)
We present two new techniques for regular expression searching and use them to derive faster practical algorithms. Based on the specific properties of Glushkov’s nondeterministic finite automaton construction algorithm, we show how to encode a deterministic finite automaton (DFA) using O(m2 m) bits, where m is the number of characters, excluding operator symbols, in the regular expression. This compares favorably against the worst case of O(m2 m Σ) bits needed by a classical DFA representation (where Σ is the alphabet) and O(m2 2m) bits needed by the Wu and Manber approach implemented in Agrep. We also present a new way to search for regular expressions, which is able to skip text characters. The idea is to determine the minimum length ℓ of a string matching the regular expression, manipulate the original automaton so that it recognizes all the reverse prefixes of length up to ℓ of the strings originally accepted, and use it to skip text characters as done for exact string matching in previous work. We combine these techniques into two algorithms, one able and one unable to skip text characters. The algorithms are simple to implement, and our experiments show that they permit fast searching for regular expressions, normally faster than any existing algorithm. 1
Faster regular expression matching
 Automata, Languages and Programming
, 2009
"... Abstract. Regular expression matching is a key task (and often the computational bottleneck) in a variety of widely used software tools and applications, for instance, the unix grep and sed commands, scripting languages such as awk and perl, programs for analyzing massive data streams, etc. We show ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
Abstract. Regular expression matching is a key task (and often the computational bottleneck) in a variety of widely used software tools and applications, for instance, the unix grep and sed commands, scripting languages such as awk and perl, programs for analyzing massive data streams, etc. We show how to solve this ubiquitous task in linear space and O(nm(log log n)/(logn)3/2+n+m) time where m is the length of the expression and n the length of the string. This is the first improvement for the dominant O(nm / logn) term in Myers ’ O(nm / logn+ (n+m) logn) bound [JACM 1992]. We also get improved bounds for external memory. 1
String matching with variable length gaps
 In Proc. 17th SPIRE
, 2010
"... Abstract. We consider string matching with variable length gaps. Given a string T and a pattern P consisting of strings separated by variable length gaps (arbitrary strings of length in a specified range), the problem is to find all ending positions of substrings in T that match P. This problem is ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract. We consider string matching with variable length gaps. Given a string T and a pattern P consisting of strings separated by variable length gaps (arbitrary strings of length in a specified range), the problem is to find all ending positions of substrings in T that match P. This problem is a basic primitive in computational biology applications. Let m and n be the lengths of P and T, respectively, and let k be the number of strings in P. We present a new algorithm achieving time O((n+m) log k+α) and space O(m+A), where A is the sum of the lower bounds of the lengths of the gaps in P and α is the total number of occurrences of the strings in P within T. Compared to the previous results this bound essentially achieves the best known time and space complexities simultaneously. Consequently, our algorithm obtains the best known bounds for almost all combinations of m, n, k, A, and α. Our algorithm is surprisingly simple and straightforward to implement. 1
String indexing for patterns with wildcards. Theory Comput
 Syst
"... Abstract. We consider the problem of indexing a string t of length n to report the occurrences of a query pattern p containing m characters and j wildcards. Let occ be the number of occurrences of p in t, and σ the size of the alphabet. We obtain the following results. – A linear space index with qu ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We consider the problem of indexing a string t of length n to report the occurrences of a query pattern p containing m characters and j wildcards. Let occ be the number of occurrences of p in t, and σ the size of the alphabet. We obtain the following results. – A linear space index with query time O(m+σj log logn+ occ). This significantly improves the previously best known linear space index by Lam et al. [ISAAC 2007], which requires query time Θ(jn) in the worst case. – An index with query timeO(m+j+occ) using spaceO(σk 2 n logk logn), where k is the maximum number of wildcards allowed in the pattern. This is the first nontrivial bound with this query time. – A timespace tradeoff, generalizing the index by Cole et al. [STOC 2004]. Our results are obtained using a novel combination of wellknown and new techniques, which could be of independent interest. 1
Regular expression matching with multistrings and intervals
 In Proc. SODA’10
"... Regular expression matching is a key task (and often computational bottleneck) in a variety of software tools and applications. For instance, the standard grep and sed utilities, scripting languages such as perl, internet traffic analysis, XML querying, and protein searching. The basic definition of ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Regular expression matching is a key task (and often computational bottleneck) in a variety of software tools and applications. For instance, the standard grep and sed utilities, scripting languages such as perl, internet traffic analysis, XML querying, and protein searching. The basic definition of a regular expression is that we combine characters with union, concatenation, and kleene star operators. The length m is proportional to the number of characters. However, often the initial operation is to concatenate characters in fairly long strings, e.g., if we search for certain combinations of words in a firewall. As a result, the number k of strings in the regular expression is significantly smaller than m. Our main result is a new algorithm that essentially replaces m with k in the complexity bounds for regular expression matching. More precisely, after an O(m log k) time and O(m) space preprocessing of the expression, we can match it in a string presented as a stream log w of characters in O(k w + log k) time per character, where w is the number of bits in a memory word. For large w, this corresponds to the previous best bound log w of O(m w + log m). Prior to this work no O(k) bound per character was known. We further extend our solution to efficiently handle character class interval operators C{x, y}. Here, C is a set of characters and C{x, y}, where x and y are integers such that 0 ≤ x ≤ y, represents a string of length between x and y from C. These character class intervals generalize variable length gaps which are frequently used for pattern matching in computational biology applications. 1
Efficient Bitparallel Algorithms for (δ, α)matching
"... Abstract. We consider the following string matching problem. Pattern p0p1p2... pm−1 (δ, α)matches the text substring ti0ti1ti2... ti m−1, if pj − ti j  ≤ δ for j ∈ {0,..., m − 1}, where 0 < ij+1 − ij ≤ α + 1. The task is then to find all text positions im−1 that (δ, α)match the pattern. For ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Abstract. We consider the following string matching problem. Pattern p0p1p2... pm−1 (δ, α)matches the text substring ti0ti1ti2... ti m−1, if pj − ti j  ≤ δ for j ∈ {0,..., m − 1}, where 0 < ij+1 − ij ≤ α + 1. The task is then to find all text positions im−1 that (δ, α)match the pattern. For a text of length n, the best previously known algorithms for this string matching problem run in time O(nm) and in time O(n⌈mα/w⌉), where the former is based on dynamic programming, and the latter on bitparallelism with w bits in computer word (32 or 64 typically). We improve these to take O(nδ+⌈n/w⌉m) and O(n⌈m log(α)/w⌉), respectively, worst case time using bitparallelism. On average the algorithms run in O(⌈n/w⌉⌈αδ/σ⌉+n) and O(n) time. Our experimental results show that the algorithms work extremely well in practice. Our algorithms handle general gaps as well, having important applications in computational biology.