Results 1  10
of
17
Fast and Flexible String Matching by Combining Bitparallelism and Suffix Automata
 ACM JOURNAL OF EXPERIMENTAL ALGORITHMICS (JEA
, 1998
"... ... In this paper we merge bitparallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bitparallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as ShiftOr. It inher ..."
Abstract

Cited by 60 (11 self)
 Add to MetaCart
... In this paper we merge bitparallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bitparallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as ShiftOr. It inherits from ShiftOr the ability to handle flexible patterns and from BDM the ability to skip characters. BNDM is 30%40% faster than BDM and up to 7 times faster than ShiftOr. When compared to the fastest existing algorithms on exact patterns (which belong to the BM family), BNDM is from 20% slower to 3 times faster, depending on the alphabet size. With respect to flexible pattern searching, BNDM is by far the fastest technique to deal with classes of characters and is competitive to search allowing errors. In particular, BNDM seems very adequate for computational biology applications, since it is the fastest algorithm to search on DNA sequences and flexible searching is an important problem in that
Flexible pattern matching
 Journal of Applied Statistics
, 2002
"... An important subtask of the pattern discovery process is pattern matching, where the pattern sought is already known and we want to determine how often and where it occurs in a sequence. In this paper we review the most practical techniques to find patterns of different kinds. We show how regular ex ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
An important subtask of the pattern discovery process is pattern matching, where the pattern sought is already known and we want to determine how often and where it occurs in a sequence. In this paper we review the most practical techniques to find patterns of different kinds. We show how regular expressions can be searched for with general techniques, and how simpler patterns can be dealt with more simply and efficiently. We consider exact as well as approximate pattern matching. Also, we cover both sequential searching, where the sequence cannot be preprocessed, and indexed searching, where we have a data structure built over the sequence to speed up the search. 1
Finding patterns with variable length gaps or don’t cares
 of Lecture Notes in Computer Science
, 2006
"... Abstract. In this paper we have presented new algorithms to handle the pattern matching problem where the pattern can contain variable length gaps. Given a pattern P with variable length gaps and a text T our algorithm works in O(n + m + α log(max1<=i<=l(bi − ai))) time where n is the length of the ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
Abstract. In this paper we have presented new algorithms to handle the pattern matching problem where the pattern can contain variable length gaps. Given a pattern P with variable length gaps and a text T our algorithm works in O(n + m + α log(max1<=i<=l(bi − ai))) time where n is the length of the text, m is the summation of the lengths of the component subpatterns, α is the total number of occurrences of the component subpatterns in the text and ai and bi are, respectively, the minimum and maximum number of don’t cares allowed between the ith and (i+1)st component of the pattern. We also present another algorithm which, given a suffix array of the text, can report whether P occurs in T in O(m + α log log n) time. Both the algorithms record information to report all the occurrences of P in T. Furthermore, the techniques used in our algorithms are shown to be useful in many other contexts. 1
New techniques for regular expression searching
 Algorithmica
, 2005
"... We present two new techniques for regular expression searching and use them to derive faster practical algorithms. Based on the specific properties of Glushkov’s nondeterministic finite automaton construction algorithm, we show how to encode a deterministic finite automaton (DFA) using O(m2 m) bits, ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
We present two new techniques for regular expression searching and use them to derive faster practical algorithms. Based on the specific properties of Glushkov’s nondeterministic finite automaton construction algorithm, we show how to encode a deterministic finite automaton (DFA) using O(m2 m) bits, where m is the number of characters, excluding operator symbols, in the regular expression. This compares favorably against the worst case of O(m2 m Σ) bits needed by a classical DFA representation (where Σ is the alphabet) and O(m2 2m) bits needed by the Wu and Manber approach implemented in Agrep. We also present a new way to search for regular expressions, which is able to skip text characters. The idea is to determine the minimum length ℓ of a string matching the regular expression, manipulate the original automaton so that it recognizes all the reverse prefixes of length up to ℓ of the strings originally accepted, and use it to skip text characters as done for exact string matching in previous work. We combine these techniques into two algorithms, one able and one unable to skip text characters. The algorithms are simple to implement, and our experiments show that they permit fast searching for regular expressions, normally faster than any existing algorithm. 1
Efficient Bitparallel Algorithms for (δ, α)matching
"... Abstract. We consider the following string matching problem. Pattern p0p1p2... pm−1 (δ, α)matches the text substring ti0ti1ti2... ti m−1, if pj − ti j  ≤ δ for j ∈ {0,..., m − 1}, where 0 < ij+1 − ij ≤ α + 1. The task is then to find all text positions im−1 that (δ, α)match the pattern. For a t ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Abstract. We consider the following string matching problem. Pattern p0p1p2... pm−1 (δ, α)matches the text substring ti0ti1ti2... ti m−1, if pj − ti j  ≤ δ for j ∈ {0,..., m − 1}, where 0 < ij+1 − ij ≤ α + 1. The task is then to find all text positions im−1 that (δ, α)match the pattern. For a text of length n, the best previously known algorithms for this string matching problem run in time O(nm) and in time O(n⌈mα/w⌉), where the former is based on dynamic programming, and the latter on bitparallelism with w bits in computer word (32 or 64 typically). We improve these to take O(nδ+⌈n/w⌉m) and O(n⌈m log(α)/w⌉), respectively, worst case time using bitparallelism. On average the algorithms run in O(⌈n/w⌉⌈αδ/σ⌉+n) and O(n) time. Our experimental results show that the algorithms work extremely well in practice. Our algorithms handle general gaps as well, having important applications in computational biology.
Automated search for LTR retrotransposons
, 2002
"... Introduction In the last few years many technological improvements in the production and analysis of DNA sequence data [29, 24] have made possible the complete sequencing of whole genomes: beginning with the microbial ones [13, 7] and continuing with those of many eukaryotic species [25, 2, 34] ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Introduction In the last few years many technological improvements in the production and analysis of DNA sequence data [29, 24] have made possible the complete sequencing of whole genomes: beginning with the microbial ones [13, 7] and continuing with those of many eukaryotic species [25, 2, 34]. The enormous amount of raw genomic data available has favoured a focused attention of the scientific community to the problem of genome annotation. Genome annotation is the process of taking the raw DNA sequence data and adding the layers of analysis and interpretation necessary to extract its biological significance and place it in the context of the understanding of biological processes [31]. The annotation process, at nucleotide level, comprises di#erent tasks: gene finding, searching for noncoding RNAs and regulatory regions, identification of large segmental duplications in the genome and identification of repetitive elements. While several di#erent tools are available to autom
Fast practical exact and approximate pattern matching in protein sequences
 In Proceedings of the 17th Australasian Workshop on Combinatorial Algorithms
, 2006
"... Abstract. Here we design, analyse and implement an algorithm that searches for motifs in protein sequences using masking techniques (“wordlevel” parrallelism). Our algorithm speeds up known algorithms by a factor of 20 (or the alphabet size). Furthermore, we present graphs of the running times of th ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Abstract. Here we design, analyse and implement an algorithm that searches for motifs in protein sequences using masking techniques (“wordlevel” parrallelism). Our algorithm speeds up known algorithms by a factor of 20 (or the alphabet size). Furthermore, we present graphs of the running times of the algorithm in comparison to its theoritical time complexity.
Efficient algorithms for (δ, γ, α)matching
"... Abstract. We propose new algorithms for (δ, γ, α)matching. In this string matching problem we are given a pattern P = p0p1... pm−1 and a text T = t0t1... tn−1 over some integer alphabet Σ = {0... σ − 1}. The pattern symbol pi matches the text symbol tj iff pi − tj  ≤ δ. The pattern P (δ, γ)match ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract. We propose new algorithms for (δ, γ, α)matching. In this string matching problem we are given a pattern P = p0p1... pm−1 and a text T = t0t1... tn−1 over some integer alphabet Σ = {0... σ − 1}. The pattern symbol pi matches the text symbol tj iff pi − tj  ≤ δ. The pattern P (δ, γ)matches some text substring tj... tj+m−1 iff for all i it holds that pi − tj+i  ≤ δ and�pi − tj+i  ≤ γ. Finally, in (δ, γ, α)matching we also permit at most α length gaps (text substrings) between each matching text symbol. The only known previous algorithm runs in O(mn) time. We give several algorithms that improve the average case up to O(n) for small α, and the worst case to O(min{mn, Mα}) or O(mn log γ/w), where M = {(i, j)  pi − tj  ≤ δ} and w is the number of bits in a machine word. We conclude with experimental results showing that the algorithms are very efficient in practice. Key words: approximate string matching, music information retrieval, bitparallelism, sparse dynamic programming 1
Weighted Degenerated Approximate Pattern Matching ⋆
"... Abstract. We present a bitparallel approach to degenerated approximate pattern matching problem. That is the problem of finding approximate matches of a “special ” pattern in a text of degenerate symbols. The special pattern P = s1 ∗ (a1,b1)... sℓ ∗ (a ℓ,b ℓ) sℓ+1 ∗ (a ℓ+1,b ℓ+1)... sω, such that s ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract. We present a bitparallel approach to degenerated approximate pattern matching problem. That is the problem of finding approximate matches of a “special ” pattern in a text of degenerate symbols. The special pattern P = s1 ∗ (a1,b1)... sℓ ∗ (a ℓ,b ℓ) sℓ+1 ∗ (a ℓ+1,b ℓ+1)... sω, such that symbol ∗ (a,b) is a sequence of at most b but at least a “don’t care ” symbols which match any symbol within the alphabet, i.e. a sequence of subpatterns with gaps; the pattern is associated with integer weights in each subpattern sℓ for replacements, insertions, and deletions. The problem is to match the pattern such that the minimum sum of weights is achieved. The total time complexity is (k(log(k+2)+1)mn)/w, where m is the length of the pattern P, n is the length of text of degenerate symbols, k is the maximum number of edit operations performed, and w is the length of the computer word.