Results 1 - 10
of
14
Fast and Flexible String Matching by Combining Bit-parallelism and Suffix Automata
- ACM JOURNAL OF EXPERIMENTAL ALGORITHMICS (JEA
, 1998
"... ... In this paper we merge bit-parallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bit-parallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as Shift-Or. It inher ..."
Abstract
-
Cited by 51 (11 self)
- Add to MetaCart
... In this paper we merge bit-parallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bit-parallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as Shift-Or. It inherits from Shift-Or the ability to handle flexible patterns and from BDM the ability to skip characters. BNDM is 30%-40% faster than BDM and up to 7 times faster than Shift-Or. When compared to the fastest existing algorithms on exact patterns (which belong to the BM family), BNDM is from 20% slower to 3 times faster, depending on the alphabet size. With respect to flexible pattern searching, BNDM is by far the fastest technique to deal with classes of characters and is competitive to search allowing errors. In particular, BNDM seems very adequate for computational biology applications, since it is the fastest algorithm to search on DNA sequences and flexible searching is an important problem in that
Flexible pattern matching
- Journal of Applied Statistics
, 2002
"... An important subtask of the pattern discovery process is pattern matching, where the pattern sought is already known and we want to determine how often and where it occurs in a sequence. In this paper we review the most practical techniques to find patterns of different kinds. We show how regular ex ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
An important subtask of the pattern discovery process is pattern matching, where the pattern sought is already known and we want to determine how often and where it occurs in a sequence. In this paper we review the most practical techniques to find patterns of different kinds. We show how regular expressions can be searched for with general techniques, and how simpler patterns can be dealt with more simply and efficiently. We consider exact as well as approximate pattern matching. Also, we cover both sequential searching, where the sequence cannot be preprocessed, and indexed searching, where we have a data structure built over the sequence to speed up the search. 1
Finding patterns with variable length gaps or don’t cares
- of Lecture Notes in Computer Science
, 2006
"... Abstract. In this paper we have presented new algorithms to handle the pattern matching problem where the pattern can contain variable length gaps. Given a pattern P with variable length gaps and a text T our algorithm works in O(n + m + α log(max1<=i<=l(bi − ai))) time where n is the length of the ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Abstract. In this paper we have presented new algorithms to handle the pattern matching problem where the pattern can contain variable length gaps. Given a pattern P with variable length gaps and a text T our algorithm works in O(n + m + α log(max1<=i<=l(bi − ai))) time where n is the length of the text, m is the summation of the lengths of the component subpatterns, α is the total number of occurrences of the component subpatterns in the text and ai and bi are, respectively, the minimum and maximum number of don’t cares allowed between the ith and (i+1)st component of the pattern. We also present another algorithm which, given a suffix array of the text, can report whether P occurs in T in O(m + α log log n) time. Both the algorithms record information to report all the occurrences of P in T. Furthermore, the techniques used in our algorithms are shown to be useful in many other contexts. 1
Efficient Bit-parallel Algorithms for (δ, α)-matching
"... Abstract. We consider the following string matching problem. Pattern p0p1p2... pm−1 (δ, α)-matches the text substring ti0ti1ti2... ti m−1, if |pj − ti j | ≤ δ for j ∈ {0,..., m − 1}, where 0 < ij+1 − ij ≤ α + 1. The task is then to find all text positions im−1 that (δ, α)-match the pattern. For a t ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract. We consider the following string matching problem. Pattern p0p1p2... pm−1 (δ, α)-matches the text substring ti0ti1ti2... ti m−1, if |pj − ti j | ≤ δ for j ∈ {0,..., m − 1}, where 0 < ij+1 − ij ≤ α + 1. The task is then to find all text positions im−1 that (δ, α)-match the pattern. For a text of length n, the best previously known algorithms for this string matching problem run in time O(nm) and in time O(n⌈mα/w⌉), where the former is based on dynamic programming, and the latter on bit-parallelism with w bits in computer word (32 or 64 typically). We improve these to take O(nδ+⌈n/w⌉m) and O(n⌈m log(α)/w⌉), respectively, worst case time using bit-parallelism. On average the algorithms run in O(⌈n/w⌉⌈αδ/σ⌉+n) and O(n) time. Our experimental results show that the algorithms work extremely well in practice. Our algorithms handle general gaps as well, having important applications in computational biology.
Automated search for LTR retrotransposons
, 2002
"... Introduction In the last few years many technological improvements in the production and analysis of DNA sequence data [29, 24] have made possible the complete sequencing of whole genomes: beginning with the microbial ones [13, 7] and continuing with those of many eukaryotic species [25, 2, 34] ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Introduction In the last few years many technological improvements in the production and analysis of DNA sequence data [29, 24] have made possible the complete sequencing of whole genomes: beginning with the microbial ones [13, 7] and continuing with those of many eukaryotic species [25, 2, 34]. The enormous amount of raw genomic data available has favoured a focused attention of the scientific community to the problem of genome annotation. Genome annotation is the process of taking the raw DNA sequence data and adding the layers of analysis and interpretation necessary to extract its biological significance and place it in the context of the understanding of biological processes [31]. The annotation process, at nucleotide level, comprises di#erent tasks: gene finding, searching for non-coding RNAs and regulatory regions, identification of large segmental duplications in the genome and identification of repetitive elements. While several di#erent tools are available to autom
Fast practical exact and approximate pattern matching in protein sequences
- In Proceedings of the 17th Australasian Workshop on Combinatorial Algorithms
, 2006
"... Abstract. Here we design, analyse and implement an algorithm that searches for motifs in protein sequences using masking techniques (“wordlevel” parrallelism). Our algorithm speeds up known algorithms by a factor of 20 (or the alphabet size). Furthermore, we present graphs of the running times of th ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. Here we design, analyse and implement an algorithm that searches for motifs in protein sequences using masking techniques (“wordlevel” parrallelism). Our algorithm speeds up known algorithms by a factor of 20 (or the alphabet size). Furthermore, we present graphs of the running times of the algorithm in comparison to its theoritical time complexity.
Efficient algorithms for (δ, γ, α)-matching
"... Abstract. We propose new algorithms for (δ, γ, α)-matching. In this string matching problem we are given a pattern P = p0p1... pm−1 and a text T = t0t1... tn−1 over some integer alphabet Σ = {0... σ − 1}. The pattern symbol pi matches the text symbol tj iff |pi − tj | ≤ δ. The pattern P (δ, γ)-match ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. We propose new algorithms for (δ, γ, α)-matching. In this string matching problem we are given a pattern P = p0p1... pm−1 and a text T = t0t1... tn−1 over some integer alphabet Σ = {0... σ − 1}. The pattern symbol pi matches the text symbol tj iff |pi − tj | ≤ δ. The pattern P (δ, γ)-matches some text substring tj... tj+m−1 iff for all i it holds that |pi − tj+i | ≤ δ and�|pi − tj+i | ≤ γ. Finally, in (δ, γ, α)-matching we also permit at most α length gaps (text substrings) between each matching text symbol. The only known previous algorithm runs in O(mn) time. We give several algorithms that improve the average case up to O(n) for small α, and the worst case to O(min{mn, |M|α}) or O(mn log γ/w), where M = {(i, j) | |pi − tj | ≤ δ} and w is the number of bits in a machine word. We conclude with experimental results showing that the algorithms are very efficient in practice. Key words: approximate string matching, music information retrieval, bit-parallelism, sparse dynamic programming 1
Weighted Degenerated Approximate Pattern Matching ⋆
"... Abstract. We present a bit-parallel approach to degenerated approximate pattern matching problem. That is the problem of finding approximate matches of a “special ” pattern in a text of degenerate symbols. The special pattern P = s1 ∗ (a1,b1)... sℓ ∗ (a ℓ,b ℓ) sℓ+1 ∗ (a ℓ+1,b ℓ+1)... sω, such that s ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. We present a bit-parallel approach to degenerated approximate pattern matching problem. That is the problem of finding approximate matches of a “special ” pattern in a text of degenerate symbols. The special pattern P = s1 ∗ (a1,b1)... sℓ ∗ (a ℓ,b ℓ) sℓ+1 ∗ (a ℓ+1,b ℓ+1)... sω, such that symbol ∗ (a,b) is a sequence of at most b but at least a “don’t care ” symbols which match any symbol within the alphabet, i.e. a sequence of subpatterns with gaps; the pattern is associated with integer weights in each subpattern sℓ for replacements, insertions, and deletions. The problem is to match the pattern such that the minimum sum of weights is achieved. The total time complexity is (k(log(k+2)+1)mn)/w, where m is the length of the pattern P, n is the length of text of degenerate symbols, k is the maximum number of edit operations performed, and w is the length of the computer word.
Fast Matching of CBG Patterns with
, 2002
"... The large data set sizes produced in many biological applications, makes pattern matching in computational biology a challenge. We present a technique for pattern matching an important class of protein patterns. We show how such a protein pattern can be represented as a logical expression, from w ..."
Abstract
- Add to MetaCart
The large data set sizes produced in many biological applications, makes pattern matching in computational biology a challenge. We present a technique for pattern matching an important class of protein patterns. We show how such a protein pattern can be represented as a logical expression, from which a circuit can be automatically synthesised, and implemented on field programmable gate arrays, which leads to highly parallelisable solutions. The method was tested on the Prosite database, and almost all the patterns could be dealt with very efficiently leading to throughput rates in most cases excess of 10 symbols per second.

