Results 1  10
of
19
A novel method for multiple alignment of sequences with repeated and shuffled elements
, 2004
"... ..."
Fast and simple character classes and bounded gaps pattern matching, with application to protein searching
 Journal of Computational Biology
, 2001
"... The problem of fast exact and approximate searching for a pattern that contains classes of characters and bounded size gaps (CBG) in a text has a wide range of applications, among which a very important one is protein pattern matching (for instance, one PROSITE protein site is associated with the CB ..."
Abstract

Cited by 23 (4 self)
 Add to MetaCart
The problem of fast exact and approximate searching for a pattern that contains classes of characters and bounded size gaps (CBG) in a text has a wide range of applications, among which a very important one is protein pattern matching (for instance, one PROSITE protein site is associated with the CBG [RK]  x(2,3)  [DE]  x(2,3)  Y, where the brackets match any of the letters inside, and x(2,3) a gap of length between 2 and 3). Currently, the only way to search for a CBG in a text is to convert it into a full regular expression (RE). However, a RE is more sophisticated than a CBG, and searching for it with a RE pattern matching algorithm complicates the search and makes it slow. This is the reason why we design in this article two new practical CBG matching algorithms that are much simpler and faster than all the RE search techniques. The first one looks exactly once at each text character. The second one does not need to consider all the text characters, and hence it is usually faster than the first one, but in bad cases may have to read the same text character more than once. We then propose a criterion based on the form of the CBG to choose a priori the fastest between both. We also show how to search permitting a few mistakes in the occurrences. We performed many practical experiments using the PROSITE database, and all of them show that our algorithms are the fastest in virtually all cases.
A System for Pattern Matching Applications on Biosequences
, 1993
"... ANREP is a system for finding matches to patterns composed of (1) spacing constraints called "spacers", and (2) approximate matches to "motifs" that are, recursively, patterns composed of "atomic" symbols. A user specifies such patterns via a declarative, freeformat, a ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
ANREP is a system for finding matches to patterns composed of (1) spacing constraints called "spacers", and (2) approximate matches to "motifs" that are, recursively, patterns composed of "atomic" symbols. A user specifies such patterns via a declarative, freeformat, and strongly typed language called A that is presented here in a tutorial style through a series of progressively more complex examples. The sample patterns are for protein and DNA sequences, the application domain for which ANREP was specifically created. ANREP provides a unified framework for almost all previously proposed biosequence patterns and extends them by providing approximate matching, a feature heretofore unavailable except for the limited case of individual sequences. The performance of ANREP is discussed and an appendix gives a concise specification of syntax and semantics. A portable C software package implementing ANREP is available via anonymous remote file transfer. Introduction In this paper we present...
Approximate string searching under weighted edit distance
 In Proceedings of the 3rd South American Workshop on String Processing (WSP ’96). Carleton Univ
, 1996
"... Abstract. Let p ∈ Σ ∗ be a string of length m and t ∈ Σ ∗ be a string of length n. The approximate string searching problem is to find all approximate matches of p in t having weighted edit distance at most k from p. We present a new method that preprocesses the pattern into a DFA which scans t onli ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
Abstract. Let p ∈ Σ ∗ be a string of length m and t ∈ Σ ∗ be a string of length n. The approximate string searching problem is to find all approximate matches of p in t having weighted edit distance at most k from p. We present a new method that preprocesses the pattern into a DFA which scans t online in linear time, thereby recognizing all positions in t where an approximate match ends. We show how to reduce the exponential preprocessing effort and propose two practical algorithms. The first algorithm constructs the states of the DFA up to a certain depth r ≥ 1. It runs in O(Σ  r+1 · m + q · m + n) time and O(Σ  r+1 + Σ  r ·m) space where q ≤ n decreases as r increases. The second algorithm constructs the transitions of the DFA when they are demanded. It runs in O(qs·Σ+qt·m+n) time and O(qs·(Σ+m)) space where qs ≤ qt ≤ n depend on the problem instance. Practical measurements show that our algorithms work well in practice and beat previous methods for problems of interest in molecular biology. 1
Estimating the Probability of Approximate Matches
 In CPM'97, Lecture Notes in Computer Science
, 1997
"... this paper addresses how to define S k (P ) and how to solve the algorithmic subproblems involved in an efficient realization with respect to this definition. Section 2 introduces as our choice for S k (P ) the set of what we call the condensed, canonical edit scripts. Our choice attempts to keep s ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
this paper addresses how to define S k (P ) and how to solve the algorithmic subproblems involved in an efficient realization with respect to this definition. Section 2 introduces as our choice for S k (P ) the set of what we call the condensed, canonical edit scripts. Our choice attempts to keep small, both (i) the number of edit scripts for which X(s) = 0, and (ii) the size of g(v). Doing so improves the convergence of the estimator as it places S k (P ) and CN k (P ) in closer correspondence. The remaining sections present dynamic programming algorithms for the following subtasks:
Approximate Matching of Secondary Structures
, 2001
"... Several methods have been developed for identifying more or less complex RNA structures in a genome. Whatever the method is, it is always based on the search of conserved primary and secondary structures. While various efficient methods have been developed for searching motifs of the primary structu ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Several methods have been developed for identifying more or less complex RNA structures in a genome. Whatever the method is, it is always based on the search of conserved primary and secondary structures. While various efficient methods have been developed for searching motifs of the primary structure, usually represented as regular expressions, few effort has been expended in the efficient search of secondary structure signals. By a helix, we mean a stemloop structure defined by a combination of sequence and folding constraints. We present a flexible algorithm that searches for all approximate matches of a helix in a genome. Helices are represented by special regular expressions, that we call secondary expressions. The method is based on an alignment graph constructed from several copies of a pushdown automaton, arranged one on top of another. The worst time complexity is O(rpn), where n is the size of the genome, p the size of the secondary expression, and r its number of union symbols. We present our results of searching for specific signals of the tRNA and RNase P RNA in two genomes.
Efficient Bitparallel Algorithms for (δ, α)matching
"... Abstract. We consider the following string matching problem. Pattern p0p1p2... pm−1 (δ, α)matches the text substring ti0ti1ti2... ti m−1, if pj − ti j  ≤ δ for j ∈ {0,..., m − 1}, where 0 < ij+1 − ij ≤ α + 1. The task is then to find all text positions im−1 that (δ, α)match the pattern. For ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Abstract. We consider the following string matching problem. Pattern p0p1p2... pm−1 (δ, α)matches the text substring ti0ti1ti2... ti m−1, if pj − ti j  ≤ δ for j ∈ {0,..., m − 1}, where 0 < ij+1 − ij ≤ α + 1. The task is then to find all text positions im−1 that (δ, α)match the pattern. For a text of length n, the best previously known algorithms for this string matching problem run in time O(nm) and in time O(n⌈mα/w⌉), where the former is based on dynamic programming, and the latter on bitparallelism with w bits in computer word (32 or 64 typically). We improve these to take O(nδ+⌈n/w⌉m) and O(n⌈m log(α)/w⌉), respectively, worst case time using bitparallelism. On average the algorithms run in O(⌈n/w⌉⌈αδ/σ⌉+n) and O(n) time. Our experimental results show that the algorithms work extremely well in practice. Our algorithms handle general gaps as well, having important applications in computational biology.
A Pattern Language for Molecular Biology
, 1995
"... In this paper we have formalised and studied a language for describing constrained patterns in biosequences. We have developed an efficient and elegant algorithm for finding a given pattern in a sequence. The efficiency of the algorithm is determined by the fact that it does not use backtracking unl ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
In this paper we have formalised and studied a language for describing constrained patterns in biosequences. We have developed an efficient and elegant algorithm for finding a given pattern in a sequence. The efficiency of the algorithm is determined by the fact that it does not use backtracking unlike other algorithms for dealing with constrained patterns. Key words: biosequences, patterns, search 1 Introduction During the last decade molecular biologists have focussed more and more attention to finding patterns in biosequences. There are several reasons for this interest. For instance, if we can find a common pattern present in DNA sequences believed to be related to gene regulation, then finding the same pattern elsewhere in DNA suggests that the respective part of the DNA may also plays role as a regulatory region [13]. Finding common patterns in protein sequences helps in predicting their three dimensional structure [4]. One of the many problems in research related to patterns i...
Regular expression matching with multistrings and intervals
 In Proc. SODA’10
"... Regular expression matching is a key task (and often computational bottleneck) in a variety of software tools and applications. For instance, the standard grep and sed utilities, scripting languages such as perl, internet traffic analysis, XML querying, and protein searching. The basic definition of ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Regular expression matching is a key task (and often computational bottleneck) in a variety of software tools and applications. For instance, the standard grep and sed utilities, scripting languages such as perl, internet traffic analysis, XML querying, and protein searching. The basic definition of a regular expression is that we combine characters with union, concatenation, and kleene star operators. The length m is proportional to the number of characters. However, often the initial operation is to concatenate characters in fairly long strings, e.g., if we search for certain combinations of words in a firewall. As a result, the number k of strings in the regular expression is significantly smaller than m. Our main result is a new algorithm that essentially replaces m with k in the complexity bounds for regular expression matching. More precisely, after an O(m log k) time and O(m) space preprocessing of the expression, we can match it in a string presented as a stream log w of characters in O(k w + log k) time per character, where w is the number of bits in a memory word. For large w, this corresponds to the previous best bound log w of O(m w + log m). Prior to this work no O(k) bound per character was known. We further extend our solution to efficiently handle character class interval operators C{x, y}. Here, C is a set of characters and C{x, y}, where x and y are integers such that 0 ≤ x ≤ y, represents a string of length between x and y from C. These character class intervals generalize variable length gaps which are frequently used for pattern matching in computational biology applications. 1