Results 1  10
of
50
A greedy algorithm for aligning DNA sequences
 J. COMPUT. BIOL
, 2000
"... For aligning DNA sequences that differ only by sequencing errors, or by equivalent errors from other sources, a greedy algorithm can be much faster than traditional dynamic programming approaches and yet produce an alignment that is guaranteed to be theoretically optimal. We introduce a new greedy a ..."
Abstract

Cited by 241 (15 self)
 Add to MetaCart
For aligning DNA sequences that differ only by sequencing errors, or by equivalent errors from other sources, a greedy algorithm can be much faster than traditional dynamic programming approaches and yet produce an alignment that is guaranteed to be theoretically optimal. We introduce a new greedy alignment algorithm with particularly good performance and show that it computes the same alignment as does a certain dynamic programming algorithm, while executing over 10 times faster on appropriate data. An implementation of this algorithm is currently used in a program that assembles the UniGene database at the National Center for Biotechnology Information.
A PATTERN MATCHING MODEL FOR MISUSE INTRUSION DETECTION
"... This paper describes a generic model of matching that can be usefully applied to misuse intrusion detection. The model is based on Colored Petri Nets. Guards define the context in which signatures are matched. The notion of start and final states, and paths between them define the set of event seque ..."
Abstract

Cited by 150 (4 self)
 Add to MetaCart
This paper describes a generic model of matching that can be usefully applied to misuse intrusion detection. The model is based on Colored Petri Nets. Guards define the context in which signatures are matched. The notion of start and final states, and paths between them define the set of event sequences matched by the net. Partial order matching can also be specified in this model. The main benefits of the model are its generality, portability and flexibility.
Parameterized Pattern Matching: Algorithms and Applications
, 1994
"... The problem of finding sections of code that either are identical or are related by the systematic renaming of variables or constants can be modeled in terms of parameterized strings (pstrings) and parameterized matches (p matches) [Baker93a]. Pstrings are strings over two alphabets, one of whic ..."
Abstract

Cited by 71 (5 self)
 Add to MetaCart
The problem of finding sections of code that either are identical or are related by the systematic renaming of variables or constants can be modeled in terms of parameterized strings (pstrings) and parameterized matches (p matches) [Baker93a]. Pstrings are strings over two alphabets, one of which represents parameters. Two pstrings are a parameterized match (pmatch) if one pstring is obtained by renaming the parameters of the other by a onetoone function. In this paper, we investigate parameterized pattern matching via parameterized suffix trees (psuffix trees), defined in [Baker93a]. We give two algorithms for constructing psuffix trees: one (eager) that runs in linear time for fixed alphabets, and another that uses auxiliary data structures and runs in O(nlog (n)) time for variable alphabets, where n is input length. We show that using a psuffix tree for a pattern pstring P, it is possible to search for all pmatches of P within a text pstring T in space linear in ï P ï...
Software Process Validation: Quantitatively Measuring the Correspondence of a Process to a Model
 ACM Transactions on Software Engineering and Methodology
, 1996
"... this article. ..."
A System for Approximate Tree Matching
, 1992
"... Ordered, labeled trees are trees in which each node has a label and the lefttoright order of its children (if it has any) is fixed. Such trees have many applications in vision, pattern recognition, molecular biology, programming compilation and natural language processing. Many of the applications ..."
Abstract

Cited by 61 (10 self)
 Add to MetaCart
Ordered, labeled trees are trees in which each node has a label and the lefttoright order of its children (if it has any) is fixed. Such trees have many applications in vision, pattern recognition, molecular biology, programming compilation and natural language processing. Many of the applications involve comparing trees or retrieving/extracting information from a repository of trees. Examples include classification of unknown patterns, analysis of newly sequenced RNA structures, semantic taxonomy for dictionary definitions, generation of interpreters for nonprocedural programming languages, and automatic error recovery and correction for programming languages. Previous systems use exact matching (or generalized regular expression matching) for tree comparison. This paper presents a system, called ApproximateTreeByExample (ATBE), which allows inexact matching of trees. The ATBE system interacts with the user through a simple, but powerful query language; graphical devices a...
Parameterized Duplication in Strings: Algorithms and an Application to Software Maintenance
 SIAM Journal on Computing
, 1997
"... As an aid in software maintenance, it would be useful to be able to track down duplication in large software systems efficiently. Duplication in code is often in the form of sections of code that are the same except for a systematic change of parameters such as identifiers and constants. To model su ..."
Abstract

Cited by 53 (5 self)
 Add to MetaCart
As an aid in software maintenance, it would be useful to be able to track down duplication in large software systems efficiently. Duplication in code is often in the form of sections of code that are the same except for a systematic change of parameters such as identifiers and constants. To model such parameterized duplication in code, this paper introduces the notions of parameterized strings and parameterized matches of parameterized strings. A data structure called a parameterized suffix tree is defined to aid in searching for parameterized matches. For fixed alphabets, algorithms are given to construct a parameterized suffix tree in linear time and to find all maximal parameterized matches over a threshold length in a parameterized pstring in time linear in the size of the input plus the number of matches reported. The algorithms have been implemented
Approximate Tree Matching in the Presence of Variable Length Don't Cares
 Journal of Algorithms
, 1993
"... Ordered labeled trees are trees in which the sibling order matters. This paper presents algorithms for three problems having to do with approximate matching for such trees with variablelength don't cares (VLDC's). In strings, a VLDC symbol in the pattern may substitute for zero or more symbols i ..."
Abstract

Cited by 38 (7 self)
 Add to MetaCart
Ordered labeled trees are trees in which the sibling order matters. This paper presents algorithms for three problems having to do with approximate matching for such trees with variablelength don't cares (VLDC's). In strings, a VLDC symbol in the pattern may substitute for zero or more symbols in the data string. For example, if "comer" is the pattern, then the "" would substitute for the substring "put" when matching the data string "computer". Approximate VLDC matching in strings means that after the best possible substitution, the pattern still need not be the same as the data string for a match to be allowed. For example, "comer" matches "counter" within distance 1 (representing the cost of removing the "m" from "comer" and having the "" substitute for "unt"). We generalize approximate VLDC string matching to three algorithms for approximate VLDC matching on trees. The time complexity of our algorithms is O(jP j \Theta jDj \Theta min(depth(P ); leaves(P )) \Theta min(de...
Recent developments in linearspace alignment methods: A survey
 J. Comput. Biol
, 1994
"... A dynamicprogramming strategy for sequence alignment first proposed in 1975 by Dan Hirschberg can be adapted to yield a number of extremely spaceefficient algorithms. Specifically, these algorithms align two sequences using only ‘‘linear space’’, i.e., an amount of computer memory that is proporti ..."
Abstract

Cited by 26 (5 self)
 Add to MetaCart
A dynamicprogramming strategy for sequence alignment first proposed in 1975 by Dan Hirschberg can be adapted to yield a number of extremely spaceefficient algorithms. Specifically, these algorithms align two sequences using only ‘‘linear space’’, i.e., an amount of computer memory that is proportional to the sum of the lengths of the two sequences being aligned. This paper begins by reviewing the basic idea, as it applies to the global (i.e., endtoend) alignment of two DNA or protein sequences. Three of our recent extensions of the technique are then outlined. The first extension computes an optimal alignment subject to the constraint that each position, i, of the first sequence must be aligned somewhere between positions L[i] and U[i] of the second sequence, for given values of L and U. The second finds all aligned position pairs (i.e., potential columns of the alignment) that occur in an alignment whose score exceeds a given threshold. The third treats the case where each of the two sequences is allowed to be an alignment (e.g., a sequence of aligned pairs), using a sensitive scoring scheme. We also describe two linearspace methods for computing k best local (i.e., involving only a part of each sequence) alignments, where k ≥ 1. One is a linearspace version of the algorithm of Waterman and Eggert (1987), and the other is based on the strategy proposed by Wilbur and Lipman (1983). Finally, we describe programs that implement various combinations of these techniques to provide a multisequence alignment method that is especially suited to handling a few very long sequences. The utility of these programs is illustrated by analysis of the locus control region of the βlike globin gene cluster of several mammals.
Fast and simple character classes and bounded gaps pattern matching, with application to protein searching
 Journal of Computational Biology
, 2001
"... The problem of fast exact and approximate searching for a pattern that contains classes of characters and bounded size gaps (CBG) in a text has a wide range of applications, among which a very important one is protein pattern matching (for instance, one PROSITE protein site is associated with the CB ..."
Abstract

Cited by 23 (4 self)
 Add to MetaCart
The problem of fast exact and approximate searching for a pattern that contains classes of characters and bounded size gaps (CBG) in a text has a wide range of applications, among which a very important one is protein pattern matching (for instance, one PROSITE protein site is associated with the CBG [RK]  x(2,3)  [DE]  x(2,3)  Y, where the brackets match any of the letters inside, and x(2,3) a gap of length between 2 and 3). Currently, the only way to search for a CBG in a text is to convert it into a full regular expression (RE). However, a RE is more sophisticated than a CBG, and searching for it with a RE pattern matching algorithm complicates the search and makes it slow. This is the reason why we design in this article two new practical CBG matching algorithms that are much simpler and faster than all the RE search techniques. The first one looks exactly once at each text character. The second one does not need to consider all the text characters, and hence it is usually faster than the first one, but in bad cases may have to read the same text character more than once. We then propose a criterion based on the form of the CBG to choose a priori the fastest between both. We also show how to search permitting a few mistakes in the occurrences. We performed many practical experiments using the PROSITE database, and all of them show that our algorithms are the fastest in virtually all cases.