Results 1  10
of
118
Finding composite regulatory patterns in DNA sequences
 Bioinformatics
, 2002
"... Pattern discovery in unaligned DNA sequences is a fundamental problem in computational biology with important applications in finding regulatory signals. Current approaches to pattern discovery focus on monad patterns that correspond to relatively short contiguous strings. However, many of the actua ..."
Abstract

Cited by 71 (3 self)
 Add to MetaCart
Pattern discovery in unaligned DNA sequences is a fundamental problem in computational biology with important applications in finding regulatory signals. Current approaches to pattern discovery focus on monad patterns that correspond to relatively short contiguous strings. However, many of the actual regulatory signals are composite patterns that are groups of monad patterns that occur near each other. A difficulty in discovering composite patterns is that one or both of the component monad patterns in the group may be “too weak”. Since the traditional monadbased motif finding algorithms usually output one (or a few) high scoring patterns, they often fail to find composite regulatory signals consisting of weak monad parts. In this paper, we present a MITRA (MIsmatch TRee Algorithm) approach for discovering composite signals. We demonstrate that MITRA performs well for both monad and composite patterns by presenting experiments over biological and synthetic data. Availability: MITRA is available at
On The Closest String and Substring Problems
 Journal of the ACM
, 2002
"... The problem of finding a center string that is `close' to every given string arises in computational molecular biology and coding theory. This problem has two versions: the Closest String problem and the Closest Substring problem. Given a set of strings S = fs 1 ; s 2 ; : : : ; s n g, each of lengt ..."
Abstract

Cited by 53 (14 self)
 Add to MetaCart
The problem of finding a center string that is `close' to every given string arises in computational molecular biology and coding theory. This problem has two versions: the Closest String problem and the Closest Substring problem. Given a set of strings S = fs 1 ; s 2 ; : : : ; s n g, each of length m, the Closest String problem is to find the smallest d and a string s of length m which is within Hamming distance d to each s i 2 S. This problem comes from coding theory when we are looking for a code not too far away from a given set of codes. Closest Substring problem, with an additional input integer L, asks for the smallest d and a string s, of length L, which is within Hamming distance d away from a substring, of length L, of each s i . This problem is much more elusive than the Closest String problem. The Closest Substring problem is formulated from applications in finding conserved regions, identifying genetic drug targets and generating genetic probes in molecular biology. Whether there are efficient approximation algorithms for both problems are major open questions in this area. We present two polynomial time approximation algorithms with approximation ratio 1 + ffl for any small ffl to settle both questions.
Tests for Gene Clustering
, 2002
"... Comparing chromosomal gene order in two or more related species is an important approach to studying the forces that guide genome organization and evolution. Linked clusters of similar genes found in related genomes are often used to support arguments of evolutionary relatedness or functional select ..."
Abstract

Cited by 26 (3 self)
 Add to MetaCart
Comparing chromosomal gene order in two or more related species is an important approach to studying the forces that guide genome organization and evolution. Linked clusters of similar genes found in related genomes are often used to support arguments of evolutionary relatedness or functional selection. However, as the gene order and the gene complement of sister genomes diverge progressively due to large scale rearrangements, horizontal gene transfer, gene duplication and gene loss, it becomes increasingly difficult to determine whether observed similarities in local genomic structure are indeed remnants of common ancestral gene order, or are merely coincidences.
A 1.375Approximation Algorithm for Sorting by Transpositions
 Proceedings of 5th Workshop on Algorithms in Bioinformatics (WABI’05), LNBI 3692, 2005
, 2005
"... Sorting permutations by transpositions is an important problem in genome rearrangements. A transposition is a rearrangement operation in which a segment is cut out of the permutation and pasted in a di#erent location. The complexity of this problem is still open and it has been a tenyearold op ..."
Abstract

Cited by 26 (0 self)
 Add to MetaCart
Sorting permutations by transpositions is an important problem in genome rearrangements. A transposition is a rearrangement operation in which a segment is cut out of the permutation and pasted in a di#erent location. The complexity of this problem is still open and it has been a tenyearold open problem to improve the best known 1.5approximation algorithm. In this paper we provide a 1.375approximation algorithm for sorting by transpositions. The algorithm is based on a new upper bound on the diameter of 3permutations. In addition, we present some new results regarding the transposition diameter: We improve the lower bound for the transposition diameter of the symmetric group, and determine the exact transposition diameter of 2permutations and simple permutations.
Finding an optimal inversion median: experimental results
 In Proc. 1st Workshop on Algs. in Bioinformatics WABI 2001
, 2001
"... Abstract. We derive a branchandbound algorithm to find an optimal inversion median of three signed permutations. The algorithm prunes to manageable size an extremely large search tree using simple geometric properties of the problem and a newly available lineartime routine for inversion distance. ..."
Abstract

Cited by 25 (10 self)
 Add to MetaCart
Abstract. We derive a branchandbound algorithm to find an optimal inversion median of three signed permutations. The algorithm prunes to manageable size an extremely large search tree using simple geometric properties of the problem and a newly available lineartime routine for inversion distance. Our experiments on simulated data sets indicate that the algorithm finds optimal medians in reasonable time for genomes of medium size when distances are not too large, as commonly occurs in phylogeny reconstruction. In addition, we have compared inversion and breakpoint medians, and found that inversion medians generally score significantly better and tend to be far more unique, which should make them valuable in medianbased treebuilding algorithms. 1
Reliable Detection of Episodes in Event Sequences
 Knowledge and Information Systems
, 2004
"... Suppose one wants to detect \bad" or \suspicious" subsequences in event sequences. Whether an observed pattern of activity (in the form of a particular subsequence) is signi cant and should be a cause for alarm, depends on how likely it is to occur fortuitously. A long enough sequence of observed ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
Suppose one wants to detect \bad" or \suspicious" subsequences in event sequences. Whether an observed pattern of activity (in the form of a particular subsequence) is signi cant and should be a cause for alarm, depends on how likely it is to occur fortuitously. A long enough sequence of observed events will almost certainly contain any subsequence, and setting thresholds for alarm is an important issue in a monitoring system that seeks to avoid false alarms. Suppose a long sequence T of observed events contains a suspicious subsequence pattern S within it, where the suspicious subsequence S consists of m events and spans a window of size w within T . We address the fundamental problem: is a certain number of occurrences of a particular subsequence unlikely to be generated by randomness itself (i.e., indicative of suspicious activity)? If the probability of an occurrence generated by randomness is high and an automated monitoring system ags it as suspicious anyway, then such a system will suer from generating too many false alarms. This paper quanti es the probability of such an S occurring in T within a window of size w, the number of distinct windows containing S as a subsequence, the expected number of such occurrences, its variance, and establishes its limiting distribution that allows to set up an alarm threshold so that the probability of false alarms is very small. We report on experiments con rming the theory and showing that we can detect bad subsequences with low false alarm rate.
Database indexing for large DNA and protein sequence collections
, 2002
"... Our aim is to develop new database technologies for the approximate matching of unstructured string data using indexes. We explore the potential of the suffix tree data structure in this context. We present a new method of building suffix trees, allowing us to build trees in excess of RAM size, whic ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
Our aim is to develop new database technologies for the approximate matching of unstructured string data using indexes. We explore the potential of the suffix tree data structure in this context. We present a new method of building suffix trees, allowing us to build trees in excess of RAM size, which has hitherto not been possible. We show that this method performs in practice as well as the O(n) method of Ukkonen [70]. Using this method we build indexes for 200Mb of protein and 300Mbp of DNA, whose diskimage exceeds the available RAM. We show experimentally that suffix trees can be effectively used in approximate string matching with biological data. For a range of query lengths and error bounds the suffix tree reduces the size of the unoptimised O(mn) dynamic programming calculation required in the evaluation of string similarity, and the gain from indexing increases with index size. In the indexes we built this reduction is significant, and less than 0.3% of the expected matrix is evaluated. We detail the requirements for further database and algorithmic research to support efficient use of large suffix indexes in biological applications.
FixedParameter Algorithms for Closest String and Related Problems
 ALGORITHMICA
, 2003
"... Closest String is one of the core problems in the field of consensus word analysis with particular importance for computational biology. Given k strings ..."
Abstract

Cited by 22 (7 self)
 Add to MetaCart
Closest String is one of the core problems in the field of consensus word analysis with particular importance for computational biology. Given k strings
On De Novo Interpretation of Tandem Mass Spectra for Peptide Identification
, 2003
"... The correct interpretation of tandem mass spectra is a difficult problem, even when it is limited to scoring peptides against a database. De novo sequencing is considerably harder, but critical when sequence databases are incomplete or not available. In this paper we build upon earlier work due to D ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
The correct interpretation of tandem mass spectra is a difficult problem, even when it is limited to scoring peptides against a database. De novo sequencing is considerably harder, but critical when sequence databases are incomplete or not available. In this paper we build upon earlier work due to Dancik et al., and Chen et al. to provide a dynamic programming algorithm for interpreting de novo spectra. Our method can handle most of the commonly occurring ions, including a, b, y, and their neutral losses. Additionally, we shift the emphasis away from sequencing to assigning ion types to peaks. In particular, we introduce the notion of core interpretations, which allow us to give confidence values to individual peak assignments, even in the absence of a strong interpretation. Finally, we introduce a systematic approach to evaluating de novo algorithms as a function of spectral quality. We show that our algorithm, in particular the coreinterpretation, is robust in the presence of measurement error, and low fragmentation probability.