Results 1  10
of
192
A Guided Tour to Approximate String Matching
 ACM Computing Surveys
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract

Cited by 409 (38 self)
 Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1
Searching in Metric Spaces
, 1999
"... The problem of searching the elements of a set which are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather ge ..."
Abstract

Cited by 322 (33 self)
 Add to MetaCart
The problem of searching the elements of a set which are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather general case where the similarity criterion defines a metric space, instead of the more restricted case of a vector space. A large number of solutions have been proposed in different areas, in many cases without crossknowledge. Because of this, the same ideas have been reinvented several times, and very different presentations have been given for the same approaches. We
Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers 49:145–165
, 1999
"... Abstract: An algorithm is presented for generating rigorously all suboptimal secondary structures between the minimum free energy and an arbitrary upper limit. The algorithm is particularly fast in the vicinity of the minimum free energy. This enables the efficient approximation of statistical quant ..."
Abstract

Cited by 126 (17 self)
 Add to MetaCart
Abstract: An algorithm is presented for generating rigorously all suboptimal secondary structures between the minimum free energy and an arbitrary upper limit. The algorithm is particularly fast in the vicinity of the minimum free energy. This enables the efficient approximation of statistical quantities, such as the partition function or measures for structural diversity. The density of states at low energies and its associated structures are crucial in assessing from a thermodynamic point of view how welldefined the ground state is. We demonstrate this by exploring the role of base modification in tRNA secondary structures, both at the level of individual sequences from Escherichia coli and by comparing artificially generated ensembles of modified and unmodified sequences with the same tRNA structure. The two major conclusions are that (1) base modification considerably sharpens the definition of the ground state structure by constraining energetically adjacent structures to be similar to the ground state, and (2) sequences whose ground state structure is thermodynamically well defined show a significant tendency to buffer single point mutations. This can have evolutionary implications, since selection pressure to improve the definition of ground states with biological function may result in increased neutrality. © 1999 John Wiley & Sons, Inc.
Within the Twilight Zone: A Sensitive ProfileProfile Comparison Tool Based on Information Theory
 J. Mol. Biol
, 2002
"... This paper presents a novel approach to proleprole comparison. The method compares two input proles (like those that are generated by PSIBLAST) and assigns a similarity score to assess their statistical similarity. Our proleprole comparison tool, which allows for gaps, can be used to detect weak ..."
Abstract

Cited by 99 (4 self)
 Add to MetaCart
This paper presents a novel approach to proleprole comparison. The method compares two input proles (like those that are generated by PSIBLAST) and assigns a similarity score to assess their statistical similarity. Our proleprole comparison tool, which allows for gaps, can be used to detect weak similarities between protein families. It has also been optimized to produce alignments that are in very good agreement with structural alignments. Tests show that the proleprole alignments are indeed highly correlated with similarities between secondary structure elements and tertiary structure. Exhaustive evaluations show that our method is signicantly more sensitive in detecting distant homologies than the popular prolebased search programs PSIBLAST and IMPALA. The relative improvement is the same order of magnitude as the improvement of PSIBLAST relative to BLAST. Our new tool often detects similarities that fall within the twilight zone of sequence similarity
Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots
 Discrete Applied Mathematics
, 2000
"... structure prediction with pseudoknots ..."
On Pattern Frequency Occurrences In A Markovian Sequence?
 Algorithmica
, 1997
"... Consider a given pattern H and a random text T generated by a Markovian source. We study the frequency of pattern occurrences in a random text when overlapping copies of the pattern are counted separately. We present exact and asymptotic formulae for all moments (including the variance), and probabi ..."
Abstract

Cited by 63 (24 self)
 Add to MetaCart
Consider a given pattern H and a random text T generated by a Markovian source. We study the frequency of pattern occurrences in a random text when overlapping copies of the pattern are counted separately. We present exact and asymptotic formulae for all moments (including the variance), and probability of r pattern occurrences for three different regions of r, namely: (i) r = O(1), (ii) central limit regime, and (iii) large deviations regime. In order to derive these results, we first construct some language expressions that characterize pattern occurrences which are later translated into generating functions. Finally, we use analytical methods to extract asymptotic behaviors of the pattern frequency. Applications of these results include molecular biology, source coding, synchronization, wireless communications, approximate pattern matching, game theory, and stock market analysis. These findings are of particular interest to information theory (e.g., secondorder properties of the re...
Phylogenetic Tree Construction Using Markov Chain Monte Carlo
, 1999
"... We describe a Bayesian method based on Markov chain simulation to study the phylogenetic relationship in a group of DNA sequences. Under simple models of mutational events, our method produces a Markov chain whose stationary distribution is the conditional distribution of the phylogeny given the obs ..."
Abstract

Cited by 59 (0 self)
 Add to MetaCart
We describe a Bayesian method based on Markov chain simulation to study the phylogenetic relationship in a group of DNA sequences. Under simple models of mutational events, our method produces a Markov chain whose stationary distribution is the conditional distribution of the phylogeny given the observed sequences. Our algorithm strikes a reasonable balance between the desire to move globally through the space of phylogenies and the need to make computationally feasible moves in areas of high probability. Since phylogenetic information is described by a tree, we have created new diagnostics to handle this type of data structure. An important byproduct of the Markov chain Monte Carlo phylogeny building technique is that it provides estimates and corresponding measures of variability for any aspect of the phylogeny under study.
A Polynomial Time Approximation Scheme for Minimum Routing Cost Spanning Trees
, 1998
"... Given an undirected graph with nonnegative costs on the edges, the routing cost of any of its spanning trees is the sum over all pairs of vertices of the cost of the path between the pair in the tree. Finding a spanning tree of minimum routing cost is NPhard, even when the costs obey the triangle i ..."
Abstract

Cited by 42 (6 self)
 Add to MetaCart
Given an undirected graph with nonnegative costs on the edges, the routing cost of any of its spanning trees is the sum over all pairs of vertices of the cost of the path between the pair in the tree. Finding a spanning tree of minimum routing cost is NPhard, even when the costs obey the triangle inequality. We show that the general case is in fact reducible to the metric case and present a polynomialtime approximation scheme valid for both versions of the problem. In particular, we show how to build a spanning tree of an nvertex weighted graph with routing cost within (1 + ffl) from the minimum in time O(n O( 1 ffl ) ). Besides the obvious connection to network design, trees with small routing cost also find application in the construction of good multiple sequence alignments in computational biology. The communication cost spanning tree problem is a generalization of the minimum routing cost tree problem where the routing costs of different pairs are weighted by different r...
Alignmentfree sequence comparisona review
 Bioinformatics
, 2003
"... Motivation: Genetic recombination and, in particular, genetic shuffling are at odds with sequence comparison by alignment, which assumes conservation of contiguity between homologous segments. A variety of theoretical foundations are being used to derive alignmentfree methods that overcome this lim ..."
Abstract

Cited by 42 (5 self)
 Add to MetaCart
Motivation: Genetic recombination and, in particular, genetic shuffling are at odds with sequence comparison by alignment, which assumes conservation of contiguity between homologous segments. A variety of theoretical foundations are being used to derive alignmentfree methods that overcome this limitation. The formulation of alternative metrics for dissimilarity between sequences and their algorithmic implementations are reviewed. Results: The overwhelming majority of work on alignmentfree sequence has taken place in the past two decades, with most reports published in the past 5 years. Two main categories of methods have been proposed—methods based on word (oligomer) frequency, and methods that do not require resolving the sequence with fixed word length segments. The first category is based on the statistics of word frequency, on the distances defined in a Cartesian space defined by the frequency vectors, and on the information content of frequency distribution. The second category includes the use of Kolmogorov complexity and Chaos Theory. Despite their low visibility, alignmentfree metrics are in fact already widely used as preselection filters for alignmentbased querying of large applications. Recent work is furthering their usage as a scaleindependent methodology that is capable of recognizing homology when loss of contiguity is beyond the possibility of alignment. Availability: Most of the alignmentfree algorithms reviewed were implemented in MATLAB code and are available
Reliable Communication Over Channels With Insertions, Deletions And Substitutions
 IEEE Transactions on Information Theory
, 2001
"... A new block code is introduced which is capable of correcting multiple insertion, deletion and substitution errors. The code consists of nonlinear inner codes, which we call `watermark' codes, concatenated with lowdensity paritycheck codes over nonbinary elds. The inner code allows probabilisti ..."
Abstract

Cited by 42 (1 self)
 Add to MetaCart
A new block code is introduced which is capable of correcting multiple insertion, deletion and substitution errors. The code consists of nonlinear inner codes, which we call `watermark' codes, concatenated with lowdensity paritycheck codes over nonbinary elds. The inner code allows probabilistic resynchronisation and provides soft outputs for the outer decoder, which then completes decoding. We present codes of rate 0.7 and transmitted length 5000 bits that can correct 30 insertion/deletion errors per block. We also present codes of rate 3/14 and length 4600 bits that can correct 450 insertion/deletion errors per block.