A Guided Tour to Approximate String Matching
 ACM Computing Surveys
, 1999
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1
Searching in Metric Spaces
, 1999
The problem of searching the elements of a set which are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather general case where the similarity criterion defines a metric space, instead of the more restricted case of a vector space. A large number of solutions have been proposed in different areas, in many cases without crossknowledge. Because of this, the same ideas have been reinvented several times, and very different presentations have been given for the same approaches. We
Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers 49:145–165
, 1999
Abstract: An algorithm is presented for generating rigorously all suboptimal secondary structures between the minimum free energy and an arbitrary upper limit. The algorithm is particularly fast in the vicinity of the minimum free energy. This enables the efficient approximation of statistical quantities, such as the partition function or measures for structural diversity. The density of states at low energies and its associated structures are crucial in assessing from a thermodynamic point of view how welldefined the ground state is. We demonstrate this by exploring the role of base modification in tRNA secondary structures, both at the level of individual sequences from Escherichia coli and by comparing artificially generated ensembles of modified and unmodified sequences with the same tRNA structure. The two major conclusions are that (1) base modification considerably sharpens the definition of the ground state structure by constraining energetically adjacent structures to be similar to the ground state, and (2) sequences whose ground state structure is thermodynamically well defined show a significant tendency to buffer single point mutations. This can have evolutionary implications, since selection pressure to improve the definition of ground states with biological function may result in increased neutrality. © 1999 John Wiley & Sons, Inc.
Within the Twilight Zone: A Sensitive ProfileProfile Comparison Tool Based on Information Theory
 J. Mol. Biol
, 2002
This paper presents a novel approach to proleprole comparison. The method compares two input proles (like those that are generated by PSIBLAST) and assigns a similarity score to assess their statistical similarity. Our proleprole comparison tool, which allows for gaps, can be used to detect weak similarities between protein families. It has also been optimized to produce alignments that are in very good agreement with structural alignments. Tests show that the proleprole alignments are indeed highly correlated with similarities between secondary structure elements and tertiary structure. Exhaustive evaluations show that our method is signicantly more sensitive in detecting distant homologies than the popular prolebased search programs PSIBLAST and IMPALA. The relative improvement is the same order of magnitude as the improvement of PSIBLAST relative to BLAST. Our new tool often detects similarities that fall within the twilight zone of sequence similarity
Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots
 Discrete Applied Mathematics
, 2000
Phylogenetic Tree Construction Using Markov Chain Monte Carlo
, 1999
We describe a Bayesian method based on Markov chain simulation to study the phylogenetic relationship in a group of DNA sequences. Under simple models of mutational events, our method produces a Markov chain whose stationary distribution is the conditional distribution of the phylogeny given the observed sequences. Our algorithm strikes a reasonable balance between the desire to move globally through the space of phylogenies and the need to make computationally feasible moves in areas of high probability. Since phylogenetic information is described by a tree, we have created new diagnostics to handle this type of data structure. An important byproduct of the Markov chain Monte Carlo phylogeny building technique is that it provides estimates and corresponding measures of variability for any aspect of the phylogeny under study.
On Pattern Frequency Occurrences In A Markovian Sequence?
 Algorithmica
, 1997
Consider a given pattern H and a random text T generated by a Markovian source. We study the frequency of pattern occurrences in a random text when overlapping copies of the pattern are counted separately. We present exact and asymptotic formulae for all moments (including the variance), and probability of r pattern occurrences for three different regions of r, namely: (i) r = O(1), (ii) central limit regime, and (iii) large deviations regime. In order to derive these results, we first construct some language expressions that characterize pattern occurrences which are later translated into generating functions. Finally, we use analytical methods to extract asymptotic behaviors of the pattern frequency. Applications of these results include molecular biology, source coding, synchronization, wireless communications, approximate pattern matching, game theory, and stock market analysis. These findings are of particular interest to information theory (e.g., secondorder properties of the re...
Abstract shapes of RNA
 Nucleic Acids Res
, 2004
The function of a nonproteincoding RNA is often determined by its structure. Since experimental determination of RNA structure is timeconsuming and expensive, its computational prediction is of great interest, and efficient solutions based on thermodynamic parameters are known. Frequently, however, the predicted minimum free energy structures are not the native ones, leading to the necessity of generating suboptimal solutions. While this can be accomplished by a number of programs, the user is often confronted with large outputs of similar structures, although he or she is interested in structures with more fundamentaldifferences,or, inotherwords, with different abstract shapes. Here, we formalize the concept of abstract shapes and introduce their efficient computation. Each shape of an RNA molecule comprises a class of similar structures and has a representative structure of minimal free energy within the class. Shape analysis is implemented in the program RNAshapes. We applied RNAshapes to the prediction of optimal and suboptimal abstract shapes of severalRNAs.For a given energy range, the number of shapes is considerably smaller than the number of structures, and in all cases, the native structures were among the top shape representatives. This demonstrates that the researcher can quickly focus on the structures of interest, without processing up to thousands of nearoptimal solutions. We complement this study with a largescale analysis of the growth behaviour of structure and shape spaces. RNAshapes is available for download and as an online version on the Bielefeld Bioinformatics Server.
Alignmentfree sequence comparisona review
 Bioinformatics
, 2003
Motivation: Genetic recombination and, in particular, genetic shuffling are at odds with sequence comparison by alignment, which assumes conservation of contiguity between homologous segments. A variety of theoretical foundations are being used to derive alignmentfree methods that overcome this limitation. The formulation of alternative metrics for dissimilarity between sequences and their algorithmic implementations are reviewed. Results: The overwhelming majority of work on alignmentfree sequence has taken place in the past two decades, with most reports published in the past 5 years. Two main categories of methods have been proposed—methods based on word (oligomer) frequency, and methods that do not require resolving the sequence with fixed word length segments. The first category is based on the statistics of word frequency, on the distances defined in a Cartesian space defined by the frequency vectors, and on the information content of frequency distribution. The second category includes the use of Kolmogorov complexity and Chaos Theory. Despite their low visibility, alignmentfree metrics are in fact already widely used as preselection filters for alignmentbased querying of large applications. Recent work is furthering their usage as a scaleindependent methodology that is capable of recognizing homology when loss of contiguity is beyond the possibility of alignment. Availability: Most of the alignmentfree algorithms reviewed were implemented in MATLAB code and are available
Reliable Communication Over Channels With Insertions, Deletions And Substitutions
 IEEE Transactions on Information Theory
, 2001
A new block code is introduced which is capable of correcting multiple insertion, deletion and substitution errors. The code consists of nonlinear inner codes, which we call `watermark' codes, concatenated with lowdensity paritycheck codes over nonbinary elds. The inner code allows probabilistic resynchronisation and provides soft outputs for the outer decoder, which then completes decoding. We present codes of rate 0.7 and transmitted length 5000 bits that can correct 30 insertion/deletion errors per block. We also present codes of rate 3/14 and length 4600 bits that can correct 450 insertion/deletion errors per block.