Results 1  10
of
338
A Guided Tour to Approximate String Matching
 ACM COMPUTING SURVEYS
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract

Cited by 585 (38 self)
 Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems.
Searching in Metric Spaces
, 1999
"... The problem of searching the elements of a set which are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather ge ..."
Abstract

Cited by 433 (38 self)
 Add to MetaCart
(Show Context)
The problem of searching the elements of a set which are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather general case where the similarity criterion defines a metric space, instead of the more restricted case of a vector space. A large number of solutions have been proposed in different areas, in many cases without crossknowledge. Because of this, the same ideas have been reinvented several times, and very different presentations have been given for the same approaches. We
Complete suboptimal folding of RNA and the stability of secondary structures
 BIOPOLYMERS 49:145–165
, 1999
"... An algorithm is presented for generating rigorously all suboptimal secondary structures between the minimum free energy and an arbitrary upper limit. The algorithm is particularly fast in the vicinity of the minimum free energy. This enables the efficient approximation of statistical quantities, s ..."
Abstract

Cited by 213 (23 self)
 Add to MetaCart
An algorithm is presented for generating rigorously all suboptimal secondary structures between the minimum free energy and an arbitrary upper limit. The algorithm is particularly fast in the vicinity of the minimum free energy. This enables the efficient approximation of statistical quantities, such as the partition function or measures for structural diversity. The density of states at low energies and its associated structures are crucial in assessing from a thermodynamic point of view how welldefined the ground state is. We demonstrate this by exploring the role of base modification in tRNA secondary structures, both at the level of individual sequences from Escherichia coli and by comparing artificially generated ensembles of modified and unmodified sequences with the same tRNA structure. The two major conclusions are that (1) base modification considerably sharpens the definition of the ground state structure by constraining energetically adjacent structures to be similar to the ground state, and (2) sequences whose ground state structure is thermodynamically well defined show a significant tendency to buffer single point mutations. This can have evolutionary implications, since selection pressure to improve the definition of ground states with biological function may result in increased neutrality.
Within the Twilight Zone: A Sensitive ProfileProfile Comparison Tool Based on Information Theory
 J. Mol. Biol
, 2002
"... This paper presents a novel approach to proleprole comparison. The method compares two input proles (like those that are generated by PSIBLAST) and assigns a similarity score to assess their statistical similarity. Our proleprole comparison tool, which allows for gaps, can be used to detect weak ..."
Abstract

Cited by 148 (4 self)
 Add to MetaCart
This paper presents a novel approach to proleprole comparison. The method compares two input proles (like those that are generated by PSIBLAST) and assigns a similarity score to assess their statistical similarity. Our proleprole comparison tool, which allows for gaps, can be used to detect weak similarities between protein families. It has also been optimized to produce alignments that are in very good agreement with structural alignments. Tests show that the proleprole alignments are indeed highly correlated with similarities between secondary structure elements and tertiary structure. Exhaustive evaluations show that our method is signicantly more sensitive in detecting distant homologies than the popular prolebased search programs PSIBLAST and IMPALA. The relative improvement is the same order of magnitude as the improvement of PSIBLAST relative to BLAST. Our new tool often detects similarities that fall within the twilight zone of sequence similarity
Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots
 Discrete Applied Mathematics
, 2000
"... structure prediction with pseudoknots ..."
(Show Context)
Alignmentfree sequence comparisona review
 Bioinformatics
, 2003
"... Motivation: Genetic recombination and, in particular, genetic shuffling are at odds with sequence comparison by alignment, which assumes conservation of contiguity between homologous segments. A variety of theoretical foundations are being used to derive alignmentfree methods that overcome this lim ..."
Abstract

Cited by 100 (8 self)
 Add to MetaCart
(Show Context)
Motivation: Genetic recombination and, in particular, genetic shuffling are at odds with sequence comparison by alignment, which assumes conservation of contiguity between homologous segments. A variety of theoretical foundations are being used to derive alignmentfree methods that overcome this limitation. The formulation of alternative metrics for dissimilarity between sequences and their algorithmic implementations are reviewed. Results: The overwhelming majority of work on alignmentfree sequence has taken place in the past two decades, with most reports published in the past 5 years. Two main categories of methods have been proposed—methods based on word (oligomer) frequency, and methods that do not require resolving the sequence with fixed word length segments. The first category is based on the statistics of word frequency, on the distances defined in a Cartesian space defined by the frequency vectors, and on the information content of frequency distribution. The second category includes the use of Kolmogorov complexity and Chaos Theory. Despite their low visibility, alignmentfree metrics are in fact already widely used as preselection filters for alignmentbased querying of large applications. Recent work is furthering their usage as a scaleindependent methodology that is capable of recognizing homology when loss of contiguity is beyond the possibility of alignment. Availability: Most of the alignmentfree algorithms reviewed were implemented in MATLAB code and are available
An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site
 Problem,” Proc. Seventh Int’l Conf. Intelligent Systems for Molecular Biology
, 1999
"... This is an investigation of methods for finding short motifs that only occur in a fraction of the input sequences. Unlike local search techniques that may not reach a global optimum, the method proposed here is guaranteed to produce the motifs with greatest zscores. This method is illustrated for t ..."
Abstract

Cited by 87 (4 self)
 Add to MetaCart
This is an investigation of methods for finding short motifs that only occur in a fraction of the input sequences. Unlike local search techniques that may not reach a global optimum, the method proposed here is guaranteed to produce the motifs with greatest zscores. This method is illustrated for the Ribosome Binding Site Problem, which is to identify the short mRNA 5 ′ untranslated sequence that is recognized by the ribosome during initiation of protein synthesis. Experiments were performed to solve this problem for each of fourteen sequenced prokaryotes, by applying the method to the full complement of genes from each. One of the interesting results of this experimentation is evidence that the recognized sequence of the thermophilic archaea A. fulgidus, M. jannaschii, M. thermoautotrophicum, andP. horikoshii may be somewhat different than the well known ShineDalgarno sequence.
Phylogenetic Tree Construction Using Markov Chain Monte Carlo
, 1999
"... We describe a Bayesian method based on Markov chain simulation to study the phylogenetic relationship in a group of DNA sequences. Under simple models of mutational events, our method produces a Markov chain whose stationary distribution is the conditional distribution of the phylogeny given the obs ..."
Abstract

Cited by 81 (0 self)
 Add to MetaCart
We describe a Bayesian method based on Markov chain simulation to study the phylogenetic relationship in a group of DNA sequences. Under simple models of mutational events, our method produces a Markov chain whose stationary distribution is the conditional distribution of the phylogeny given the observed sequences. Our algorithm strikes a reasonable balance between the desire to move globally through the space of phylogenies and the need to make computationally feasible moves in areas of high probability. Since phylogenetic information is described by a tree, we have created new diagnostics to handle this type of data structure. An important byproduct of the Markov chain Monte Carlo phylogeny building technique is that it provides estimates and corresponding measures of variability for any aspect of the phylogeny under study.
On Pattern Frequency Occurrences In A Markovian Sequence?
 Algorithmica
, 1997
"... Consider a given pattern H and a random text T generated by a Markovian source. We study the frequency of pattern occurrences in a random text when overlapping copies of the pattern are counted separately. We present exact and asymptotic formulae for all moments (including the variance), and probabi ..."
Abstract

Cited by 75 (25 self)
 Add to MetaCart
Consider a given pattern H and a random text T generated by a Markovian source. We study the frequency of pattern occurrences in a random text when overlapping copies of the pattern are counted separately. We present exact and asymptotic formulae for all moments (including the variance), and probability of r pattern occurrences for three different regions of r, namely: (i) r = O(1), (ii) central limit regime, and (iii) large deviations regime. In order to derive these results, we first construct some language expressions that characterize pattern occurrences which are later translated into generating functions. Finally, we use analytical methods to extract asymptotic behaviors of the pattern frequency. Applications of these results include molecular biology, source coding, synchronization, wireless communications, approximate pattern matching, game theory, and stock market analysis. These findings are of particular interest to information theory (e.g., secondorder properties of the re...
Abstract shapes of RNA
 Nucleic Acids Res
, 2004
"... The function of a nonproteincoding RNA is often determined by its structure. Since experimental determination of RNA structure is timeconsuming and expensive, its computational prediction is of great interest, and efficient solutions based on thermodynamic parameters are known. Frequently, howeve ..."
Abstract

Cited by 63 (13 self)
 Add to MetaCart
(Show Context)
The function of a nonproteincoding RNA is often determined by its structure. Since experimental determination of RNA structure is timeconsuming and expensive, its computational prediction is of great interest, and efficient solutions based on thermodynamic parameters are known. Frequently, however, the predicted minimum free energy structures are not the native ones, leading to the necessity of generating suboptimal solutions. While this can be accomplished by a number of programs, the user is often confronted with large outputs of similar structures, although he or she is interested in structures with more fundamentaldifferences,or, inotherwords, with different abstract shapes. Here, we formalize the concept of abstract shapes and introduce their efficient computation. Each shape of an RNA molecule comprises a class of similar structures and has a representative structure of minimal free energy within the class. Shape analysis is implemented in the program RNAshapes. We applied RNAshapes to the prediction of optimal and suboptimal abstract shapes of severalRNAs.For a given energy range, the number of shapes is considerably smaller than the number of structures, and in all cases, the native structures were among the top shape representatives. This demonstrates that the researcher can quickly focus on the structures of interest, without processing up to thousands of nearoptimal solutions. We complement this study with a largescale analysis of the growth behaviour of structure and shape spaces. RNAshapes is available for download and as an online version on the Bielefeld Bioinformatics Server.