Results 1 - 10
of
75
Multiple sequence alignment with the Clustal series of programs
- Nucleic Acids Res
, 2003
"... The Clustal series of programs are widely used in molecular biology for the multiple alignment of both nucleic acid and protein sequences and for preparing phylogenetic trees. The popularity of the programs depends on a number of factors, including not only the accuracy of the results, but also the ..."
Abstract
-
Cited by 139 (3 self)
- Add to MetaCart
The Clustal series of programs are widely used in molecular biology for the multiple alignment of both nucleic acid and protein sequences and for preparing phylogenetic trees. The popularity of the programs depends on a number of factors, including not only the accuracy of the results, but also the robustness, portability and user-friendliness of the programs. New features include NEXUS and FASTA format output, printing range numbers and faster tree calculation. Although, Clustal was originally developed to run on a local computer, numerous Web servers have been set up, notably at the EBI
Exploiting the Past and the Future in Protein Secondary Structure Prediction
, 1999
"... Motivation: Predicting the secondary structure of a protein (alpha-helix, beta-sheet, coil) is an important step towards elucidating its three dimensional structure, as well as its function. Presently, the best predictors are based on machine learning approaches, in particular neural network archite ..."
Abstract
-
Cited by 91 (19 self)
- Add to MetaCart
Motivation: Predicting the secondary structure of a protein (alpha-helix, beta-sheet, coil) is an important step towards elucidating its three dimensional structure, as well as its function. Presently, the best predictors are based on machine learning approaches, in particular neural network architectures with a fixed, and relatively short, input window of amino acids, centered at the prediction site. Although a fixed small window avoids overfitting problems, it does not permit to capture variable long-ranged information. Results: We introduce a family of novel architectures which can learn to make predictions based on variable ranges of dependencies. These architectures extend recurrent neural networks, introducing non-causal bidirectional dynamics to capture both upstream and downstream information. The prediction algorithm is completed by the use of mixtures of estimators that leverage evolutionary information, expressed in terms of multiple alignments, both at the input and output levels. While our system currently achieves an overall performance close to 76% correct prediction---at least comparable to the best existing systems---the main emphasis here is on the development of new algorithmic ideas. Availability: The executable program for predicting protein secondary structure is available from the authors free of charge. Contact: pfbaldi@ics.uci.edu, gpollast@ics.uci.edu, brunak@cbs.dtu.dk, paolo@dsi.unifi.it. 1
PROBCONS: Probabilistic consistency-based multiple sequence alignment
- Genome Res
, 2005
"... To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because of not only the high computational cost but also the lack of proper objec ..."
Abstract
-
Cited by 84 (5 self)
- Add to MetaCart
To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because of not only the high computational cost but also the lack of proper objective functions for measuring alignment quality. In this paper, we introduce prob-abilistic consistency, a novel scoring function for multiple sequence comparisons. We present PROBCONS, a practical tool for progressive protein multiple sequence alignment based on prob-abilistic consistency, and evaluate its performance on several standard alignment benchmark datasets. On the BAliBASE, SABmark, and PREFAB benchmark alignment databases, PROB-CONS achieves statistically significant improvement over other leading methods while maintain-ing practical speed. PROBCONS is publicly available as a web resource. Source code and execu-tables are available under the GNU Public License at
A Memory-Efficient Dynamic Programming Algorithm for Optimal Alignment of a Sequence to an RNA Secondary Structure
, 2002
"... Background: Covariance models (CMs) are probabilistic models of RNA secondary structure, analogous to profile hidden Markov models of linear sequence. The dynamic programming algorithm for aligning a CM to an RNA sequence of length N is O(N³) in memory. This is only practical for small RNAs. Re ..."
Abstract
-
Cited by 51 (1 self)
- Add to MetaCart
Background: Covariance models (CMs) are probabilistic models of RNA secondary structure, analogous to profile hidden Markov models of linear sequence. The dynamic programming algorithm for aligning a CM to an RNA sequence of length N is O(N³) in memory. This is only practical for small RNAs. Results:...
Parallel Hardware for Sequence Comparison and Alignment
- CABIOS
, 1996
"... Sequence comparison, a vital research tool in computational biology, is based on a simple O(n 2 ) algorithm that easily maps to a linear array of processors. This paper reviews and compares high-performance sequence analysis on general-purpose supercomputers and single-purpose, reconfigurable, and ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
Sequence comparison, a vital research tool in computational biology, is based on a simple O(n 2 ) algorithm that easily maps to a linear array of processors. This paper reviews and compares high-performance sequence analysis on general-purpose supercomputers and single-purpose, reconfigurable, and programmable co-processors. The difficulty of comparing hardware from published performance figures is also noted. Introduction The vast databases produced by the Human Genome Project demand innovative tools for fast sequence and database analysis. Because sequence databases contain billions of characters it is important to locate areas of interest within a database or genome quickly. Dynamic programming organizes sequence comparison by comparing shorter subsequences first, so their costs can be made available in a table (Figure 1) for the next longer subsequence comparisons. The final entry becomes the comparison rating. Exact sequence comparison is an O(n 2 )-time dynamic programming a...
Kestrel: A Programmable Array for Sequence Analysis
, 1998
"... Kestrel is a programmable linear array processor designed for sequence analysis. Among other features, Kestrel includes an 8-bit word, a single-cycle add-and-minimize instruction, a multiplier and efficient communication using shared registers. This paper describes Kestrel’s functional units in de ..."
Abstract
-
Cited by 20 (8 self)
- Add to MetaCart
Kestrel is a programmable linear array processor designed for sequence analysis. Among other features, Kestrel includes an 8-bit word, a single-cycle add-and-minimize instruction, a multiplier and efficient communication using shared registers. This paper describes Kestrel’s functional units in detail, and examines each of their effects on system performance. With functional prototype chips completed, we will assemble a full single-board Kestrel array, with 512 processing elements on eight chips, in early 1998.
Recent developments in linear-space alignment methods: A survey
- J. Comput. Biol
, 1994
"... A dynamic-programming strategy for sequence alignment first proposed in 1975 by Dan Hirschberg can be adapted to yield a number of extremely space-efficient algorithms. Specifically, these algorithms align two sequences using only ‘‘linear space’’, i.e., an amount of computer memory that is proporti ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
A dynamic-programming strategy for sequence alignment first proposed in 1975 by Dan Hirschberg can be adapted to yield a number of extremely space-efficient algorithms. Specifically, these algorithms align two sequences using only ‘‘linear space’’, i.e., an amount of computer memory that is proportional to the sum of the lengths of the two sequences being aligned. This paper begins by reviewing the basic idea, as it applies to the global (i.e., end-to-end) alignment of two DNA or protein sequences. Three of our recent extensions of the technique are then outlined. The first extension computes an optimal alignment subject to the constraint that each position, i, of the first sequence must be aligned somewhere between positions L[i] and U[i] of the second sequence, for given values of L and U. The second finds all aligned position pairs (i.e., potential columns of the alignment) that occur in an alignment whose score exceeds a given threshold. The third treats the case where each of the two sequences is allowed to be an alignment (e.g., a sequence of aligned pairs), using a sensitive scoring scheme. We also describe two linear-space methods for computing k best local (i.e., involving only a part of each sequence) alignments, where k ≥ 1. One is a linear-space version of the algorithm of Waterman and Eggert (1987), and the other is based on the strategy proposed by Wilbur and Lipman (1983). Finally, we describe programs that implement various combinations of these techniques to provide a multi-sequence alignment method that is especially suited to handling a few very long sequences. The utility of these programs is illustrated by analysis of the locus control region of the β-like globin gene cluster of several mammals.
Memory-bounded A* graph search
- In Proc. 15th International Flairs Conference
, 2002
"... We describe a framework for reducing the space complexity of graph search algorithms such as A* that use Open and Closed lists to keep track of the frontier and interior nodes of the search space. We propose a sparse representation of the Closed list in which only a fraction of already expanded node ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
We describe a framework for reducing the space complexity of graph search algorithms such as A* that use Open and Closed lists to keep track of the frontier and interior nodes of the search space. We propose a sparse representation of the Closed list in which only a fraction of already expanded nodes need to be stored to perform the two functions of the Closed List- preventing duplicate search effort and allowing solution extraction. Our proposal is related to earlier work on search algorithms that do not use a Closed list at all [Korf and Zhang, 2000]. However, the approach we describe has several advantages that make it effective for a wider variety of problems. 1
The Repeat Pattern Toolkit (RPT): Analyzing the Structure and Evolution of the C. elegans Genome
- In Second International Conference on Intelligent Systems for Molecular Biology
, 1994
"... Over 3:6 million bases of DNA sequence from chromosome III of the C. elegans have been determined. The availability of this extended region of contiguous sequence has allowed us to analyze the nature and prevalence of repetitive sequences in the genome of a eukaryotic organism with a high gene densi ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
Over 3:6 million bases of DNA sequence from chromosome III of the C. elegans have been determined. The availability of this extended region of contiguous sequence has allowed us to analyze the nature and prevalence of repetitive sequences in the genome of a eukaryotic organism with a high gene density. We have assembled a Repeat Pattern Toolkit (RPT) to analyze the patterns of repeats occurring in DNA. The tools include identifying signi øcant local alignments (utilizing both two-way and three-way alignments), dividing the set of alignments into connected components (signifying repeat families), computing evolutionary distance between repeat family members, constructing minimum spanning trees from the connected components, and visualizing the evolution of the repeat families. Over 7000 families of repetitive sequences were identiøed. The size of the families ranged from isolated pairs to over 1600 segments of similar sequence. Approximately 12:3% of the analyzed sequence participates i...
Bidirectional dynamics for protein secondary structure prediction
- Sequence Learning: Paradigms, Algorithms, and Applications
, 2000
"... For certain categories of sequences, information from both the past and the future can be used for analysis and predictions at time t. This is the case for biological sequences where the nature and function of a region in a sequence may strongly depend on events located both upstream and downstream. ..."
Abstract
-
Cited by 18 (5 self)
- Add to MetaCart
For certain categories of sequences, information from both the past and the future can be used for analysis and predictions at time t. This is the case for biological sequences where the nature and function of a region in a sequence may strongly depend on events located both upstream and downstream. We develop a new family of adaptive graphical model architectures for learning non-causal sequence translations. These architectures employ two chains of hidden variables that propagate information from the past and from the future, respectively. This general idea can be instantiated either as a stochastic model (generalizing input output hidden Markov models), or as a neural network (generalizing recurrent neural networks). We illustrate the methodology by applying bidirectional models to the problem of protein secondary structure prediction. 1

