Results 1 - 10
of
40
Hidden Markov models in computational biology: applications to protein modeling
- JOURNAL OF MOLECULAR BIOLOGY
, 1994
"... Hidden.Markov Models (HMMs) are applied t.0 the problems of statistical modeling, database searching and multiple sequence alignment of protein families and protein domains. These methods are demonstrated the on globin family, the protein kinase catalytic domain, and the EF-hand calcium binding moti ..."
Abstract
-
Cited by 436 (29 self)
- Add to MetaCart
Hidden.Markov Models (HMMs) are applied t.0 the problems of statistical modeling, database searching and multiple sequence alignment of protein families and protein domains. These methods are demonstrated the on globin family, the protein kinase catalytic domain, and the EF-hand calcium binding motif. In each case the parameters of an HMM are estimated from a training set of unaligned sequences. After the HMM is built, it is used to obtain a multiple alignment of all the training sequences. It is also used to search the. SWISS-PROT 22 database for other sequences. that are members of the given protein family, or contain the given domain. The Hi " produces multiple alignments of good quality that agree closely with the alignments produced by programs that incorporate threedimensional structural information. When employed in discrimination tests (by examining how closely the sequences in a database fit the globin, kinase and EF-hand HMMs), the '\ HMM is able to distinguish members of these families from non-members with a high degree of accuracy. Both the HMM and PROFILESEARCH (a technique used to search for relationships between a protein sequence and multiply aligned sequences) perform better in these tests than PROSITE (a dictionary of sites and patterns in proteins). The HMM appecvs to have a slight advantage over PROFILESEARCH in terms of lower rates of false
Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families
- PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS FOR MOLECULAR BIOLOGY
, 1993
"... A Bayesian method for estimating the amino acid distributions in the states of a hidden Markov model (HMM) for a protein family or the columns of a multiple alignment of that family is introduced. This method uses Dirichlet mixture densities as priors over amino acid distributions. These mixtu ..."
Abstract
-
Cited by 56 (6 self)
- Add to MetaCart
A Bayesian method for estimating the amino acid distributions in the states of a hidden Markov model (HMM) for a protein family or the columns of a multiple alignment of that family is introduced. This method uses Dirichlet mixture densities as priors over amino acid distributions. These mixture densities are determined from examination of previously constructed HMMs or multiple alignments. It is shown that this Bayesian method can improve the quality of HMMs produced from small training sets. Specific experiments on the EF-hand motif are reported, for which these priors are shown to produce HMMs with higher likelihood on unseen data, and fewer false positives and false negatives in a database search task.
Improving Prediction of Protein Secondary Structure using Structured Neural Networks and Multiple Sequence Alignments
- J. Comput. Biol
, 1996
"... The prediction of protein secondary structure by use of carefully structured neural networks and multiple sequence alignments has been investigated. Separate networks are used for predicting the three secondary structures ff-helix, fi-strand and coil. The networks are designed using a priori knowled ..."
Abstract
-
Cited by 53 (4 self)
- Add to MetaCart
The prediction of protein secondary structure by use of carefully structured neural networks and multiple sequence alignments has been investigated. Separate networks are used for predicting the three secondary structures ff-helix, fi-strand and coil. The networks are designed using a priori knowledge of amino acid properties with respect to the secondary structure and of the characteristic periodicity in ff-helices. Since these single-structure networks all have less than 600 adjustable weights over-fitting is avoided. To obtain a three-state prediction of ff-helix, fi-strand or coil, ensembles of single-structure networks are combined with another neural network. This method gives an overall prediction accuracy of 66.3% when using seven-fold cross-validation on a database of 126 non-homologous globular proteins. Applying the method to multiple sequence alignments of homologous proteins increases the prediction accuracy significantly to 71.3% with corresponding Matthews' correlation c...
Regularizers for Estimating Distributions of Amino Acids from Small Samples
, 1995
"... This paper examines several different methods for estimating the distribution of amino acids in a specific context, given a very small sample of amino acids from that distribution. These distribution estimators, sometimes called regularizers, are frequently used when aligning sequences to each other ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
This paper examines several different methods for estimating the distribution of amino acids in a specific context, given a very small sample of amino acids from that distribution. These distribution estimators, sometimes called regularizers, are frequently used when aligning sequences to each other or to models such as profiles or hidden Markov models. The distribution estimators considered here are zero-offsets, pseudocounts, substitution matrices (with several variants), feature alphabets, and Dirichlet mixture regularizers.
Identification of Protein Motifs Using Conserved Amino Acid Properties and Partitioning Techniques
- PROC. OF THIRD INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS FOR MOLECULAR BIOLOGY
, 1995
"... Analyzing a set of protein sequences involves a fundamental relationship between the coherency of the set and the specificity of the motif that describes it. Motifs may be obscured by training sets that contain incoherent sequences, in part due to protein subclasses, contamination, or errors. W ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
Analyzing a set of protein sequences involves a fundamental relationship between the coherency of the set and the specificity of the motif that describes it. Motifs may be obscured by training sets that contain incoherent sequences, in part due to protein subclasses, contamination, or errors. We develop an algorithm for motif identification that systematically explores possible patterns of coherency within a set of protein sequences. Our algorithm constructs alternative partitions of the training set data, where one subset of each partition is presumed to contain coherent data and is used for forming a motif. The motif is described by multiple overlapping amino acid groups based on evolutionary, biochemical, or physical properties. We demonstrate our method on a training set of reverse transcriptases that contains subclasses, sequence errors, misalignments, and contaminating sequences. Despite these complications, our program identifies a novel motif for the subclass o...
Discovering Empirically Conserved Amino Acid Substitution Groups in Databases of Protein Families
- J. PURE APPL. ALGEBRA
, 1996
"... This paper introduces a method for identifying amino acid substitution groups that are conserved empirically in aligned positions from databases of protein families. Existing approaches view amino acid substitution as a pairwise phenomenon and characterizes it using substitution matrices. In con ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
This paper introduces a method for identifying amino acid substitution groups that are conserved empirically in aligned positions from databases of protein families. Existing approaches view amino acid substitution as a pairwise phenomenon and characterizes it using substitution matrices. In contrast, the method presented here identifies subsets of amino acids that are conserved empirically using a conditional distribution matrix, which contains entries for every combination of individual amino acids and subsets of amino acids. Each row in the conditional distribution matrix contains the distribution of amino acids in those aligned positions that contain a given subset of amino acids. The algorithm converts a database of protein families into a conditional distribution matrix and then examines each possible substitution group for evidence of conservation. A substitution group is empirically conserved when it has characteristics of compactness and isolation, meaning that am...
Evaluating regularizers for estimating distributions of amino acids
- In
, 1995
"... This paper makes a quantitative comparison of different methods, called regularizers, for estimating the distribution of amino acids in a specific context, given a very small sample of amino acids from that distribution. The regularizers considered here are zero-offsets, pseudocounts, substitution m ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
This paper makes a quantitative comparison of different methods, called regularizers, for estimating the distribution of amino acids in a specific context, given a very small sample of amino acids from that distribution. The regularizers considered here are zero-offsets, pseudocounts, substitution matrices (with several variants), and Dirichlet mixture regularizers. Each regularizer is evaluated based on how well it estimates the distributions of the columns of a multiple alignment—specifically, the expected encoding cost per amino acid using the regularizer and all possible samples from each column. In general, pseudocounts give the lowest encoding costs for samples of size zero, substitution matrices give the lowest encoding costs for samples of size one, and Dirichlet mixtures give the lowest for larger samples. One of the substitution matrix variants, which added pseudocounts and scaled counts, does almost as well as the best known Dirichlet mixtures, but with a lower computation cost.
A similar fragments merging approach to learn automata on proteins
- In: Machine Learning: ECML
, 2005
"... Publication interne n˚1735 — Juillet 2005 — 18 pages Abstract: We propose here to learn automata for the characterization of proteins families to overcome the limitations of the position-specific characterizations classically used in Pattern Discovery. We introduce a new heuristic approach learning ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Publication interne n˚1735 — Juillet 2005 — 18 pages Abstract: We propose here to learn automata for the characterization of proteins families to overcome the limitations of the position-specific characterizations classically used in Pattern Discovery. We introduce a new heuristic approach learning non-deterministic automata based on selection and ordering of significantly similar fragments to be merged and on physico-chemical properties identification. Quality of the characterization of the major intrinsic protein (MIP) family is assessed by leave-one-out cross-validation for a large range of models specificity. Key-words: grammatical inference, automata, proteins Goulven Kerbellec is supported by a PhD research grant from Région Bretagne.
The Context-Dependence of Amino Acid Properties
- Intelligent Systems in Molecular Biology, AAAI
, 1997
"... One of the current limitations of using sequence alignments to identify proteins with similar structures is that some proteins with similar structures do not have significant sequence similarity by identity. One way to address this "hidden-homology" problem is to match amino acids based on their che ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
One of the current limitations of using sequence alignments to identify proteins with similar structures is that some proteins with similar structures do not have significant sequence similarity by identity. One way to address this "hidden-homology" problem is to match amino acids based on their chemical and physical properties. However, the amino acid properties overlap, creating orthogonal dimensions of similarity, the relative strengths of which are ambiguous. It has been observed that the role an amino acid plays (and hence the property that is important) at a site in a protein depends on its secondary and tertiary environment. To approximate and take advantage of this dependence on context for improving the sensitivity of alignments of proteins whose structures are unknown, we propose a surrogate definition of context based on the pattern of hydropathy in a small window of contiguous neighbors surrounding each amino acid. We present the results of an experiment in which a search-b...
M: BLOMAP: An Encoding of Amino Acids which Improves Signal Peptide Cleavage Site Prediction
- In Chen Y., Wong L: Proc. 3 rd AsiaPacific Bioinformatics Conference, Imperial
, 2005
"... Research on cleavage site prediction for signal peptides has focused mainly on the application of different classification algorithms to achieve improved prediction accuracies. This paper addresses the fundamental issue of amino acid encoding to present amino acid sequences in the most beneficial wa ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Research on cleavage site prediction for signal peptides has focused mainly on the application of different classification algorithms to achieve improved prediction accuracies. This paper addresses the fundamental issue of amino acid encoding to present amino acid sequences in the most beneficial way for machine learning algorithms. A comparison of several standard encoding methods shows, that for cleavage site prediction the frequently used orthonormal encoding is inferior compared to other methods. The best results are achieved with a new encoding method named BLOMAP – based on the BLOSUM62 substitution matrix – using a Naïve Bayes classifier. 1.

