Results 1 - 10
of
87
Hidden Markov models in computational biology: applications to protein modeling
- JOURNAL OF MOLECULAR BIOLOGY
, 1994
"... Hidden.Markov Models (HMMs) are applied t.0 the problems of statistical modeling, database searching and multiple sequence alignment of protein families and protein domains. These methods are demonstrated the on globin family, the protein kinase catalytic domain, and the EF-hand calcium binding moti ..."
Abstract
-
Cited by 439 (29 self)
- Add to MetaCart
Hidden.Markov Models (HMMs) are applied t.0 the problems of statistical modeling, database searching and multiple sequence alignment of protein families and protein domains. These methods are demonstrated the on globin family, the protein kinase catalytic domain, and the EF-hand calcium binding motif. In each case the parameters of an HMM are estimated from a training set of unaligned sequences. After the HMM is built, it is used to obtain a multiple alignment of all the training sequences. It is also used to search the. SWISS-PROT 22 database for other sequences. that are members of the given protein family, or contain the given domain. The Hi " produces multiple alignments of good quality that agree closely with the alignments produced by programs that incorporate threedimensional structural information. When employed in discrimination tests (by examining how closely the sequences in a database fit the globin, kinase and EF-hand HMMs), the '\ HMM is able to distinguish members of these families from non-members with a high degree of accuracy. Both the HMM and PROFILESEARCH (a technique used to search for relationships between a protein sequence and multiply aligned sequences) perform better in these tests than PROSITE (a dictionary of sites and patterns in proteins). The HMM appecvs to have a slight advantage over PROFILESEARCH in terms of lower rates of false
Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization
- Machine Learning
, 1995
"... . The MEME algorithm extends the expectation maximization (EM) algorithm for identifying motifs in unalignedbiopolymer sequences. The aim of MEME is to discover new motifs in a set of biopolymer sequences where little or nothing is known in advance about any motifs that may be present. MEME innovati ..."
Abstract
-
Cited by 167 (7 self)
- Add to MetaCart
. The MEME algorithm extends the expectation maximization (EM) algorithm for identifying motifs in unalignedbiopolymer sequences. The aim of MEME is to discover new motifs in a set of biopolymer sequences where little or nothing is known in advance about any motifs that may be present. MEME innovations expand the range of problems which can be solved using EM and increase the chance of finding good solutions. First, subsequences which actually occur in the biopolymer sequences are used as starting points for the EM algorithm to increase the probability of finding globally optimal motifs. Second, the assumption that each sequence contains exactly one occurrence of the shared motif is removed. This allows multiple appearances of a motif to occur in any sequence and permits the algorithm to ignore sequences with no appearance of the shared motif, increasing its resistance to noisy data. Third, a method for probabilistically erasing shared motifs after they are found is incorporated so tha...
Approaches to the Automatic Discovery of Patterns in Biosequences
, 1995
"... This paper is a survey of approaches and algorithms used for the automatic discovery of patterns in biosequences. Patterns with the expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering those patterns which a ..."
Abstract
-
Cited by 125 (21 self)
- Add to MetaCart
This paper is a survey of approaches and algorithms used for the automatic discovery of patterns in biosequences. Patterns with the expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering those patterns which are the most frequently used in molecular bioinformatics. A formulation is given of the problem of the automatic discovery of such patterns from a set of sequences, and an analysis presented of the ways in which an assessment can be made of the significance and usefulness of the discovered patterns. It is shown that this problem is related to problems studied in the field of machine learning. The largest part of this paper comprises a review of a number of existing methods developed to solve this problem and how these relate to each other, focusing on the algorithms underlying the approaches. A comparison is given of the algorithms, and examples are given of patterns that have been discovered...
Position-based sequence weights
- J. Mol. Biol
, 1994
"... Sequence weighting methods have been used to reduce redundancy and emphasize diversity in multiple sequence alignment and searching applications. Each of these methods is based on a notion of distance between a sequence and an ancestral or generalized sequence. We describe a different approach, whic ..."
Abstract
-
Cited by 74 (3 self)
- Add to MetaCart
Sequence weighting methods have been used to reduce redundancy and emphasize diversity in multiple sequence alignment and searching applications. Each of these methods is based on a notion of distance between a sequence and an ancestral or generalized sequence. We describe a different approach, which bases weights on the diversity observed at each position in the alignment, rather than on a sequence distance measure. These position-based weights make minimal assumptions, are simple to compute, and perform well in comprehensive evaluations. Redundancy is a common feature of sequence databanks, where a typical gene or protein family is represented by a highly non-random sample of sequences. For example, an ancient protein family might be represented by a few highly diverged microbial and invertebrate sequences plus many mammalian sequences that form a closely related subgroup. This situation can be detrimental in sequence alignment and searching applications, where it is usually desirable to represent the diversity among related sequences. Since closely related sequences are largely redundant, they provide less information in a multiple sequence alignment than their distant cousins. Sequence weighting methods have been introduced to compensate for over-representation
Empirical statistical estimates for sequence similarity searches
- J. Mol. Biol
, 1998
"... Sequence similarity searches today are the most effective method for exploiting the information in the rapidly growing DNA and protein sequence databases. One of the most dramatic improvements ..."
Abstract
-
Cited by 66 (3 self)
- Add to MetaCart
Sequence similarity searches today are the most effective method for exploiting the information in the rapidly growing DNA and protein sequence databases. One of the most dramatic improvements
Valenzia A: Effective use of sequence correlation and conservation in fold recognition
- J Mol Biol
, 1999
"... Protein families are a rich source of information; sequence conservation and sequence correlation are two of the main properties that can be derived from the analysis of multiple sequence alignments. Sequence conservation is related to the direct evolutionary pressure to retain the chemical characte ..."
Abstract
-
Cited by 37 (3 self)
- Add to MetaCart
Protein families are a rich source of information; sequence conservation and sequence correlation are two of the main properties that can be derived from the analysis of multiple sequence alignments. Sequence conservation is related to the direct evolutionary pressure to retain the chemical characteristics of some positions in order to maintain a given function. Sequence correlation is attributed to the small sequence adjustments needed to maintain protein stability against constant mutational drift. Here, we showed that sequence conservation and correlation were each frequently informative enough to detect incorrectly folded proteins. Furthermore, combining conservation, correlation, and polarity, we achieved an almost perfect discrimination between native and incorrectly folded proteins. Thus, we made use of this information for threading by evaluating the models suggested by a threading method according to the degree of proximity of the corresponding correlated, conserved, and apolar residues. The results showed that the fold recognition capacity of a given threading approach could be improved almost fourfold by selecting the alignments that score best under the three different sequencebased approaches.
Global Self Organization of All Known Protein Sequences Reveals Inherent Biological Signatures
, 1997
"... A global classification of all currently known protein sequences is performed. Every protein sequence is partitioned into segments of 50 amino acids and a dynamicprogramming distance is calculated between each pair of segments. This space of segments is first embedded into Euclidean space with small ..."
Abstract
-
Cited by 35 (3 self)
- Add to MetaCart
A global classification of all currently known protein sequences is performed. Every protein sequence is partitioned into segments of 50 amino acids and a dynamicprogramming distance is calculated between each pair of segments. This space of segments is first embedded into Euclidean space with small metric distortion. A novel self-organized cross-validated clustering algorithm is then applied to the embedded space with Euclidean distances. The resulting hierarchical tree of clusters offers a new representation of protein sequences and families, which compares favorably with the most updated classifications based on functional and structural protein data. Motifs and domains such as the Zinc Finger, EF hand, Homeobox, EGF-like and others are automatically correctly identified. A novel representation of protein families is introduced, from which functional biological kinship of protein families can be deduced, as demonstrated for the transporters family. The self organization method prese...
A flexible motif search technique based on generalized profiles
- COMPUTERS AND CHEMISTRY
, 1996
"... ... generalized profile syntax serving as a motif definition language; and (2) a motif search method specifically adapted to the problem of finding multiple instances of a motif in the same sequence. The new profile structure, which is the core of the generalized profile syntax, combines the functio ..."
Abstract
-
Cited by 26 (5 self)
- Add to MetaCart
... generalized profile syntax serving as a motif definition language; and (2) a motif search method specifically adapted to the problem of finding multiple instances of a motif in the same sequence. The new profile structure, which is the core of the generalized profile syntax, combines the functions of a variety of motif descriptors implemented in other methods, including regular expression-like patterns, weight matrices, previously used profiles, and certain types of hidden Markov models (HMMs). The relationship between generalized profiles and other biomolecular motif descriptors is analyzed in detail, with special attention to HMMs. Generalized profiles are shown to be equivalent to a particular class of HMMs, and conversion procedures in both directions are given. The conversion procedures provide an interpretation for local alignment in the framework of stochastic models, allowing for clear, simple significance tests. A mathematical statement of the motif search problem defines the new method exactly without linking it to a specific algorithmic solution. Part of the definition includes a new definition of disjointness of alignments.
Using substitution probabilities to improve position-specific scoring matrices
- Computer Applications in the Biosciences
, 1996
"... blocks Subject classification: proteins *To whom reprint requests should be sent Running head: Improved position-specific scoring matrices Each column of amino acids in a multiple alignment of protein sequences can be represented as a vector of 20 amino acid counts. For alignment and searching appli ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
blocks Subject classification: proteins *To whom reprint requests should be sent Running head: Improved position-specific scoring matrices Each column of amino acids in a multiple alignment of protein sequences can be represented as a vector of 20 amino acid counts. For alignment and searching applications, the count vector is an imperfect representation of a position, because the observed sequences are an incomplete sample of the full set of related sequences. One general solution to this problem is to model unobserved sequences by adding artificial "pseudo-counts " to the observed counts. We introduce a simple method for computing pseudo-counts that combines the diversity observed in each alignment position with amino acid substitution probabilities. In extensive empirical tests, this position-based method out-performed other pseudo-count methods and was a substantial improvement over the traditional average score method used for constructing profiles. 2
ParaMEME: A Parallel Implementation and a Web Interface for a DNA and Protein Motif Discovery Tool
- Computer Applications in the Biosciences
, 1996
"... Many advanced software tools fail to reach a wide audience because they require specialized hardware, installation expertise, or an abundance of CPU cycles. The worldwide web offers a new opportunity for distributing such systems. One such program, MEME, discovers repeated patterns, called motifs, ..."
Abstract
-
Cited by 25 (9 self)
- Add to MetaCart
Many advanced software tools fail to reach a wide audience because they require specialized hardware, installation expertise, or an abundance of CPU cycles. The worldwide web offers a new opportunity for distributing such systems. One such program, MEME, discovers repeated patterns, called motifs, in sets of DNA or protein sequences. This tool is now available to biologists over the worldwide web, using an asynchronous, single-program multiple-data version of the program called ParaMEME that runs on an Intel Paragon XP/S parallel computer at the San Diego Supercomputer Center. ParaMEME scales gracefully to 64 nodes on the Paragon with efficiencies above 72% for large data sets. The worldwide web interface to ParaMEME accepts a set of sequences interactively from a user, submits the sequences to the Paragon for analysis, and e-mails the results back to the user. ParaMEME is available for free public use at http://www.sdsc.edu/MEME. Keywords: intelligent system, supercomputer, ...

