Results 1  10
of
658
Pfam protein families database
 Nucleic Acids Research, 2008, 36(Database issue): D281–D288
"... Pfam is a comprehensive collection of protein domains and families, represented as multiple sequence alignments and as profile hidden Markov models. The current release of Pfam (22.0) contains 9318 protein families. Pfam is now based not only on the UniProtKB sequence database, but also on NCBI GenP ..."
Abstract

Cited by 749 (14 self)
 Add to MetaCart
(Show Context)
Pfam is a comprehensive collection of protein domains and families, represented as multiple sequence alignments and as profile hidden Markov models. The current release of Pfam (22.0) contains 9318 protein families. Pfam is now based not only on the UniProtKB sequence database, but also on NCBI GenPept and on sequences from selected metagenomics projects. Pfam is available on the web from the consortium members using a new, consistent and improved website design in the UK
The Infinite Hidden Markov Model
 Machine Learning
, 2002
"... We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. Th ..."
Abstract

Cited by 632 (41 self)
 Add to MetaCart
We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying statetransition matrix, and the expected number of distinct hidden states in a finite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be infiniteconsider, for example, symbols being possible words appearing in English text.
Survey of clustering algorithms
 IEEE TRANSACTIONS ON NEURAL NETWORKS
, 2005
"... Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the ..."
Abstract

Cited by 483 (4 self)
 Add to MetaCart
(Show Context)
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
Hidden Markov models for detecting remote protein homologies
 Bioinformatics
, 1998
"... A new hidden Markov model method (SAMT98) for nding remote homologs of protein sequences is described and evaluated. The method begins with a single target sequence and iteratively builds a hidden Markov model (hmm) from the sequence and homologs found using the hmm for database search. SAMT98 is ..."
Abstract

Cited by 466 (15 self)
 Add to MetaCart
(Show Context)
A new hidden Markov model method (SAMT98) for nding remote homologs of protein sequences is described and evaluated. The method begins with a single target sequence and iteratively builds a hidden Markov model (hmm) from the sequence and homologs found using the hmm for database search. SAMT98 is also used to construct model libraries automatically from sequences in structural databases. We evaluate the SAMT98 method with four datasets. Three of the test sets are foldrecognition tests, where the correct answers are determined by structural similarity. The fourth uses a curated database. The method is compared against wublastp and against doubleblast, a twostep method similar to ISS, but using blast instead of fasta. Results SAMT98 had the fewest errors in all tests dramatically so for the foldrecognition tests. At the minimumerror point on the SCOPdomains test, SAMT98 got 880 true positives and 68 false positives, doubleblast got 533 true positives with 71 false positives, and wublastp got 353 true positives with 24 false positives. The method is optimized to recognize superfamilies, and would require parameter adjustment to be used to nd family or fold relationships. One key to the performance of the hmm method is a new scorenormalization technique that compares the score to the score with a reversed model rather than to a uniform null model. Availability A World Wide Web server, as well as information on obtaining the Sequence Alignment and PREPRINT to appear in Bioinformatics, 1999
Protein homology detection by HMMHMM comparison
 Bioinformatics
, 2005
"... Motivation: Protein homology detection and sequence alignment are at the basis of protein structure prediction, function prediction, and evolution. Results: We have generalized the alignment of protein sequences with a profile hidden Markov model (HMM) to the case of pairwise alignment of profile H ..."
Abstract

Cited by 397 (8 self)
 Add to MetaCart
(Show Context)
Motivation: Protein homology detection and sequence alignment are at the basis of protein structure prediction, function prediction, and evolution. Results: We have generalized the alignment of protein sequences with a profile hidden Markov model (HMM) to the case of pairwise alignment of profile HMMs. We present a method for detecting distant homologous relationships between proteins based on this approach. The method (HHsearch) is benchmarked together with BLAST, PSIBLAST, HMMER, and the profileprofile comparison tools PROF_SIM and COMPASS, in an allagainstall comparison of a database of 3691 protein domains from SCOP 1.63 with pairwise sequence identities below 20%. Sensitivity: When predicted secondary structure is included in the HMMs, HHsearch is able to detect between 2.7 and 4.2 times more homologs than PSIBLAST or HMMER and between 1.44 and 1.9 times more than COMPASS or PROF_SIM for a rate of false positives of 10%. Approximately half of the improvement over the profile–profile comparison methods is attributable to the use of profile HMMs in place of simple profiles. Alignment quality: Higher sensitivity is mirrored by an increased alignment quality. HHsearch produced 1.2, 1.7, and 3.3 times more good alignments (“balanced ” score> 0.3) than the next best method (COMPASS), and 1.6, 2.9, and 9.4 times more than PSIBLAST, at the family, superfamily, and fold level. Speed: HHsearch scans a query of 200 residues against 3691 domains in 33s on an AMD64 3GHz PC. This is 10 times faster than PROF_SIM and 17 times faster than
A hidden Markov model for predicting transmembrane helices in protein sequences
 In Proceedings of the 6th International Conference on Intelligent Systems for Molecular Biology (ISMB
, 1998
"... A novel method to model and predict the location and orientation of alpha helices in membrane spanning proteins is presented. It is based on a hidden Markov model (HMM) with an architecture that corresponds closely to the biological system. The model is cyclic with 7 types of states for helix core, ..."
Abstract

Cited by 367 (9 self)
 Add to MetaCart
(Show Context)
A novel method to model and predict the location and orientation of alpha helices in membrane spanning proteins is presented. It is based on a hidden Markov model (HMM) with an architecture that corresponds closely to the biological system. The model is cyclic with 7 types of states for helix core, helix caps on either side, loop on the cytoplasmic side, two loops for the noncytoplasmic side, and a globular domain state in the middle of each loop. The two loop paths on the noncytoplasmic side are used to model short and long loops separately, which corresponds biologically to the two known different membrane insertions mechanisms. The close mapping between the biological and computational states allows us to infer which parts of the model architecture are important to capture the information that encodes the membrane topology, and to gain a better understanding of the mechanisms and constraints involved. Models were estimated both by maximum likelihood and a discriminative method, and a method for reassignment of the membrane helix boundaries were developed. In a cross validated test on single sequences, our transmembrane HMM, TMHMM, correctly predicts the entire topology for 77 % of the sequencesin a standard dataset of 83 proteins with known topology. The same accuracy was achieved on a larger dataset of 160 proteins. These results compare favourably with existing methods.
PROBCONS: Probabilistic consistencybased multiple sequence alignment
 Genome Res
, 2005
"... To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because of not only the high computational cost but also the lack of proper objec ..."
Abstract

Cited by 262 (11 self)
 Add to MetaCart
(Show Context)
To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because of not only the high computational cost but also the lack of proper objective functions for measuring alignment quality. In this paper, we introduce probabilistic consistency, a novel scoring function for multiple sequence comparisons. We present PROBCONS, a practical tool for progressive protein multiple sequence alignment based on probabilistic consistency, and evaluate its performance on several standard alignment benchmark datasets. On the BAliBASE, SABmark, and PREFAB benchmark alignment databases, PROBCONS achieves statistically significant improvement over other leading methods while maintaining practical speed. PROBCONS is publicly available as a web resource. Source code and executables are available under the GNU Public License at
Hidden Markov processes
 IEEE Trans. Inform. Theory
, 2002
"... Abstract—An overview of statistical and informationtheoretic aspects of hidden Markov processes (HMPs) is presented. An HMP is a discretetime finitestate homogeneous Markov chain observed through a discretetime memoryless invariant channel. In recent years, the work of Baum and Petrie on finite ..."
Abstract

Cited by 260 (5 self)
 Add to MetaCart
(Show Context)
Abstract—An overview of statistical and informationtheoretic aspects of hidden Markov processes (HMPs) is presented. An HMP is a discretetime finitestate homogeneous Markov chain observed through a discretetime memoryless invariant channel. In recent years, the work of Baum and Petrie on finitestate finitealphabet HMPs was expanded to HMPs with finite as well as continuous state spaces and a general alphabet. In particular, statistical properties and ergodic theorems for relative entropy densities of HMPs were developed. Consistency and asymptotic normality of the maximumlikelihood (ML) parameter estimator were proved under some mild conditions. Similar results were established for switching autoregressive processes. These processes generalize HMPs. New algorithms were developed for estimating the state, parameter, and order of an HMP, for universal coding and classification of HMPs, and for universal decoding of hidden Markov channels. These and other related topics are reviewed in this paper. Index Terms—Baum–Petrie algorithm, entropy ergodic theorems, finitestate channels, hidden Markov models, identifiability, Kalman filter, maximumlikelihood (ML) estimation, order estimation, recursive parameter estimation, switching autoregressive processes, Ziv inequality. I.
A Discriminative Framework for Detecting Remote Protein Homologies
, 1999
"... A new method for detecting remote protein homologies is introduced and shown to perform well in classifying protein domains by SCOP superfamily. The method is a variant of support vector machines using a new kernel function. The kernel function is derived from a generative statistical model for a ..."
Abstract

Cited by 255 (4 self)
 Add to MetaCart
(Show Context)
A new method for detecting remote protein homologies is introduced and shown to perform well in classifying protein domains by SCOP superfamily. The method is a variant of support vector machines using a new kernel function. The kernel function is derived from a generative statistical model for a protein family, in this case a hidden Markov model. This general approach of combining generative models like HMMs with discriminative methods such as support vector machines may have applications in other areas of biosequence analysis as well.
Sequence Comparisons Using Multiple Sequences Detect Three Times as Many Remote . . .
, 1998
"... The sequences of related proteins can diverge beyond the point where their relationship can be recognised by pairwise sequence comparisons. In attempts to overcome this limitation, methods have been developed that use as a query, not a single sequence, but sets of related sequences or a representati ..."
Abstract

Cited by 244 (16 self)
 Add to MetaCart
The sequences of related proteins can diverge beyond the point where their relationship can be recognised by pairwise sequence comparisons. In attempts to overcome this limitation, methods have been developed that use as a query, not a single sequence, but sets of related sequences or a representation of the characteristics shared by related sequences. Here we describe an assessment of three of these methods: the SAMT98 implementation of a hidden Markov model procedure; PSIBLAST; and the intermediate sequence search (ISS) procedure. We determined the extent to which these procedures can detect evolutionary relationships between the members of the sequence database PDBD40J. This database, derived from the structural classification of proteins (SCOP), contains the sequences of proteins of known structure whose sequence identities with each other are 40 % or less. The evolutionary relationships that exist between those that have low sequence identities were found by the examination of their structural details and, in many cases, their functional