Results 1  10
of
418
The Infinite Hidden Markov Model
 Machine Learning
, 2002
"... We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. Th ..."
Abstract

Cited by 488 (33 self)
 Add to MetaCart
We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying statetransition matrix, and the expected number of distinct hidden states in a finite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be infiniteconsider, for example, symbols being possible words appearing in English text.
Hidden Markov models for detecting remote protein homologies
 Bioinformatics
, 1998
"... A new hidden Markov model method (SAMT98) for nding remote homologs of protein sequences is described and evaluated. The method begins with a single target sequence and iteratively builds a hidden Markov model (hmm) from the sequence and homologs found using the hmm for database search. SAMT98 is ..."
Abstract

Cited by 306 (12 self)
 Add to MetaCart
A new hidden Markov model method (SAMT98) for nding remote homologs of protein sequences is described and evaluated. The method begins with a single target sequence and iteratively builds a hidden Markov model (hmm) from the sequence and homologs found using the hmm for database search. SAMT98 is also used to construct model libraries automatically from sequences in structural databases. We evaluate the SAMT98 method with four datasets. Three of the test sets are foldrecognition tests, where the correct answers are determined by structural similarity. The fourth uses a curated database. The method is compared against wublastp and against doubleblast, a twostep method similar to ISS, but using blast instead of fasta. Results SAMT98 had the fewest errors in all tests dramatically so for the foldrecognition tests. At the minimumerror point on the SCOPdomains test, SAMT98 got 880 true positives and 68 false positives, doubleblast got 533 true positives with 71 false positives, and wublastp got 353 true positives with 24 false positives. The method is optimized to recognize superfamilies, and would require parameter adjustment to be used to nd family or fold relationships. One key to the performance of the hmm method is a new scorenormalization technique that compares the score to the score with a reversed model rather than to a uniform null model. Availability A World Wide Web server, as well as information on obtaining the Sequence Alignment and PREPRINT to appear in Bioinformatics, 1999
Survey of clustering algorithms
 IEEE TRANSACTIONS ON NEURAL NETWORKS
, 2005
"... Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the ..."
Abstract

Cited by 231 (3 self)
 Add to MetaCart
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
A Discriminative Framework for Detecting Remote Protein Homologies
, 1999
"... A new method for detecting remote protein homologies is introduced and shown to perform well in classifying protein domains by SCOP superfamily. The method is a variant of support vector machines using a new kernel function. The kernel function is derived from a generative statistical model for a ..."
Abstract

Cited by 193 (4 self)
 Add to MetaCart
A new method for detecting remote protein homologies is introduced and shown to perform well in classifying protein domains by SCOP superfamily. The method is a variant of support vector machines using a new kernel function. The kernel function is derived from a generative statistical model for a protein family, in this case a hidden Markov model. This general approach of combining generative models like HMMs with discriminative methods such as support vector machines may have applications in other areas of biosequence analysis as well.
Sequence Comparisons Using Multiple Sequences Detect Three Times as Many Remote . . .
, 1998
"... The sequences of related proteins can diverge beyond the point where their relationship can be recognised by pairwise sequence comparisons. In attempts to overcome this limitation, methods have been developed that use as a query, not a single sequence, but sets of related sequences or a representati ..."
Abstract

Cited by 183 (15 self)
 Add to MetaCart
The sequences of related proteins can diverge beyond the point where their relationship can be recognised by pairwise sequence comparisons. In attempts to overcome this limitation, methods have been developed that use as a query, not a single sequence, but sets of related sequences or a representation of the characteristics shared by related sequences. Here we describe an assessment of three of these methods: the SAMT98 implementation of a hidden Markov model procedure; PSIBLAST; and the intermediate sequence search (ISS) procedure. We determined the extent to which these procedures can detect evolutionary relationships between the members of the sequence database PDBD40J. This database, derived from the structural classification of proteins (SCOP), contains the sequences of proteins of known structure whose sequence identities with each other are 40 % or less. The evolutionary relationships that exist between those that have low sequence identities were found by the examination of their structural details and, in many cases, their functional
Hidden Markov processes
 IEEE Trans. Inform. Theory
, 2002
"... Abstract—An overview of statistical and informationtheoretic aspects of hidden Markov processes (HMPs) is presented. An HMP is a discretetime finitestate homogeneous Markov chain observed through a discretetime memoryless invariant channel. In recent years, the work of Baum and Petrie on finite ..."
Abstract

Cited by 170 (3 self)
 Add to MetaCart
Abstract—An overview of statistical and informationtheoretic aspects of hidden Markov processes (HMPs) is presented. An HMP is a discretetime finitestate homogeneous Markov chain observed through a discretetime memoryless invariant channel. In recent years, the work of Baum and Petrie on finitestate finitealphabet HMPs was expanded to HMPs with finite as well as continuous state spaces and a general alphabet. In particular, statistical properties and ergodic theorems for relative entropy densities of HMPs were developed. Consistency and asymptotic normality of the maximumlikelihood (ML) parameter estimator were proved under some mild conditions. Similar results were established for switching autoregressive processes. These processes generalize HMPs. New algorithms were developed for estimating the state, parameter, and order of an HMP, for universal coding and classification of HMPs, and for universal decoding of hidden Markov channels. These and other related topics are reviewed in this paper. Index Terms—Baum–Petrie algorithm, entropy ergodic theorems, finitestate channels, hidden Markov models, identifiability, Kalman filter, maximumlikelihood (ML) estimation, order estimation, recursive parameter estimation, switching autoregressive processes, Ziv inequality. I.
Hidden Markov models for sequence analysis: extension and analysis of the basic method
, 1996
"... Hidden Markov models (HMMs) are a highly effective means of modeling a family of unaligned sequences or a common motif within a set of unaligned sequences. The trained HMM can then be used for discrimination or multiple alignment. The basic mathematical description of an HMM and its expectationmaxi ..."
Abstract

Cited by 164 (20 self)
 Add to MetaCart
Hidden Markov models (HMMs) are a highly effective means of modeling a family of unaligned sequences or a common motif within a set of unaligned sequences. The trained HMM can then be used for discrimination or multiple alignment. The basic mathematical description of an HMM and its expectationmaximization training procedure is relatively straightforward. In this paper, we review the mathematical extensions and heuristics that move the method from the theoretical to the practical. Then, we experimentally analyze the effectiveness of model regularization, dynamic model modification, and optimization strategies. Finally it is demonstrated on the SH2 domain how a domain can be found from unaligned sequences using a special model type. The experimental work was completed with the aid of the Sequence Alignment and Modeling software suite. 1 Introduction Since their introduction to the computational biology community (Haussler et al., 1993; Krogh et al., 1994a), hidden Markov models (HMMs...
Using the Fisher kernel method to detect remote protein homologies
 In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
, 1999
"... A new method, called the Fisher kernel method, for detecting remote protein homologies is introduced and shown to perform well in classifying protein domains by SCOP superfamily. The method is a variant of support vector machines using a new kernel function. The kernel function is derived from a hid ..."
Abstract

Cited by 160 (3 self)
 Add to MetaCart
A new method, called the Fisher kernel method, for detecting remote protein homologies is introduced and shown to perform well in classifying protein domains by SCOP superfamily. The method is a variant of support vector machines using a new kernel function. The kernel function is derived from a hidden Markov model. The general approach of combining generative models like HMMs with discriminative methods such as support vector machines may have applications in other areas of biosequence analysis as well.
A generalized hidden markov model for the recognition of human genes
 in DNA. In: Proc. Int. Conf. Intell
, 1996
"... We present a statistical model of genes in DNA. A Generalized Hidden Markov Model (GtlMM) provides the framework for describing the grasnmar of a legal parse of a DNA sequence (Stormo & Haussler 1994). Probabilities are assigned to transitions between states in tile GItMM and to the generation of ea ..."
Abstract

Cited by 157 (14 self)
 Add to MetaCart
We present a statistical model of genes in DNA. A Generalized Hidden Markov Model (GtlMM) provides the framework for describing the grasnmar of a legal parse of a DNA sequence (Stormo & Haussler 1994). Probabilities are assigned to transitions between states in tile GItMM and to the generation of each nucleotide base given a particular state. Machine learning techniques are applied to optimize these probabilities using a standardized training set. Given a new candidate sequence, the best parse is deduced from the model using a dynamic programlning algorithm to identify the path through the model with maximum probability. Tile GHMM is flexible and modular, so new sensors and additional states can be inserted easily. In addition, it provides simple solutions for integrating cardinality constraints, reading frame constraints, "indels’, and homology searching. The description and results of an implementation of such a genefinding model, called Genie, is presented. The exon sensor is a codon frequency model conditioned on windowed nucleotide frequency and the preceding eodon. Two neural networks are used, as in (Brunak, Engelbrecht, & Knudsen 1991), for splice site prediction. We show that this simple model perforins quite well. For a crossvalidated standard test set of 304 genes [ftp://wwwhgc.lbl.gov/pub/genesets] in human DNA, our genefinding system identified up to 85 % of proteincoding bases correctly with a specificity of 80%. 58 % of exons were exactly identified with a specificity of 51%. Genie is shown to perform favorably compared with several other genefinding systems.
DT: GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences
 J Mol Biol
, 1999
"... A new protein fold recognition method is described which is both fast and reliable. The method uses a traditional sequence alignment algorithm to generate alignments which are then evaluated by a method derived from threading techniques. As a ®nal step, each threaded model is evaluated by a neural n ..."
Abstract

Cited by 152 (10 self)
 Add to MetaCart
A new protein fold recognition method is described which is both fast and reliable. The method uses a traditional sequence alignment algorithm to generate alignments which are then evaluated by a method derived from threading techniques. As a ®nal step, each threaded model is evaluated by a neural network in order to produce a single measure of con®dence in the proposed prediction. The speed of the method, along with its sensitivity and very low falsepositive rate makes it ideal for automatically predicting the structure of all the proteins in a translated bacterial genome (proteome). The method has been applied to the genome of Mycoplasma genitalium, and analysis of the results shows that as many as 46 % of the proteins derived from the predicted protein coding regions have a signi®cant relationship to a protein of known structure. In some cases, however, only one domain of the protein can be predicted, giving a total coverage of 30 % when calculated as a fraction of the number of amino acid residues in the whole proteome. # 1999 Academic Press