Results 1  10
of
23
Hidden Markov models for sequence analysis: extension and analysis of the basic method
, 1996
"... Hidden Markov models (HMMs) are a highly effective means of modeling a family of unaligned sequences or a common motif within a set of unaligned sequences. The trained HMM can then be used for discrimination or multiple alignment. The basic mathematical description of an HMM and its expectationmaxi ..."
Abstract

Cited by 219 (23 self)
 Add to MetaCart
(Show Context)
Hidden Markov models (HMMs) are a highly effective means of modeling a family of unaligned sequences or a common motif within a set of unaligned sequences. The trained HMM can then be used for discrimination or multiple alignment. The basic mathematical description of an HMM and its expectationmaximization training procedure is relatively straightforward. In this paper, we review the mathematical extensions and heuristics that move the method from the theoretical to the practical. Then, we experimentally analyze the effectiveness of model regularization, dynamic model modification, and optimization strategies. Finally it is demonstrated on the SH2 domain how a domain can be found from unaligned sequences using a special model type. The experimental work was completed with the aid of the Sequence Alignment and Modeling software suite. 1 Introduction Since their introduction to the computational biology community (Haussler et al., 1993; Krogh et al., 1994a), hidden Markov models (HMMs...
Dirichlet Mixtures: A Method for Improving Detection of Weak but Significant Protein Sequence Homology
, 1996
"... This paper presents the mathematical foundations of Dirichlet mixtures, which have been used to improve database search results for homologous sequences, when a variable number of sequences from a protein family or domain are known. We present a method for condensing the information in a protein dat ..."
Abstract

Cited by 175 (24 self)
 Add to MetaCart
This paper presents the mathematical foundations of Dirichlet mixtures, which have been used to improve database search results for homologous sequences, when a variable number of sequences from a protein family or domain are known. We present a method for condensing the information in a protein database into a mixture of Dirichlet densities. These mixtures are designed to be combined with observed amino acid frequencies, to form estimates of expected amino acid probabilities at each position in a profile, hidden Markov model, or other statistical model. These estimates give a statistical model greater generalization capacity, such that remotely related family members can be more reliably recognized by the model. Dirichlet mixtures have been shown to outperform substitution matrices and other methods for computing these expected amino acid distributions in database search, resulting in fewer false positives and false negatives for the families tested. This paper corrects a previously p...
Grouping Web Page References into Transactions for Mining World Wide Web Browsing Patterns
 Dept. of Computer Science, Univ. of Minnesota
, 1997
"... Webbased organizations often generate and collect large volumes of data in their daily operations. Analyzing such data involves the discovery of meaningful relationships from a large collection of primarily unstructured data, often stored in Web server access logs. While traditional domains for dat ..."
Abstract

Cited by 42 (3 self)
 Add to MetaCart
Webbased organizations often generate and collect large volumes of data in their daily operations. Analyzing such data involves the discovery of meaningful relationships from a large collection of primarily unstructured data, often stored in Web server access logs. While traditional domains for data mining, such as point of sale databases, have naturally defined transactions, there is no convenient method of clustering web references into transactions. This paper identifies a model of user browsing behavior that separates web page references into those made for navigation purposes and those for information content purposes. A transaction identification method based on the browsing model is defined and successfully tested against other methods, such as the maximal forward reference algorithm proposed in [1]. Transactions identified by the proposed methods are used to discover association rules from real world data using the WEBMINER system [7]. 1 Introduction and Background As more or...
Weighting Hidden Markov Models For Maximum Discrimination
 Bioinformatics
, 1998
"... 1.1 Motivation Hidden Markov models can efficiently and automatically build statistical representations of related sequences. Unfortunately, training sets are frequently biased toward one subgroup of sequences, leading to an insufficiently general model. This work evaluates sequence weighting metho ..."
Abstract

Cited by 26 (3 self)
 Add to MetaCart
(Show Context)
1.1 Motivation Hidden Markov models can efficiently and automatically build statistical representations of related sequences. Unfortunately, training sets are frequently biased toward one subgroup of sequences, leading to an insufficiently general model. This work evaluates sequence weighting methods based on the maximumdiscrimination idea. 1.2 Results One good method scales sequence weights by an exponential that ranges between 0.1 for the best scoring sequence and 1.0 for the worst. Experiments with a curated data set show that while training with one or two sequences performed worse than singlesequence Probabilistic SmithWaterman, training with five or ten sequences reduced errors by 20% and 51%, respectively. This new version of the SAM HMM suite outperforms HMMer (17% reduction over PSW for 10 training sequences), MetaMEME (28% reduction), and unweighted SAM (31% reduction). 1.3 Availability A WorldWide Web server, as well as information on obtaining the Sequence Alignme...
Evaluating regularizers for estimating distributions of amino acids. In
 Proc. of Third Int. Conf. on Intelligent Systems for Molecular Biology
, 1995
"... Abstract This paper makes a quantitative comparison of different methods, called regularizers, for estimating the distribution of amino acids in a specific context, given a very small sampie of amino acids from that distribution. The regulaxizers considered here axe zerooffsets, pseudocounts, subs ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
Abstract This paper makes a quantitative comparison of different methods, called regularizers, for estimating the distribution of amino acids in a specific context, given a very small sampie of amino acids from that distribution. The regulaxizers considered here axe zerooffsets, pseudocounts, substitution matrices (with several vaxiants), and Dirichlet mizture reg. ularizers. Each regularizer is evaluated based on how well it estimates the distributions of the columns of a multiple alignmentspecifically, the expected encoding cost per amino acid using the regularizer and all possible samples from each column. In general, pseudocounts give the lowest encoding costs for samples of size zero, substitution matrices give the lowest encoding costs for samples of size one, and Dirichlet mixtures give the lowest for larger samples. One of the substitution matrix variants, which added pseudocounts and scaled counts, does almost as well as the best known Dirichlet mixtures, but with a lower computation cost. Keywords: regularizers, entropy, encoding cost, pseudocounts, Gribskov average score, substitution matrices, datadependent pseudocounts, Dirichlet mixture priors 1 Why estimate amino acid distributions? Most search and comparison algorithms for proteins need to estimate the probabilities of the twenty amino acids in a given context. This probability is often expressed indirectly as a score for each of the amino acids, with positive scores for expected amino acids and negative scores for unexpected ones. As Altschul pointed out [1], any alignmentscoring system is really making an assertion about the probability of the test sequences given the reference sequence. The score for an alignment is the sum of the scores for individual matched positions, plus the costs for insertions and deletions. For each match position, there are twenty scoresone for each of the possible amino acids in the test sequence. Each match score can be interpreted as the logarithm of the ratio of two estimated probabilities: the probability of the test amino acid given the amino acid in the reference sequence and the probability of the test amino acid in the background distribution. If we define /5:(i) as the estimated probability of amino acid i in position z and /50(i) as the estimated background probability in any^position, then the score for i in column t is logb(Pt(i)/Po(i)) for some arbitrary logarithmic base b Any method for estimating the probabilities /3t(i) and/50(i) defines a matchscoring system. Rather than looking at the final scoring system, this paper will concentrate on methods that can be used for estimating the probabilities themselves. In more sophisticated models than single sequence alignments, such as multiple alignments, profiles For alignment and search problems, we usually add scores from many positions, and so fairly small improvements in computing individual match scores can add up to significant overall differences. For example, the small differences between the PAM and BLOSUM scoring matrices have been shown to make a significant difference in the quality of search results The differences between regularizers are often fairly small; this paper attempts to quantify these small differences for several regularizers. Section 2 explains the measure used to quantify the tests; Section 3 explains the notion of posterior counts; Section 4 describes the data used for training and testing; and Section 5 presents the different methods and quantitative comparisons of them. ISMB95 From: ISMB95 Proceedings.
An alternative model of amino acid replacement
 Bioinformatics
, 2005
"... Motivation: The observed correlations between pairs of homologous protein sequences are typically explained in terms of a Markovian dynamic of amino acid substitution. This model assumes that every location on the protein sequence has the same background distribution of amino acids, an assumption th ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
(Show Context)
Motivation: The observed correlations between pairs of homologous protein sequences are typically explained in terms of a Markovian dynamic of amino acid substitution. This model assumes that every location on the protein sequence has the same background distribution of amino acids, an assumption that is incompatible with the observed heterogeneity of protein amino acid profiles and with the success of profile multiple sequence alignment. Results: We propose an alternative model of amino acid replacement during protein evolution based upon the assumption that the variation of the amino acid background distribution from one residue to the next is sufficient to explain the observed sequence correlations of homologs. The resulting dynamical model of independent replacements drawn from heterogeneous backgrounds is simple and consistent, and provides a unified homology match score for sequence–sequence, sequence–profile and profile–profile alignment. Contact:
Measurements of Protein SequenceStructure Correlations
, 2004
"... Correlations between protein structures and amino acid sequences are widely used for protein structure prediction. For example, secondary structure predictors generally use correlations between a secondary structure sequence and corresponding primary structure sequence, whereas threading algorithms, ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
Correlations between protein structures and amino acid sequences are widely used for protein structure prediction. For example, secondary structure predictors generally use correlations between a secondary structure sequence and corresponding primary structure sequence, whereas threading algorithms, and similar tertiary structure predictors, typically incorporate interresidue contact potentials. To investigate the relative importance of these interactions we measured the mutual information between the primary structure, secondary structure and sidechain surface exposure, both for adjacent residues along the amino acid sequence, and for tertiary structure contacts between residues distantly separated along the backbone. We find that local interactions along the amino acid chain are far more important than nonlocal contacts, and that correlations between proximate amino acids are essentially uninformative. This suggests that knowledgebased contact potentials may be less important for structure predication than is generally believed.
Inference of Entropies of Discrete Random Variable with Unknown Cardinality. arXiv:physics/0207009v1 [physics.dataan
, 2002
"... We examine the recently introduced NSB estimator of entropies of severely undersampled discrete variables and devise a procedure for calculating the involved integrals. We discover that the output of the estimator has a well defined limit for large cardinalities of the variables being studied. Thus ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
We examine the recently introduced NSB estimator of entropies of severely undersampled discrete variables and devise a procedure for calculating the involved integrals. We discover that the output of the estimator has a well defined limit for large cardinalities of the variables being studied. Thus one can estimate entropies with no a priori assumptions about these cardinalities, and a closed form solution for such estimates is given. 1
Using Mixtures of Common Ancestors for Estimating the Probabilities of Discrete Events in Biological Sequences
 In Proceedings of the Ninth International Conference on Intelligent Systems for Molecular Biology
, 2002
"... Accurately estimating probabilities from observations is important for probabilisticbased approaches to problems in computational biology. In this paper we present a biologicallymotivated method for estimating probability distributions over discrete alphabets from observations using a mixture mode ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Accurately estimating probabilities from observations is important for probabilisticbased approaches to problems in computational biology. In this paper we present a biologicallymotivated method for estimating probability distributions over discrete alphabets from observations using a mixture model of common ancestors. The method is an extension of substitution matrixbased probability estimation methods. In contrast to previous such methods, our method has a simple Bayesian interpretation and has the advantage over Dirichlet mixtures that it is both effective and simple to compute for large alphabets.
Dirichlet Mixtures for Query Estimation in Information Retrieval
, 2005
"... Treated as small samples of text, user queries require smoothing to better estimate the probabilities of their true model. Traditional techniques to perform this smoothing include automatic query expansion and local feedback. This paper applies the bioinformatics smoothing technique, Dirichlet mixtu ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Treated as small samples of text, user queries require smoothing to better estimate the probabilities of their true model. Traditional techniques to perform this smoothing include automatic query expansion and local feedback. This paper applies the bioinformatics smoothing technique, Dirichlet mixtures, to the task of query estimation. We discuss Dirichlet mixtures ’ relation to relevance models, probabilistic latent semantic indexing, and other information retrieval techniques. We describe how Dirichlet mixtures give insight into the value of retaining the original query in query expansion. On the task of adhoc retrieval, query estimation by Dirichlet mixtures generally does not perform well, but aspects of its behavior show promise. Experiments where the original query is mixed with the models estimated by relevance models and Dirichlet mixtures confirms that query estimation methods should not fully discount the prior information held in a query.