Results 1 - 10
of
32
Hidden Markov models in computational biology: applications to protein modeling
- JOURNAL OF MOLECULAR BIOLOGY
, 1994
"... Hidden.Markov Models (HMMs) are applied t.0 the problems of statistical modeling, database searching and multiple sequence alignment of protein families and protein domains. These methods are demonstrated the on globin family, the protein kinase catalytic domain, and the EF-hand calcium binding moti ..."
Abstract
-
Cited by 436 (29 self)
- Add to MetaCart
Hidden.Markov Models (HMMs) are applied t.0 the problems of statistical modeling, database searching and multiple sequence alignment of protein families and protein domains. These methods are demonstrated the on globin family, the protein kinase catalytic domain, and the EF-hand calcium binding motif. In each case the parameters of an HMM are estimated from a training set of unaligned sequences. After the HMM is built, it is used to obtain a multiple alignment of all the training sequences. It is also used to search the. SWISS-PROT 22 database for other sequences. that are members of the given protein family, or contain the given domain. The Hi " produces multiple alignments of good quality that agree closely with the alignments produced by programs that incorporate threedimensional structural information. When employed in discrimination tests (by examining how closely the sequences in a database fit the globin, kinase and EF-hand HMMs), the '\ HMM is able to distinguish members of these families from non-members with a high degree of accuracy. Both the HMM and PROFILESEARCH (a technique used to search for relationships between a protein sequence and multiply aligned sequences) perform better in these tests than PROSITE (a dictionary of sites and patterns in proteins). The HMM appecvs to have a slight advantage over PROFILESEARCH in terms of lower rates of false
Finding motifs using random projections
, 2001
"... Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif Å of length 15, where each planted instance differs from Å in 4 positions. Whereas previous algorithms all failed to solve this (15,4)-motif problem, Pevz ..."
Abstract
-
Cited by 174 (5 self)
- Add to MetaCart
Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif Å of length 15, where each planted instance differs from Å in 4 positions. Whereas previous algorithms all failed to solve this (15,4)-motif problem, Pevzner and Sze introduced algorithms that succeeded. However, their algorithms failed to solve the considerably more difficult (14,4)-, (16,5)-, and (18,6)motif problems. We introduce a novel motif discovery algorithm based on the use of random projections of the input’s substrings. Experiments on simulated data demonstrate that this algorithm performs better than existing algorithms and, in particular, typically solves the difficult (14,4)-, (16,5)-, and (18,6)-motif problems quite efficiently. A probabilistic estimate shows that the small values of � for which the algorithm fails to recover the planted Ð � �-motif are in all likelihood inherently impossible to solve. We also present experimental results on realistic biological data by identifying ribosome binding sites in prokaryotes as well as a number of known transcriptional regulatory motifs in eukaryotes. 1. CHALLENGING MOTIF PROBLEMS Pevzner and Sze [23] considered a very precise version of the motif discovery problem of computational biology, which had also been considered by Sagot [26]. Based on this formulation, they issued an algorithmic challenge: Planted Ð � �-Motif Problem: Suppose there is a fixed but unknown nucleotide sequence Å (the motif) of length Ð. The problem is to determine Å, givenØ nucleotide sequences each of length Ò, and each containing a planted variant of Å. More precisely, each such planted variant is a substring that is Å with exactly � point substitutions. One instantiation that they labeled “The Challenge Problem ” was parameterized as finding a planted (15,4)-motif in Ø � sequences each of length Ò � �. These values of Ò, Ø, andÐ are
Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization
- Machine Learning
, 1995
"... . The MEME algorithm extends the expectation maximization (EM) algorithm for identifying motifs in unalignedbiopolymer sequences. The aim of MEME is to discover new motifs in a set of biopolymer sequences where little or nothing is known in advance about any motifs that may be present. MEME innovati ..."
Abstract
-
Cited by 166 (7 self)
- Add to MetaCart
. The MEME algorithm extends the expectation maximization (EM) algorithm for identifying motifs in unalignedbiopolymer sequences. The aim of MEME is to discover new motifs in a set of biopolymer sequences where little or nothing is known in advance about any motifs that may be present. MEME innovations expand the range of problems which can be solved using EM and increase the chance of finding good solutions. First, subsequences which actually occur in the biopolymer sequences are used as starting points for the EM algorithm to increase the probability of finding globally optimal motifs. Second, the assumption that each sequence contains exactly one occurrence of the shared motif is removed. This allows multiple appearances of a motif to occur in any sequence and permits the algorithm to ignore sequences with no appearance of the shared motif, increasing its resistance to noisy data. Third, a method for probabilistically erasing shared motifs after they are found is incorporated so tha...
Dirichlet Mixtures: A Method for Improving Detection of Weak but Significant Protein Sequence Homology
, 1996
"... This paper presents the mathematical foundations of Dirichlet mixtures, which have been used to improve database search results for homologous sequences, when a variable number of sequences from a protein family or domain are known. We present a method for condensing the information in a protein dat ..."
Abstract
-
Cited by 105 (20 self)
- Add to MetaCart
This paper presents the mathematical foundations of Dirichlet mixtures, which have been used to improve database search results for homologous sequences, when a variable number of sequences from a protein family or domain are known. We present a method for condensing the information in a protein database into a mixture of Dirichlet densities. These mixtures are designed to be combined with observed amino acid frequencies, to form estimates of expected amino acid probabilities at each position in a profile, hidden Markov model, or other statistical model. These estimates give a statistical model greater generalization capacity, such that remotely related family members can be more reliably recognized by the model. Dirichlet mixtures have been shown to outperform substitution matrices and other methods for computing these expected amino acid distributions in database search, resulting in fewer false positives and false negatives for the families tested. This paper corrects a previously p...
Algorithms for Extracting Structured Motifs Using a Suffix Tree With an Application to Promoter and Regulatory Site Consensus Identification
, 2000
"... This paper introduces two exact algorithms for extracting conserved structured motifs from a set of DNA sequences. Structured motifs may be described as an ordered collection of p # 1 "boxes" (each box corresponding to one part of the structured motif), p substitution rates (one for each box) and p ..."
Abstract
-
Cited by 71 (5 self)
- Add to MetaCart
This paper introduces two exact algorithms for extracting conserved structured motifs from a set of DNA sequences. Structured motifs may be described as an ordered collection of p # 1 "boxes" (each box corresponding to one part of the structured motif), p substitution rates (one for each box) and p - 1 intervals of distance (one for each pair of successive boxes in the collection). The contents of the boxes -- that is, the motifs themselves -- are unknown at the start of the algorithm. This is precisely what the algorithms are meant to find. A suffix tree is used for finding such motifs. The algorithms are efficient enough to be able to infer site consensi, such as, for instance, promoter sequences or regulatory sites, from a set of unaligned sequences corresponding to the non coding regions upstream from all genes of a genome. In particular, both algorithms time complexity scales linearly with N 2 n where n is the average length of the sequences and N their number. An application t...
Stochastic Context-Free Grammars for Modeling RNA
, 1993
"... this paper, we apply stochastic context-free grammars (SCFGs) to the problems of statistical modeling, database searching, multiple alignment, and prediction of the secondary structure of RNA families. This approach is highly related to our previous work on modeling protein families with HMMs [HKMS9 ..."
Abstract
-
Cited by 36 (4 self)
- Add to MetaCart
this paper, we apply stochastic context-free grammars (SCFGs) to the problems of statistical modeling, database searching, multiple alignment, and prediction of the secondary structure of RNA families. This approach is highly related to our previous work on modeling protein families with HMMs [HKMS93, KBM
Smooth On-Line Learning Algorithms for Hidden Markov Models
, 1994
"... he modeling and analysis of DNA and protein sequences in biology (Baldi et al. (1992) and (1993), Cardon and Stormo (1992), Haussler et al. (1992), Krogh et al. (1993), and references therein) and optical character recognition (Levin and Pieraccini (1993)). A first order HMMM is characterized by a ..."
Abstract
-
Cited by 36 (5 self)
- Add to MetaCart
he modeling and analysis of DNA and protein sequences in biology (Baldi et al. (1992) and (1993), Cardon and Stormo (1992), Haussler et al. (1992), Krogh et al. (1993), and references therein) and optical character recognition (Levin and Pieraccini (1993)). A first order HMMM is characterized by a set of states, an alphabet of symbols, a probability transition matrix T = (t ij ) and a probability emission matrix E = (e ij ). The parameter t ij (resp. e ij ) represents the probability of transition from state i to state j (resp. of emission of symbol j from state i). HMMs can be viewed as adaptive systems: given a training sequence of symbols O, the parameters of a HMM can be iteratively adjusted in order the optimize the fit between the model and the data, as measu
Significantly Lower Entropy Estimates for Natural DNA Sequences
- Journal of Computational Biology
, 1996
"... If DNA were a random string over its alphabet fA; C; G; Tg, an optimal code would assign 2 bits to each nucleotide. We imagine DNA to be a highly ordered, purposeful molecule, and might therefore reasonably expect statistical models of its string representation to produce much lower entropy estima ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
If DNA were a random string over its alphabet fA; C; G; Tg, an optimal code would assign 2 bits to each nucleotide. We imagine DNA to be a highly ordered, purposeful molecule, and might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly this has not been the case for many natural DNA sequences, including portions of the human genome. We introduce a new statistical model (compression algorithm), the strongest reported to date, for naturally occurring DNA sequences. Conventional techniques code a nucleotide using only slightly fewer bits (1.90) than one obtains by relying only on the frequency statistics of individual nucleotides (1.95). Our method in some cases increases this gap by more than five-fold (1.66) and may lead to better performance in microbiological pattern recognition applications. One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexac...
Extracting structured motifs using a suffix tree - algorithms and application to promoter consensus identification
- In Proceedings of RECOMB 2000
, 2000
"... promoter consensus identification ..."
Inferring regulatory elements from a whole genome. An analysis of Helicobacter pylori sigma(80) family of promoter signals
- J. Mol. Biol
, 2000
"... binding site. E-mail address of the corresponding author: ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
binding site. E-mail address of the corresponding author:

