Results 1  10
of
93
Fitting a mixture model by expectation maximization to discover motifs in biopolymers
 Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology
, 1994
"... ABSTRACT: The algorithm described in this paper discovers one or more motifs in a collection of DNA or protein sequences by using the technique of expectation maximization to fit a twocomponent finite mixture model to the set of sequences. Multiple motifs are found by fitting a twocomponent finite ..."
Abstract

Cited by 530 (4 self)
 Add to MetaCart
ABSTRACT: The algorithm described in this paper discovers one or more motifs in a collection of DNA or protein sequences by using the technique of expectation maximization to fit a twocomponent finite mixture model to the set of sequences. Multiple motifs are found by fitting a twocomponent finite mixture model to the data, probabilistically erasing the occurrences of the motif thus found, and repeating the process to find successive motifs. The algorithm requires only a set of sequences and a number specifying the width of the motifs as input. It returns a model of each motif and a threshold which together can be used as a Bayesoptimal classifier for searching for occurrences of the motif in other databases. The algorithm estimates how many times each motif occurs in the input dataset and outputs an alignment of the occurrences of the motif. The algorithm is capable of discovering several different motifs with differing numbers of occurrences in a single dataset. Motifs are discovered by treating the set of sequences as though they were created by a stochastic process which can be modelled as a mixture of two densities, one of which generated the occurrences of the motif, and the other the rest of the positions in the sequences. Expectation maximization is used to estimate the parameters of the two densities and the mixing
Hidden Markov models in computational biology: applications to protein modeling
 JOURNAL OF MOLECULAR BIOLOGY
, 1994
"... Hidden.Markov Models (HMMs) are applied t.0 the problems of statistical modeling, database searching and multiple sequence alignment of protein families and protein domains. These methods are demonstrated the on globin family, the protein kinase catalytic domain, and the EFhand calcium binding moti ..."
Abstract

Cited by 524 (35 self)
 Add to MetaCart
Hidden.Markov Models (HMMs) are applied t.0 the problems of statistical modeling, database searching and multiple sequence alignment of protein families and protein domains. These methods are demonstrated the on globin family, the protein kinase catalytic domain, and the EFhand calcium binding motif. In each case the parameters of an HMM are estimated from a training set of unaligned sequences. After the HMM is built, it is used to obtain a multiple alignment of all the training sequences. It is also used to search the. SWISSPROT 22 database for other sequences. that are members of the given protein family, or contain the given domain. The Hi " produces multiple alignments of good quality that agree closely with the alignments produced by programs that incorporate threedimensional structural information. When employed in discrimination tests (by examining how closely the sequences in a database fit the globin, kinase and EFhand HMMs), the '\ HMM is able to distinguish members of these families from nonmembers with a high degree of accuracy. Both the HMM and PROFILESEARCH (a technique used to search for relationships between a protein sequence and multiply aligned sequences) perform better in these tests than PROSITE (a dictionary of sites and patterns in proteins). The HMM appecvs to have a slight advantage over PROFILESEARCH in terms of lower rates of false
Finding motifs using random projections
, 2001
"... Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif Å of length 15, where each planted instance differs from Å in 4 positions. Whereas previous algorithms all failed to solve this (15,4)motif problem, Pevz ..."
Abstract

Cited by 210 (5 self)
 Add to MetaCart
Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif Å of length 15, where each planted instance differs from Å in 4 positions. Whereas previous algorithms all failed to solve this (15,4)motif problem, Pevzner and Sze introduced algorithms that succeeded. However, their algorithms failed to solve the considerably more difficult (14,4), (16,5), and (18,6)motif problems. We introduce a novel motif discovery algorithm based on the use of random projections of the input’s substrings. Experiments on simulated data demonstrate that this algorithm performs better than existing algorithms and, in particular, typically solves the difficult (14,4), (16,5), and (18,6)motif problems quite efficiently. A probabilistic estimate shows that the small values of � for which the algorithm fails to recover the planted Ð � �motif are in all likelihood inherently impossible to solve. We also present experimental results on realistic biological data by identifying ribosome binding sites in prokaryotes as well as a number of known transcriptional regulatory motifs in eukaryotes. 1. CHALLENGING MOTIF PROBLEMS Pevzner and Sze [23] considered a very precise version of the motif discovery problem of computational biology, which had also been considered by Sagot [26]. Based on this formulation, they issued an algorithmic challenge: Planted Ð � �Motif Problem: Suppose there is a fixed but unknown nucleotide sequence Å (the motif) of length Ð. The problem is to determine Å, givenØ nucleotide sequences each of length Ò, and each containing a planted variant of Å. More precisely, each such planted variant is a substring that is Å with exactly � point substitutions. One instantiation that they labeled “The Challenge Problem ” was parameterized as finding a planted (15,4)motif in Ø � sequences each of length Ò � �. These values of Ò, Ø, andÐ are
Hidden Markov models for sequence analysis: extension and analysis of the basic method
, 1996
"... Hidden Markov models (HMMs) are a highly effective means of modeling a family of unaligned sequences or a common motif within a set of unaligned sequences. The trained HMM can then be used for discrimination or multiple alignment. The basic mathematical description of an HMM and its expectationmaxi ..."
Abstract

Cited by 163 (20 self)
 Add to MetaCart
Hidden Markov models (HMMs) are a highly effective means of modeling a family of unaligned sequences or a common motif within a set of unaligned sequences. The trained HMM can then be used for discrimination or multiple alignment. The basic mathematical description of an HMM and its expectationmaximization training procedure is relatively straightforward. In this paper, we review the mathematical extensions and heuristics that move the method from the theoretical to the practical. Then, we experimentally analyze the effectiveness of model regularization, dynamic model modification, and optimization strategies. Finally it is demonstrated on the SH2 domain how a domain can be found from unaligned sequences using a special model type. The experimental work was completed with the aid of the Sequence Alignment and Modeling software suite. 1 Introduction Since their introduction to the computational biology community (Haussler et al., 1993; Krogh et al., 1994a), hidden Markov models (HMMs...
Approaches to the Automatic Discovery of Patterns in Biosequences
, 1995
"... This paper is a survey of approaches and algorithms used for the automatic discovery of patterns in biosequences. Patterns with the expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering those patterns which a ..."
Abstract

Cited by 138 (21 self)
 Add to MetaCart
This paper is a survey of approaches and algorithms used for the automatic discovery of patterns in biosequences. Patterns with the expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering those patterns which are the most frequently used in molecular bioinformatics. A formulation is given of the problem of the automatic discovery of such patterns from a set of sequences, and an analysis presented of the ways in which an assessment can be made of the significance and usefulness of the discovered patterns. It is shown that this problem is related to problems studied in the field of machine learning. The largest part of this paper comprises a review of a number of existing methods developed to solve this problem and how these relate to each other, focusing on the algorithms underlying the approaches. A comparison is given of the algorithms, and examples are given of patterns that have been discovered...
Dirichlet Mixtures: A Method for Improving Detection of Weak but Significant Protein Sequence Homology
, 1996
"... This paper presents the mathematical foundations of Dirichlet mixtures, which have been used to improve database search results for homologous sequences, when a variable number of sequences from a protein family or domain are known. We present a method for condensing the information in a protein dat ..."
Abstract

Cited by 130 (22 self)
 Add to MetaCart
This paper presents the mathematical foundations of Dirichlet mixtures, which have been used to improve database search results for homologous sequences, when a variable number of sequences from a protein family or domain are known. We present a method for condensing the information in a protein database into a mixture of Dirichlet densities. These mixtures are designed to be combined with observed amino acid frequencies, to form estimates of expected amino acid probabilities at each position in a profile, hidden Markov model, or other statistical model. These estimates give a statistical model greater generalization capacity, such that remotely related family members can be more reliably recognized by the model. Dirichlet mixtures have been shown to outperform substitution matrices and other methods for computing these expected amino acid distributions in database search, resulting in fewer false positives and false negatives for the families tested. This paper corrects a previously p...
Probabilistic discovery of time series motifs
, 2003
"... Several important time series data mining problems reduce to the core task of finding approximately repeated subsequences in a longer time series. In an earlier work, we formalized the idea of approximately repeated subsequences by introducing the notion of time series motifs. Two limitations of thi ..."
Abstract

Cited by 119 (22 self)
 Add to MetaCart
Several important time series data mining problems reduce to the core task of finding approximately repeated subsequences in a longer time series. In an earlier work, we formalized the idea of approximately repeated subsequences by introducing the notion of time series motifs. Two limitations of this work were the poor scalability of the motif discovery algorithm, and the inability to discover motifs in the presence of noise. Here we address these limitations by introducing a novel algorithm inspired by recent advances in the problem of pattern discovery in biosequences. Our algorithm is probabilistic in nature, but as we show empirically and theoretically, it can find time series motifs with very high probability even in the presence of noise or “don’t care ” symbols. Not only is the algorithm fast, but it is an anytime algorithm, producing likely candidate motifs almost immediately, and gradually improving the quality of results over time.
Algorithms for Extracting Structured Motifs Using a Suffix Tree With an Application to Promoter and Regulatory Site Consensus Identification
, 2000
"... This paper introduces two exact algorithms for extracting conserved structured motifs from a set of DNA sequences. Structured motifs may be described as an ordered collection of p # 1 "boxes" (each box corresponding to one part of the structured motif), p substitution rates (one for each box) and p ..."
Abstract

Cited by 87 (7 self)
 Add to MetaCart
This paper introduces two exact algorithms for extracting conserved structured motifs from a set of DNA sequences. Structured motifs may be described as an ordered collection of p # 1 "boxes" (each box corresponding to one part of the structured motif), p substitution rates (one for each box) and p  1 intervals of distance (one for each pair of successive boxes in the collection). The contents of the boxes  that is, the motifs themselves  are unknown at the start of the algorithm. This is precisely what the algorithms are meant to find. A suffix tree is used for finding such motifs. The algorithms are efficient enough to be able to infer site consensi, such as, for instance, promoter sequences or regulatory sites, from a set of unaligned sequences corresponding to the non coding regions upstream from all genes of a genome. In particular, both algorithms time complexity scales linearly with N 2 n where n is the average length of the sequences and N their number. An application t...
Gibbs recursive sampler: finding transcription factor binding sites
 Nucleic Acids Res
, 2003
"... The Gibbs Motif Sampler is a software package for locating common elements in collections of biopolymer sequences. In this paper we describe a new variation of the Gibbs Motif Sampler, the Gibbs Recursive Sampler, which has been developed specifically for locating multiple transcription factor bindi ..."
Abstract

Cited by 59 (5 self)
 Add to MetaCart
The Gibbs Motif Sampler is a software package for locating common elements in collections of biopolymer sequences. In this paper we describe a new variation of the Gibbs Motif Sampler, the Gibbs Recursive Sampler, which has been developed specifically for locating multiple transcription factor binding sites for multiple transcription factors simultaneously in unaligned DNA sequences that may be heterogeneous in DNA composition. Here we describe the basic operation of the webbased version of this sampler. The sampler may be accessed at
On The Closest String and Substring Problems
 Journal of the ACM
, 2002
"... The problem of finding a center string that is `close' to every given string arises in computational molecular biology and coding theory. This problem has two versions: the Closest String problem and the Closest Substring problem. Given a set of strings S = fs 1 ; s 2 ; : : : ; s n g, each of lengt ..."
Abstract

Cited by 55 (14 self)
 Add to MetaCart
The problem of finding a center string that is `close' to every given string arises in computational molecular biology and coding theory. This problem has two versions: the Closest String problem and the Closest Substring problem. Given a set of strings S = fs 1 ; s 2 ; : : : ; s n g, each of length m, the Closest String problem is to find the smallest d and a string s of length m which is within Hamming distance d to each s i 2 S. This problem comes from coding theory when we are looking for a code not too far away from a given set of codes. Closest Substring problem, with an additional input integer L, asks for the smallest d and a string s, of length L, which is within Hamming distance d away from a substring, of length L, of each s i . This problem is much more elusive than the Closest String problem. The Closest Substring problem is formulated from applications in finding conserved regions, identifying genetic drug targets and generating genetic probes in molecular biology. Whether there are efficient approximation algorithms for both problems are major open questions in this area. We present two polynomial time approximation algorithms with approximation ratio 1 + ffl for any small ffl to settle both questions.