## A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA (1996)

### Cached

### Download Links

- [www.aaai.org]
- [www.aaai.org]
- [www.cse.ucsc.edu]
- [www.cbse.ucsc.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 156 - 14 self |

### BibTeX

@INPROCEEDINGS{Kulp96ageneralized,

author = {David Kulp and David Haussler and Martin G. Reese and Frank H. Eeckman},

title = {A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA},

booktitle = {},

year = {1996},

pages = {134--142},

publisher = {AAAI Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present a statistical model of genes in DNA. A Generalized Hidden Markov Model (GHMM) provides the framework for describing the grammar of a legal parse of a DNA sequence (Stormo & Haussler 1994). Probabilities are assigned to transitions between states in the GHMM and to the generation of each nucleotide base given a particular state. Machine learning techniques are applied to optimize these probabilities using a standardized training set. Given a new candidate sequence, the best parse is deduced from the model using a dynamic programming algorithm to identify the path through the model with maximum probability. The GHMM is flexible and modular, so new sensors and additional states can be inserted easily. In addition, it provides simple solutions for integrating cardinality constraints, reading frame constraints, "indels", and homology searching. The description and results of an implementation of such a gene-finding model, called Genie, is presented. The exon sensor is a codon fre...

### Citations

832 | An introduction to Hidden Markov Models
- abiler, Juang
- 1986
(Show Context)
Citation Context ...rforms as well as the better published gene-finding systems when compared against a standard test set. Methods System Framework Hidden Markov Models have been used for decades in pattern recognition (=-=Rabiner & Juang 1986-=-). More recently, their applicability to computational biology has gained recognition, see e.g. (Krogh et al. 1994). In (Krogh, Mian, & Haussler 1994), an HMM was built for identifying gene structure ... |

523 | Haussler D: Hidden Markov models in computational biology. Applications to protein modeling
- Krogh, Brown, et al.
- 1994
(Show Context)
Citation Context ... Framework Hidden Markov Models have been used for decades in pattern recognition (Rabiner & Juang 1986). More recently, their applicability to computational biology has gained recognition, see e.g. (=-=Krogh et al. 1994-=-). In (Krogh, Mian, & Haussler 1994), an HMM was built for identifying gene structure in E. coli . HMMs have been generalized to allow one state in the model to generate more than one symbol (Stormo &... |

187 |
Evaluation of gene structure prediction programs
- Burset, Guigo
- 1996
(Show Context)
Citation Context ...lable via anonymous FTP from www-hgc.lbl.gov in directory /pub/genesets/. For comparison with other gene-finding systems, we also tested Genie against a second data set, provided by Burset and Guigo (=-=Burset & Guigo 1996-=-). This data set of 570 genes from many different organisms was used in (Burset & Guigo 1996) to compare the effectiveness of many different gene-finders. Our system, like most of those tested in (Bur... |

164 |
GENMARK: parallel gene recognition for both DNA strands
- Borodovsky, McIninch
- 1993
(Show Context)
Citation Context ...istical measures with database homology searching to identify gene features (see, for example, FGENEH (Solovyev, A., & Lawrence 1994), GRAILII (Xu et al. 1994), GenLang (Dong & Searls 1994), GENMARK (=-=Borodovsky & McIninch 1993-=-), and GeneID (Guigo et al. 1992)). The development of gene-finding systems raises research questions regarding the effective and efficient implementation of the system separate from the efficacy of i... |

102 | Prediction of human mRNA donor and acceptor sites from the DNA sequence - Brunak, Engelbrecht, et al. |

78 |
Assessment of protein coding measures
- Fickett, Tung
- 1992
(Show Context)
Citation Context ...ial, internal and terminal exons. Research historically could be categorized as either statistical or homology based, and most research until recently aimed to characterize a single feature. Fickett (=-=Fickett & Tung 1992-=-) provides an overview and evaluation of many statistical measures for signal and content sensors. Recently, gene-finding systems have been developed that employ many of the known recognition techniqu... |

76 |
Prediction of gene structure
- GuigoÂ, Knudsen, et al.
- 1992
(Show Context)
Citation Context ...earching to identify gene features (see, for example, FGENEH (Solovyev, A., & Lawrence 1994), GRAILII (Xu et al. 1994), GenLang (Dong & Searls 1994), GENMARK (Borodovsky & McIninch 1993), and GeneID (=-=Guigo et al. 1992-=-)). The development of gene-finding systems raises research questions regarding the effective and efficient implementation of the system separate from the efficacy of its components. In this paper, we... |

69 | Gene structure prediction by linguistic methods
- Dong, Searls
- 1994
(Show Context)
Citation Context ... methods combine multiple statistical measures with database homology searching to identify gene features (see, for example, FGENEH (Solovyev, A., & Lawrence 1994), GRAILII (Xu et al. 1994), GenLang (=-=Dong & Searls 1994-=-), GENMARK (Borodovsky & McIninch 1993), and GeneID (Guigo et al. 1992)). The development of gene-finding systems raises research questions regarding the effective and efficient implementation of the ... |

65 | Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res - Solovyev, Salamov, et al. - 1994 |

63 |
A hidden markov model that finds genes in e. coli dna
- Krogh, Haussler
- 1994
(Show Context)
Citation Context ... Framework Hidden Markov Models have been used for decades in pattern recognition (Rabiner & Juang 1986). More recently, their applicability to computational biology has gained recognition, see e.g. (=-=Krogh et al. 1994-=-). In (Krogh, Mian, & Haussler 1994), an HMM was built for identifying gene structure in E. coli . HMMs have been generalized to allow one state in the model to generate more than one symbol (Stormo &... |

56 |
Neural Networks for Speech and Sequence Recognition
- Bengio
- 1996
(Show Context)
Citation Context ...on of the parse as an ordered set of state/sequence pairs: OE = f(q 1 ; x 1 ); (q 2 ; x 2 ); : : : ; (q k ; x k )g: Then equation 4 can be generalized (Fong 1995; Auger & Lawrence 1989; Sankoff 1992; =-=Bengio 1996-=-) as P (X; OE) = P (q1 jB) / k Y i=1 P (x i jq i ) !/ k\Gamma1 Y i=1 P (q i+1 jnode(q i )) ! : (5) Each term P (x i jq i ) can be further decomposed using P (x i jq i ) = P (x i jl(x i ); q i )P (l(x ... |

52 |
Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks
- Snyder, Stormo
- 1993
(Show Context)
Citation Context ...e efficacy of its components. In this paper, we present the results of the implementation of a gene-finding system as a Generalized Hidden Markov Model. Our system is similar in design to GeneParser (=-=Snyder & Stormo 1993-=-), but is based on a rigorous probabilistic framework. We show how a GHMM offers a simple elegant model of genes in eukaryotic DNA. The probabilistic framework provides meaningful answers (in a probab... |

31 |
Algorithms for the optimal identification of segment neighborhoods
- Auger, Lawrence
- 1989
(Show Context)
Citation Context ...ion of subsequences) and a redefinition of the parse as an ordered set of state/sequence pairs: OE = f(q 1 ; x 1 ); (q 2 ; x 2 ); : : : ; (q k ; x k )g: Then equation 4 can be generalized (Fong 1995; =-=Auger & Lawrence 1989-=-; Sankoff 1992; Bengio 1996) as P (X; OE) = P (q1 jB) / k Y i=1 P (x i jq i ) !/ k\Gamma1 Y i=1 P (q i+1 jnode(q i )) ! : (5) Each term P (x i jq i ) can be further decomposed using P (x i jq i ) = P ... |

20 |
Prediction of exon-intron structure by a dynamic programming approach. BioSystems
- M, Roytberg
- 1993
(Show Context)
Citation Context ...et al (Krogh et al. 1994). Dynamic Program for Optimizing Parse The Viterbi algorithm is used to maximize equation 5 for OE. This approach is well described elsewhere including (Snyder & Stormo 1993; =-=Gelfand & Roytberg 1993-=-; Auger & Lawrence 1989; Sankoff 1992; Gelfand & Roytberg 1995; Bengio 1996). Notable differences from the standard dynamic programming algorithm relate to accommodating the GHMM framework. Specifical... |

20 | Optimally parsing a sequence into different classes based on multiple types of evidence
- Stormo, Haussler
- 1994
(Show Context)
Citation Context ...@genome.lbl.gov Abstract We present a statistical model of genes in DNA. A Generalized Hidden Markov Model (GHMM) provides the framework for describing the grammar of a legal parse of a DNA sequence (=-=Stormo & Haussler 1994-=-). Probabilities are assigned to transitions between states in the GHMM and to the generation of each nucleotide base given a particular state. Machine learning techniques are applied to optimize thes... |

7 | Novel neural network prediction systems for human promoters and splice sites
- Reese, Eeckman
- 1995
(Show Context)
Citation Context ...al current work includes designing a graphical interface for use by biologists at large-scale sequencing centers such as Lawrence Berkeley National Laboratory, incorporating a promoter signal sensor (=-=Reese 1995-=-), and providing multiple gene recognition capability. We hope to report results regarding these enhancements by the time of the conference. Acknowledgments We wish to thank Gary Stormo for his suppor... |

6 |
Efficient optimal decomposition of a sequence into disjoint regions, each matched to some template in an inventory
- Sankoff
- 1992
(Show Context)
Citation Context ...d a redefinition of the parse as an ordered set of state/sequence pairs: OE = f(q 1 ; x 1 ); (q 2 ; x 2 ); : : : ; (q k ; x k )g: Then equation 4 can be generalized (Fong 1995; Auger & Lawrence 1989; =-=Sankoff 1992-=-; Bengio 1996) as P (X; OE) = P (q1 jB) / k Y i=1 P (x i jq i ) !/ k\Gamma1 Y i=1 P (q i+1 jnode(q i )) ! : (5) Each term P (x i jq i ) can be further decomposed using P (x i jq i ) = P (x i jl(x i );... |

6 |
Structure of vertebrate genes: a statistical analysis implicating selection
- Smith
- 1988
(Show Context)
Citation Context ...an internal exon -- then the distribution of the number of internal exons is geometric over P (EjA). Experimental evidence indicates that the number of exons in a gene is not geometric (Hawkins 1988)(=-=Smith 1988). Hence we would li-=-ke to impose an arbitrary distribution constraint on the "cardinality" of exons (Wu 1995). The solution requires the removal of all cycles in the GHMM by virtually "unspooling" the... |

5 |
Optimally parsing a sequence into di erent classes based on multiple types of information
- Stormo, Haussler
- 1994
(Show Context)
Citation Context ...rg@cse.ucsc.edu Abstract We present a statistical model of genes in DNA. A Generalized Hidden Markov Model (GHMM) provides the framework for describing the grammar of a legal parse of a DNA sequence (=-=Stormo & Haussler 1994-=-). Probabilities are assigned to transitions between states in the GHMM and to the generation of each nucleotide base given a particular state. Machine learning techniques are applied to optimize thes... |

4 |
Algorithms for the optimal identi cation of segment neighborhoods
- Auger, Lawrence
- 1989
(Show Context)
Citation Context ...994). Dynamic Program for Optimizing Parse The Viterbi algorithm is used to maximize equation 5 for . This approachiswell described elsewhere including (Snyder & Stormo 1993� Gelfand & Roytberg 1993� =-=Auger & Lawrence 1989-=-� Sanko 1992� Gelfand & Roytberg 1995� Bengio 1996). Notable di erences from the standard dynamic programming algorithm relate to accommodating the GHMM framework. Speci - cally, a rst pass through th... |

4 |
Identi cation of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks
- Snyder, Stormo
- 1993
(Show Context)
Citation Context ...he e - cacy of its components. In this paper, we present the results of the implementation of a gene- nding system as a Generalized Hidden Markov Model. Our system is similar in design to GeneParser (=-=Snyder & Stormo 1993-=-), but is based on a rigorous probabilistic framework. We showhow a GHMM o ers a simple elegant model of genes in eukaryotic DNA. The probabilistic framework provides meaningful answers (in a probabil... |

3 |
E cient optimal decomposition of a sequence into disjoint regions, each matched to some template in an inventory
- Sanko
- 1992
(Show Context)
Citation Context ...ion of subsequences) and a rede nition of the parse as an ordered set of state/sequence pairs: = f(q1�x1)� (q2�x2)�:::�(q k�x k)g: Then equation 4 can be generalized (Fong 1995� Auger &Lawrence 1989� =-=Sanko 1992-=-� Bengio 1996) as P (X� )=P (q1jB) kY P (xijqi) i=1 ! k;1 ! : Y P (qi+1 jnode(qi)) i=1 (5) Each termP (x ijq i) can be further decomposed using P (x ijq i)=P (x ijl(x i)�q i)P (l(x i)jq i) where l(x i... |

2 |
A survey on intron and exon lengths. Nucl. Acids Res. 16, 9893-9908. W. Filsell and others
- Hawkins
- 1988
(Show Context)
Citation Context ...n acceptor to an internal exon -- then the distribution of the number of internal exons is geometric over P (EjA). Experimental evidence indicates that the number of exons in a gene is not geometric (=-=Hawkins 1988)(Smith 1988). -=-Hence we would like to impose an arbitrary distribution constraint on the "cardinality" of exons (Wu 1995). The solution requires the removal of all cycles in the GHMM by virtually "uns... |

2 |
A phase-specific dynamic programming algorithm for parsing gent structure. Gene-Finding and Gone Structure Predici.ion Workshop
- Wu
- 1995
(Show Context)
Citation Context ...al evidence indicates that the number of exons in a gene is not geometric (Hawkins 1988)(Smith 1988). Hence we would like to impose an arbitrary distribution constraint on the "cardinality" =-=of exons (Wu 1995). The sol-=-ution requires the removal of all cycles in the GHMM by virtually "unspooling" the graph. Figure 2 shows the unspooled version of figure 1. The transition probabilities P (E i+1 jA i ) can b... |

1 |
Parsing of DNA sequences using dynamic programming with a state machine model. unpublished manuscript
- Fong
- 1995
(Show Context)
Citation Context ... concatenation of subsequences) and a redefinition of the parse as an ordered set of state/sequence pairs: OE = f(q 1 ; x 1 ); (q 2 ; x 2 ); : : : ; (q k ; x k )g: Then equation 4 can be generalized (=-=Fong 1995-=-; Auger & Lawrence 1989; Sankoff 1992; Bengio 1996) as P (X; OE) = P (q1 jB) / k Y i=1 P (x i jq i ) !/ k\Gamma1 Y i=1 P (q i+1 jnode(q i )) ! : (5) Each term P (x i jq i ) can be further decomposed u... |

1 |
Dynamic programming for gene recognition
- Gelfand, Roytberg
- 1995
(Show Context)
Citation Context ...e The Viterbi algorithm is used to maximize equation 5 for OE. This approach is well described elsewhere including (Snyder & Stormo 1993; Gelfand & Roytberg 1993; Auger & Lawrence 1989; Sankoff 1992; =-=Gelfand & Roytberg 1995-=-; Bengio 1996). Notable differences from the standard dynamic programming algorithm relate to accommodating the GHMM framework. Specifically, a first pass through the sequence establishes candidate tr... |

1 | A Hidden Markov Model that nds genes - Krogh, �, et al. - 1994 |

1 |
Assessment of protein coding measures. Nucl. Acids Res
- Fickett, Tung
- 1992
(Show Context)
Citation Context ...tial, internal and terminal exons. Research historically could be categorized as either statistical or homology based, and most research until recently aimed to characterize a single feature. Fickett(=-=Fickett & Tung 1992-=-) provides an overview and evaluation of many statistical measures for signal and content sensors. Recently, gene-finding systems have been developed that employ many of the known recognition tecimiqu... |