Results 1 - 10
of
29
A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA
, 1996
"... We present a statistical model of genes in DNA. A Generalized Hidden Markov Model (GHMM) provides the framework for describing the grammar of a legal parse of a DNA sequence (Stormo & Haussler 1994). Probabilities are assigned to transitions between states in the GHMM and to the generation of each n ..."
Abstract
-
Cited by 122 (13 self)
- Add to MetaCart
We present a statistical model of genes in DNA. A Generalized Hidden Markov Model (GHMM) provides the framework for describing the grammar of a legal parse of a DNA sequence (Stormo & Haussler 1994). Probabilities are assigned to transitions between states in the GHMM and to the generation of each nucleotide base given a particular state. Machine learning techniques are applied to optimize these probabilities using a standardized training set. Given a new candidate sequence, the best parse is deduced from the model using a dynamic programming algorithm to identify the path through the model with maximum probability. The GHMM is flexible and modular, so new sensors and additional states can be inserted easily. In addition, it provides simple solutions for integrating cardinality constraints, reading frame constraints, "indels", and homology searching. The description and results of an implementation of such a gene-finding model, called Genie, is presented. The exon sensor is a codon fre...
Gene Structure Prediction by Linguistic Methods
- Genomics
, 1994
"... The higher-order structure of genes and other features of biological sequences can be described by means of formal grammars. These grammars can then be used by general-purpose parsers to detect and assemble such structures by means of syntactic pattern recognition. We describe a grammar and parser f ..."
Abstract
-
Cited by 55 (2 self)
- Add to MetaCart
The higher-order structure of genes and other features of biological sequences can be described by means of formal grammars. These grammars can then be used by general-purpose parsers to detect and assemble such structures by means of syntactic pattern recognition. We describe a grammar and parser for eukaryotic protein-encoding genes, which by some measures is as effective as current connectionist and combinatorial algorithms in predicting gene structures for sequence database entries. Parameters on the grammar rules are optimized for several different species, and mixing experiments performed to determine the degree of species specificity and the relative importance of compositional, signal-based, and syntactic components in gene prediction. Introduction Formal language theory views languages as sets of strings over some alphabet, and specifies potentially infinite languages with concise sets of rules called grammars [10]. Grammars are an exceptionally well-studied methodology, fami...
Improved Splice Site Detection in Genie
- J. COMPUT. BIOL
, 1997
"... We present an improved splice site predictor for the genefinding program Genie. Genie is based on a generalized Hidden Markov Model (GHMM) that describes the grammar of a legal parse of a multi-exon gene in a DNA sequence. In Genie, probabilities are estimated for gene features by using dynamic prog ..."
Abstract
-
Cited by 41 (3 self)
- Add to MetaCart
We present an improved splice site predictor for the genefinding program Genie. Genie is based on a generalized Hidden Markov Model (GHMM) that describes the grammar of a legal parse of a multi-exon gene in a DNA sequence. In Genie, probabilities are estimated for gene features by using dynamic programming to combine information from multiple content and signal sensors, including sensors that integrate matches to homologous sequences from a database. One of the hardest problems in genefinding is to determine the complete gene structure correctly. The splice site sensors are the key signal sensors that address this problem. We replaced the existing splice site sensors in Genie with two novel neural networks based on dinucleotide frequencies. Using these novel sensors, Genie shows significant improvements in the sensitivity and specificity of gene structure identification. Experimental results in tests using a standard set of annotated genes showed that Genie identified 86% of coding nuc...
A Method for Identifying Splice Sites and Translational Start Sites in Eukaryotic mRNA
, 1997
"... This paper describes a new method for determining the consensus sequences that signal the start of translation and the boundaries between exons and introns (donor and acceptor sites) in eukaryotic mRNA. The method takes into account the dependencies between adjacent bases, in contrast to the usual t ..."
Abstract
-
Cited by 40 (1 self)
- Add to MetaCart
This paper describes a new method for determining the consensus sequences that signal the start of translation and the boundaries between exons and introns (donor and acceptor sites) in eukaryotic mRNA. The method takes into account the dependencies between adjacent bases, in contrast to the usual technique of considering each position independently. When coupled with a dynamic program to compute the most likely sequence, new consensus sequences emerge. The consensus sequence information is summarized in conditional probability matrices which, when used to locate signals in uncharacterized genomic DNA, have greater sensitivity and specificity than conventional matrices. Species-specific versions of these matrices are especially effective at distinguishing true and false sites.
Computational Methods for the Identification of Genes in Vertebrate Genomic Sequences
- Hum. Mol. Genet
, 1997
"... Research into new methods to identify genes in anonymous genomic sequences has been going on for more than 15 years. Over this period of time, the field has evolved from the designing of programs to identify protein coding regions in compact mitochondrial or bacterial genomes, to the challenge of pr ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
Research into new methods to identify genes in anonymous genomic sequences has been going on for more than 15 years. Over this period of time, the field has evolved from the designing of programs to identify protein coding regions in compact mitochondrial or bacterial genomes, to the challenge of predicting the detailed organization of multi-exon vertebrate genes. The best program currently available perfectly locates more than 80 % of the internal coding exons, and only 5 % of the predictions do not overlap a real exon. Given such accuracy, computational methods are indeed very useful; however, they do not alleviate the need for experimental validation. If the performances are satisfactory for the identification of the coding moiety of genes (internal coding exons), the determination of the full extent of the transcript (5 ′ and 3 ′ extremities of the gene) and the location of promoter regions are still unreliable. As the human and mouse genome sequencing projects enter a production mode, the fully automated annotation of megabase-long anonymous genomic sequences is the next big challenge in bioinformatics.
Finding Genes in DNA with a Hidden Markov Model
- Journal of Computational Biology
, 1997
"... This study describes a new Hidden Markov Model (HMM) system for segmenting uncharacterized genomic DNA sequences into exons, introns, and intergenic regions. Separate HMM modules were designed and trained for specific regions of DNA: exons, introns, intergenic regions, and splice sites. The models w ..."
Abstract
-
Cited by 36 (0 self)
- Add to MetaCart
This study describes a new Hidden Markov Model (HMM) system for segmenting uncharacterized genomic DNA sequences into exons, introns, and intergenic regions. Separate HMM modules were designed and trained for specific regions of DNA: exons, introns, intergenic regions, and splice sites. The models were then tied together to form a biologically feasible topology. The integrated HMM was trained further on a set of eukaryotic DNA sequences, and tested by using it to segment a separate set of sequences. The resulting HMM system, which is called VEIL (Viterbi Exon-Intron Locator), obtains an overall accuracy on test data of 92% of total bases correctly labelled, with a correlation coefficient of 0.73. Using the more stringent test of exact exon prediction, VEIL correctly located both ends of 53% of the coding exons, and 49% of the exons it predicts are exactly correct. These results compare favorably to the best previous results for gene structure prediction, and demonstrate the benefits of...
KDD for Science Data Analysis: Issues and Examples
- In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining
, 1996
"... The analysis of the massive data sets collected by scientific instruments demands automation as a pre-requisite to analysis. There is an urgent need to create an intermediate level at which scientists can operate effectively; isolating them from the massive sizes and harnessing human analysis capabi ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
The analysis of the massive data sets collected by scientific instruments demands automation as a pre-requisite to analysis. There is an urgent need to create an intermediate level at which scientists can operate effectively; isolating them from the massive sizes and harnessing human analysis capabilities to focus on tasks in which machines do not even remotely approach humans---namely, creative data analysis, theory and hypothesis formation, and drawing insights into underlying phenomena. We give an overview of the main issues in the exploitation of scientific datasets, present five case studies where KDD tools play important and enabling roles, and conclude with future challenges for data mining and KDD techniques in science data analysis. keywords: Applications in Science, Data Analysis, overview article, large databases, automated analysis, scietific data sets, scientific discovery. To appear: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining...
A Probabilistic Learning Approach to Whole-Genome Operon Prediction
- In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
, 2000
"... We present a computational approach to predicting operons in the genomes of prokaryotic organisms. Our approach uses machine learning methods to induce predictive models for this task from a rich variety of data types including sequence data, gene expression data, and functional annotations as ..."
Abstract
-
Cited by 33 (7 self)
- Add to MetaCart
We present a computational approach to predicting operons in the genomes of prokaryotic organisms. Our approach uses machine learning methods to induce predictive models for this task from a rich variety of data types including sequence data, gene expression data, and functional annotations associated with genes. We use multiple learned models that individually predict promoters, terminators and operons themselves. A key part of our approach is a dynamic programming method that uses our predictions to map every known and putative gene in a given genome into its most probable operon. We evaluate our approach using data from the E. coli K-12 genome.
Finding Genes in Human DNA with a Hidden Markov Model
- Journal of Computational Biology
, 1996
"... This study describes a new Hidden Markov Model (HMM) system for segmenting uncharacterized human genomic DNA into exons, introns, and intergenic regions. Three separate models were designed for each of the three types of human DNA (exons, introns, and intergenic), and training was performed on a cor ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
This study describes a new Hidden Markov Model (HMM) system for segmenting uncharacterized human genomic DNA into exons, introns, and intergenic regions. Three separate models were designed for each of the three types of human DNA (exons, introns, and intergenic), and training was performed on a corpus collected specifically for this project. The model was then augmented using biological knowledge about splice junction consensus sites, which were used to tie together the separately trained models. The resulting integrated model was then used to segment a test set of human DNA sequences that were not used during training. The initial results are highly encouraging and indicate that an HMM can form the basis of an effective gene-finding system. 1 Introduction Robust computational solutions to the gene-finding problem are a valuable resource for the Human Genome Program and for the molecular biology community at large. Software that can reliably identify putative genes in DNA sequence ca...

