Results 1 - 10
of
33
Prediction of complete gene structures in human genomic DNA
- J. Mol. Biol
, 1997
"... The problem of identifying genes in genomic DNA sequences by computational methods has attracted considerable research attention in recent years. From one point of view, the problem is closely ..."
Abstract
-
Cited by 487 (7 self)
- Add to MetaCart
The problem of identifying genes in genomic DNA sequences by computational methods has attracted considerable research attention in recent years. From one point of view, the problem is closely
A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA
, 1996
"... We present a statistical model of genes in DNA. A Generalized Hidden Markov Model (GHMM) provides the framework for describing the grammar of a legal parse of a DNA sequence (Stormo & Haussler 1994). Probabilities are assigned to transitions between states in the GHMM and to the generation of each n ..."
Abstract
-
Cited by 122 (13 self)
- Add to MetaCart
We present a statistical model of genes in DNA. A Generalized Hidden Markov Model (GHMM) provides the framework for describing the grammar of a legal parse of a DNA sequence (Stormo & Haussler 1994). Probabilities are assigned to transitions between states in the GHMM and to the generation of each nucleotide base given a particular state. Machine learning techniques are applied to optimize these probabilities using a standardized training set. Given a new candidate sequence, the best parse is deduced from the model using a dynamic programming algorithm to identify the path through the model with maximum probability. The GHMM is flexible and modular, so new sensors and additional states can be inserted easily. In addition, it provides simple solutions for integrating cardinality constraints, reading frame constraints, "indels", and homology searching. The description and results of an implementation of such a gene-finding model, called Genie, is presented. The exon sensor is a codon fre...
Improved Splice Site Detection in Genie
- J. COMPUT. BIOL
, 1997
"... We present an improved splice site predictor for the genefinding program Genie. Genie is based on a generalized Hidden Markov Model (GHMM) that describes the grammar of a legal parse of a multi-exon gene in a DNA sequence. In Genie, probabilities are estimated for gene features by using dynamic prog ..."
Abstract
-
Cited by 41 (3 self)
- Add to MetaCart
We present an improved splice site predictor for the genefinding program Genie. Genie is based on a generalized Hidden Markov Model (GHMM) that describes the grammar of a legal parse of a multi-exon gene in a DNA sequence. In Genie, probabilities are estimated for gene features by using dynamic programming to combine information from multiple content and signal sensors, including sensors that integrate matches to homologous sequences from a database. One of the hardest problems in genefinding is to determine the complete gene structure correctly. The splice site sensors are the key signal sensors that address this problem. We replaced the existing splice site sensors in Genie with two novel neural networks based on dinucleotide frequencies. Using these novel sensors, Genie shows significant improvements in the sensitivity and specificity of gene structure identification. Experimental results in tests using a standard set of annotated genes showed that Genie identified 86% of coding nuc...
A Method for Identifying Splice Sites and Translational Start Sites in Eukaryotic mRNA
, 1997
"... This paper describes a new method for determining the consensus sequences that signal the start of translation and the boundaries between exons and introns (donor and acceptor sites) in eukaryotic mRNA. The method takes into account the dependencies between adjacent bases, in contrast to the usual t ..."
Abstract
-
Cited by 40 (1 self)
- Add to MetaCart
This paper describes a new method for determining the consensus sequences that signal the start of translation and the boundaries between exons and introns (donor and acceptor sites) in eukaryotic mRNA. The method takes into account the dependencies between adjacent bases, in contrast to the usual technique of considering each position independently. When coupled with a dynamic program to compute the most likely sequence, new consensus sequences emerge. The consensus sequence information is summarized in conditional probability matrices which, when used to locate signals in uncharacterized genomic DNA, have greater sensitivity and specificity than conventional matrices. Species-specific versions of these matrices are especially effective at distinguishing true and false sites.
GeneSplicer: a new computational method for splice site prediction
- Nucleic Acids Res
, 2001
"... GeneSplicer is a new, flexible system for detecting splice sites in the genomic DNA of various eukaryotes. The system has been tested successfully using DNA from two reference organisms: the model plant Arabidopsis thaliana and human. It was compared to six programs representing the leading splice s ..."
Abstract
-
Cited by 37 (1 self)
- Add to MetaCart
GeneSplicer is a new, flexible system for detecting splice sites in the genomic DNA of various eukaryotes. The system has been tested successfully using DNA from two reference organisms: the model plant Arabidopsis thaliana and human. It was compared to six programs representing the leading splice site detectors for each of these species: NetPlantGene, NetGene2, HSPL, NNSplice, GENIO and SpliceView. In each case GeneSplicer performed comparably to the best alternative, in terms of both accuracy and computational efficiency.
Computational Methods for the Identification of Genes in Vertebrate Genomic Sequences
- Hum. Mol. Genet
, 1997
"... Research into new methods to identify genes in anonymous genomic sequences has been going on for more than 15 years. Over this period of time, the field has evolved from the designing of programs to identify protein coding regions in compact mitochondrial or bacterial genomes, to the challenge of pr ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
Research into new methods to identify genes in anonymous genomic sequences has been going on for more than 15 years. Over this period of time, the field has evolved from the designing of programs to identify protein coding regions in compact mitochondrial or bacterial genomes, to the challenge of predicting the detailed organization of multi-exon vertebrate genes. The best program currently available perfectly locates more than 80 % of the internal coding exons, and only 5 % of the predictions do not overlap a real exon. Given such accuracy, computational methods are indeed very useful; however, they do not alleviate the need for experimental validation. If the performances are satisfactory for the identification of the coding moiety of genes (internal coding exons), the determination of the full extent of the transcript (5 ′ and 3 ′ extremities of the gene) and the location of promoter regions are still unreliable. As the human and mouse genome sequencing projects enter a production mode, the fully automated annotation of megabase-long anonymous genomic sequences is the next big challenge in bioinformatics.
Finding Genes in DNA with a Hidden Markov Model
- Journal of Computational Biology
, 1997
"... This study describes a new Hidden Markov Model (HMM) system for segmenting uncharacterized genomic DNA sequences into exons, introns, and intergenic regions. Separate HMM modules were designed and trained for specific regions of DNA: exons, introns, intergenic regions, and splice sites. The models w ..."
Abstract
-
Cited by 36 (0 self)
- Add to MetaCart
This study describes a new Hidden Markov Model (HMM) system for segmenting uncharacterized genomic DNA sequences into exons, introns, and intergenic regions. Separate HMM modules were designed and trained for specific regions of DNA: exons, introns, intergenic regions, and splice sites. The models were then tied together to form a biologically feasible topology. The integrated HMM was trained further on a set of eukaryotic DNA sequences, and tested by using it to segment a separate set of sequences. The resulting HMM system, which is called VEIL (Viterbi Exon-Intron Locator), obtains an overall accuracy on test data of 92% of total bases correctly labelled, with a correlation coefficient of 0.73. Using the more stringent test of exact exon prediction, VEIL correctly located both ends of 53% of the coding exons, and 49% of the exons it predicts are exactly correct. These results compare favorably to the best previous results for gene structure prediction, and demonstrate the benefits of...
KDD for Science Data Analysis: Issues and Examples
- In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining
, 1996
"... The analysis of the massive data sets collected by scientific instruments demands automation as a pre-requisite to analysis. There is an urgent need to create an intermediate level at which scientists can operate effectively; isolating them from the massive sizes and harnessing human analysis capabi ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
The analysis of the massive data sets collected by scientific instruments demands automation as a pre-requisite to analysis. There is an urgent need to create an intermediate level at which scientists can operate effectively; isolating them from the massive sizes and harnessing human analysis capabilities to focus on tasks in which machines do not even remotely approach humans---namely, creative data analysis, theory and hypothesis formation, and drawing insights into underlying phenomena. We give an overview of the main issues in the exploitation of scientific datasets, present five case studies where KDD tools play important and enabling roles, and conclude with future challenges for data mining and KDD techniques in science data analysis. keywords: Applications in Science, Data Analysis, overview article, large databases, automated analysis, scietific data sets, scientific discovery. To appear: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining...
The Conserved Exon Method for Gene Finding
, 2000
"... A new approach to gene finding is introduced called the "Conserved Exon Method" (CEM). It is based on the idea of looking for conserved protein sequences by comparing pairs of DNA sequences, identifying putative exon pairs based on conserved regions and splice junction signals then chaining pairs of ..."
Abstract
-
Cited by 33 (0 self)
- Add to MetaCart
A new approach to gene finding is introduced called the "Conserved Exon Method" (CEM). It is based on the idea of looking for conserved protein sequences by comparing pairs of DNA sequences, identifying putative exon pairs based on conserved regions and splice junction signals then chaining pairs of putative exons together. It simultaneously predicts gene structures in both human and mouse genomic sequences (or in other pairs of sequences at the appropriate evolutionary distance). Experimental results indicate the potential usefulness of this approach.
Finding Genes in Human DNA with a Hidden Markov Model
- Journal of Computational Biology
, 1996
"... This study describes a new Hidden Markov Model (HMM) system for segmenting uncharacterized human genomic DNA into exons, introns, and intergenic regions. Three separate models were designed for each of the three types of human DNA (exons, introns, and intergenic), and training was performed on a cor ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
This study describes a new Hidden Markov Model (HMM) system for segmenting uncharacterized human genomic DNA into exons, introns, and intergenic regions. Three separate models were designed for each of the three types of human DNA (exons, introns, and intergenic), and training was performed on a corpus collected specifically for this project. The model was then augmented using biological knowledge about splice junction consensus sites, which were used to tie together the separately trained models. The resulting integrated model was then used to segment a test set of human DNA sequences that were not used during training. The initial results are highly encouraging and indicate that an HMM can form the basis of an effective gene-finding system. 1 Introduction Robust computational solutions to the gene-finding problem are a valuable resource for the Human Genome Program and for the molecular biology community at large. Software that can reliably identify putative genes in DNA sequence ca...

