Results 1 - 10
of
18
Prediction of complete gene structures in human genomic DNA
- J. Mol. Biol
, 1997
"... The problem of identifying genes in genomic DNA sequences by computational methods has attracted considerable research attention in recent years. From one point of view, the problem is closely ..."
Abstract
-
Cited by 487 (7 self)
- Add to MetaCart
The problem of identifying genes in genomic DNA sequences by computational methods has attracted considerable research attention in recent years. From one point of view, the problem is closely
Improved Splice Site Detection in Genie
- J. COMPUT. BIOL
, 1997
"... We present an improved splice site predictor for the genefinding program Genie. Genie is based on a generalized Hidden Markov Model (GHMM) that describes the grammar of a legal parse of a multi-exon gene in a DNA sequence. In Genie, probabilities are estimated for gene features by using dynamic prog ..."
Abstract
-
Cited by 41 (3 self)
- Add to MetaCart
We present an improved splice site predictor for the genefinding program Genie. Genie is based on a generalized Hidden Markov Model (GHMM) that describes the grammar of a legal parse of a multi-exon gene in a DNA sequence. In Genie, probabilities are estimated for gene features by using dynamic programming to combine information from multiple content and signal sensors, including sensors that integrate matches to homologous sequences from a database. One of the hardest problems in genefinding is to determine the complete gene structure correctly. The splice site sensors are the key signal sensors that address this problem. We replaced the existing splice site sensors in Genie with two novel neural networks based on dinucleotide frequencies. Using these novel sensors, Genie shows significant improvements in the sensitivity and specificity of gene structure identification. Experimental results in tests using a standard set of annotated genes showed that Genie identified 86% of coding nuc...
KDD for Science Data Analysis: Issues and Examples
- In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining
, 1996
"... The analysis of the massive data sets collected by scientific instruments demands automation as a pre-requisite to analysis. There is an urgent need to create an intermediate level at which scientists can operate effectively; isolating them from the massive sizes and harnessing human analysis capabi ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
The analysis of the massive data sets collected by scientific instruments demands automation as a pre-requisite to analysis. There is an urgent need to create an intermediate level at which scientists can operate effectively; isolating them from the massive sizes and harnessing human analysis capabilities to focus on tasks in which machines do not even remotely approach humans---namely, creative data analysis, theory and hypothesis formation, and drawing insights into underlying phenomena. We give an overview of the main issues in the exploitation of scientific datasets, present five case studies where KDD tools play important and enabling roles, and conclude with future challenges for data mining and KDD techniques in science data analysis. keywords: Applications in Science, Data Analysis, overview article, large databases, automated analysis, scietific data sets, scientific discovery. To appear: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining...
Identification of Genes in Human Genomic DNA
, 1997
"... A general probabilistic model of the gene structural and compositional properties of human genomic DNA is introduced and applied to the problem of identifying genes in unannotated human genomic sequences. The model uses a \Hidden semi-Markov" or semi-Markov source architecture which incorporate ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
A general probabilistic model of the gene structural and compositional properties of human genomic DNA is introduced and applied to the problem of identifying genes in unannotated human genomic sequences. The model uses a \Hidden semi-Markov" or semi-Markov source architecture which incorporates probabilistic descriptions of fundamental transcriptional, translational and splicing signals, as well as length distri-butions and compositional features of exons, introns and intergenic regions. Distinct sets of model parameters are derived which account for many of the substantial di er-ences in gene density and structure observed in distinct C+G compositional regions (\isochores") of the human genome. A novel model building procedure, termed Max-imal Dependence Decomposition, is introduced which captures potentially important dependencies between non-adjacent aswell as adjacent positions in a biological signal. Application of this model to the donor splice signal not only gives better discrimina-tion of potential donor sites than previous probabilistic models, but also reveals subtle properties of this signal which suggest aspects of its biochemical function. Acceptor
Evaluation of gene-finding programs on mammalian sequences
- Genome Res
, 2001
"... Article cited in: ..."
Integrating Database Homology in a Probabilistic Gene Structure Model
- Proceedings of the Pacific Symposium on Biocomputing
, 1997
"... We present an improved stochastic model of genes in DNA, and describe a method for integrating database homology into the probabilistic framework. A generalized hidden Markov model (GHMM) describes the grammar of a legal parse of a DNA sequence. Probabilities are estimated for gene features by using ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
We present an improved stochastic model of genes in DNA, and describe a method for integrating database homology into the probabilistic framework. A generalized hidden Markov model (GHMM) describes the grammar of a legal parse of a DNA sequence. Probabilities are estimated for gene features by using dynamic programming to combine information from multiple sensors. We showhow matches to homologous sequences from a database can be integrated into the probability estimation by interpreting the likelihood of a sequence in terms of the bit-cost to encode a sequence given a homology match. We also demonstrate how homology matches in protein databases can be exploited to help identify splice sites. Our experiments show signi cant improvements in the sensitivity and speci city ofgene structure identi cation when these new features are added to our gene- nding system, Genie. Experimental results in tests using a standard set of annotated genes showed that Genie identi ed 95 % of coding nucleotides correctly with a speci city of 91%, and 77 % of exons were identi ed exactly. 1
Periodic Sequence Patterns in Human Exons
- In Proceedings of the 1995 Conference on Intelligent Systems for Molecular Biology (ISMB95), in Cambridge (UK). Menlo Park, CA
, 1995
"... We analyse the sequential structure of human exons and their flanking introns by hidden Markov models. Together, models of donor site regions, acceptor site regions and flanked internal exons, show that exons --- besides the reading frame --- hold a specific periodic pattern. The pattern, which has ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
We analyse the sequential structure of human exons and their flanking introns by hidden Markov models. Together, models of donor site regions, acceptor site regions and flanked internal exons, show that exons --- besides the reading frame --- hold a specific periodic pattern. The pattern, which has the consensus: non-T(A/T)G and a minimal periodicity of roughly 10 nucleotides, is not a consequence of the nucleotide statistics in the three codon positions, nor of the well known nucleosome positioning signal. We discuss the relation between the pattern and other known sequence elements responsible for the intrinsic bending or curvature of DNA. Keywords: DNA, sequential structure, periodicity, exon, intron, hidden Markov models. and Jet Propulsion Laboratory, Caltech. y and Department of Psychology, Stanford University. Introduction Besides specifying the choice and order of amino acids in proteins genetic material hold a multitude of additional signals playing an important role in ...
Chemical Information: How Do We Get It and What Do We Do With It?
- Proceedings of the 1995 International Chemical Informatics Conference, Nimes
, 1995
"... There is an acute need for high quality, freely available, public domain data sets for many areas of chemistry. These are not simply information repositories for casual perusal, but can form the foundation for queries which are complex in two senses: reaching across the WorldWideWeb for data; and us ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
There is an acute need for high quality, freely available, public domain data sets for many areas of chemistry. These are not simply information repositories for casual perusal, but can form the foundation for queries which are complex in two senses: reaching across the WorldWideWeb for data; and using algorithms beyond Boolean logic to process the garnered data. Implementing this paradise requires that we carefully address some fundamental questions of data and program syntax and semantics, and socially organize to provide the financial support. In the interim, building the infrastructure offers opportunities for experimentation, as I will illustrate with our work on Klotho.
Computational Genefinding
, 1998
"... Introduction Computational methodology for finding genes and other functional sites in genomic DNA has evolved significantly over the last 20 years. Excellent recent surveys have been given by Guig'o [10], Claverie [3], Krogh [14] and others. Among the types of functional sites in genomic DNA that ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Introduction Computational methodology for finding genes and other functional sites in genomic DNA has evolved significantly over the last 20 years. Excellent recent surveys have been given by Guig'o [10], Claverie [3], Krogh [14] and others. Among the types of functional sites in genomic DNA that researchers have sought to recognize are splice sites, start and stop codons, branch points, promoters and terminators of transcription, polyadenylation sites, ribosomal binding sites, topoisomerase II binding sites, topoisomerase I cleavage sites, and various transcription factor binding sites [8]. Local sites such as these are called signals and methods for detecting them may be called signal sensors. Genomic DNA signals can be contrasted with extended and variable length regions such as exons and introns, which are recognized by different methods that may be called content sensors [26]. 2 Signal Sensors The most bas
Gene Structure Prediction From Many Attributes
, 1998
"... Considerable research effort has been directed in recent years toward the problem of computationally identifying genes in DNA sequences. A fundamental component of a genefinding system is a predictor which, when given a window of DNA sequence data, predicts whether or not it codes for protein pro ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Considerable research effort has been directed in recent years toward the problem of computationally identifying genes in DNA sequences. A fundamental component of a genefinding system is a predictor which, when given a window of DNA sequence data, predicts whether or not it codes for protein product. In this paper we propose that mistake-driven, multiplicative-weight-update learning algorithms operating over a large feature set are well suited to to this prediction problem, and describe a system we have built which takes this approach. Our system is fast, simple, and produces more accurate classifiers than have previously been obtained for a range of different sequence lengths. We conclude that a system of this type will be a useful component in larger gene-finding programs. Keywords: exon prediction; gene identification; coding sequence; machine learning; Winnow; multiplicative-weight-update learning algorithm 1 Introduction A major thrust of biological research over the p...

