Results 1  10
of
33
Integrating Genomic Homology into Gene Structure Prediction
, 2001
"... TWINSCAN is a new genestructure prediction system that directly extends the probability model of GENSCAN, allowing it to exploit homology between two related genomes. Separate probability models are used for conservation in exons, introns, splice sites, and UTRs, reflecting the differences among th ..."
Abstract

Cited by 201 (12 self)
 Add to MetaCart
TWINSCAN is a new genestructure prediction system that directly extends the probability model of GENSCAN, allowing it to exploit homology between two related genomes. Separate probability models are used for conservation in exons, introns, splice sites, and UTRs, reflecting the differences among their patterns of evolutionary conservation. TWINSCAN is specifically designed for the analysis of highthroughput genomic sequences containing an unknown number of genes. In experiments on highthroughput mouse sequences, using homologous sequences from the human genome, TWINSCAN shows notable improvement over GENSCAN in exon sensitivity and specificity and dramatic improvement in exact gene sensitivity and specificity. This improvement can be attributed entirely to modeling the patterns of evolutionary conservation in genomic sequence.
Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res
, 2000
"... ..."
The Conserved Exon Method for Gene Finding
, 2000
"... A new approach to gene finding is introduced called the "Conserved Exon Method" (CEM). It is based on the idea of looking for conserved protein sequences by comparing pairs of DNA sequences, identifying putative exon pairs based on conserved regions and splice junction signals then chainin ..."
Abstract

Cited by 41 (0 self)
 Add to MetaCart
A new approach to gene finding is introduced called the "Conserved Exon Method" (CEM). It is based on the idea of looking for conserved protein sequences by comparing pairs of DNA sequences, identifying putative exon pairs based on conserved regions and splice junction signals then chaining pairs of putative exons together. It simultaneously predicts gene structures in both human and mouse genomic sequences (or in other pairs of sequences at the appropriate evolutionary distance). Experimental results indicate the potential usefulness of this approach.
TigrScan and GlimmerHMM: two open source ab initio eukaryotic genefinders
 Bioinformatics
, 2004
"... ab initio eukaryotic genefinders ..."
hybrid Markov/semiMarkov chains
 Computational Statistics and Data Analysis
, 2005
"... Models that combine Markovian states with implicit geometric state occupancy distributions and semiMarkovian states with explicit state occupancy distributions, are investigated. This type of model retains the flexibility of hidden semiMarkov chains for the modeling of short or medium size homogen ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
Models that combine Markovian states with implicit geometric state occupancy distributions and semiMarkovian states with explicit state occupancy distributions, are investigated. This type of model retains the flexibility of hidden semiMarkov chains for the modeling of short or medium size homogeneous zones along sequences but also enables the modeling of long zones with Markovian states. The forwardbackward algorithm, which in particular enables to implement efficiently the Estep of the EM algorithm, and the Viterbi algorithm for the restoration of the most likely state sequence are derived. It is also shown that macrostates, i.e. seriesparallel networks of states with common observation distribution, are not a valid alternative to semiMarkovian states but may be useful at a more macroscopic level to combine Markovian states with semiMarkovian states. This statistical modeling approach is illustrated by the analysis of branching and flowering patterns in plants.
Calculating the exact probability of languagelike patterns in biomolecular sequences
 Proc. Int. Conf. Intell. Syst. Mol. Biol
, 1998
"... atteson. ~ p eaplant, biology, yale. edu We present algorithms for the exact computation of the probability that a random string of a certain length matches a given regular expression. These algorithms can be used to determine statistical significance in a wariety of pattern searches such as motif s ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
atteson. ~ p eaplant, biology, yale. edu We present algorithms for the exact computation of the probability that a random string of a certain length matches a given regular expression. These algorithms can be used to determine statistical significance in a wariety of pattern searches such as motif searches and genefinding. This work improves upon work of K1effe and Langebacker (Kleffe & Langbecker 1990) and of Sewell and Durbin (Sewell & Durbin 1995) in several ways. First, in many cases of interest, the algorithms presented here are faster. In addition, the type of pattern considered here strictly includes those of both previous works but also allows, for instance, arbitrary length gaps. Also, the type of probability model which can be used is more general than that of Sewcll and Durbin, allowing for Markov chains. The problem solved in this work is in fact in the class of NPhard problems which are believed to bc intractable. However, the problem is fixedparameter tractable, meaning that it is tractable for small patterns. The is problem is also computationally feasible for many patterns which occur in practice. As a sample application, we consider calculating the statistical significance of most of the PROSITE patterns as in Sewell and Durbin. Whercas their method was only fast enough to exactly compute the probabilities for sequences of length 13 larger than the pattern length, we calculate these probabilities for sequences of up to length 2000. In addition, we calculate most of these probabilities using a first order Markov chain. Most of the PROSITE patterns have high significance at length 2000 under both the i.i.d, and Markov chain models. For further applications, we demonstrate the calculation of the probability of a PROSITE pattern occurring on either strand of a random DNA sequence of up to 500 kilobases and the probability of a simple gene model occurring in a random sequence of up to 1 megabase.
A Better Method for Length Distribution Modeling in HMMs and Its Application to Gene Finding
"... Hidden Markov models (HMMs) have proved to be a useful abstraction in modeling biological sequences. In some situations it is necessary to use generalized HMMs in order to model the length distributions of some sequence elements because basic HMMs force geometriclike distributions. In this paper we ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
Hidden Markov models (HMMs) have proved to be a useful abstraction in modeling biological sequences. In some situations it is necessary to use generalized HMMs in order to model the length distributions of some sequence elements because basic HMMs force geometriclike distributions. In this paper we suggest the use of an arbitrary length distributions with geometric tails to model lengths of elements in biological sequences. We give an algorithm for annotation of a biological sequence in O(ndm ) time using such length distributions coupled with a suitable generalization of the HMM; here n is the length of the sequence, m is the number of states in the model, d is a parameter of the length distribution, and is a small constant dependent on model topology (compared to previously proposed algorithms with O(n ) time [10]).
Evidence Combination in Hidden Markov Models for Gene Prediction
, 2005
"... I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii This thesis introduces new techniques for finding gene ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii This thesis introduces new techniques for finding genes in genomic sequences. Genes are regions of a genome encoding proteins of an organism. Identification of genes in a genome is an important step in the annotation process after a new genome is sequenced. The prediction accuracy of gene finding can be greatly improved by using experimental evidence. This evidence includes homologies between the genome and databases of known proteins, or evolutionary conservation of genomic sequence in different species. We propose a flexible framework to incorporate several different sources of such evidence into a gene finder based on a hidden Markov model. Various sources of evidence are expressed as partial probabilistic statements about the annotation of positions in the sequence, and these are combined with the hidden Markov model to obtain the final gene prediction. The opportunity to
GradientBased Feature Selection for Conditional Random Fields and
 Its Applications in Computational Genetics, 21st IEEE International Conference on Tools with Artificial Intelligence (ICTAI
, 2009
"... Gene prediction is one of the first and most important steps in understanding the genome of a species, and different approaches haven been proposed. In 2007, a de novo gene predictor, called CONTRAST, based on Conditional Random Fields (CRFs) is introduced, and proved to substantially outperform pre ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Gene prediction is one of the first and most important steps in understanding the genome of a species, and different approaches haven been proposed. In 2007, a de novo gene predictor, called CONTRAST, based on Conditional Random Fields (CRFs) is introduced, and proved to substantially outperform previous predictors. However, the oversize feature set used in the model has posed several issues, like overfitting problem and excessive computational demand. To resolve these issues, we did a thorough survey of two existing feature selection methods for CRFs, namely the gainbased and gradientbased methods, and applied the later one to CONTRAST. The results show that with the gradientbased feature selection scheme, we are able to achieve comparable or even better prediction accuracy on testing data, using only a very small fraction of the features from the candidate pool. The feature selection method also helps researchers better understand the underlying structure of the genomic sequences, further provides insights of the function and evolutionary dynamics of genomes. 1
Computational genomics: Mapping, comparison, and annotation of genomes
"... The field of genomics provides many challenges to computer scientists and mathematicians. The area of computational genomics has been expanding recently, and the timely application of computer science in this field is proving to be an essential component of the large international effort in genomics ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
The field of genomics provides many challenges to computer scientists and mathematicians. The area of computational genomics has been expanding recently, and the timely application of computer science in this field is proving to be an essential component of the large international effort in genomics. In this thesis we address key issues in the different stages of genome research: planning of a genome sequencing project, obtaining and assembling sequence information, and ultimately study, crossspecies comparison, and annotation of finished genomic sequence. We present applications of computational techniques to the above areas: (1) In relation to the early stages of a genome project, we address physical mapping, and we present results on the theoretical problem of finding minimum superstrings of hypergraphs, a combinatorial problem motivated by physical mapping. We also present a statistical and simulation study of “walking with cloneend sequences”, an important method for sequencing a large genome. (2) Turning to the problem of obtaining the finished genomic sequence, we present ARACHNE, a prototype software system for assembling sequence data that are derived