Results 1 - 10
of
59
Integrating Genomic Homology into Gene Structure Prediction
, 2001
"... TWINSCAN is a new gene-structure prediction system that directly extends the probability model of GENSCAN, allowing it to exploit homology between two related genomes. Separate probability models are used for conservation in exons, introns, splice sites, and UTRs, reflecting the differences among th ..."
Abstract
-
Cited by 137 (6 self)
- Add to MetaCart
TWINSCAN is a new gene-structure prediction system that directly extends the probability model of GENSCAN, allowing it to exploit homology between two related genomes. Separate probability models are used for conservation in exons, introns, splice sites, and UTRs, reflecting the differences among their patterns of evolutionary conservation. TWINSCAN is specifically designed for the analysis of high-throughput genomic sequences containing an unknown number of genes. In experiments on high-throughput mouse sequences, using homologous sequences from the human genome, TWINSCAN shows notable improvement over GENSCAN in exon sensitivity and specificity and dramatic improvement in exact gene sensitivity and specificity. This improvement can be attributed entirely to modeling the patterns of evolutionary conservation in genomic sequence.
Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing
- Bioinformatics
, 2001
"... Motivation: Comparison of multimegabase genomic DNA sequences is a popular technique for finding and annotating conserved genome features. Performing such comparisons entails finding many short local alignments between sequences up to tens of megabases in length. To process such long sequences e#cie ..."
Abstract
-
Cited by 56 (6 self)
- Add to MetaCart
Motivation: Comparison of multimegabase genomic DNA sequences is a popular technique for finding and annotating conserved genome features. Performing such comparisons entails finding many short local alignments between sequences up to tens of megabases in length. To process such long sequences e#ciently, existing algorithms find alignments by expanding around short runs of matching bases with no substitutions or other di#erences. Unfortunately, exact matches that are short enough to occur often in significant alignments also occur frequently by chance in the background sequence. Thus, these algorithms must trade o# between e#ciency and sensitivity to features without long exact matches. Results: We introduce a new algorithm, lsh-all-pairs, to find ungapped local alignments in genomic sequence with up to a specified fraction of substitutions. The length and substitution rate of these alignments can be chosen so that they appear frequently in significant similarities yet still remain rare in the background sequence. The algorithm finds ungapped alignments e#ciently using a randomized search technique, locality-sensitive hashing. We have found lsh-all-pairs to be both e#cient and sensitive for finding local similarities with as little as 63% identity in mammalian genomic sequences up to tens of megabases in length. Availability: Contact the author at the address below. Contact: jbuhler@cs.washington.edu Supplementary Information: the sequences and local alignment data described in this work are available at http://bio.cs.washington.edu/jbuhler-bioinformatics-2001/. Keywords: local alignment, genome annotation, locality-sensitive hashing Sequence Comparison by Locality-Sensitive Hashing 1
Gene structure prediction and alternative splicing analysis using genomically aligned ESTs
- Genome Res
, 2001
"... service ..."
Gene finding with a hidden Markov model of genome structure and evolution
, 2003
"... Motivation: A growing number of genomes are sequenced. The differences in evolutionary pattern between functional regions can thus be observed genome-wide in a whole set of organisms. The diverse evolutionary pattern of different functional regions can be exploited in the process of genomic annotati ..."
Abstract
-
Cited by 47 (7 self)
- Add to MetaCart
Motivation: A growing number of genomes are sequenced. The differences in evolutionary pattern between functional regions can thus be observed genome-wide in a whole set of organisms. The diverse evolutionary pattern of different functional regions can be exploited in the process of genomic annotation. The modelling of evolution by the existing comparative gene finders leaves room for improvement. Results: Aprobabilistic model of both genome structure and evolution is designed. This type of model is called
Efficient Multiple Genome Alignment
, 2002
"... Motivation: To allow a direct comparison of the genomic DNA sequences of sufficiently similar organisms, there is an urgent need for software tools that can align more than two genomic sequences. Results: We developed... ..."
Abstract
-
Cited by 36 (9 self)
- Add to MetaCart
Motivation: To allow a direct comparison of the genomic DNA sequences of sufficiently similar organisms, there is an urgent need for software tools that can align more than two genomic sequences. Results: We developed...
ExonHunter: a comprehensive approach to gene finding
- Bioinformatics
, 2005
"... We present ExonHunter, a new and comprehensive gene finder system that outperforms existing systems, featuring several new ideas and approaches. Our system combines numerous sources of information (genomic sequences, ESTs, and protein databases of related species) with a gene finder based on hidden ..."
Abstract
-
Cited by 21 (5 self)
- Add to MetaCart
We present ExonHunter, a new and comprehensive gene finder system that outperforms existing systems, featuring several new ideas and approaches. Our system combines numerous sources of information (genomic sequences, ESTs, and protein databases of related species) with a gene finder based on hidden Markov model in a novel and systematic way. In our framework, various sources of information are expressed as partial probabilistic statements about positions in the sequence and their annotation. We then combine these into the final prediction with a quadratic programming method extending existing methods. Allowing only partial statements is key to our transparent handling of missing information and coping with the heterogeneous character of individual sources of information. As well, we give a new method for modeling length distribution of intergenic regions in hidden Markov models. On a commonly used test set, ExonHunter performs significantly better than ROSETTA, SLAM, or TWINSCAN, and more than two thirds of genes were predicted completely correctly.
Gene identification in novel eukaryotic genomes by self-training algorithm
- NUCLEIC ACIDS RES
, 2005
"... ..."
Methods in comparative genomics: Genome correspondence, gene identification, and regulatory motif discovery
- Journal of Computational Biology
, 2004
"... In Kellis et al. (2003), we reported the genome sequences of S. paradoxus, S. mikatae, and S. bayanus and compared these three yeast species to their close relative, S. cerevisiae. Genomewide comparative analysis allowed the identification of functionally important sequences, both coding and noncodi ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
In Kellis et al. (2003), we reported the genome sequences of S. paradoxus, S. mikatae, and S. bayanus and compared these three yeast species to their close relative, S. cerevisiae. Genomewide comparative analysis allowed the identification of functionally important sequences, both coding and noncoding. In this companion paper we describe the mathematical and algorithmic results underpinning the analysis of these genomes. (1) We present methods for the automatic determination of genome correspondence. The algorithms enabled the automatic identification of orthologs for more than 90 % of genes and intergenic regions across the four species despite the large number of duplicated genes in the yeast genome. The remaining ambiguities in the gene correspondence revealed recent gene family expansions in regions of rapid genomic change. (2) We present methods for the identification of proteincoding genes based on their patterns of nucleotide conservation across related species. We observed the pressure to conserve the reading frame of functional proteins and developed a test for gene identification with high sensitivity and specificity. We used this test to revisit the genome of S. cerevisiae, reducing the overall gene count by 500 genes (10 % of previously
A Computational Model for RNA Multiple Structural Alignment
- Proc. Symp. on Combinatorial Pattern Matching, LNCS 3103
, 2004
"... Abstract. This paper addresses the problem of aligning multiple sequences of non-coding RNA genes. We approach this problem with the biologically motivated paradigm that scoring of ncRNA alignments should be based primarily on secondary structure rather than nucleotide conservation. We introduce a n ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Abstract. This paper addresses the problem of aligning multiple sequences of non-coding RNA genes. We approach this problem with the biologically motivated paradigm that scoring of ncRNA alignments should be based primarily on secondary structure rather than nucleotide conservation. We introduce a novel graph theoretic model (NLG) for analyzing algorithms based on this approach, prove that the RNA multiple alignment problem is NP-Complete in this model, and present a polynomial time algorithm that approximates the optimal structure of size S within a factor of O(log 2 S). 1

