Results 1 - 10
of
144
RSEARCH: Finding homologs of single structured RNA sequences
- BMC Bioinformatics
, 2003
"... Background: Many trans-acting noncoding RNA genes and cis-acting RNA regulatory elements conserve secondary structure rather than primary sequence. Most homology search tools only look at the primary sequence level, however. ..."
Abstract
-
Cited by 83 (0 self)
- Add to MetaCart
Background: Many trans-acting noncoding RNA genes and cis-acting RNA regulatory elements conserve secondary structure rather than primary sequence. Most homology search tools only look at the primary sequence level, however.
Aligning Gene Expression Time Series With Time Warping Algorithms
, 2001
"... Motivation: Increasingly, biological processes are being studied through time series of RNA expression data collected for large numbers of genes. Because common processes may unfold at varying rates in different experiments or individuals, methods are needed that will allow corresponding expression ..."
Abstract
-
Cited by 76 (2 self)
- Add to MetaCart
Motivation: Increasingly, biological processes are being studied through time series of RNA expression data collected for large numbers of genes. Because common processes may unfold at varying rates in different experiments or individuals, methods are needed that will allow corresponding expression states in different time series to be mapped to one another. Results: We present implementations of time warping algorithms applicable to RNA and protein expression data and demonstrate their application to published yeast RNA expression time series. Programs executing two warping algorithms are described, a simple warping algorithm and an interpolative algorithm, along with programs that generate graphics that visually present alignment information. We show time warping to be superior to simple clustering at mapping corresponding time states. We document the impact of statistical measurement noise and sample size on the quality of time alignments, and present issues related to statistical assessment of alignment quality through alignment scores. We also discuss directions for algorithm improvement including development of multiple time series alignments and possible applications to causality searches and non-temporal processes (`concentration warping'). Availability: Academic implementations of alignment programs genewarp and genewarpi and the graphics generation programs grphwarp and grphwarpi are available as Win32 system DOS box executables on our web site along with documentation on their use. The publicly available data on which they were demonstrated may be found at http://genome-www.stanford.edu/cellcycle/. Postscript files generated by grphwarp and grphwarpi may be directly printed or viewed using GhostView software available at http://www.cs.wisc.edu/#ghost/. Con...
Identification of protein coding regions by database similarity search
- Nature Genetics
, 1993
"... Correspondence should be addressed to W.G. page 1 Summary Sequence similarity between a translated nucleotide sequence and a known biological protein can provide strong evidence for the presence of a homologous coding region, and such similarities can often be identified even between distantly relat ..."
Abstract
-
Cited by 64 (1 self)
- Add to MetaCart
Correspondence should be addressed to W.G. page 1 Summary Sequence similarity between a translated nucleotide sequence and a known biological protein can provide strong evidence for the presence of a homologous coding region, and such similarities can often be identified even between distantly related genes. The computer program BLASTX performed conceptual translation of a nucleotide query sequence followed by a protein database search in one programmatic step. The BLAST search algorithm combined with Karlin-Altschul statistics yields a predictable selectivity that has been parameterized. We characterized the sensitivity of BLASTX recognition to the presence of substitution, insertion and deletion errors in the query sequence and to sequence divergence. Reading frames were reliably identified in the presence of 1 % query errors, a rate that is typical for primary nucleotide sequence data. BLASTX is appropriate for use in moderate and large scale sequencing projects at the earliest opportunity, when the data are most prone to containing errors. page 2
Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing
- Bioinformatics
, 2001
"... Motivation: Comparison of multimegabase genomic DNA sequences is a popular technique for finding and annotating conserved genome features. Performing such comparisons entails finding many short local alignments between sequences up to tens of megabases in length. To process such long sequences e#cie ..."
Abstract
-
Cited by 56 (6 self)
- Add to MetaCart
Motivation: Comparison of multimegabase genomic DNA sequences is a popular technique for finding and annotating conserved genome features. Performing such comparisons entails finding many short local alignments between sequences up to tens of megabases in length. To process such long sequences e#ciently, existing algorithms find alignments by expanding around short runs of matching bases with no substitutions or other di#erences. Unfortunately, exact matches that are short enough to occur often in significant alignments also occur frequently by chance in the background sequence. Thus, these algorithms must trade o# between e#ciency and sensitivity to features without long exact matches. Results: We introduce a new algorithm, lsh-all-pairs, to find ungapped local alignments in genomic sequence with up to a specified fraction of substitutions. The length and substitution rate of these alignments can be chosen so that they appear frequently in significant similarities yet still remain rare in the background sequence. The algorithm finds ungapped alignments e#ciently using a randomized search technique, locality-sensitive hashing. We have found lsh-all-pairs to be both e#cient and sensitive for finding local similarities with as little as 63% identity in mammalian genomic sequences up to tens of megabases in length. Availability: Contact the author at the address below. Contact: jbuhler@cs.washington.edu Supplementary Information: the sequences and local alignment data described in this work are available at http://bio.cs.washington.edu/jbuhler-bioinformatics-2001/. Keywords: local alignment, genome annotation, locality-sensitive hashing Sequence Comparison by Locality-Sensitive Hashing 1
Topology Prediction for Helical Transmembrane Proteins at 86% Accuracy
- Protein Sci
, 1996
"... Previously, we introduced a neural network system predicting locations of transmembrane helices based on evolutionary profiles (PHDhtm, (Rost et al., 1995). Here, we describe an improvement and an extension of that system. The improvement is achieved by a dynamic programming-like algorithm that opt ..."
Abstract
-
Cited by 43 (11 self)
- Add to MetaCart
Previously, we introduced a neural network system predicting locations of transmembrane helices based on evolutionary profiles (PHDhtm, (Rost et al., 1995). Here, we describe an improvement and an extension of that system. The improvement is achieved by a dynamic programming-like algorithm that optimises helices compatible with the neural network output. The extension is the prediction of topology (orientation of first loop region with respect to membrane) by applying to the refined prediction the observation that positively charged residues are more abundant in extra-cytoplasmic regions. Furthermore, we introduce a method to reduce the number of false positives, i.e., proteins falsely predicted with membrane helices. The evaluation of prediction accuracy is based on a cross-validation and a double-blind test set (in total 131 proteins). The final method appears to be more accurate than other methods published. (1) For almost 89% (3%) of the test proteins all transmembrane helices are predicted correctly. (2) For more than 86% (3%) of the proteins topology is predicted correctly. (3) We define reliability indices which correlate with prediction accuracy: for one half of the proteins segment accuracy raises to 98%; and for two-thirds accuracy of topology prediction is 95%. (4) The rate of proteins for which transmembrane helices are predicted falsely is below 2% (1%). Finally, the method is applied to 1616 sequences of Haemophilus influenzae. We predict 19% of the genome sequences to contain one or more transmembrane helices. This appears to be lower than what we predicted previously for the yeast VIII chromosome (about 25%).
Indexing and Retrieval for Genomic Databases
- IEEE Transactions on Knowledge and Data Engineering
, 2002
"... Genomic sequence databases are widely used by molecular biologists for homology searching. Amino-acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationall ..."
Abstract
-
Cited by 40 (6 self)
- Add to MetaCart
Genomic sequence databases are widely used by molecular biologists for homology searching. Amino-acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationally intensive local alignments on selected sequences only and to reduce the costs of the alignments that are attempted. We present an index-based approach for both selecting sequences that display broad similarity to a query and for fast local alignment. We show experimentally that the indexed approach results in signi cant savings in computationally intensive local alignments, and that index-based searching is as accurate as existing exhaustive search schemes.
A first-generation linkage disequilibrium map of human chromosome 22
- Nature
, 2002
"... A collection of DNA sequence variants across the genome may be used to test specific genes or regions of the human genome for association with a variety of phenotypes such as disease risk or variable drug response. Detecting association relies on the non-random correlation (“linkage disequilibrium”, ..."
Abstract
-
Cited by 37 (4 self)
- Add to MetaCart
A collection of DNA sequence variants across the genome may be used to test specific genes or regions of the human genome for association with a variety of phenotypes such as disease risk or variable drug response. Detecting association relies on the non-random correlation (“linkage disequilibrium”, LD) of one of the marker alleles with a trait-related variant. We have measured LD along the complete sequence of human chromosome 22. Duplicate genotyping and analysis of 1,504 markers in CEPH reference families at a median spacing of 15kb reveals a highly variable pattern of LD along the chromosome, in which extensive regions of virtually complete LD up to 758 kb in length are interspersed with regions of little or no detectable LD. The LD patterns are replicated in 1,286 overlapping markers genotyped on a panel of unrelated U.K. Caucasians. There is a strong correlation between high LD and low recombination frequency in the extant genetic map, reflecting patterns of historic recombination between ancestral chromosomes that yield conserved blocks of LD in present-day chromosomes. This study demonstrates the feasibility for developing genome-wide maps of LD.
A Linear Time Algorithm for Finding All Maximal Scoring Subsequences
- In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
, 1999
"... Given a sequence of real numbers ("scores"), we present a practical linear time algorithm to find those nonoverlapping, contiguoussubsequenceshaving greatest total scores. This improves on the best previously known algorithm, which requires quadratic time in the worst case. The problem arises i ..."
Abstract
-
Cited by 33 (3 self)
- Add to MetaCart
Given a sequence of real numbers ("scores"), we present a practical linear time algorithm to find those nonoverlapping, contiguoussubsequenceshaving greatest total scores. This improves on the best previously known algorithm, which requires quadratic time in the worst case. The problem arises in biological sequence analysis, where the highscoring subsequences correspond to regions of unusual composition in a nucleic acid or protein sequence. For instance, Altschul, Karlin, and others have used this approach to identify transmembrane regions, DNA binding domains, and regions of high charge in proteins. Keywords: maximal scoring subsequence, locally optimal subsequence, maximum sum interval, sequence analysis. 1 Introduction When analyzing long nucleic acid or protein sequences, the identification of unusual subsequences is an important task, since such features may be biologically significant. A common approach is to assign a score to each residue, and then look for contig...
Sequence Comparison Significance and Poisson Approximation
- Stat. Sci
, 1994
"... The Chen-Stein method of Poisson approximation has been used to establish theorems about comparison of two DNA or protein sequences. The most useful result for sequence alignment applies to alignment scoring for aligned letters and no gaps. However there has not been a valid method to assign statist ..."
Abstract
-
Cited by 31 (4 self)
- Add to MetaCart
The Chen-Stein method of Poisson approximation has been used to establish theorems about comparison of two DNA or protein sequences. The most useful result for sequence alignment applies to alignment scoring for aligned letters and no gaps. However there has not been a valid method to assign statistical significance to alignment scores with gaps. In this paper we extend Poisson approximation techniques using the Aldous clumping heuristic to a practical method of estimating statistical significance.
Accurate formula for p-values of gapped local sequence and profile alignments
- J. Mol. Biol
, 2000
"... A simple general approximation for the distribution of gapped local alignment scores is presented, suitable for assessing significance of comparisons between two protein sequences or a sequence and a profile. The approximation takes account of the scoring scheme (ie gap penalty and substitution matr ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
A simple general approximation for the distribution of gapped local alignment scores is presented, suitable for assessing significance of comparisons between two protein sequences or a sequence and a profile. The approximation takes account of the scoring scheme (ie gap penalty and substitution matrix or profile), sequence composition and length. Use of this formula means it is unnecessary to fit an extreme-value distribution to simulations or to the results of data-bank searches. The method is based on the theoretical ideas introduced in (Mott & Tribe, 1999). Extensive simulation studies show that score-thresholds produced by the method are accurate to within ±5 % 95 % of the time. We also investigate factors which affect the accuracy of alignment statistics, and show that any method based on asymptotic theory is limited because asymptotic behaviour is not strictly achieved for many real protein sequences, due to extreme composition effects. Consequently it may not be practicable to find a general formula that is significantly more accurate until the sub-asymptotic behaviour of alignments is better understood.

