Results 1 - 10
of
19
Empirical statistical estimates for sequence similarity searches
- J. Mol. Biol
, 1998
"... Sequence similarity searches today are the most effective method for exploiting the information in the rapidly growing DNA and protein sequence databases. One of the most dramatic improvements ..."
Abstract
-
Cited by 66 (3 self)
- Add to MetaCart
Sequence similarity searches today are the most effective method for exploiting the information in the rapidly growing DNA and protein sequence databases. One of the most dramatic improvements
Sequence Comparison Significance and Poisson Approximation
- Stat. Sci
, 1994
"... The Chen-Stein method of Poisson approximation has been used to establish theorems about comparison of two DNA or protein sequences. The most useful result for sequence alignment applies to alignment scoring for aligned letters and no gaps. However there has not been a valid method to assign statist ..."
Abstract
-
Cited by 31 (4 self)
- Add to MetaCart
The Chen-Stein method of Poisson approximation has been used to establish theorems about comparison of two DNA or protein sequences. The most useful result for sequence alignment applies to alignment scoring for aligned letters and no gaps. However there has not been a valid method to assign statistical significance to alignment scores with gaps. In this paper we extend Poisson approximation techniques using the Aldous clumping heuristic to a practical method of estimating statistical significance.
Accurate formula for p-values of gapped local sequence and profile alignments
- J. Mol. Biol
, 2000
"... A simple general approximation for the distribution of gapped local alignment scores is presented, suitable for assessing significance of comparisons between two protein sequences or a sequence and a profile. The approximation takes account of the scoring scheme (ie gap penalty and substitution matr ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
A simple general approximation for the distribution of gapped local alignment scores is presented, suitable for assessing significance of comparisons between two protein sequences or a sequence and a profile. The approximation takes account of the scoring scheme (ie gap penalty and substitution matrix or profile), sequence composition and length. Use of this formula means it is unnecessary to fit an extreme-value distribution to simulations or to the results of data-bank searches. The method is based on the theoretical ideas introduced in (Mott & Tribe, 1999). Extensive simulation studies show that score-thresholds produced by the method are accurate to within ±5 % 95 % of the time. We also investigate factors which affect the accuracy of alignment statistics, and show that any method based on asymptotic theory is limited because asymptotic behaviour is not strictly achieved for many real protein sequences, due to extreme composition effects. Consequently it may not be practicable to find a general formula that is significantly more accurate until the sub-asymptotic behaviour of alignments is better understood.
Qualified Answers That Reflect User Needs and Preferences
- In Proceedings of the International Conference on Very Large Databa ses
, 1994
"... This paper introduces a formalism to describe the needs and preferences of database users. Because of the precise formulation of these concepts, we have found an automatic and very simple mechanism to incorporate user needs and preferences into the query answering process. In the formalism, the use ..."
Abstract
-
Cited by 16 (6 self)
- Add to MetaCart
This paper introduces a formalism to describe the needs and preferences of database users. Because of the precise formulation of these concepts, we have found an automatic and very simple mechanism to incorporate user needs and preferences into the query answering process. In the formalism, the user provides a lattice of domain independent values that define preferences and needs and a set of domain specific user constraints qualified with lattice values. The constraints are automatically incorporated into a relational or deductive database through a series of syntactic transformations that produces an annotated deductive database. Query answering procedures for deductive databases are then used, with minor modifications, to obtain annotated answers to queries. Because preference declaration is separated from data representation and management, preferences can be easily altered without touching the database. Also, the query language allows users to ask for answers at different prefere...
Rapid Significance Estimation in Local Sequence Alignment with Gaps
, 2001
"... In order to assess the significance of sequence alignments it is crucial to know the distribution of alignment scores of pairs of random sequences. For gapped local alignment it is empirically known that the shape of this distribution is of the Gumbel form. However, the determination of the paramete ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
In order to assess the significance of sequence alignments it is crucial to know the distribution of alignment scores of pairs of random sequences. For gapped local alignment it is empirically known that the shape of this distribution is of the Gumbel form. However, the determination of the parameters of this distribution is a computationally very expensive task. We present a new algorithmic approach which allows to estimate the more important of the Gumbel parameters at least five times faster than the traditional methods. Actual runtimes of our algorithm between less than a second and a few minutes on a workstation bring significance estimation into the realm of interactive applications.
Bootstrapping and Normalization for Enhanced Evaluations of Pairwise Sequence Comparison
- PROC. IEEE
, 2002
"... ..."
Hybrid Alignment: High-Performance with Universal Statistics
- Bioinformatics
, 2002
"... The score statistics of a recently introduced "hybrid alignment" algorithm is studied in detail numerically. An extensive survey across the 2; 216 models of protein domains contained in the Pfam v5.4 database (Bateman et al. 2000. Nucl. Acid Res. 28:263-266) verifies the theoretical predictions: For ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
The score statistics of a recently introduced "hybrid alignment" algorithm is studied in detail numerically. An extensive survey across the 2; 216 models of protein domains contained in the Pfam v5.4 database (Bateman et al. 2000. Nucl. Acid Res. 28:263-266) verifies the theoretical predictions: For the position-specific scoring functions used in the Pfam models, the score statistics of hybrid alignment obey the Gumbel distribution, with the key Gumbel parameter taking on the asymptotic value 1 universally for all models. Thus, the use of hybrid alignment eliminates the time-consuming computer simulations normally needed to assign p-values to alignment scores. The performance of the hybrid algorithm in detecting sequence homology is also studied, using protein sequences from the SCOP (Murzin et al. 1995. J. Mol. Biol. 247:536-540) and PfamA databases. The performance is found to be comparable to the best of the existing methods. Hybrid alignment is thereby established as a high performance alignment algorithm with well-characterized, universal statistics.
An Analytic Approach to Significance Assessment in Local Sequence Alignment with Gaps
, 1999
"... A detailed study of the Smith-Waterman alignment algorithm is performed in order to find an analytical approach to the problem of assessing the statistical significance of local alignments with gaps. The significance is shown to be given in terms of an eigenvalue equation which captures the dynamics ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
A detailed study of the Smith-Waterman alignment algorithm is performed in order to find an analytical approach to the problem of assessing the statistical significance of local alignments with gaps. The significance is shown to be given in terms of an eigenvalue equation which captures the dynamics of the much simpler global alignment algorithm. This eigenvalue equation is then explicitly solved for a simple scoring system and the resulting significance estimations are verified by a comparison to extensive numerical simulations.
Statistical significance in biological sequence comparison
- Handbook of Statistical Genetics
, 2001
"... The availability of comprehensive sequence databases, rapid sequence comparison methods, and accurate statistical estimates for sequence similarity has fundamentally changed the practice of bio-chemistry and molecular biology. With the possible exceptions of E. coli and Saccharomyces, the vast major ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The availability of comprehensive sequence databases, rapid sequence comparison methods, and accurate statistical estimates for sequence similarity has fundamentally changed the practice of bio-chemistry and molecular biology. With the possible exceptions of E. coli and Saccharomyces, the vast majority of the genes in newly sequenced genomes are characterized by sequence similarity searching. blast, fasta, and Smith-Waterman similarity searches provide the most informative and reliable method for inferring the biological function of an anonymous gene (or the protein that it encodes). Typically, 60–80 % of eubacterial (and yeast) genes share statistically significant sequence similarity with sequences from another organism. Significant sequence similarity can be used to infer common ancestors and similar three-dimensional structures, and is routinely used to assign functions in metabolic pathways. Even for the first archaebacterial genome sequenced (M. jannaschii; Bult et al., 1996), similarity-based functional gene assignments could be made for about 50 % of the genes (Andrade et al., 1997) and subsequent sequence analyses (Koonin, 1997) suggested functions for another 20 % of the genes. Unfortunately, some investigators are uncomfortable inferring the relationship between two se-quences from a probability or expectation value; they prefer to think in terms of percent identity
A Practical Approach to Significance Assessment in Alignment with Gaps
"... Abstract. Current numerical methods for assessing the statistical significance of local alignments with gaps are time consuming. Analytical solutions thus far have been limited to specific cases. Here, we present a new line of attack to the problem of statistical significance assessment. We combine ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Current numerical methods for assessing the statistical significance of local alignments with gaps are time consuming. Analytical solutions thus far have been limited to specific cases. Here, we present a new line of attack to the problem of statistical significance assessment. We combine this new approach with known properties of the dynamics of the global alignment algorithm and high performance numerical techniques and present a novel method for assessing significance of gaps within practical time scales. The results and performance of these new methods test very well against tried methods with drastically less effort.

