Results 1 - 10
of
11
Sequence Comparisons Using Multiple Sequences Detect Three Times as Many Remote . . .
, 1998
"... The sequences of related proteins can diverge beyond the point where their relationship can be recognised by pairwise sequence comparisons. In attempts to overcome this limitation, methods have been developed that use as a query, not a single sequence, but sets of related sequences or a representati ..."
Abstract
-
Cited by 147 (14 self)
- Add to MetaCart
The sequences of related proteins can diverge beyond the point where their relationship can be recognised by pairwise sequence comparisons. In attempts to overcome this limitation, methods have been developed that use as a query, not a single sequence, but sets of related sequences or a representation of the characteristics shared by related sequences. Here we describe an assessment of three of these methods: the SAM-T98 implementation of a hidden Markov model procedure; PSI-BLAST; and the intermediate sequence search (ISS) procedure. We determined the extent to which these procedures can detect evolutionary relationships between the members of the sequence database PDBD40-J. This database, derived from the structural classification of proteins (SCOP), contains the sequences of proteins of known structure whose sequence identities with each other are 40 % or less. The evolutionary relationships that exist between those that have low sequence identities were found by the examination of their structural details and, in many cases, their functional
GenTHREADER: An Efficient and Reliable Protein Fold Recognition Method for Genomic Sequences
- J. Mol. Biol
, 1999
"... Ouzounis et al., 1993; Abagyan et al., 1994; Nishikawa & Matsuo, 1994; Flo ckner et al., 1995; Lathrop & Smith, 1996; Madej et al., 1995; Fischer Eisenberg, 1996; Defay & Cohen, 1996; Russell et al., 1996). Blind testing has shown that fold recognition methods can be very effective (Shortle, 1997), ..."
Abstract
-
Cited by 119 (8 self)
- Add to MetaCart
Ouzounis et al., 1993; Abagyan et al., 1994; Nishikawa & Matsuo, 1994; Flo ckner et al., 1995; Lathrop & Smith, 1996; Madej et al., 1995; Fischer Eisenberg, 1996; Defay & Cohen, 1996; Russell et al., 1996). Blind testing has shown that fold recognition methods can be very effective (Shortle, 1997), and so it is surprising that they are not being more widely applied to genome analysis. Three problems with fold recognition methods probably contribute to their lack of use: their slowness, the requirement for human intervention to interpret the results and the inaccuracy of sequence-structure alignments produced. Different methods suffer from each of these problems to differing degrees. Of the three problems, the lack of automation in the fold recognition process is perhaps the biggest problem in the application of threading methods to genomic sequence analysis. Whilst it is reasonable to require some human intervention when predicting the structure of just a few sequences, this is clearl
Dirichlet Mixtures: A Method for Improving Detection of Weak but Significant Protein Sequence Homology
, 1996
"... This paper presents the mathematical foundations of Dirichlet mixtures, which have been used to improve database search results for homologous sequences, when a variable number of sequences from a protein family or domain are known. We present a method for condensing the information in a protein dat ..."
Abstract
-
Cited by 105 (20 self)
- Add to MetaCart
This paper presents the mathematical foundations of Dirichlet mixtures, which have been used to improve database search results for homologous sequences, when a variable number of sequences from a protein family or domain are known. We present a method for condensing the information in a protein database into a mixture of Dirichlet densities. These mixtures are designed to be combined with observed amino acid frequencies, to form estimates of expected amino acid probabilities at each position in a profile, hidden Markov model, or other statistical model. These estimates give a statistical model greater generalization capacity, such that remotely related family members can be more reliably recognized by the model. Dirichlet mixtures have been shown to outperform substitution matrices and other methods for computing these expected amino acid distributions in database search, resulting in fewer false positives and false negatives for the families tested. This paper corrects a previously p...
Within the Twilight Zone: A Sensitive Profile-Profile Comparison Tool Based on Information Theory
- J. Mol. Biol
, 2002
"... This paper presents a novel approach to prole-prole comparison. The method compares two input proles (like those that are generated by PSI-BLAST) and assigns a similarity score to assess their statistical similarity. Our prole-prole comparison tool, which allows for gaps, can be used to detect weak ..."
Abstract
-
Cited by 71 (4 self)
- Add to MetaCart
This paper presents a novel approach to prole-prole comparison. The method compares two input proles (like those that are generated by PSI-BLAST) and assigns a similarity score to assess their statistical similarity. Our prole-prole comparison tool, which allows for gaps, can be used to detect weak similarities between protein families. It has also been optimized to produce alignments that are in very good agreement with structural alignments. Tests show that the prole-prole alignments are indeed highly correlated with similarities between secondary structure elements and tertiary structure. Exhaustive evaluations show that our method is signicantly more sensitive in detecting distant homologies than the popular prole-based search programs PSI-BLAST and IMPALA. The relative improvement is the same order of magnitude as the improvement of PSI-BLAST relative to BLAST. Our new tool often detects similarities that fall within the twilight zone of sequence similarity
Valenzia A: Effective use of sequence correlation and conservation in fold recognition
- J Mol Biol
, 1999
"... Protein families are a rich source of information; sequence conservation and sequence correlation are two of the main properties that can be derived from the analysis of multiple sequence alignments. Sequence conservation is related to the direct evolutionary pressure to retain the chemical characte ..."
Abstract
-
Cited by 37 (3 self)
- Add to MetaCart
Protein families are a rich source of information; sequence conservation and sequence correlation are two of the main properties that can be derived from the analysis of multiple sequence alignments. Sequence conservation is related to the direct evolutionary pressure to retain the chemical characteristics of some positions in order to maintain a given function. Sequence correlation is attributed to the small sequence adjustments needed to maintain protein stability against constant mutational drift. Here, we showed that sequence conservation and correlation were each frequently informative enough to detect incorrectly folded proteins. Furthermore, combining conservation, correlation, and polarity, we achieved an almost perfect discrimination between native and incorrectly folded proteins. Thus, we made use of this information for threading by evaluating the models suggested by a threading method according to the degree of proximity of the corresponding correlated, conserved, and apolar residues. The results showed that the fold recognition capacity of a given threading approach could be improved almost fourfold by selecting the alignments that score best under the three different sequencebased approaches.
Grouping Web Page References into Transactions for Mining World Wide Web Browsing Patterns
- Dept. of Computer Science, Univ. of Minnesota
, 1997
"... Web-based organizations often generate and collect large volumes of data in their daily operations. Analyzing such data involves the discovery of meaningful relationships from a large collection of primarily unstructured data, often stored in Web server access logs. While traditional domains for dat ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
Web-based organizations often generate and collect large volumes of data in their daily operations. Analyzing such data involves the discovery of meaningful relationships from a large collection of primarily unstructured data, often stored in Web server access logs. While traditional domains for data mining, such as point of sale databases, have naturally defined transactions, there is no convenient method of clustering web references into transactions. This paper identifies a model of user browsing behavior that separates web page references into those made for navigation purposes and those for information content purposes. A transaction identification method based on the browsing model is defined and successfully tested against other methods, such as the maximal forward reference algorithm proposed in [1]. Transactions identified by the proposed methods are used to discover association rules from real world data using the WEBMINER system [7]. 1 Introduction and Background As more or...
Fast Probabilistic Analysis of Sequence Function Using Scoring Matrices
, 2000
"... Motivation: We present techniques for increasing the speed of sequence analysis using scoring matrices. Our techniques are based on calculating, for a given scoring matrix, the quantile function, which assigns a probability, or p, value to each segmental score. Our techniques also permit the user to ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
Motivation: We present techniques for increasing the speed of sequence analysis using scoring matrices. Our techniques are based on calculating, for a given scoring matrix, the quantile function, which assigns a probability, or p, value to each segmental score. Our techniques also permit the user to specify a p threshold to indicate the desired tradeo# between sensitivity and speed for a particular sequence analysis. The resulting increase in speed should allow scoring matrices to be used more widely in large-scale sequencing and annotation projects. Results: We develop three techniques for increasing the speed of sequence analysis: probability #ltering, lookahead scoring, and permuted lookahead scoring. In probability #ltering, we compute the score threshold that corresponds to the userspeci #ed p threshold. We use the score threshold to limit the number of segments that are retained in the search process. In lookahead scoring, we test intermediate scores to determine whether they wi...
A Novel Approach to Remote Homology Detection: Jumping Alignments
- J Comput Biol
, 2002
"... We describe a new algorithm for protein classification and the detection of remote homologs. The rationale is to exploit both vertical and horizontal information of a multiple alignment in a well-balanced manner. This is in contrast to established methods such as profiles and profile hidden Markov m ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
We describe a new algorithm for protein classification and the detection of remote homologs. The rationale is to exploit both vertical and horizontal information of a multiple alignment in a well-balanced manner. This is in contrast to established methods such as profiles and profile hidden Markov models which focus on vertical information as they model the columns of the alignment independently and to family pairwise search which focuses on horizontal information as it treats given sequences separately. In our setting, we want to select from a given database of "candidate sequences" those proteins that belong to a given superfamily. In order to do so, each candidate sequence is separately tested against a multiple alignment of the known members of the superfamily by means of a new jumping alignment algorithm. This algorithm is an extension of the Smith-Waterman algorithm and computes a local alignment of a single sequence and a multiple alignment. In contrast to traditional methods, however, this alignment is not based on a summary of the individual columns of the multiple alignment. Rather, the candidate sequence is at each position aligned to one sequence of the multiple alignment, called the "reference sequence." In addition, the reference sequence may change within the alignment, while each such jump is penalized. To evaluate the discriminative quality of the jumping alignment algorithm, we compare it to profiles, profile hidden Markov models, and family pairwise search on a subset of the SCOP database of protein domains. The discriminative quality is assessed by median false positive counts (med-FP-counts). For moderate med-FP-counts, the number of successful searches with our method is considerably higher than with the competing methods.
De Novo Protein Design. II. Plasticity in Sequence Space
- J. Mol. Biol
, 1999
"... sponding authors Introduction It has been hypothesized that the total number of different protein folds is nite, and roughly of the order of 1000 (Chothia, 1992; Orengo et al., 1994; Wang, 1998). Once examples of every fold are known, protein structure prediction would reduce to the inverse protei ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
sponding authors Introduction It has been hypothesized that the total number of different protein folds is nite, and roughly of the order of 1000 (Chothia, 1992; Orengo et al., 1994; Wang, 1998). Once examples of every fold are known, protein structure prediction would reduce to the inverse protein folding problem, which consists in identifying which sequences are compatible with a given fold (Drexler, 1981). This alternative view of structure determination is the basis of the new eld of structural genomics, which aims to deliver structural information about most genomederived protein sequences. While it is not feasible to determine experimentally the structure of every protein, useful models can be obtained by fold recognition and comparative modeling, provided there is a comprehensive library of folds (Kim, 1998). Structural genomics is currently focusing on the construction of such a library, and a gure of 10,000 to 100,000 representative proteins has been proposed (Sali, 1998). W
Protein Design. I. In Search of Stability and Specificity
- J. Mol. Biol
, 1999
"... this report, the rst in a series of two, the method is fully described, emphasizing the importance of each term of the energy function considered. The success of the procedure is assessed with respect to its ability to design in and design out sequences for different template backbones, including th ..."
Abstract
- Add to MetaCart
this report, the rst in a series of two, the method is fully described, emphasizing the importance of each term of the energy function considered. The success of the procedure is assessed with respect to its ability to design in and design out sequences for different template backbones, including the B1 domain of protein G, lambda repressor and myoglobin. Comparisons with experimental mutation data are provided for protein G and lambda repressor. In the accompanying paper, we show in greater detail how much sequence information can be retrieved from the backbone template of a protein using our physical energy function (Koehl & Levitt, 1999)

