Results 1 - 10
of
479
A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features
- Machine Learning
, 1993
"... In the past, nearest neighbor algorithms for learning from examples have worked best in domains in which all features had numeric values. In such domains, the examples can be treated as points and distance metrics can use standard definitions. In symbolic domains, a more sophisticated treatment of t ..."
Abstract
-
Cited by 249 (3 self)
- Add to MetaCart
In the past, nearest neighbor algorithms for learning from examples have worked best in domains in which all features had numeric values. In such domains, the examples can be treated as points and distance metrics can use standard definitions. In symbolic domains, a more sophisticated treatment of the feature space is required. We introduce a nearest neighbor algorithm for learning in domains with symbolic features. Our algorithm calculates distance tables that allow it to produce real-valued distances between instances, and attaches weights to the instances to further modify the structure of feature space. We show that this technique produces excellent classification accuracy on three problems that have been studied by machine learning researchers: predicting protein secondary structure, identifying DNA promoter sequences, and pronouncing English text. Direct experimental comparisons with the other learning algorithms show that our nearest neighbor algorithm is comparable or superior ...
CATH -- a hierarchic classification of protein domain structures
- STRUCTURE
, 1997
"... Background: Protein evolution gives rise to families of structurally related proteins, within which sequence identities can be extremely low. As a result, structure-based classifications can be effective at identifying unanticipated relationships in known structures and in optimal cases function can ..."
Abstract
-
Cited by 210 (11 self)
- Add to MetaCart
Background: Protein evolution gives rise to families of structurally related proteins, within which sequence identities can be extremely low. As a result, structure-based classifications can be effective at identifying unanticipated relationships in known structures and in optimal cases function can also be assigned. The ever increasing number of known protein structures is too large to classify all proteins manually, therefore, automatic methods are needed for fast evaluation of protein structures. Results: We present a semi-automatic procedure for deriving a novel hierarchical classification of protein domain structures (CATH). The four main levels of our classification are protein class (C), architecture (A), topology (T) and homologous superfamily (H). Class is the simplest level, and it essentially describes the secondary structure composition of each domain. In contrast, architecture summarises the shape revealed by the orientations of the secondary structure units, such as barrels and sandwiches. At the topology level, sequential connectivity is considered, such that members of the same architecture might have quite different topologies. When structures belonging to the same T-level have suitably high similarities combined with similar functions, the proteins are assumed to be evolutionarily related and put into the same homologous superfamily. Conclusions: Analysis of the structural families generated by CATH reveals the prominent features of protein structure space. We find that nearly a third of the homologous superfamilies (H-levels) belong to ten major T-levels, which we call superfolds, and furthermore that nearly two-thirds of these H-levels cluster into nine simple architectures. A database of well-characterised protein structure families, such as CATH, will facilitate the assignment of structure–function/ evolution relationships to both known and newly determined protein structures.
Learning structured prediction models: a large margin approach
, 2004
"... We consider large margin estimation in a broad range of prediction models where inference involves solving combinatorial optimization problems, for example, weighted graphcuts or matchings. Our goal is to learn parameters such that inference using the model reproduces correct answers on the training ..."
Abstract
-
Cited by 127 (7 self)
- Add to MetaCart
We consider large margin estimation in a broad range of prediction models where inference involves solving combinatorial optimization problems, for example, weighted graphcuts or matchings. Our goal is to learn parameters such that inference using the model reproduces correct answers on the training data. Our method relies on the expressive power of convex optimization problems to compactly capture inference or solution optimality in structured prediction models. Directly embedding this structure within the learning formulation produces concise convex problems for efficient estimation of very complex and diverse models. We describe experimental results on a matching task, disulfide connectivity prediction, showing significant improvements over state-of-the-art methods. 1.
Knowledge-based protein secondary structure assignment
- Proteins
, 1995
"... ABSTRACT We have developed an auto-matic algorithm STRIDE for protein secondary structure assignment from atomic coordinates based on the combined use of hydrogen bond energy and statistically derived backbone tor-sional angle information. Parameters of the pattern recognition procedure were optimiz ..."
Abstract
-
Cited by 97 (2 self)
- Add to MetaCart
ABSTRACT We have developed an auto-matic algorithm STRIDE for protein secondary structure assignment from atomic coordinates based on the combined use of hydrogen bond energy and statistically derived backbone tor-sional angle information. Parameters of the pattern recognition procedure were optimized using designations provided by the crystallog-raphers as a standard-of-truth. Comparison to the currently most widely used technique DSSP by Kabsch and Sander (Biopolymers 222577-2637, 1983) shows that STRIDE and DSSP as-sign secondary structural states in 58 and 31% of 226 protein chains in our data sample, re-spectively, in greater agreement with the spe-cific residue-by-residue definitions provided by the discoverers of the structures while in 11 % of the chains, the assignments are the same. STRIDE delineates every 11 th helix and every 32nd strand more in accord with published assignments. Q 1995 Wiley-Liss, Inc. Key words: protein structure analysis, hydro-gen bond, torsional angle, a-helix, p-sheet
A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach
- J MOL BIOL
, 2001
"... We have introduced a new method of protein secondary structure prediction which is based on the theory of support vector machine (SVM). SVM represents a new approach to supervised pattern classification which has been successfully applied to a wide range of pattern recognition problems, including ob ..."
Abstract
-
Cited by 97 (1 self)
- Add to MetaCart
We have introduced a new method of protein secondary structure prediction which is based on the theory of support vector machine (SVM). SVM represents a new approach to supervised pattern classification which has been successfully applied to a wide range of pattern recognition problems, including object recognition, speaker identification, gene function prediction with microarray expression profile, etc. In these cases, the performance of SVM either matches or is significantly better than that of traditional machine learning approaches, including neural networks. The first use of the SVM approach to predict protein secondary structure is described here. Unlike the previous studies, we first constructed several binary classifiers, then assembled a tertiary classifier for three secondary structure states (helix, sheet and coil) based on these binary classifiers. The SVM method achieved a good performance of segment overlap accuracy SOV = 76.2 % through sevenfold cross validation on a database of 513 non-homologous protein chains with multiple sequence alignments, which out-performs existing methods. Meanwhile three-state overall per-residue accuracy Q 3 achieved 73.5 %, which is at least comparable to existing single prediction methods. Furthermore a useful "reliability index" for the predictions was developed. In addition, SVM has many attractive features, including effective avoidance of overfitting, the ability to handle large feature spaces, information condensing of the given data set, etc. The SVM method is conveniently applied to many other pattern classification tasks in biology.
Review: Protein Secondary Structure Prediction Continues to Rise
- J. Struct. Biol
, 2001
"... f prediction accuracy? We shall see. 2001 Academic Press INTRODUCTION History. Linus Pauling correctly guessed the formation of helices and strands (14, 15) (and falsely hypothesized other structures). Three years before Pauling's guess was verified by the publications of the first X-ray structure ..."
Abstract
-
Cited by 92 (13 self)
- Add to MetaCart
f prediction accuracy? We shall see. 2001 Academic Press INTRODUCTION History. Linus Pauling correctly guessed the formation of helices and strands (14, 15) (and falsely hypothesized other structures). Three years before Pauling's guess was verified by the publications of the first X-ray structures (16, 17), one group had already ventured to predict secondary structure from sequence (18). The first-generation prediction methods following in the 1960s and 1970s were all based on single amino acid propensities (19). The second-generation methods dominating the scene until the early 1990s used propensities for segments of 3--51 adjacent residues (19). Basically any imaginable theoretical algorithm had been applied to the problem of predicting secondary structure from sequence. However, it seemed that prediction accuracy stalled at levels slightly above 60% (percentage of residues predicted correctly in one of the three states: helix, strand, and other). The reason for this limit was the
Exploiting the Past and the Future in Protein Secondary Structure Prediction
, 1999
"... Motivation: Predicting the secondary structure of a protein (alpha-helix, beta-sheet, coil) is an important step towards elucidating its three dimensional structure, as well as its function. Presently, the best predictors are based on machine learning approaches, in particular neural network archite ..."
Abstract
-
Cited by 91 (19 self)
- Add to MetaCart
Motivation: Predicting the secondary structure of a protein (alpha-helix, beta-sheet, coil) is an important step towards elucidating its three dimensional structure, as well as its function. Presently, the best predictors are based on machine learning approaches, in particular neural network architectures with a fixed, and relatively short, input window of amino acids, centered at the prediction site. Although a fixed small window avoids overfitting problems, it does not permit to capture variable long-ranged information. Results: We introduce a family of novel architectures which can learn to make predictions based on variable ranges of dependencies. These architectures extend recurrent neural networks, introducing non-causal bidirectional dynamics to capture both upstream and downstream information. The prediction algorithm is completed by the use of mixtures of estimators that leverage evolutionary information, expressed in terms of multiple alignments, both at the input and output levels. While our system currently achieves an overall performance close to 76% correct prediction---at least comparable to the best existing systems---the main emphasis here is on the development of new algorithmic ideas. Availability: The executable program for predicting protein secondary structure is available from the authors free of charge. Contact: pfbaldi@ics.uci.edu, gpollast@ics.uci.edu, brunak@cbs.dtu.dk, paolo@dsi.unifi.it. 1
Improving the Prediction of Protein Secondary Structure in Three and Eight Classes Using Recurrent Neural Networks and Profiles
, 2001
"... Secondarystructurepredictions areincreasinglybecomingtheworkhorseforseveralmethodsaimingatpredictingproteinstructure andfunction.Hereweuseensemblesofbidirectionalrecurrentneuralnetworkarchitectures, PSIBLAST -derivedprofiles,andalargenonredundant trainingsettoderivetwonewpredictors:(a)the secondvers ..."
Abstract
-
Cited by 87 (21 self)
- Add to MetaCart
Secondarystructurepredictions areincreasinglybecomingtheworkhorseforseveralmethodsaimingatpredictingproteinstructure andfunction.Hereweuseensemblesofbidirectionalrecurrentneuralnetworkarchitectures, PSIBLAST -derivedprofiles,andalargenonredundant trainingsettoderivetwonewpredictors:(a)the secondversionoftheSSproprogramforsecondary structureclassificationintothreecategoriesand(b) thefirstversionoftheSSpro8programforsecondarystructureclassificationintotheeightclasses producedbytheDSSPprogram.Wedescribethe resultsofthreedifferenttestsetsonwhichSSpro achievedasustainedperformanceofabout78% correctprediction.Wereportconfusionmatrices, comparePSI-BLASTtoBLAST-derivedprofiles,and assessthecorrespondingperformanceimprovements. SSproandSSpro8areimplementedasweb servers,availabletogetherwithotherstructural featurepredictorsat:http://promoter.ics.uci.edu/ BRNN-PRED/.Proteins2002;47:228--235.
Comprehensive assessment of automatic structural alignment against a manual standard, the Scop classification of proteins
- Protein Sci
, 1998
"... ..."

