Results 1 - 10
of
65
A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach
- J MOL BIOL
, 2001
"... We have introduced a new method of protein secondary structure prediction which is based on the theory of support vector machine (SVM). SVM represents a new approach to supervised pattern classification which has been successfully applied to a wide range of pattern recognition problems, including ob ..."
Abstract
-
Cited by 97 (1 self)
- Add to MetaCart
We have introduced a new method of protein secondary structure prediction which is based on the theory of support vector machine (SVM). SVM represents a new approach to supervised pattern classification which has been successfully applied to a wide range of pattern recognition problems, including object recognition, speaker identification, gene function prediction with microarray expression profile, etc. In these cases, the performance of SVM either matches or is significantly better than that of traditional machine learning approaches, including neural networks. The first use of the SVM approach to predict protein secondary structure is described here. Unlike the previous studies, we first constructed several binary classifiers, then assembled a tertiary classifier for three secondary structure states (helix, sheet and coil) based on these binary classifiers. The SVM method achieved a good performance of segment overlap accuracy SOV = 76.2 % through sevenfold cross validation on a database of 513 non-homologous protein chains with multiple sequence alignments, which out-performs existing methods. Meanwhile three-state overall per-residue accuracy Q 3 achieved 73.5 %, which is at least comparable to existing single prediction methods. Furthermore a useful "reliability index" for the predictions was developed. In addition, SVM has many attractive features, including effective avoidance of overfitting, the ability to handle large feature spaces, information condensing of the given data set, etc. The SVM method is conveniently applied to many other pattern classification tasks in biology.
Review: Protein Secondary Structure Prediction Continues to Rise
- J. Struct. Biol
, 2001
"... f prediction accuracy? We shall see. 2001 Academic Press INTRODUCTION History. Linus Pauling correctly guessed the formation of helices and strands (14, 15) (and falsely hypothesized other structures). Three years before Pauling's guess was verified by the publications of the first X-ray structure ..."
Abstract
-
Cited by 92 (13 self)
- Add to MetaCart
f prediction accuracy? We shall see. 2001 Academic Press INTRODUCTION History. Linus Pauling correctly guessed the formation of helices and strands (14, 15) (and falsely hypothesized other structures). Three years before Pauling's guess was verified by the publications of the first X-ray structures (16, 17), one group had already ventured to predict secondary structure from sequence (18). The first-generation prediction methods following in the 1960s and 1970s were all based on single amino acid propensities (19). The second-generation methods dominating the scene until the early 1990s used propensities for segments of 3--51 adjacent residues (19). Basically any imaginable theoretical algorithm had been applied to the problem of predicting secondary structure from sequence. However, it seemed that prediction accuracy stalled at levels slightly above 60% (percentage of residues predicted correctly in one of the three states: helix, strand, and other). The reason for this limit was the
Improving the Prediction of Protein Secondary Structure in Three and Eight Classes Using Recurrent Neural Networks and Profiles
, 2001
"... Secondarystructurepredictions areincreasinglybecomingtheworkhorseforseveralmethodsaimingatpredictingproteinstructure andfunction.Hereweuseensemblesofbidirectionalrecurrentneuralnetworkarchitectures, PSIBLAST -derivedprofiles,andalargenonredundant trainingsettoderivetwonewpredictors:(a)the secondvers ..."
Abstract
-
Cited by 87 (21 self)
- Add to MetaCart
Secondarystructurepredictions areincreasinglybecomingtheworkhorseforseveralmethodsaimingatpredictingproteinstructure andfunction.Hereweuseensemblesofbidirectionalrecurrentneuralnetworkarchitectures, PSIBLAST -derivedprofiles,andalargenonredundant trainingsettoderivetwonewpredictors:(a)the secondversionoftheSSproprogramforsecondary structureclassificationintothreecategoriesand(b) thefirstversionoftheSSpro8programforsecondarystructureclassificationintotheeightclasses producedbytheDSSPprogram.Wedescribethe resultsofthreedifferenttestsetsonwhichSSpro achievedasustainedperformanceofabout78% correctprediction.Wereportconfusionmatrices, comparePSI-BLASTtoBLAST-derivedprofiles,and assessthecorrespondingperformanceimprovements. SSproandSSpro8areimplementedasweb servers,availabletogetherwithotherstructural featurepredictorsat:http://promoter.ics.uci.edu/ BRNN-PRED/.Proteins2002;47:228--235.
The principled design of large-scale recursive neural network architectures–dag-rnns and the protein structure prediction problem
, 2003
"... We describe a general methodology for the design of large-scale recursive neural network architectures (DAG-RNNs) which comprises three fundamental steps: (1) representation of a given domain using suitable directed acyclic graphs (DAGs) to connect visible and hidden node variables; (2) parameteriza ..."
Abstract
-
Cited by 36 (8 self)
- Add to MetaCart
We describe a general methodology for the design of large-scale recursive neural network architectures (DAG-RNNs) which comprises three fundamental steps: (1) representation of a given domain using suitable directed acyclic graphs (DAGs) to connect visible and hidden node variables; (2) parameterization of the relationship between each variable and its parent variables by feedforward neural networks; and (3) application of weight-sharing within appropriate subsets of DAG connections to capture stationarity and control model complexity. Here we use these principles to derive several specific classes of DAG-RNN architectures based on lattices, trees, and other structured graphs. These architectures can process a wide range of data structures with variable sizes and dimensions. While the overall resulting models remain probabilistic, the internal deterministic dynamics allows efficient propagation of information, as well as training by gradient descent, in order to tackle large-scale problems. These methods are used here to derive state-of-the-art predictors for protein structural features such as secondary structure (1D) and both fine- and coarse-grained contact maps (2D). Extensions, relationships to graphical models, and implications for the design of neural architectures are briefly discussed. The protein prediction servers are available over the
Disulfide connectivity prediction using recursive neural networks and evolutionary information
- Bioinformatics
, 2004
"... Motivation. We focus on the prediction of disulfide bridges in proteins starting from their amino acid sequence and from the knowledge of the disulfide bonding state of each cysteine. The location of disulfide bridges is a structural feature that conveys important information about the protein main ..."
Abstract
-
Cited by 31 (3 self)
- Add to MetaCart
Motivation. We focus on the prediction of disulfide bridges in proteins starting from their amino acid sequence and from the knowledge of the disulfide bonding state of each cysteine. The location of disulfide bridges is a structural feature that conveys important information about the protein main chain conformation and can therefore help towards the solution of the folding problem. Existing approaches based on weighted graph matching algorithms do not take advantage of evolutionary information. Recursive neural networks (RNN), on the other hand, can handle in a natural way complex data structures such as graphs whose vertices are labeled by real vectors, allowing us to incorporate multiple alignment profiles in the graphical representation of disulfide connectivity patterns. Results. The core of the method is the use of machine learning tools to rank alternative disulfide connectivity patterns. We develop an ad-hoc RNN architecture for scoring labeled undirected graphs that represent connectivity patterns. In order to compare our algorithm with previous methods, we report experimental results on the SWISS-PROT 39 data set. We find that using multiple alignment profiles allows us to obtain significant prediction accuracy improvements, clearly demonstrating the important role played by evolutionary information. Availability. The Web interface of the predictor is available at
Prediction of Coordination Number and Relative Solvent Accessibility in Proteins
, 2001
"... Knowingthecoordinationnumber andrelativesolventaccessibilityofalltheresidues inaproteiniscrucialforderivingconstraintsuseful inmodelingproteinfoldingandproteinstructure andinscoringremotehomologysearches.Wedevelopensemblesofbidirectionalrecurrentneural networkarchitecturestoimprovethestateofthe arti ..."
Abstract
-
Cited by 30 (10 self)
- Add to MetaCart
Knowingthecoordinationnumber andrelativesolventaccessibilityofalltheresidues inaproteiniscrucialforderivingconstraintsuseful inmodelingproteinfoldingandproteinstructure andinscoringremotehomologysearches.Wedevelopensemblesofbidirectionalrecurrentneural networkarchitecturestoimprovethestateofthe artinbothcontactandaccessibilityprediction, leveragingalargecorpusofcurateddatatogether withevolutionaryinformation.Theensemblesare usedtodiscriminatebetweentwodifferentstatesof residuecontactsorrelativesolventaccessibility, higherorlowerthanathresholddeterminedbythe averagevalueoftheresiduedistributionorthe accessibilitycutoff.Forcoordinationnumbers,the ensembleachievesperformancesrangingwithin 70.6--73.9%dependingontheradiusadoptedtodiscriminatecontacts (6--12).Theseperformances representgainsof16--20%overthebaselinestatisticalpredictor, alwaysassigninganaminoacidtothe largestclass,andare4--7%betterthananyprevious method.Acombinationofdifferentradiuspredictorsfurtherimprovesperformance. Foraccessibilitythresholdsintherelevant15 --30%range,the ensembleconsistentlyachievesaperformanceabove 77%,whichis10--16%abovethebaselineprediction andbetterthanotherexistingpredictors,byupto severalpercentagepoints.Forbothproblems,we quantifytheimprovementduetoevolutionaryinformationintheformofPSI -BLAST-generatedprofiles overBLASTprofiles.Thepredictionprogramsare implementedintheformoftwowebservers,CONproandACCpro, availableathttp://promoter.ics. uci.edu/BRNN-PRED/.Proteins2002;47:142--153.
EVA: Large-Scale Analysis of Secondary Structure Prediction
, 2001
"... EVAisaweb-basedserverthat evaluatesautomaticstructurepredictionservers continuouslyandobjectively.SinceJune2000,EVA collectedmorethan20,000secondarystructurepredictions. TheEVAsetssufficedtoconcludethatthe fieldofsecondarystructurepredictionhasadvancedagain. Accuracyincreasedsubstantiallyin the1990s ..."
Abstract
-
Cited by 28 (7 self)
- Add to MetaCart
EVAisaweb-basedserverthat evaluatesautomaticstructurepredictionservers continuouslyandobjectively.SinceJune2000,EVA collectedmorethan20,000secondarystructurepredictions. TheEVAsetssufficedtoconcludethatthe fieldofsecondarystructurepredictionhasadvancedagain. Accuracyincreasedsubstantiallyin the1990sthroughusingevolutionaryinformation takenfromthedivergenceofproteinsinthesame structuralfamily.Recently,theevolutionaryinformationresultingfromimprovedsearchesandlarger databaseshasagainboostedpredictionaccuracyby morethan4%toitscurrentheightaround76%ofall residuespredictedcorrectlyinoneofthethree states:helix,strand,orother.Thebestcurrent methodssolvedmostoftheproblemsraisedat earlierCASPmeetings:Allgoodmethodsnowget segmentsrightandperformwellonstrands.Isthe recentincreaseinaccuracysignificantenoughto makepredictionsevenmoreuseful?Webelievethe answerisaffirmative.Whatisthelimitofprediction accuracy?Weshallsee.Alldataareavailable throughtheEVAwebsiteat{cubic.bioc.columbia. edu/eva/}.Therawdatafortheresultspresentedare availableat{eva}/sec/bup_common/2001_02_22/. Proteins2001;Suppl5:192--199.2002Wiley-Liss,Inc. Keywords:automaticevaluation;large-scaleassessment; proteinstructureprediction
A general framework for unsupervised processing of structured data
- NEUROCOMPUTING
, 2004
"... ..."
Combining Discriminant Models with new Multi-Class SVMs
, 2000
"... The idea of combining models instead of simply selecting the best one, in order to improve performance, is well known in statistics and has a long theoretical background. However, making full use of theoretical results is ordinarily subject to the satisfaction of strong hypotheses (weak correlati ..."
Abstract
-
Cited by 27 (6 self)
- Add to MetaCart
The idea of combining models instead of simply selecting the best one, in order to improve performance, is well known in statistics and has a long theoretical background. However, making full use of theoretical results is ordinarily subject to the satisfaction of strong hypotheses (weak correlation among the errors, availability of large training sets, possibility to rerun the training procedure an arbitrary number of times, etc.). In contrast, the practitioner who has to make a decision is frequently faced with the dicult problem of combining a given set of pretrained classiers, with highly correlated errors, using only a small training sample. Overtting is then the main risk, which cannot be overcome but with a strict complexity control of the combiner selected. This suggests that SVMs, which implement the SRM inductive principle, should be well suited for these dicult situations. Investigating this idea, we introduce a new family of multi-class SVMs and assess them as ensemble methods on a real-world problem. This task, protein secondary structure prediction, is an open problem in biocomputing for which model combination appears to be an issue of central importance. Experimental evidence highlights the gain in quality resulting from combining some of the most widely used prediction methods with our SVMs rather than with the ensemble methods traditionally used in the eld. The gain is increased when the outputs of the combiners are post-processed with a simple DP algorithm.

