#### DMCA

## Building multiclass classifiers for remote homology detection and fold recognition (2006)

### Cached

### Download Links

Venue: | BMC BIOINFORMATICS |

Citations: | 10 - 6 self |

### BibTeX

@ARTICLE{Rangwala06buildingmulticlass,

author = {Huzefa Rangwala and George Karypis},

title = {Building multiclass classifiers for remote homology detection and fold recognition},

journal = {BMC BIOINFORMATICS},

year = {2006},

pages = {2006}

}

### OpenURL

### Abstract

Background: Protein remote homology detection and fold recognition are central problems in computational biology. Supervised learning algorithms based on support vector machines are currently one of the most effective methods for solving these problems. These methods are primarily used to solve binary classification problems and they have not been extensively used to solve the more general multiclass remote homology prediction and fold recognition problems. Results: We present a comprehensive evaluation of a number of methods for building SVM-based multiclass classification schemes in the context of the SCOP protein classification. These methods include schemes that directly build an SVM-based multiclass model, schemes that employ a secondlevel learning approach to combine the predictions generated by a set of binary SVM-based classifiers, and schemes that build and combine binary classifiers for various levels of the SCOP hierarchy beyond those defining the target classes. Conclusion: Analyzing the performance achieved by the different approaches on four different datasets we show that most of the proposed multiclass SVM-based classification approaches are quite effective in solving the remote homology prediction and fold recognition problems and that the schemes that use predictions from binary models constructed for ancestral categories within the SCOP hierarchy tend to not only lead to lower error rates but also reduce the number of errors in which a superfamily is assigned to an entirely different fold and a fold is predicted as being from a different SCOP class. Our results also show that the limited size of the training data makes it hard to learn complex second-level models, and that models of moderate complexity lead to consistently better results. Background Breakthroughs in large-scale sequencing have led to a surge in the available protein sequence information that has far out-stripped our ability to experimentally characterize their functions. As a result, researchers are increasingly relying on computational techniques to classify proteins into functional and structural families based solely on their primary amino acid sequences. While satisfactory methods exist to detect homologs with high levels of similarity, accurately detecting homologs at low levels of sequence similarity (remote homology detection) still remains a challenging problem. As a result, over the years several methods have been developed to address the problems of remote homology prediction and fold recognition. These include methods based on pairwise sequence comparisons Recent advances in string kernels that have been specifically designed for protein sequences and capture their evolutionary relationships Even though highly accurate SVM-based binary classifiers can go a long way in addressing some of the biologist's requirements, it is still unknown how to best combine the predictions of a set of SVM-based binary classifiers to solve the multiclass classification problem and assign a protein sequence to a particular superfamily or fold. Moreover, it is not clear whether schemes that combine binary classifiers are inherently better suited for solving the remote homology prediction and fold recognition problems over schemes that directly build an SVM-based multiclass classification model. The work done by Ding et al. [17] recognized this problem, and used simple voting mechanism to combine the predictions obtained from binary base classifiers. They not only used, the one-versus-rest classifiers but also trained several one-versus-one classifiers, and a combination of them to obtain good classification results. The hierarchical nature, of the SCOP database was exploited by Huang et al. [18], such that the predictions were made in a hierarchical fashion, where a classifier was first used to classify the sequences into the four major classes, and then folds. Recently, Ie et al. [19], developed schemes for combining the outputs of a set of binary SVM-based classifiers for primarily solving the remote homology detection problem. Specifically borrowing ideas from errorcorrecting output codes These schemes are thoroughly evaluated for both remote homology detection and fold recognition using four different datasets derived from SCOP Results Algorithms For Direct SVM-based K-way Classifier Solution One way of solving the K-way classification problem using support vector machines is to use one of the many multiclass formulations for SVMs that were developed over the years In this study we evaluate the effectiveness of one of these formulations that was developed by Crammer and Singer [31], which leads to reasonably efficient optimization problems. This formulation aims to learn a matrix W of size K × n such that the predicted class y* for an instance x is given by where W i is the i th row of W whose dimension is n. This formulation models each class i by its own hyperplane (whose normal vector corresponds to the i th row of W) and assigns an example x to the class i that maximizes its corresponding hyperplane distance. W itself is learned from the training data following a maximum margin with soft constraints formulation that gives rise to the following optimization problem [31]: where i 0 are slack variables, > 0 is a regularization constant, and is equal to 1 if z = y i , and 0 otherwise. As in the binary support vector machines the dual version of the optimization problem and the resulting classifier depends only on the inner products, which allows us to use any of the recently developed protein string kernels Merging K One-vs-Rest Binary Classifiers An alternate way of solving the K-way classification problem in the context of SVM is to first build a set of K oneversus-rest binary classification models {f 1 , f 2 ,..., f K }, use all of them to predict an instance x, and then based on the predictions of these base classifiers {f 1 (x), f 2 (x),..., f K (x)} assign x to one of the K classes Max Classifier A common way of combining the predictions of a set of K one-versus-rest binary classifiers is to assume that the K outputs are directly comparable and assign x to the class that achieved the highest one-versus-rest prediction value; that is, the prediction y* for an instance x is given by However, the assumption that the output scores of the different binary classifiers are directly comparable may not be valid, as different classes may be of different sizes and/ or less separable from the rest of the dataset-indirectly affecting the nature of the binary model that was learned. Cascaded SVM-Learning Approaches A promising approach that has been explored in combining the outputs of K binary classification models is to formulate it as a cascaded learning problem in which a second level model is trained on the outputs of the binary classifiers to correctly solve the multiclass classification problem A simple model that can be learned is the scaling model in which the final prediction for an instance x is given by where w i is a factor used to scale the functional output of the i th classifier, and the set of K w i scaling factors make up the model that is being learned during the second level training phase An extension to the above scheme is to also incorporate a shift parameter s i with each of the classes and learn a model whose prediction is given by The motivation behind this model is to emulate the expressive power of the z-score approach (i.e., w i = l/ i , s i = -i / i ) but learn these parameters using a maximum ) is the vector containing the K outputs of the one-versus-rest binary classifiers. We will refer to this as the Crammer-Singer (CS) model. Comparing the scaling approach to the Crammer-Singer approach we can see that the Crammer-Singer methodology is a more general version and should be able to learn a similar weight vector as the scaling approach. In the scaling approach, there is a single weight value associated with each of the classes. However, the Crammer-Singer approach has a whole weight vector of dimensions equal to the number of features per class. During the training stage, for the Crammer-Singer approach if all the weight values w i, j = 0, i j the weight vector will be equivalent to the scaling weight vector. Thus, we would expect the Crammer-Singer setting to fit the dataset much better during the training stage. Use of Hierarchical Information One of the key characteristics of remote homology detection and fold recognition is that the target classes are naturally organized in a hierarchical fashion. This hierarchical organization is evident in the tree-structured organization of the various known protein structures that is produced by the widely used protein structure classification schemes of SCOP In our study we use the SCOP classification database to define the remote homology prediction and fold recognition problems. SCOP organizes the proteins into four primary levels (class, fold, superfamily, and family) based on structure and sequence similarity. Within the SCOP classification, the problem of remote homology prediction corresponds to that of predicting the superfamily of a particular protein under the constraint that the protein is not similar to any of its descendant families, whereas the problem of fold recognition corresponds to that of predicting the fold (i.e., second level of hierarchy) under the constraint that the protein is not similar to any of its descendant superfamilies. These two constraints are important because if they are violated, then we are actually solving either the family or remote homology prediction problems, respectively. The questions that arise are whether or not and how we can take advantage of the fact that the target classes (either superfamilies or folds) correspond to a level in a hierarchical classification scheme, so as to improve the overall classification performance? The approach investigated in this study is primarily motivated by the different schemes presented above to combine the functional outputs of multiple one-versus-rest binary classifiers. A general way of doing this is to learn a binary one-versus-rest model for each or a subset of the nodes of the hierarchical classification scheme, and then combine these models using an approach similar to the CS-scheme. Note that the output space of this model is still the K f possible folds, but the model combines information both from the fold-level binary classifiers as well as the binary classifiers for superfamily-and class-level models. In addition to CS-type models, the hierarchical information can also be used to build simpler models by combining selective subsets of binary classifiers. In our study we experimented with such models by focusing only on the subsets of nodes that are characteristic for each target class and are uniquely determined by it. Specifically, given a target class (i.e., superfamily or fold), the path starting from that node and moving upwards towards the root of the classification hierarchy uniquely identifies a set of nodes corresponding to higher level classes containing the target class. For example, if the target class is a superfamily, this path will identify the superfamily itself, its corresponding fold, and its corresponding class in the SCOP hierarchy. In a similar fashion, we can use the scale and shift type approach for every node in the hierarchical tree. This allows for an extra shift parameter to be associated with each of the nodes being modeled. Note that similar approaches can be used to define models for fold recognition, where a weight vector is learned to combine the target fold level node along with its specific class level node. A model can also be learned by not considering all the levels along the paths to the root of the tree. The generic problem of classifying within the context of a hierarchical classification system has recently been studied by the machine learning community and a number of alternative approaches have been developed [34][35][36]. Implementation We learn the weight vector by a cross-validation set-up on the training set using either the ranking perceptron [37] or structured SVM algorithm [34] both of which work on the principles of large margin discriminative classifiers. We also introduce the notion of loss functions that are optimized for the different integration methods. The exact training methodology, including the programs used for this study are explained in the methods section. Structured Output Spaces The various models introduced for merging K-way one versus rest binary classifiers can be expressed using a unified framework that was recently introduced for learning in structured output spaces [34,[37][38][39]. This framework [34] learns a discriminant function F : × over input/output pairs from which it derives predictions by maximizing F over the response variable for a specific given input x. Hence, the general form of the hypothesis h is where denotes a parameter vector. Function F is aparameterized family of functions that is designed such that F(x, y; ) achieves the maximum value for the correct output y. Among the various choices for F, if we focus on those that are linear in a combined feature representation of inputs and outputs, (x, y), then Equation 9 can be rewritten as [34]: The specific form of depends on the nature of the problem and it is this flexibility that allows us to represent the hypothesis spaces introduced for merging binary models in terms of Equation 10. Similarly, for the scale & shift approach (Equation 5), the (x, y) function maps the (x, y) pair onto a feature space of size 2K f , where the first K f dimensions are used to encode the scaling factors and the second K f dimensions are used to encode the shift factors. Specifically, given an example x belonging to fold i, (x, y) maps (x, y) onto the vector whose ith entry is f i (x), it's (K f + i) th entry is one, and the remaining entries are set to zero. Then, from Equation 10 we have that which is equivalent to Equation 5, with the first half of corresponding the scale vector w, and the second half corresponding to the shift vector s. Finally, in the case of the Cramer-Singer approach, the (x, y) function maps (x, y) onto a feature space of size K f × K f . Specifically, given a sequence x belonging to fold i, (x, y) maps (x, y) onto the vector whose K f entries starting at (i -1)K f are set to f(x) (i.e., the fold prediction outputs) and the remaining (K f -1)K f entries are set to zero. Then, by rewriting Equation 10 in terms of the above combined input-output representation, we get This is equivalent to Equation 6, as can be viewed as the matrix W with K f rows and K f columns. Ranking Perceptron One way of learning in Equation 10, is to use the recently developed extension to Rosenblatt's linear perceptron classifier [40], called ranking perceptron [37]. This is an online learning algorithm that iteratively updates for each training example that is misclassified according to Equation 10. For each misclassified example x i , is updated by adding to it a multiple of ( ( where is given from Equation 10 (i.e., the erroneously predicted class for x i ). This online learning framework is identical to that used in standard perceptron learning and is known to converge when the examples are linearly separable. However this convergence property does not hold when the examples are not linearly separable. For our study, we have extended the ranking perceptron algorithm to follow a large margin classification principle whose goal is to learn that tries to satisfy the following m constraints (one for each of the training examples): required margin, is given by || || 2 , where is a user-specified constant. Note, the margin is expressed in terms of 's length to ensure that the separation constraints are invariant to simple scaling transformations. The ranking perceptron algorithm was also used in Algorithm 1 shows our extended ranking perceptron algorithm that uses the constraints of Equation 14 to guide its online learning. The key steps in this algorithm are lines 8-10 that update 952 based on the satisfaction/violation of the constraints for each one of the m training instances. Since the ranking perceptron algorithm is not guaranteed to converge when the examples are not linearly separable, Algorithm 1 incorporates an explicit stopping criterion that after each iteration it computes the training error-rate of , and terminates when 's error rate has not improved in 100 consecutive iterations. The algorithm returns the that achieved the lowest training error rate over all iterations. SVM-Struct Recently, an efficient way of learning the vector of Equation 10 has been formulated as a convex optimization problem [34]. In this approach is learned subject to the following m nonlinear constraints This hard-margin problem can be converted to a soft-margin equivalent to allow errors in the training set. This is done by introducing a slack variable, , for every nonlinear constraint of Equation 15. The soft-margin problem is expressed as [34]: The results of classification depend on the value C which is the misclassification cost that determines the trade-off between the generalization capability of the model being learned and maximizing the margin. It needs to be optimized to prevent under-fitting and over-fitting the data during the training phase. Note that the SVM-Struct algorithm was also used in Loss Functions The loss function plays a key role while learning , in both the SVM-struct and ranking perceptron optimizations. Till now, our discussion focused on zero-one loss that assigns a penalty of one for a misclassification and zero for a correct prediction. However, in cases where the class sizes vary significantly across the different folds, such a zeroone loss function may not be the most appropriate as it The percent similarity between two sequences is computed by aligning the pair of sequences using SW-GSM with a gap opening of 5.0 and gap extension of 1.0. "Avg. Pairwise Similarity" is the average of all the pairwise percent identities, "Avg. Max. Similarity" is the average of the maximum pairwise percent identity for each sequence i.e, it measures the similarity to its most similar sequence. The "Avg. Pairwise Similarity (within folds)" and "Avg. Pairwise Similarity (outside folds)" is the average of the average pairwise percent sequence similarity within the same fold and outside the fold for a given sequence. While using the hierarchical information in the cascaded learning approaches we experimented with a weighted loss function where a larger penalty was assigned when the predicted label did not share the same ancestor compared to the case when the predicted and true class labels shared the same ancestors. This variation did not result in an improvement compared to the zero-one and balanced loss. Hence, we do not report results of using such hierarchical loss functions here. Evaluation The performance of various schemes in terms of zero-one and balanced error is summarized in These tables also show the performance achieved by incorporating different types of hierarchical information in the two-level learning framework. For remote homology prediction they present results that combine information from the ancestor nodes (fold and fold+class), whereas for fold recognition they present results that com- bine information from ancestor nodes (class), descendant nodes (superfamily), and their combination (superfamily+class). Zero-one Versus Balanced Loss Function The direct K-way and the two-level learning approaches can be trained using either the zero-one or the balanced loss functions (the MaxClassifier scheme does not explicitly optimize a loss function). The zero-one loss function achieved consistently worse results than those achieved by the balanced loss function for both the remote homology detection (comparing Performance of Direct K-way Classifier Comparing the direct K-way classifiers against the MaxClassifier approach we see that the error rates achieved by the direct approach are smaller for both the remote homology detection and fold recognition problems. In many cases these improvements are substantial. For example, the direct K-way classifier achieves a 10.9% zero-one error rate for sf40 compared to a corresponding error rate of 21.0% achieved by MaxClassifier. In addition, unlike the common belief that learning SVM-based direct multiclass classifiers is computationally very expensive, we found that the Crammer-Singer formulation that we used, required time that is comparable to that required for building the various binary classifiers used by the MaxClassifier approach. Non-Hierarchical Two-Level Learning Approaches Analyzing the performance of the various two-level classifiers that do not use hierarchical information we see that the scaling (S) and scale & shift (SS) schemes achieve better error rates than those achieved by the Crammer-Singer (CS) scheme. Since the hypothesis space of the CS scheme is a superset of the hypothesis spaces of the S and SS schemes, we found this result to be surprising at first. However, in analyzing the characteristics of the models that were learned we noticed that the reason for this performance difference is the fact that the CS scheme tended to overfit the data. This was evident by the fact that the CS scheme had lower error rates on the training set than either the S or SS schemes (results not reported here). Since CS's linear model has more parameters than the other two schemes, due to the fact that the size of the training set for all three of them is the same and rather limited, such overfitting can easily occur. Note that these observations regarding these three approaches hold for the two-level approaches that use hierarchical information as well. Comparing the performance of the S and SS schemes against that of the direct K-way classifier we see that the two-level schemes are somewhat worse for sf40 and fd25 and considerably better for sf95 and fd40. In addition, they are consistently and substantially better than the MaxClassifer approach across all four datasets. SVM-Struct versus Ranking Perceptron For the two-level approaches that do not use hierarchical information, This relative performance of the perceptron algorithm is both surprising as well as expected. The surprising aspect is that it is able to keep up with the considerably more sophisticated, mathematically rigorous, and computationally expensive optimizers used in SVM-struct, which lend to converge to a local minimum solution that is close the global minimum. However, this behavior, especially when the results of the CS scheme are token into account, was expected because the hypothesis spaces of the S and SS schemes are rather small (the number of variables in the S and SS models are K and 2K, respectively) and as such the optimization problem is relatively easy. However, in the case of the CS scheme which is parameterized by K 2 variables, the optimization problem becomes harder, and SVM-struct's optimization framework is capable of finding a better solution. Due to this observation we did not pursue the ranking perceptron algorithm any further when we considered two-level models that incorporate hierarchy information. Hierarchical Two-Level Learning Approaches The results for remote homology prediction show that the use of hierarchical information does not improve the overall error rates. The situation is different for fold recognition in which the use of hierarchical information leads to some improvements for fd40, especially in terms of balanced error Even though the use of hierarchical information does not improve the overall classification accuracy, as the results in Comparison with Earlier Results As discussed in the introduction, our research in this paper was motivated by the recent work of Ie et. al. [19] in which they looked at the same problem of solving the Kway classification problem in the context of remote homology and fold recognition and presented a two-level learning approach based on the simple scaling model (S) with and without hierarchical information. These results show that the zero-one and balanced error rates of our algorithms are in most cases less than half of that achieved by the previous algorithms. This performance advantage can be attributed to (i) differences in the one-vs-rest binary classifiers ([19] used the profile kernel [14] whereas our schemes used the SW-PSSM kernel), (ii) our implementation of the ranking perceptron allows for a better specification of the classification margin, and (iii) our results have been optimized by performing a model selection step, described in detail in the methods section. Discussion The work described in this paper helps to answer three fundamental questions. First, whether or not SVM-based approaches that directly learn multiclass classification models can effectively and computationally efficiently solve the problems of remote homology prediction and fold recognition. Second, whether or not the recently developed highly accurate binary SVM-based one-vs-rest classifiers for remote homology prediction and fold recognition can be utilized to build an equally effective multiclass prediction scheme. Third, whether or not the incorporation of binary SVM-based prediction models for coarser and/or finer levels of a typical protein structure hierarchical classification scheme can be used to improve the multiclass classification performance. The experimental evaluation of a number of previously developed methods and methods introduced in the course of this work show that, to a large extent, the answer to all three of these questions to be yes. The CrammerSinger-based direct K-way classifier is able to learn effective models in a reasonable amount of time. Its classification performance is better than that of MaxClassifier and comparable to that achieved by the two-level learning schemes in three out of the four datasets. The two-level learning framework show the best overall results, producing consistently the lowest error rates. The performance of this framework greatly depends on the complexity of the hypothesis space used during the second-level learning. Complex hypothesis spaces (e.g., the one based on Crammer-Singer) tends to overfit the training dataset, whereas The results shown in the table are optimized for the balanced loss function. simpler spaces (e.g., scaling and scale & shift) produced better and more consistent results. We believe that this is a direct consequence of the limited training set size. However, since the size of the training set depends on the number of proteins with known 3D structure, this limitation is not expected to be removed in the near future. The use of hierarchical information further improves the performance of the two-level learning framework. Not only it achieves somewhat lower zero-one and balanced error rates but it also leads to a significant reduction in the number of prediction errors in which a test instance is assigned to a superfamily or fold that belongs to an entirely different fold or SCOP class from itself. As a result, in the context of protein structure prediction via comparative modeling [41], we expect that structures built from the predictions of hierarchical two-level classifiers will lead to better models. Conclusion In this paper we presented various SVM-based algorithms for solving the k-way classification problem in the context of remote homology prediction and fold recognition. Our results show that direct k-way SVM-based formulations and algorithms based on the two-level learning paradigm are quite effective for solving these problems and achieve better results than those obtained by using a set of binary one-vs-rest SVM-based classifiers. Moreover, our results and analysis showed that the two-level schemes that incorporate predictions from binary models constructed for ancestral categories within the SCOP hierarchy tend to not only lead to lower error rates but also reduce the number of errors in which a superfamily is assigned to an entirely different fold and a fold is predicted as being from a different SCOP class. Methods Dataset Description We evaluated the performance of the various schemes on four datasets. The first dataset, referred to as sf95 (superfamily -95%), was created by Ie et al. [19] to evaluate the performance of the multiclass classification algorithms that they developed (sf95 was designed by Ie et al. [19], whereas the other three datasets, referred to as sf40 (superfamily -40%), fd25 (fold -25%), and fd40 (fold -40%), were created for this study. sf40, fd25, and fd40 are available at the supplementary website.) The sf95 dataset was derived from SCOP 1.65, whereas the other datasets were derived from SCOP 1.67. Datasets, sf95 and sf40 are designed to evaluate the performance of remote homology prediction and were derived by taking only the domains with less than 95% and 40% pairwise sequence identity according to Astral [42], respectively. This set of domains was further reduced by keeping only the domains belonging to folds that (i) contained at least three superfamilies and (ii) one of these superfamilies contained multiple families. For sf95, the resulting dataset contained 2115 domains organized in 25 folds and 47 superfamilies, whereas for sf40, the resulting dataset contained 1119 domains organized in 25 folds and 37 superfamilies. Datasets, fd25 and fd40 were designed to evaluate the performance of fold recognition and were derived by taking only the domains with less than 25% and 40% pairwise sequence identity, respectively. This set of domains was further reduced by keeping only the domains belonging to folds that (i) contained at least three superfamilies and (ii) at least three of these superfamilies contained more than three domains. For fd25, the resulting dataset contained 1294 domains organized in 25 folds and 137 superfamilies, whereas for fd40, the resulting dataset contained 1651 domains organized in 27 folds and 158 superfamilies. The results for Ie et al were obtained from the supplementary website for the work Binary Classifiers The various one-versus-rest binary classifiers were constructed using SVMs. These classifiers used the recently developed [15] Smith-Waterman based profile kernel function (SW-PSSM), that has been shown to achieve the best reported results for remote homology prediction and fold recognition. The SW-PSSM kernel computes a local alignment between two protein sequences, in which the similarity between two sequence positions is determined using a PICASSO like scoring function [43,44], and a position independent affine gap modeling scheme. The parameters for the affine gap model (i.e., gap-opening (go) and gap-extension (ge) costs), and zero-shift (zs) for our base classifiers were set to go = 3.0, ge = 0.75 and zs = 1.5. These values were determined in Direct K-way Classifier The direct K-way classification models were built using the publicly available implementation of the algorithm from the authors [31]. To ensure that the schemes are compared fairly, we use the same SW-PSSM kernel function used by the binary SVM classifiers. We tested the direct K-way classifiers using linear kernel functions as well, but the performance of the SW-PSSM kernels were substantially better. Performance Assessment Measures The performance of the classification algorithms was assessed using the zero-one (ZE) and the balanced error rate (BE) In addition, the performance of the various classifiers was evaluated using the previously established approach for evaluating fold recognition methods introduced in [46,47] that does not penalize for certain types of misclassifications. For each test instance, this scheme ranks the various classes from the most to the least likely and a test instance is considered to be correctly classified if its true class is among the highest-ranked n classes (i.e., top n ). The classes in the ranked list that are within the same next higher-level SCOP ancestral class are ignored and do not count towards determining the highest-ranked n classes. That is, in the case of fold recognition, the folds that are part of the same SCOP class as the test instance are ignored and they do not count in determining the n highest-ranked predictions. Similarly, in case of remote homology detection, this scheme ignores the superfamilies that are part of the same SCOP fold as the test sequence. Using a small value for n that is greater than one, this measure assesses the ability of a classifier to find the correct class among its highest ranked predictions, and by penalizing only for the substantially wrong mispredictions (i.e., different SCOP classes or folds), it can assess the severity of the misclassifications of the different schemes. In our experiments we computed the error rates for n = 1 and n = 3. Training Methodology For each dataset we separated the proteins into test and training sets, ensuring that the test set is never used during any parts of the learning phase. For sf95 and sf40 (fd25 and fd40), the test set is constructed by selecting from each superfamily (fold) all the sequences that are part of one family (superfamily). Thus during training, the dataset does not contain any sequences that are homologous (remote homologous) to the sequences in the test set and thus allows us to evaluate/assess remote homology prediction (fold recognition) performance. This is a standard protocol for evaluating remote homology detection and fold recognition and has been used in a number of earlier studies The models for the two-level approach can be learned in three phases by first splitting the training set into two sets, one for learning the first-level model and the other for learning the second-level model. In the first phase, the k one-vs-rest binary classifiers are trained using the training set for the first level. In the second phase, each of these k classifiers are used to predict the training set for the second level. Finally, in the third phase, the second-level model is trained using these predictions. However, due to the limited size of the available training set, we followed a different approach that does not require us to split the training set into two sets. This approach was motivated by the cross-validation methodology and is similar to that used in Model Selection The performance of the SVM depends on the parameter that controls the trade-off between the margin and the misclassification cost ("C" parameter in SVM-Struct), whereas the performance of ranking perceptron depends on the parameter in Algorithm 1. We perform a model selection or parameter selection step. To perform this exercise fairly, we split our test set into two equal halves of similar distributions, namely sets A and B. Using set A, we vary the controlling parameters and select the best performing model for set A. We use this selected model and compute the accuracy for set B. We ABSTRACT Motivation: Protein remote homology detection is a central problem in computational biology. Supervised learning algorithms based on support vector machines are currently one of the most effective methods for remote homology detection. The performance of these methods depends on how the protein sequences are modeled and on the method used to compute the kernel function between them. Availability and Requirements Results: We introduce two classes of kernel functions that are constructed by combining sequence profiles with new and existing approaches for determining the similarity between pairs of protein sequences. These kernels are constructed directly from these explicit protein similarity measures and employ effective profile-to-profile scoring schemes for measuring the similarity between pairs of proteins. Experiments with remote homology detection and fold recognition problems show that these kernels are capable of producing results that are substantially better than those produced by all of the existing stateof-the-art SVM-based methods. In addition, the experiments show that these kernels, even when used in the absence of profiles, produce results that are better than those produced by existing non-profile-based schemes. Availability: The programs for computing the various kernel functions are available on request from the authors. Contact: