Results 1 - 10
of
58
Mismatch string kernels for discriminative protein classification
- Bioinformatics
, 2004
"... Motivation Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training an ..."
Abstract
-
Cited by 90 (7 self)
- Add to MetaCart
Motivation Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training and prediction are also important concerns. Results We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the problem of protein classification and remote homology detection. These kernels measure sequence similarity based on shared occurrences of fixed-length patterns in the data, allowing for mutations between patterns. Thus the kernels provide a biologically well-motivated way to compare protein sequences without relying on family-based generative models such as hidden Markov models. We compute the kernels efficiently using a mismatch tree data structure, allowing us to calculate the contributions of all patterns occurring in the data in one pass while traversing the tree. When used with an SVM, the kernels enable fast prediction on test sequences. We report experiments on two benchmark SCOP data sets, where we show that the mismatch kernel used with an SVM classifier performs competitively with state-of-the-art methods for homology detection, particularly when very few training examples are available. Examination of the highestweighted patterns learned by the SVM classifier recovers biologically important motifs in protein families and superfamilies. Availability SVM software is publically available at
Profile-based string kernels for remote homology detection and motif extraction
- Journal of Bioinformatics and Computational Biology
, 2004
"... We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation nei ..."
Abstract
-
Cited by 57 (7 self)
- Add to MetaCart
We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences (“k-mers”) in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the profiles is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with
Semi-supervised protein classification using cluster kernels
, 2003
"... A key issue in supervised protein classification is the representation of input sequences of amino acids. Recent work using string kernels for protein data has achieved state-of-the-art classification performance. However,
such representations are based only on labeled data — examples with known 3D ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
A key issue in supervised protein classification is the representation of input sequences of amino acids. Recent work using string kernels for protein data has achieved state-of-the-art classification performance. However,
such representations are based only on labeled data — examples with known 3D structures, organized into structural classes — while in practice, unlabeled data is far more plentiful. In this work, we develop simple and scalable cluster kernel techniques for incorporating unlabeled data into the representation of protein sequences. We show that our methods greatly improve the classification performance of string kernels and outperform standard approaches for using unlabeled data, such as adding close homologs of the positive examples to the training data. We achieve equal or superior performance to previously presented cluster kernel methods while achieving far greater computational efficiency.
Fast string kernels using inexact matching for protein sequences
- Journal of Machine Learning Research
, 2004
"... We describe several families of k-mer based string kernels related to the recently presented mismatch kernel and designed for use with support vector machines (SVMs) for classification of protein sequence data. These new kernels – restricted gappy kernels, substitution kernels, and wildcard kernels ..."
Abstract
-
Cited by 28 (0 self)
- Add to MetaCart
We describe several families of k-mer based string kernels related to the recently presented mismatch kernel and designed for use with support vector machines (SVMs) for classification of protein sequence data. These new kernels – restricted gappy kernels, substitution kernels, and wildcard kernels – are based on feature spaces indexed by k-length subsequences (“k-mers”) from the string alphabet Σ. However, for all kernels we define here, the kernel value K(x,y) can be computed in O(cK(|x|+|y|)) time, where the constant cK depends on the parameters of the kernel but is independent of the size |Σ | of the alphabet. Thus the computation of these kernels is linear in the length of the sequences, like the mismatch kernel, but we improve upon the parameter-dependent constant cK = k m+1 |Σ | m of the (k,m)-mismatch kernel. We compute the kernels efficiently using a trie data structure and relate our new kernels to the recently described transducer formalism. In protein classification experiments on two benchmark SCOP data sets, we show that our new faster kernels achieve SVM classification performance comparable to the mismatch kernel and the Fisher kernel derived from profile hidden Markov models, and we investigate the dependence of the kernels on parameter choice.
A Kernel Approach for Learning From Almost Orthogonal Patterns
, 2002
"... In kernel methods, all the information about the training data is contained in the Gram matrix. If this matrix has large diagonal values, which arises for many types of kernels, then kernel methods do not perform well. We propose and test several methods for dealing with this problem by reducing the ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
In kernel methods, all the information about the training data is contained in the Gram matrix. If this matrix has large diagonal values, which arises for many types of kernels, then kernel methods do not perform well. We propose and test several methods for dealing with this problem by reducing the dynamic range of the matrix while preserving the positive definiteness of the Hessian of the quadratic programming problem that one has to solve when training a Support Vector Machine.
On a theory of learning with similarity functions
- In International Conference on Machine Learning
, 2006
"... Abstract. Kernel functions have become an extremely popular tool in machine learning, with an attractive theory as well. This theory views a kernel as implicitly mapping data points into a possibly very high dimensional space, and describes a kernel function as being good for a given learning proble ..."
Abstract
-
Cited by 24 (8 self)
- Add to MetaCart
Abstract. Kernel functions have become an extremely popular tool in machine learning, with an attractive theory as well. This theory views a kernel as implicitly mapping data points into a possibly very high dimensional space, and describes a kernel function as being good for a given learning problem if data is separable by a large margin in that implicit space. However, while quite elegant, this theory does not necessarily correspond to the intuition of a good kernel as a good measure of similarity, and the underlying margin in the implicit space usually is not apparent in “natural ” representations of the data. Therefore, it may be difficult for a domain expert to use the theory to help design an appropriate kernel for the learning task at hand. Moreover, the requirement of positive semi-definiteness may rule out the most natural pairwise similarity functions for the given problem domain. In this work we develop an alternative, more general theory of learning with similarity functions (i.e., sufficient conditions for a similarity function to allow one to learn well) that does not require reference to implicit spaces, and does not require the function to be positive semi-definite (or even symmetric). Instead, our theory talks in terms of more direct properties of how the function behaves as a similarity measure. Our results also generalize the standard theory in the sense that any good kernel function under the usual definition can be shown to also be a good similarity function under our definition (though with some loss in the parameters). In this way, we provide the first steps towards a theory of kernels and more general similarity functions that describes the effectiveness of a given function in terms of natural similarity-based properties. 1
Fast kernels for inexact string matching
- Sixteenth Annual Conference on Learning Theory and Seventh Kernel Workshop
, 2003
"... Abstract. We introduce several new families of string kernels designed in particular for use with support vector machines (SVMs) for classification of protein sequence data. These kernels – restricted gappy kernels, substitution kernels, and wildcard kernels – are based on feature spaces indexed by ..."
Abstract
-
Cited by 20 (5 self)
- Add to MetaCart
Abstract. We introduce several new families of string kernels designed in particular for use with support vector machines (SVMs) for classification of protein sequence data. These kernels – restricted gappy kernels, substitution kernels, and wildcard kernels – are based on feature spaces indexed by k-length subsequences from the string alphabet Σ (or the alphabet augmented by a wildcard character), and hence they are related to the recently presented (k, m)-mismatch kernel and string kernels used in text classification. However, for all kernels we define here, the kernel value K(x, y) can be computed in O(cK(|x | + |y|)) time, where the constant cK depends on the parameters of the kernel but is independent of the size |Σ | of the alphabet. Thus the computation of these kernels is linear in the length of the sequences, like the mismatch kernel, but we improve upon the parameterdependent constant cK = k m+1 |Σ | m of the mismatch kernel. We compute the kernels efficiently using a recursive function based on a trie data structure and relate our new kernels to the recently described transducer formalism. Finally, we report protein classification experiments on a benchmark SCOP dataset, where we show that our new faster kernels achieve SVM classification performance comparable to the mismatch kernel and the Fisher kernel derived from profile hidden Markov models.
Efficient Remote Homology Detection Using Local Structure
- BIOINFORMATICS
, 2003
"... Motivation: The function of an unknown biological sequence can often be accurately inferred if we are able to map this unknown sequence to its corresponding homologous family. At present, discriminative methods such as SVM-Fisher and SVM-pairwise, which combine support vector machine and sequence si ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
Motivation: The function of an unknown biological sequence can often be accurately inferred if we are able to map this unknown sequence to its corresponding homologous family. At present, discriminative methods such as SVM-Fisher and SVM-pairwise, which combine support vector machine and sequence similarity, are recognized as the most accurate methods, with SVM-pairwise being the most accurate. However, these methods typically encode sequence information into their feature vectors and ignore the structure information. They are also computationally inefficient. Based on these observations, we present an alternative method for SVM-based protein classification. Our proposed method, SVM-I-sites, utilizes structure similarity for remote homology detection. Result:
Remote homolog detection using local sequence-structure correlations
- Proteins
, 2004
"... ABSTRACT Remote homology detection refers to the detection of structural homology in proteins when there is little or no sequence similarity. In this article, we present a remote homolog detection method called SVM-HMMSTR that overcomes the reliance on detectable sequence similarity by transforming ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
ABSTRACT Remote homology detection refers to the detection of structural homology in proteins when there is little or no sequence similarity. In this article, we present a remote homolog detection method called SVM-HMMSTR that overcomes the reliance on detectable sequence similarity by transforming the sequences into strings of hidden Markov states that represent local folding motif patterns. These state strings are transformed into fixeddimension feature vectors for input to a support vector machine. Two sets of features are defined: an order-independent feature set that captures the amino acid and local structure composition; and an order-dependent feature set that captures the sequential ordering of the local structures. Tests using the Structural Classification of Proteins (SCOP) 1.53 data set show that the SVM-HMMSTR gives a significant improvement over several current methods. Proteins 2004;57:518–530. © 2004 Wiley-Liss, Inc. Key words: remote homology; local structure; support vector machines; hidden Markov model; protein folding; I-sites; HMMSTR
FSSA: a novel method for identifying functional signatures from structural alignments
- Bioinformatics
, 2005
"... doi:10.1093/bioinformatics/bti471 ..."

