Results 1  10
of
87
Mismatch string kernels for discriminative protein classification
 Bioinformatics
, 2004
"... Motivation: Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training a ..."
Abstract

Cited by 131 (8 self)
 Add to MetaCart
Motivation: Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training and prediction are also important concerns. Results: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the problem of protein classification and remote homology detection. These kernels measure sequence similarity based on shared occurrences of fixedlength patterns in the data, allowing for mutations between patterns.Thus, the kernels provide a biologically wellmotivated way to compare protein sequences without relying on familybased generative models such as hidden Markov models. We compute the kernels efficiently using a mismatch tree data structure, allowing us to calculate the contributions of all patterns occurring in the data in one pass while traversing the tree. When used with an SVM, the kernels enable fast prediction on test sequences. We report experiments on two benchmark SCOP datasets, where we show that the mismatch kernel used with an SVM classifier performs competitively with stateoftheart methods for homology detection, particularly when very few training examples are available. Examination of the highestweighted patterns learned by the SVM classifier recovers biologically important motifs in protein families and superfamilies. Availability: SVM software is publicly available at
Profilebased string kernels for remote homology detection and motif extraction
 Journal of Bioinformatics and Computational Biology
, 2004
"... We introduce novel profilebased string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSIBLAST algorithm, to define positiondependent mutation nei ..."
Abstract

Cited by 71 (8 self)
 Add to MetaCart
We introduce novel profilebased string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSIBLAST algorithm, to define positiondependent mutation neighborhoods along protein sequences for inexact matching of klength subsequences (“kmers”) in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSIBLAST in order to build the profiles is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profilebased string kernels used with
Semisupervised protein classification using cluster kernels
 Advances in Neural Information Processing Systems 16
, 2004
"... kernels ..."
Fast string kernels using inexact matching for protein sequences
 Journal of Machine Learning Research
, 2004
"... We describe several families of kmer based string kernels related to the recently presented mismatch kernel and designed for use with support vector machines (SVMs) for classification of protein sequence data. These new kernels – restricted gappy kernels, substitution kernels, and wildcard kernels ..."
Abstract

Cited by 42 (0 self)
 Add to MetaCart
We describe several families of kmer based string kernels related to the recently presented mismatch kernel and designed for use with support vector machines (SVMs) for classification of protein sequence data. These new kernels – restricted gappy kernels, substitution kernels, and wildcard kernels – are based on feature spaces indexed by klength subsequences (“kmers”) from the string alphabet Σ. However, for all kernels we define here, the kernel value K(x,y) can be computed in O(cK(x+y)) time, where the constant cK depends on the parameters of the kernel but is independent of the size Σ  of the alphabet. Thus the computation of these kernels is linear in the length of the sequences, like the mismatch kernel, but we improve upon the parameterdependent constant cK = k m+1 Σ  m of the (k,m)mismatch kernel. We compute the kernels efficiently using a trie data structure and relate our new kernels to the recently described transducer formalism. In protein classification experiments on two benchmark SCOP data sets, we show that our new faster kernels achieve SVM classification performance comparable to the mismatch kernel and the Fisher kernel derived from profile hidden Markov models, and we investigate the dependence of the kernels on parameter choice.
On a theory of learning with similarity functions
 In International Conference on Machine Learning
, 2006
"... Abstract. Kernel functions have become an extremely popular tool in machine learning, with an attractive theory as well. This theory views a kernel as implicitly mapping data points into a possibly very high dimensional space, and describes a kernel function as being good for a given learning proble ..."
Abstract

Cited by 38 (9 self)
 Add to MetaCart
Abstract. Kernel functions have become an extremely popular tool in machine learning, with an attractive theory as well. This theory views a kernel as implicitly mapping data points into a possibly very high dimensional space, and describes a kernel function as being good for a given learning problem if data is separable by a large margin in that implicit space. However, while quite elegant, this theory does not necessarily correspond to the intuition of a good kernel as a good measure of similarity, and the underlying margin in the implicit space usually is not apparent in “natural ” representations of the data. Therefore, it may be difficult for a domain expert to use the theory to help design an appropriate kernel for the learning task at hand. Moreover, the requirement of positive semidefiniteness may rule out the most natural pairwise similarity functions for the given problem domain. In this work we develop an alternative, more general theory of learning with similarity functions (i.e., sufficient conditions for a similarity function to allow one to learn well) that does not require reference to implicit spaces, and does not require the function to be positive semidefinite (or even symmetric). Instead, our theory talks in terms of more direct properties of how the function behaves as a similarity measure. Our results also generalize the standard theory in the sense that any good kernel function under the usual definition can be shown to also be a good similarity function under our definition (though with some loss in the parameters). In this way, we provide the first steps towards a theory of kernels and more general similarity functions that describes the effectiveness of a given function in terms of natural similaritybased properties. 1
A Kernel Approach for Learning From Almost Orthogonal Patterns
, 2002
"... In kernel methods, all the information about the training data is contained in the Gram matrix. If this matrix has large diagonal values, which arises for many types of kernels, then kernel methods do not perform well. We propose and test several methods for dealing with this problem by reducing the ..."
Abstract

Cited by 30 (4 self)
 Add to MetaCart
In kernel methods, all the information about the training data is contained in the Gram matrix. If this matrix has large diagonal values, which arises for many types of kernels, then kernel methods do not perform well. We propose and test several methods for dealing with this problem by reducing the dynamic range of the matrix while preserving the positive definiteness of the Hessian of the quadratic programming problem that one has to solve when training a Support Vector Machine.
Fast kernels for inexact string matching
 Sixteenth Annual Conference on Learning Theory and Seventh Kernel Workshop
, 2003
"... Abstract. We introduce several new families of string kernels designed in particular for use with support vector machines (SVMs) for classification of protein sequence data. These kernels – restricted gappy kernels, substitution kernels, and wildcard kernels – are based on feature spaces indexed by ..."
Abstract

Cited by 27 (7 self)
 Add to MetaCart
Abstract. We introduce several new families of string kernels designed in particular for use with support vector machines (SVMs) for classification of protein sequence data. These kernels – restricted gappy kernels, substitution kernels, and wildcard kernels – are based on feature spaces indexed by klength subsequences from the string alphabet Σ (or the alphabet augmented by a wildcard character), and hence they are related to the recently presented (k, m)mismatch kernel and string kernels used in text classification. However, for all kernels we define here, the kernel value K(x, y) can be computed in O(cK(x  + y)) time, where the constant cK depends on the parameters of the kernel but is independent of the size Σ  of the alphabet. Thus the computation of these kernels is linear in the length of the sequences, like the mismatch kernel, but we improve upon the parameterdependent constant cK = k m+1 Σ  m of the mismatch kernel. We compute the kernels efficiently using a recursive function based on a trie data structure and relate our new kernels to the recently described transducer formalism. Finally, we report protein classification experiments on a benchmark SCOP dataset, where we show that our new faster kernels achieve SVM classification performance comparable to the mismatch kernel and the Fisher kernel derived from profile hidden Markov models.
Efficient Remote Homology Detection Using Local Structure
 BIOINFORMATICS
, 2003
"... Motivation: The function of an unknown biological sequence can often be accurately inferred if we are able to map this unknown sequence to its corresponding homologous family. At present, discriminative methods such as SVMFisher and SVMpairwise, which combine support vector machine and sequence si ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
Motivation: The function of an unknown biological sequence can often be accurately inferred if we are able to map this unknown sequence to its corresponding homologous family. At present, discriminative methods such as SVMFisher and SVMpairwise, which combine support vector machine and sequence similarity, are recognized as the most accurate methods, with SVMpairwise being the most accurate. However, these methods typically encode sequence information into their feature vectors and ignore the structure information. They are also computationally inefficient. Based on these observations, we present an alternative method for SVMbased protein classification. Our proposed method, SVMIsites, utilizes structure similarity for remote homology detection. Result:
Similaritybased Classification: Concepts and Algorithms
, 2008
"... This report reviews and extends the field of similaritybased classification, presenting new analyses, algorithms, data sets, and the most comprehensive set of experimental results to date. Specifically, the generalizability of using similarities as features is analyzed, design goals and methods for ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
This report reviews and extends the field of similaritybased classification, presenting new analyses, algorithms, data sets, and the most comprehensive set of experimental results to date. Specifically, the generalizability of using similarities as features is analyzed, design goals and methods for weighting nearestneighbors for similaritybased learning are proposed, and different methods for consistently converting similarities into kernels are compared. Experiments on eight real data sets compare eight approaches and their variants to similaritybased learning. 1