## Mismatch string kernels for discriminative protein classification (2004)

### Cached

### Download Links

- [noble.gs.washington.edu]
- [www.ccls.columbia.edu]
- [www1.cs.columbia.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Bioinformatics |

Citations: | 152 - 8 self |

### BibTeX

@ARTICLE{Leslie04mismatchstring,

author = {Christina Leslie and Eleazar Eskin and Adiel Cohen and Jason Weston and William Stafford Noble},

title = {Mismatch string kernels for discriminative protein classification},

journal = {Bioinformatics},

year = {2004},

volume = {20},

pages = {467--476}

}

### Years of Citing Articles

### OpenURL

### Abstract

Motivation Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training and prediction are also important concerns. Results We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the problem of protein classification and remote homology detection. These kernels measure sequence similarity based on shared occurrences of fixed-length patterns in the data, allowing for mutations between patterns. Thus the kernels provide a biologically well-motivated way to compare protein sequences without relying on family-based generative models such as hidden Markov models. We compute the kernels efficiently using a mismatch tree data structure, allowing us to calculate the contributions of all patterns occurring in the data in one pass while traversing the tree. When used with an SVM, the kernels enable fast prediction on test sequences. We report experiments on two benchmark SCOP data sets, where we show that the mismatch kernel used with an SVM classifier performs competitively with state-of-the-art methods for homology detection, particularly when very few training examples are available. Examination of the highestweighted patterns learned by the SVM classifier recovers biologically important motifs in protein families and superfamilies. Availability SVM software is publically available at

### Citations

10096 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...gorithm attempts to learn a decision boundary between the different classes. In this category, several successful techniques [15, 19, 17] use protein sequences to train a support vector machine (SVM) =-=[26, 8]-=- classifier. In this paper, we present a method for using SVMs for remote homology detection, based on a family of kernel functions called mismatch kernels. A kernel function measures the similarity b... |

5820 | Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res
- Altschul, Madden, et al.
- 1997
(Show Context)
Citation Context ... functional and struc∗ Formerly William Noble Grundy, see www.gs.columbia.edu/˜noble/name-change.html 1 tural families based on sequence homology. Approaches based on pairwise similarity of sequences =-=[27, 1, 2]-=-, profiles for protein families [11], consensus patterns using motifs [4, 3] and hidden Markov models [16, 9, 5] have all been used for this problem. Recent research suggests that the best-performing ... |

5344 |
Basic local alignment search tool
- Altschul, Gish, et al.
- 1990
(Show Context)
Citation Context ... functional and struc∗ Formerly William Noble Grundy, see www.gs.columbia.edu/˜noble/name-change.html 1 tural families based on sequence homology. Approaches based on pairwise similarity of sequences =-=[27, 1, 2]-=-, profiles for protein families [11], consensus patterns using motifs [4, 3] and hidden Markov models [16, 9, 5] have all been used for this problem. Recent research suggests that the best-performing ... |

1442 | A training algorithm for optimal margin classifiers
- Boser
- 1992
(Show Context)
Citation Context ...ethod, we can recover conservation information as an output of our training. 2 SVMs and Kernels Support Vector Machines (SVMs) are a class of supervised learning algorithms first introduced by Vapnik =-=[6, 26]-=-. Given a set of labeled training vectors (xi, yi), i = 1 . . . m, where the xi are real vectors and yi are ±1, training an SVM amounts to solving an optimization problem that determines a linear clas... |

1171 | SCOP: a structural classification of proteins database for the investigation of sequences and structures
- Murzin
- 1995
(Show Context)
Citation Context ...er, when mismatch kernels are used with SVMs, we can implement the classification to make lineartime prediction on test sequences. We report results for two sets of experiments over the SCOP database =-=[21]-=-. In the first set of experiments, we test our method on the benchmark dataset assembled bys[15], where SCOP sequences are augmented by domain homologs of positive training sequences in order to assis... |

1065 |
2000, An Introduction to Support Vector Machines
- Cristianini, Shawe-Taylor
(Show Context)
Citation Context ...gorithm attempts to learn a decision boundary between the different classes. In this category, several successful techniques [15, 19, 17] use protein sequences to train a support vector machine (SVM) =-=[26, 8]-=- classifier. In this paper, we present a method for using SVMs for remote homology detection, based on a family of kernel functions called mismatch kernels. A kernel function measures the similarity b... |

970 |
Algorithms on strings, trees and sequences
- Gusfield
- 1997
(Show Context)
Citation Context ...we directly and efficiently compute the kernel matrix using a data structure called a mismatch tree. Mismatch tree data structure The mismatch tree data structure is similar to a trie or suffix tree (=-=Gusfield, 1997-=-). We use the mismatch tree to represent the feature space (the set of all k-mers) and to organize a lexical traversal of all instances of k-mers that occur (with mismatches) in the data. The entire k... |

568 | Hidden Markov models in computational biology. Applications to protein modeling
- Krogh, Brown, et al.
- 1994
(Show Context)
Citation Context ...milies based on sequence homology. Approaches based on pairwise similarity of sequences [27, 1, 2], profiles for protein families [11], consensus patterns using motifs [4, 3] and hidden Markov models =-=[16, 9, 5]-=- have all been used for this problem. Recent research suggests that the best-performing methods are discriminative: protein sequences are seen as a set of labeled examples — positive if they are in th... |

433 |
Algorithms on strings, trees, and sequences: computer science and computational biology
- Gusfield
- 1997
(Show Context)
Citation Context ...ng the path AL. Many of the instances reach the maximum number of allowed mismatches and are not passed down to their child nodes. The mismatch tree data structure is similar to a trie or suffix tree =-=[12]-=-. We use the mismatch tree to represent the feature space (the set of all k-mers) and to organize a lexical traversal of all instances of k-mers that occur (with mismatches) in the data. The entire ke... |

413 | Text classification using string kernels
- Lodhi, Shawe-Taylor, et al.
- 2000
(Show Context)
Citation Context ...iency. Recently, Chris Watkins [28] and David Haussler [13] have defined a set of kernel functions over strings, and one of these string kernels has been implemented for a text classification problem =-=[20]-=-. However, the cost of computing each kernel entry is O(n 2 ) in the length of the input sequences, making them too slow for most biological applications. The (k, m)-mismatch kernel for a pair of leng... |

412 | Convolution Kernels on Discrete Structures
- Haussler
- 1999
(Show Context)
Citation Context ...ubsequences in remotely related protein sequences. An important characteristic of any protein classification algorithm is its computational efficiency. Recently, Chris Watkins [28] and David Haussler =-=[13]-=- have defined a set of kernel functions over strings, and one of these string kernels has been implemented for a text classification problem [20]. However, the cost of computing each kernel entry is O... |

316 |
The spectrum kernel: A string kernel for SVM protein classification
- Leslie
- 2002
(Show Context)
Citation Context ...sitive if they are in the family and negative otherwise — and a learning algorithm attempts to learn a decision boundary between the different classes. In this category, several successful techniques =-=[15, 19, 17]-=- use protein sequences to train a support vector machine (SVM) [26, 8] classifier. In this paper, we present a method for using SVMs for remote homology detection, based on a family of kernel function... |

308 |
Profile analysis: detection of distantly related proteins
- Gribskov, McLachlan, et al.
- 1987
(Show Context)
Citation Context ...ble Grundy, see www.gs.columbia.edu/˜noble/name-change.html 1 tural families based on sequence homology. Approaches based on pairwise similarity of sequences [27, 1, 2], profiles for protein families =-=[11]-=-, consensus patterns using motifs [4, 3] and hidden Markov models [16, 9, 5] have all been used for this problem. Recent research suggests that the best-performing methods are discriminative: protein ... |

212 | A discriminative framework for detecting remote protein homologies
- Jaakkola, Diekhans, et al.
- 1999
(Show Context)
Citation Context ...sitive if they are in the family and negative otherwise — and a learning algorithm attempts to learn a decision boundary between the different classes. In this category, several successful techniques =-=[15, 19, 17]-=- use protein sequences to train a support vector machine (SVM) [26, 8] classifier. In this paper, we present a method for using SVMs for remote homology detection, based on a family of kernel function... |

197 |
The PROSITE database, its status
- Bairoch, Bucher, et al.
- 1997
(Show Context)
Citation Context ...oble/name-change.html 1 tural families based on sequence homology. Approaches based on pairwise similarity of sequences [27, 1, 2], profiles for protein families [11], consensus patterns using motifs =-=[4, 3]-=- and hidden Markov models [16, 9, 5] have all been used for this problem. Recent research suggests that the best-performing methods are discriminative: protein sequences are seen as a set of labeled e... |

173 | On comparing classifiers: pitfalls to avoid and a recommended approach
- Salzberg
- 1997
(Show Context)
Citation Context ... one homology detection method. Qualitatively, the curves for SVM-Fisher and mismatch-SVM are quite similar. When we compare the overall performance of two methods using a two-tailed signed rank test =-=[14, 25]-=-, we find that almost none of the differences between methods are statistically significant. Using a p-value threshold of 0.05 and including a Bonferroni adjustment to account for multiple comparisons... |

166 |
Hidden markov models of biological primary sequence information
- Baldi, Chauvin, et al.
- 1994
(Show Context)
Citation Context ...milies based on sequence homology. Approaches based on pairwise similarity of sequences [27, 1, 2], profiles for protein families [11], consensus patterns using motifs [4, 3] and hidden Markov models =-=[16, 9, 5]-=- have all been used for this problem. Recent research suggests that the best-performing methods are discriminative: protein sequences are seen as a set of labeled examples — positive if they are in th... |

159 | Multiple alignment using hidden Markov models
- Eddy
- 1995
(Show Context)
Citation Context ...milies based on sequence homology. Approaches based on pairwise similarity of sequences [27, 1, 2], profiles for protein families [11], consensus patterns using motifs [4, 3] and hidden Markov models =-=[16, 9, 5]-=- have all been used for this problem. Recent research suggests that the best-performing methods are discriminative: protein sequences are seen as a set of labeled examples — positive if they are in th... |

159 | Combining pairwise sequence similarity and support vector machines for remote protein homology detection
- Liao, Noble
- 2002
(Show Context)
Citation Context ...sitive if they are in the family and negative otherwise — and a learning algorithm attempts to learn a decision boundary between the different classes. In this category, several successful techniques =-=[15, 19, 17]-=- use protein sequences to train a support vector machine (SVM) [26, 8] classifier. In this paper, we present a method for using SVMs for remote homology detection, based on a family of kernel function... |

137 | Mismatch string kernels for SVM protein classification
- Leslie
- 2003
(Show Context)
Citation Context ...ance than our mismatch-SVM approach. However, mismatch-SVM performs as well as SVM-pairwise, the best-performing method reported in [19] for this benchmark. The current work is an expanded version of =-=[18]-=-, which defined the mismatch kernel and presented results on the Jaakkola et al. dataset. Here, in addition to reporting experiments on the second benchmark dataset and compairing to the SVM-pairwise ... |

129 | Dynamic alignment kernels
- Watkins
- 1999
(Show Context)
Citation Context ...tion rule to conserved subsequences in remotely related protein sequences. An important characteristic of any protein classification algorithm is its computational efficiency. Recently, Chris Watkins =-=[28]-=- and David Haussler [13] have defined a set of kernel functions over strings, and one of these string kernels has been implemented for a text classification problem [20]. However, the cost of computin... |

92 | Fast kernels for string and tree matching
- Vishwanathan, Smola
- 2002
(Show Context)
Citation Context ...ces will have a large k-spectrum kernel value if they share many of the same k-mers. One can extend this idea by taking weighted sums of k-spectrum kernels for different values of k, as described in (=-=Vishwanathan and Smola, 2002-=-). For a more sensitive and biologically realistic kernel, we want to allow some degree of mismatching in our feature map. That is, we want the kernel value between two sequences x and y to be large i... |

88 |
An algorithm for finding signals of unknown length in DNA sequences
- Pavesi, Mauri, et al.
- 2001
(Show Context)
Citation Context ...traversal of all instances of k-mers that occur (with mismatches) in the data. The entire kernel matrix is computed in one traversal of the tree. Our algorithm is similar to the approach presented in =-=[24, 22]-=- for finding subsequence patterns that occur with mismatches. A related data structure was also used for sparse prediction trees [10, 23]. A (k, m)-mismatch tree is a rooted tree of depth k where each... |

72 |
The ASTRAL compendium for sequence and structure analysis
- Brenner
- 2000
(Show Context)
Citation Context ...and BLAST over the same dataset. Sequences for these experiments were extracted from the Structural Classification of Proteins (SCOP) [21] version 1.53 using the Astral database (astral.stanford.edu, =-=[7]-=-), removing similar sequences using an E-value threshold of 10 −25 . This procedure resulted in 4352 distinct sequences, grouped into families, superfamilies, and folds. All pairwise E-values are comp... |

66 | A new discriminative kernel from probabilistic models
- Tsuda, Kawanabe, et al.
- 2002
(Show Context)
Citation Context ...MM, and the SVM-Fisher method. We note that, more recently, a newer version of the SAM HMM software has become available (Karplus et al., 2001), modifications to the Fisher kernel have been explored (=-=Tsuda et al., 2002-=-), and other novel approaches to homology detection have been introduced (Spang et al., 2002). Figure 3 illustrates the mismatch kernel’s performance relative to the profile HMM and SVM-Fisher homolog... |

64 |
What is the value added by human intervention in protein structure prediction
- Karplus, Karchin, et al.
- 2001
(Show Context)
Citation Context ...rimental results from Jaakkola et al. for two methods: the SAM-T98 iterative HMM, and the SVM-Fisher method. We note that, more recently, a newer version of the SAM HMM software has become available (=-=Karplus et al., 2001-=-), modifications to the Fisher kernel have been explored (Tsuda et al., 2002), and other novel approaches to homology detection have been introduced (Spang et al., 2002). Figure 3 illustrates the mism... |

35 | Protein family classification using sparse Markov transducers
- Eskin, Grundy, et al.
- 2000
(Show Context)
Citation Context ...he tree. Our algorithm is similar to the approach presented in [24, 22] for finding subsequence patterns that occur with mismatches. A related data structure was also used for sparse prediction trees =-=[10, 23]-=-. A (k, m)-mismatch tree is a rooted tree of depth k where each internal node has 20 (more generally, l) branches, eachslabeled with an amino acid (symbol from A). A leaf node represents a fixed k-mer... |

29 | The prints protein fingerprint database in its fifth year
- Attwood, Beck, et al.
- 1998
(Show Context)
Citation Context ...oble/name-change.html 1 tural families based on sequence homology. Approaches based on pairwise similarity of sequences [27, 1, 2], profiles for protein families [11], consensus patterns using motifs =-=[4, 3]-=- and hidden Markov models [16, 9, 5] have all been used for this problem. Recent research suggests that the best-performing methods are discriminative: protein sequences are seen as a set of labeled e... |

29 | An efficient extension to mixture techniques for prediction and decision trees
- Pereira, Singer
- 1999
(Show Context)
Citation Context ...he tree. Our algorithm is similar to the approach presented in [24, 22] for finding subsequence patterns that occur with mismatches. A related data structure was also used for sparse prediction trees =-=[10, 23]-=-. A (k, m)-mismatch tree is a rooted tree of depth k where each internal node has 20 (more generally, l) branches, eachslabeled with an amino acid (symbol from A). A leaf node represents a fixed k-mer... |

26 | Fast kernels for inexact string matching - Leslie, Kuang - 2003 |

25 | Embedding strategies for effective use of information from multiple sequence alignments
- Henikoff, Henikoff
- 1997
(Show Context)
Citation Context ... one homology detection method. Qualitatively, the curves for SVM-Fisher and mismatch-SVM are quite similar. When we compare the overall performance of two methods using a two-tailed signed rank test =-=[14, 25]-=-, we find that almost none of the differences between methods are statistically significant. Using a p-value threshold of 0.05 and including a Bonferroni adjustment to account for multiple comparisons... |

20 | A novel approach to remote homology detection: jumping alignments
- Spang, Rehmsmeier, et al.
- 2002
(Show Context)
Citation Context ...oftware has become available (Karplus et al., 2001), modifications to the Fisher kernel have been explored (Tsuda et al., 2002), and other novel approaches to homology detection have been introduced (=-=Spang et al., 2002-=-). Figure 3 illustrates the mismatch kernel’s performance relative to the profile HMM and SVM-Fisher homology detection methods. The figure includes results for all 33 SCOP families, and each series c... |

12 |
Spelling approximate or repeated motifs using a suffix tree
- Sagot
- 1998
(Show Context)
Citation Context ...traversal of all instances of k-mers that occur (with mismatches) in the data. The entire kernel matrix is computed in one traversal of the tree. Our algorithm is similar to the approach presented in =-=[24, 22]-=- for finding subsequence patterns that occur with mismatches. A related data structure was also used for sparse prediction trees [10, 23]. A (k, m)-mismatch tree is a rooted tree of depth k where each... |

8 |
Computer alignment of sequences, chapter Phylogenetic Analysis of DNA Sequences
- Waterman, Joyce, et al.
- 1991
(Show Context)
Citation Context ... functional and struc∗ Formerly William Noble Grundy, see www.gs.columbia.edu/˜noble/name-change.html 1 tural families based on sequence homology. Approaches based on pairwise similarity of sequences =-=[27, 1, 2]-=-, profiles for protein families [11], consensus patterns using motifs [4, 3] and hidden Markov models [16, 9, 5] have all been used for this problem. Recent research suggests that the best-performing ... |

2 |
Phylogenetic analysis of DNA sequences. Computer Alignment of Sequences, pp.59–72
- Waterman, Joyce, et al.
- 1991
(Show Context)
Citation Context ... problems in computational biology is the classification of protein sequences into functional and structural families based on sequence homology. Approaches based on pairwise similarity of sequences (=-=Waterman et al., 1991-=-; Altschul et al., 1990, 1997), profiles for protein families (Gribskov et al., 1987), consensus patterns using motifs (Bairoch, 1995; Attwood et al., 1998) and hidden Markov models (Krogh et al., 199... |