## Rational kernels: Theory and algorithms (2004)

### Cached

### Download Links

- [www.research.att.com]
- [www2.research.att.com]
- [jmlr.csail.mit.edu]
- [www.ai.mit.edu]
- [www.ai.mit.edu]
- [www.ics.uci.edu]
- [jmlr.org]
- [tagh.de]
- DBLP

### Other Repositories/Bibliography

Venue: | Journal of Machine Learning Research |

Citations: | 38 - 7 self |

### BibTeX

@ARTICLE{Cortes04rationalkernels:,

author = {Corinna Cortes and Patrick Haffner and Mehryar Mohri and Kristin Bennett and Nicolò Cesa-bianchi},

title = {Rational kernels: Theory and algorithms},

journal = {Journal of Machine Learning Research},

year = {2004},

volume = {5},

pages = {1035--1062}

}

### Years of Citing Articles

### OpenURL

### Abstract

Many classification algorithms were originally designed for fixed-size vectors. Recent applications in text and speech processing and computational biology require however the analysis of variable-length sequences and more generally weighted automata. An approach widely used in statistical learning techniques such as Support Vector Machines (SVMs) is that of kernel methods, due to their computational efficiency in high-dimensional feature spaces. We introduce a general family of kernels based on weighted transducers or rational relations, rational kernels, that extend kernel methods to the analysis of variable-length sequences or more generally weighted automata. We show that rational kernels can be computed efficiently using a general algorithm of composition of weighted transducers and a general single-source shortest-distance algorithm. Not all rational kernels are positive definite and symmetric (PDS), or equivalently verify the Mercer condition, a condition that guarantees the convergence of training for discriminant classification algorithms such as SVMs. We present several theoretical results related to PDS rational kernels. We show that under some general conditions these kernels are

### Citations

8980 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...utomata which contain the correct result in most cases. An approach widely used in statistical learning techniques such as Support Vector Machines (SVMs) (Boser et al., 1992; Cortes and Vapnik, 1995; =-=Vapnik, 1998-=-) is that of kernel methods, due to their computational efficiency in high-dimensional feature spaces. We introduce a general family of kernels based on weighted transducers or rational relations, rat... |

2171 | Support-vector networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ...o use the full weighted automata which contain the correct result in most cases. An approach widely used in statistical learning techniques such as Support Vector Machines (SVMs) (Boser et al., 1992; =-=Cortes and Vapnik, 1995-=-; Vapnik, 1998) is that of kernel methods, due to their computational efficiency in high-dimensional feature spaces. We introduce a general family of kernels based on weighted transducers or rational ... |

2028 | Learning with Kernels
- Scholkopf, Smola
- 2002
(Show Context)
Citation Context ...,xj)i, j≤n for all n ≥ 1 and all {x1,...,xn} ⊆ X is symmetric and all its eigenvalues are non-negative. PDS kernels can be used to construct other families of kernels that also meet these conditions (=-=Schölkopf and Smola, 2002-=-). Polynomial kernels of degree p are formed from the expression (K + a) p , and Gaussian kernels can be formed as exp(−d 2 /σ 2 ) with d 2 (x,y) = K(x,x) + K(y,y) − 2K(x,y). The following sections wi... |

1291 | A training algorithm for optimal margin classifiers
- Boser, Guyon, et al.
(Show Context)
Citation Context ...preferable instead to use the full weighted automata which contain the correct result in most cases. An approach widely used in statistical learning techniques such as Support Vector Machines (SVMs) (=-=Boser et al., 1992-=-; Cortes and Vapnik, 1995; Vapnik, 1998) is that of kernel methods, due to their computational efficiency in high-dimensional feature spaces. We introduce a general family of kernels based on weighted... |

830 | Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
- Durbin, Eddy, et al.
- 1998
(Show Context)
Citation Context ...thers. Rational kernels provide a unified framework for the design of computationally efficient kernels for strings or weighted automata. The framework includes in particular pair-HMM string kernels (=-=Durbin et al., 1998-=-, Watkins, 1999), Haussler’s convolution kernels for strings, the path kernels of Takimoto and Warmuth (2003), and other classes of string kernels introduced for computational biology. We also showed ... |

487 | Boostexter: A boosting-based system for text categorization - Schapire, Singer - 2000 |

385 | Watkins,C.,Text classification using string kernel
- Lodhi, Cristianini, et al.
- 2001
(Show Context)
Citation Context ...t is an arbitrary function mapping K to R. Figure 1 shows an example of a weighted transducer over the probability semiring corresponding to the gappy n-gram kernel with decay factor λ as defined by (=-=Lodhi et al., 2001-=-). Such gappy n-gram kernels are rational kernels (Cortes et al., 2003c). Rational kernels can be naturally extended to kernels over weighted automata. Let A be a weighted automaton defined over the s... |

368 | Convolution Kernels on Discrete Structures
- Haussler
- 1999
(Show Context)
Citation Context ... rational kernels. We also study the relationship between rational kernels and some commonly used string kernels or similarity measures such as the edit-distance, the convolution kernels of Haussler (=-=Haussler, 1999-=-), and some string kernels used in the context of computational biology (Leslie et al., 2003). We show that these kernels are all specific instances of rational kernels. In each case, we explicitly de... |

305 |
LINPACK User's Guide
- DONGARRA, BUNCH, et al.
- 1979
(Show Context)
Citation Context ...nes a PDS rational kernel, M is a symmetric matrix with non-negative eigenvalues, i.e., M is symmetric positive semi-definite. The Cholesky decomposition extends to the case of semidefinite matrices (=-=Dongarra et al., 1979-=-): there exists an upper triangular matrix R = (Ri j) with non-negative diagonal elements such that M = RR t . Let Y = {y1,...,yn} be an arbitrary subset of n distinct strings of Σ ∗ . Define the weig... |

274 |
Combinatorial Optimization: Networks and
- LAWLER
- 1976
(Show Context)
Citation Context ...s linear: O(|Q| + (T⊕ + T⊗)|E|), where T⊕ denotes the maximum time to compute ⊕ and T⊗ the time to compute ⊗ (Mohri, 2002). The algorithm can then be viewed as a generalization of Lawler’s algorithm (=-=Lawler, 1976-=-) to the case of an arbitrary semiring K. It is then based on a generalized relaxation of the outgoing transitions of each state of M visited in reverse topological order (Mohri, 2002). 4. See Pereira... |

179 |
Automata-Theoretic Aspects of Formal Power Series
- Salomaa, Soittola
- 1978
(Show Context)
Citation Context ...ith the input alphabet of T2. Then, the result of the composition of T1 and T2 is a weighted transducer T1 ◦ T2 which, when it is regulated, is defined for all x,y by (Berstel, 1979; Eilenberg, 1974; =-=Salomaa and Soittola, 1978-=-; Kuich and Salomaa, 1986) 2 [[T1 ◦ T2]](x,y) = � [[T1]](x,z) ⊗ [[T2]](z,y). 2. We use a matrix notation for the definition of composition as opposed to a functional notation. z∈Δ ∗ 1038sRATIONAL KERN... |

126 | Mismatch string kernels for SVM protein classifiaction
- Leslie, Eskin, et al.
- 2003
(Show Context)
Citation Context ...monly used string kernels or similarity measures such as the edit-distance, the convolution kernels of Haussler (Haussler, 1999), and some string kernels used in the context of computational biology (=-=Leslie et al., 2003-=-). We show that these kernels are all specific instances of rational kernels. In each case, we explicitly describe the corresponding weighted transducer. These transducers are often simple and efficie... |

124 | Speech Recognition by Composition of Weighted Finite Automata
- Pereira, Riley
- 1997
(Show Context)
Citation Context ...ansducer. 4.1 Composition of weighted transducers There exists a general and efficient composition algorithm for weighted transducers which takes advantage of the sparseness of the input transducers (=-=Pereira and Riley, 1997-=-; Mohri et al., 1996). States in the composition T1 ◦ T2 of two weighted transducers T1 and T2 are identified with pairs of 1040s0 a:a/1.61 1 b:b/0.22 RATIONAL KERNELS: THEORY AND ALGORITHMS 0 a:b/0 b... |

114 |
Transductions and Context-Free Languages. Teubner Studienbucher
- Berstel
- 1979
(Show Context)
Citation Context ...tput alphabet of T1, coincides with the input alphabet of T2. Then, the result of the composition of T1 and T2 is a weighted transducer T1 ◦ T2 which, when it is regulated, is defined for all x,y by (=-=Berstel, 1979-=-; Eilenberg, 1974; Salomaa and Soittola, 1978; Kuich and Salomaa, 1986) 2 [[T1 ◦ T2]](x,y) = � [[T1]](x,z) ⊗ [[T2]](z,y). 2. We use a matrix notation for the definition of composition as opposed to a ... |

97 | The Design Principles of a Weighted Finite-State Transducer
- Mohri, Pereira, et al.
- 2000
(Show Context)
Citation Context ...ased on the following criterion: it is considered an error if the highest scoring class given by the classifier is none of these labels. 7.2.3 IMPLEMENTATION AND RESULTS We used the AT&T FSM Library (=-=Mohri et al., 2000-=-) and the GRM Library (Allauzen et al., 2004) for the implementation of the n-gram rational kernels Kn used. We used these kernels with SVMs, using a general learning library for large-margin classifi... |

80 | Weighted automata in text and speech processing
- Mohri, Pereira, et al.
- 1996
(Show Context)
Citation Context ... of weighted transducers There exists a general and efficient composition algorithm for weighted transducers which takes advantage of the sparseness of the input transducers (Pereira and Riley, 1997; =-=Mohri et al., 1996-=-). States in the composition T1 ◦ T2 of two weighted transducers T1 and T2 are identified with pairs of 1040s0 a:a/1.61 1 b:b/0.22 RATIONAL KERNELS: THEORY AND ALGORITHMS 0 a:b/0 b:a/0.69 2 a:a/1.2 b:... |

80 | The kernel trick for distances
- Scholkopf
- 2000
(Show Context)
Citation Context ...ere d can be represented by a rational kernel. We will use these results later when dealing with the case of the edit-distance. 7. Many of the results given by Berg et al. (1984) are re-presented in (=-=Schölkopf, 2001-=-) with the terminology of conditionally positive definite instead of negative definite kernels. We adopt the original terminology used by Berg et al. (1984). 1047sCORTES, HAFFNER, AND MOHRI Definition... |

73 | Semiring Frameworks and Algorithms for Shortest-Distance Problems - Mohri |

59 | Path kernels and multiplicative updates - Takimoto, Warmuth - 2003 |

39 | Optimizing svms for complex call classification
- Haffner, Tur, et al.
- 2003
(Show Context)
Citation Context ...ational kernels Kn used. We used these kernels with SVMs, using a general learning library for large-margin classification (LLAMA), which offers an optimized multi-class recombination of binary SVMs (=-=Haffner et al., 2003-=-). Training time took a few hours on a single processor of a 2.4GHz Intel Pentium processor Linux cluster with 2GB of memory and 512 KB cache. In our experiments, we used the trigram kernel K3 with a ... |

25 | Edit-distance of weighted automata: General definitions and algorithms - Mohri - 2003 |

20 | Positive definite rational kernels - Cortes, Haffner, et al. - 2003 |

12 |
Automata, Languages and Machines, volume A–B
- Eilenberg
- 1974
(Show Context)
Citation Context ... T1, coincides with the input alphabet of T2. Then, the result of the composition of T1 and T2 is a weighted transducer T1 ◦ T2 which, when it is regulated, is defined for all x, y by (Berstel, 1979, =-=Eilenberg, 1974-=-, Salomaa and Soittola, 1978, Kuich and Salomaa, 1986): 2 [T1 ◦ T2 ](x, y) = � [T1 ](x, z) ⊗ [T2 ](z,y) (5) z ∈Δ ∗ Note that a transducer can be viewed as a matrix over a countable set Σ∗ × Δ∗ and com... |

8 | A general weighted grammar library
- Allauzen, Mohri, et al.
(Show Context)
Citation Context ...nsidered an error if the highest scoring class given by the classifier is none of these labels. 7.2.3 IMPLEMENTATION AND RESULTS We used the AT&T FSM Library (Mohri et al., 2000) and the GRM Library (=-=Allauzen et al., 2004-=-) for the implementation of the n-gram rational kernels Kn used. We used these kernels with SVMs, using a general learning library for large-margin classification (LLAMA), which offers an optimized mu... |

7 | Mehryar Mohri. Rational kernels - Cortes, Haffner - 2002 |

5 | Weighted automata kernels - general framework and algorithms
- Cortes, Haffner, et al.
- 2003
(Show Context)
Citation Context ...ral difficult largevocabulary spoken-dialog classification tasks based on deployed spoken-dialog systems. Our results 1. We have described in shorter publications part of the material presented here (=-=Cortes et al., 2003-=-a,b,c,d). 1036sRATIONAL KERNELS: THEORY AND ALGORITHMS SEMIRING SET ⊕ ⊗ 0 1 Boolean {0,1} ∨ ∧ 0 1 Probability R+ + × 0 1 Log R ∪ {−∞,+∞} ⊕log + +∞ 0 Tropical R ∪ {−∞,+∞} min + +∞ 0 Table 1: Semiring e... |

4 | Distribution Kernels Based on Moments of Counts
- Cortes, Mohri
- 2004
(Show Context)
Citation Context ...der moments of the distribution of the counts of sequences, moment kernels, and report the results of our experiments on the same tasks which demonstrate a consistent gain in classification accuracy (=-=Cortes and Mohri, 2004-=-). Rational kernels can be used in a similar way in many other natural language processing, speech processing, and bioinformatics tasks. 1059sAcknowledgments CORTES, HAFFNER, AND MOHRI We thank Allen ... |

3 | Languages and Machines, volume A–B - Automata - 1974 |

3 | Mehryar Mohri, “Voice signatures - Shafran, Riley |

1 | Mehryar Mohri. Lattice Kernels for Spoken Dialog Classification - Cortes, Haffner |

1 |
Rational Kernels: Theory and Algorithms
- Pereira, Riley
- 1997
(Show Context)
Citation Context ...1 and T2. 4.1 Composition of weighted transducers There exists a general and efficient composition algorithm for weighted transducers which takes advantage of the sparseness of the input transducers (=-=Pereira and Riley, 1997-=-, Mohri et al., 1996). States in the composition T1 ◦ T2 of two weighted transducers T1 and T2 are identified with pairs of a state of T1 and a state of T2. Leaving aside transitions with ɛ inputs or ... |