## Convolution Kernels for Natural Language (2001)

### Cached

### Download Links

- [www-2.cs.cmu.edu]
- [www.research.att.com]
- [www.ai.mit.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Advances in Neural Information Processing Systems 14 |

Citations: | 255 - 7 self |

### BibTeX

@INPROCEEDINGS{Collins01convolutionkernels,

author = {Michael Collins and Nigel Duffy},

title = {Convolution Kernels for Natural Language},

booktitle = {Advances in Neural Information Processing Systems 14},

year = {2001},

pages = {625--632},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

We describe the application of kernel methods to Natural Language Processing (NLP) problems. In many NLP tasks the objects being modeled are strings, trees, graphs or other discrete structures which require some mechanism to convert them into feature vectors. We describe kernels for various natural language structures, allowing rich, high dimensional representations of these structures. We show how a kernel over trees can be applied to parsing using the voted perceptron algorithm, and we give experimental results on the ATIS corpus of parse trees.

### Citations

2188 | Support-vector network
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ...s on the ATIS corpus of parse trees. 1 Introduction Kernel methods have been widely used to extend the applicability of many well-known algorithms, such as the Perceptron [1], Support Vector Machines =-=[6]-=-, or Principal Component Analysis [15]. A key property of these algorithms is that the only operation they require is the evaluation of dot products between pairs of examples. One may therefore replac... |

2112 | Building a Large Annotated Corpus of English: the Penn Treebank
- Marcus, Santorini, et al.
- 1993
(Show Context)
Citation Context ...plied to this problem. 4 Experimental Results To demonstrate the utility of convolution kernels for natural language we applied our tree kernel to the problem of parsing the Penn treebank ATIS corpus =-=[14]-=-. We split the treebank randomly into a training set of size 800, a development set of size 200 and a test set of size 336. This was done 10 different ways to obtain statistically significant results.... |

524 | An efficient boosting algorithm for combining preferences
- Freund, Iyer, et al.
- 2003
(Show Context)
Citation Context ... and Tagging This section formalizes the use of kernels for parsing and tagging problems. The method is derived by the transformation from ranking problems to a margin-based classification problem in =-=[8-=-]. It is also related to the Markov Random Field methods for parsing suggested in [13], and the boosting methods for parsing in [4]. We consider the following set-up: Training data is a set of exampl... |

413 | Large margin classification using the perceptron algorithm
- Freund, Schapire
- 1999
(Show Context)
Citation Context ... A PCFG was trained on the training set, and a beam search was used to give a set of parses, with PCFG probabilities, for each of the sentences. We applied a variant of the voted perceptron algorithm =-=[7]-=-, which is a more robust version of the original perceptron algorithm with performance similar to that of SVMs. The voted perceptron can be kernelized in the same way that SVMs can but it can be consi... |

387 | Text classification using string kernels
- Lodhi, Saunders, et al.
(Show Context)
Citation Context ... goes into some detail describing which construction operations are valid in this context, i.e. which operations maintain the essential Mercer conditions. This paper and previous work by Lodhi et al. =-=[12]-=- examining the application of convolution kernels to strings provide some evidence that convolution kernels may provide an extremely useful tool for applying modern machine learning techniques to high... |

369 | Convolution kernels on discrete structures
- Haussler
- 1999
(Show Context)
Citation Context ... to parsing using the perceptron algorithm, giving experimental results on the ATIS corpus of parses. The kernels we describe are instances of "Convolution Kernels", which were introduced by=-= Haussler [10] and Watki-=-ns [16], and which involve a recursive calculation over the "parts" of a discrete structure. Although we concentrate on NLP tasks in this paper, the kernels should also be useful in computat... |

271 | Discriminative reranking for natural language parsing
- Collins, Koo
- 2005
(Show Context)
Citation Context ...ion from ranking problems to a margin-based classification problem in [8]. It is also related to the Markov Random Field methods for parsing suggested in [13], and the boosting methods for parsing in =-=[4-=-]. We consider the following set-up: Training data is a set of example input/output pairs. In parsing we would have training examples fs i ; t i g where each s i is a sentence and each t i is the cor... |

179 | Kernel Principal Component Analysis
- Schölkopf, Müller
- 1999
(Show Context)
Citation Context ...1 Introduction Kernel methods have been widely used to extend the applicability of many well-known algorithms, such as the Perceptron [1], Support Vector Machines [6], or Principal Component Analysis =-=[15]-=-. A key property of these algorithms is that the only operation they require is the evaluation of dot products between pairs of examples. One may therefore replace the dot product with a Mercer kernel... |

146 |
Beyond Grammar: An Experience-Based Theory of Language
- Bod
- 1998
(Show Context)
Citation Context ...hat is exponential in its size). Because of this we would like design algorithms whose computational complexity does not depend on n. Representations of this kind have been studied extensively by Bod =-=[2]-=-. However, the work in [2] involves training and decoding algorithms that depend computationally on the number of subtrees involved. 1 The parameter estimation techniques described in [2] do not corre... |

122 | Dynamic alignment kernels
- Watkins
- 1999
(Show Context)
Citation Context ... the perceptron algorithm, giving experimental results on the ATIS corpus of parses. The kernels we describe are instances of "Convolution Kernels", which were introduced by Haussler [10] an=-=d Watkins [16], and whic-=-h involve a recursive calculation over the "parts" of a discrete structure. Although we concentrate on NLP tasks in this paper, the kernels should also be useful in computational biology, wh... |

90 | Statistical techniques for natural language parsing
- Charniak
- 1997
(Show Context)
Citation Context ...lities are typically estimated using maximum likelihood estimation, which gives simple relative frequency estimates. Competing analyses for the same sentence are ranked using these probabilities. See =-=[3]-=- for an introduction to these methods. This paper proposes an alternative to generative models such as PCFGs and HMMs. Instead of identifying parameters with rules of the grammar, we show how kernels ... |

58 | Efficient algorithms for parsing the DOP model
- Goodman
- 1996
(Show Context)
Citation Context ... spite of an exponentially large number of subtrees, and that efficient parameter estimation techniques exist which optimize discriminative criteria that have been well-studied theoretically. Goodman =-=[9]-=- gives an ingenious conversion of the model in [2] to an equivalent PCFG whose number of rules is linear in the size of the training data, thus solving many of the computational issues. An exact imple... |

33 | The DOP estimation method is biased and inconsistent
- Johnson
- 2002
(Show Context)
Citation Context ...depend computationally on the number of subtrees involved. 1 The parameter estimation techniques described in [2] do not correspond to maximum-likelihood estimation or a discriminative criterion: see =-=[11]-=- for discussion. The methods we propose show that the score for a parse can be calculated in polynomial time in spite of an exponentially large number of subtrees, and that efficient parameter estimat... |

13 | Parsing with a Single Neuron: Convolution Kernels for Natural Language Problems
- Collins, Duffy
- 2001
(Show Context)
Citation Context ... fragments exponentially with their size. It is straightforward to design similar kernels for tagging problems (see figure 1) and for another common structure found in NLP, dependency structures. See =-=[5]-=- for details. In the tagging kernel, the implicit feature representation tracks all features consisting of a subsequence of state labels, each with or without an underlying word. For example, the pair... |

4 |
Estimators for stochastic `unification-based" grammars
- Johnson, Geman, et al.
- 1999
(Show Context)
Citation Context ...lems. The method is derived by the transformation from ranking problems to a margin-based classification problem in [8]. It is also related to the Markov Random Field methods for parsing suggested in =-=[13-=-], and the boosting methods for parsing in [4]. We consider the following set-up: Training data is a set of example input/output pairs. In parsing we would have training examples fs i ; t i g where e... |