## An SVM Based Voting Algorithm with Application to Parse Reranking (2003)

### Cached

### Download Links

Venue: | In Proc. of CoNLL 2003 |

Citations: | 34 - 4 self |

### BibTeX

@INPROCEEDINGS{Shen03ansvm,

author = {Libin Shen and Aravind K. Joshi},

title = {An SVM Based Voting Algorithm with Application to Parse Reranking},

booktitle = {In Proc. of CoNLL 2003},

year = {2003},

pages = {9--16}

}

### OpenURL

### Abstract

This paper introduces a novel Support Vector Machines (SVMs) based voting algorithm for reranking, which provides a way to solve the sequential models indirectly. We have presented a risk formulation under the PAC framework for this voting algorithm. We have applied this algorithm to the parse reranking problem, and achieved labeled recall and precision of 89.4%/89.8% on WSJ section 23 of Penn Treebank.

### Citations

9946 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ... ordinal regression. We then apply this algorithm to the parse reranking problem. 1.1 A Short Introduction of SVMs In this section, we give a short introduction of Support Vector Machines. We follow (=-=Vapnik, 1998)�-=-��s definition of SVMs. For each training sample (yi, xi), yi represents its class, and xi represents its input vector defined on a d-dimensional space. Suppose the training samples {(y1, x1), ..., (y... |

2506 | Conditional random fields: probabilistic modeling for segmenting and labeling sequence data
- Lafferty, McCallum, et al.
- 2001
(Show Context)
Citation Context ...ptimization in margin maximization. label bias problem, which means that the transitions leaving a given state compete only against each other, rather than against all other transitions in the model (=-=Lafferty et al., 2001-=-). Intuitively, it is the local normalization that results in the label bias problem. One way of using discriminative machine learning algorithms in sequential models is to rerank the n-best outputs o... |

2235 | Building a Large Annotated Corpus for English: The Penn Treebank
- Marcus, Santorini, et al.
- 1993
(Show Context)
Citation Context ...fier. The soft margin parameter C is set to its default value in SV M light . We use the same data set as described in (Collins, 2000; Collins and Duffy, 2002). Section 2-21 of the Penn WSJ Treebank (=-=Marcus et al., 1994-=-) are used as training data, and section 23 is used for final test. The training data contains around 40,000 sentences, each of which has 27 distinct parses on average. Of the 40,000 training sentence... |

1053 | An Introduction to Support Vector Machines - Cristianini, Shawe-Taylor - 2000 |

1023 | Head-driven Statistical Models for Natural Language Parsing
- Collins
- 1999
(Show Context)
Citation Context ...LR/LP = labeled recall/precision. CBs = average number of crossing brackets per sentence. 0 CBs, 2 CBs are the percentage of sentences with 0 or ≤ 2 crossing brackets respectively. CO99 = Model 2 of=-= (Collins, 1999).-=- CH00 = (Charniak, 2000). CO00 = (Collins, 2000). ≤40 Words (2245 sentences) Model LR LP CBs 0 CBs 2 CBs CO99 88.5% 88.7% 0.92 66.7% 87.1% CH00 90.1% 90.1% 0.74 70.1% 89.6% CO00 90.1% 90.4% 0.73 70.... |

878 | A Maximum-Entropy-Inspired Parser
- Charniak
- 2000
(Show Context)
Citation Context ...recision. CBs = average number of crossing brackets per sentence. 0 CBs, 2 CBs are the percentage of sentences with 0 or ≤ 2 crossing brackets respectively. CO99 = Model 2 of (Collins, 1999). CH00 ==-= (Charniak, 2000).-=- CO00 = (Collins, 2000). ≤40 Words (2245 sentences) Model LR LP CBs 0 CBs 2 CBs CO99 88.5% 88.7% 0.92 66.7% 87.1% CH00 90.1% 90.1% 0.74 70.1% 89.6% CO00 90.1% 90.4% 0.73 70.7% 89.6% SVM 89.9% 90.3% ... |

765 | Probabilistic outputs for support vector machines and comparison to regularized likelihood methods
- Platt
- 1999
(Show Context)
Citation Context ...ng hyperplane, but not a probability. A possible solution to this problem is to map SVMs’ results into probabilities through a Sigmoid function, and use Viterbi search to combine those probabilities=-= (Platt, 1999).-=- However, this approach conflicts with SVMs’ purpose of achieving the so-called global optimization 1 . First, this approach may constrain SVMs to local features because of the leftto-right scanning... |

677 | An Introduction to Tree Adjoining Grammars
- Joshi
- 1987
(Show Context)
Citation Context ... harmful as they are in tree kernel systems. One way to include more useful features is to take advantage of the derivation tree and the elementary trees in Lexicalized Tree Adjoining Grammar (LTAG) (=-=Joshi and Schabes, 1997-=-). The basic idea is that each elementary tree and every segment in a derivation tree is linguistically meaningful. We also plan to apply this algorithm to other sequential models, especially to the S... |

511 | Making large-scale support vector machine learning practical
- Joachims
- 1998
(Show Context)
Citation Context ...ns, samples are usually inseparable even if the kernel trick is used. SVMs can still be trained to maximize the margin through the method of soft margin. 5 Experiments and Analysis We use SV M light (=-=Joachims, 1998-=-) as the SVM classifier. The soft margin parameter C is set to its default value in SV M light . We use the same data set as described in (Collins, 2000; Collins and Duffy, 2002). Section 2-21 of the ... |

428 | Large margin classification using the perceptron algorithm
- Freund, Schapire
- 1999
(Show Context)
Citation Context ...ut of an existing parser (Collins, 1999, Model 2). One is based on Markov Random Fields, and the other is based on a boosting approach. In (Collins and Duffy, 2002), the use of Voted Perceptron (VP) (=-=Freund and Schapire, 1999-=-) for the parse reranking problem has been described. In that paper, the tree kernel (Collins and Duffy, 2001) has been used to efficiently count the number of common subtrees as described in (Bod, 19... |

279 | Convolution kernels for natural language
- Collins, Duffy
- 2001
(Show Context)
Citation Context ... on a boosting approach. In (Collins and Duffy, 2002), the use of Voted Perceptron (VP) (Freund and Schapire, 1999) for the parse reranking problem has been described. In that paper, the tree kernel (=-=Collins and Duffy, 2001-=-) has been used to efficiently count the number of common subtrees as described in (Bod, 1998). In this paper we will follow the reranking approach. We describe a novel SVM-based voting algorithm for ... |

222 | New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron
- Collins, Duffy
- 2002
(Show Context)
Citation Context ...e to take part in the reranking. In recent years, reranking techniques have been successfully applied to the so-called history-based models (Black et al., 1993), especially to parsing (Collins, 2000; =-=Collins and Duffy, 2002-=-). In a history-based model, the current decision depends on the decisions made previously. Therefore, we may regard parsing as a special form of sequential model without losing generality. Collins (2... |

194 | Chunking with Support Vector Machines
- Kudo, Matsumoto
- 2001
(Show Context)
Citation Context ...on of the margin. In ordinal regression, the margin is min |f(ri) − f(ri−1)|, where f is the regression function for ordinal values. In our algorithm, the margin is min |score(xi1) − score(xij)|=-=. In (Kudo and Matsumoto, 2001-=-), SVMs have been employed in the NP chunking task, a typical labeling problem. However, they have used a deterministic algorithm for decoding. In (Collins, 2000), two reranking algorithms were propos... |

163 | Towards historybased grammars: Using richer models for probabilistic parsing
- Black, Jelinek, et al.
- 1992
(Show Context)
Citation Context ... penalized due to the label bias problem) will have a chance to take part in the reranking. In recent years, reranking techniques have been successfully applied to the so-called history-based models (=-=Black et al., 1993-=-), especially to parsing (Collins, 2000; Collins and Duffy, 2002). In a history-based model, the current decision depends on the decisions made previously. Therefore, we may regard parsing as a specia... |

156 |
Beyond grammar: An experience-based theory of language
- Bod
- 1998
(Show Context)
Citation Context ...e, 1999) for the parse reranking problem has been described. In that paper, the tree kernel (Collins and Duffy, 2001) has been used to efficiently count the number of common subtrees as described in (=-=Bod, 1998-=-). In this paper we will follow the reranking approach. We describe a novel SVM-based voting algorithm for reranking. It provides an alternative way of using a large margin classifier for sequential m... |

99 |
Learning algorithms with optimal stability in neural networks
- Krauth, Mézard
- 1987
(Show Context)
Citation Context ...s existing in the training data, but these algorithm are not supposed to maximize margins. Variants of the Perceptron algorithm, which are known as Approximate Maximal Margin classifier, such as PAM (=-=Krauth and Mezard, 1987-=-), ALMA (Gentile, 2001) and PAUM (Li et al., 2002), produce decision hyperplanes within ratio of the maximal margin. However, almost all these algorithms are reported to be inferior to SVMs in accurac... |

61 | The perceptron algorithm with uneven margins
- Li, Zaragoza, et al.
- 2002
(Show Context)
Citation Context ...not supposed to maximize margins. Variants of the Perceptron algorithm, which are known as Approximate Maximal Margin classifier, such as PAM (Krauth and Mezard, 1987), ALMA (Gentile, 2001) and PAUM (=-=Li et al., 2002-=-), produce decision hyperplanes within ratio of the maximal margin. However, almost all these algorithms are reported to be inferior to SVMs in accuracy, while more efficient in training. Furthermore,... |

20 | Framewise phone classification using support vector machines
- Salomon, King, et al.
- 2002
(Show Context)
Citation Context ...for SV M light is roughly O(n 2.1 ) (Joachims, 1998), where n is the number of the training samples. One solution to the scaling difficulties is to use the Kernel Fisher Discriminant as described in (=-=Salomon et al., 2002)-=-. In this paper, we divide training data into slices to speed up training. Each slice contains two pairs of parses from each sentence. Specifically, slice i contains positive samples (( ˜pk, pki), +1... |

9 | Introduction to large margin classifiers
- Smola, Bartlett, et al.
- 2000
(Show Context)
Citation Context ...ral Risk Minimization (SRM) under Probably Approximately Correct (PAC) framework 2 ; test error is related to training data error, number of training samples and the capacity of the learning machine (=-=Smola et al., 2000-=-). Vapnik-Chervonenkis (VC) dimension (Vapnik, 1999), as well as some other measures, is used to estimate the complexity of the hypothesis space, or the capacity of the learning machine. The drawback ... |

6 | margin rank boundaries for ordinal regression - Large |

2 | Structural risk minimization over data-dependent hierarchies - Williamson, Shawe-taylor, et al. - 1998 |

1 | Support vector machines for parse selection
- Dijkstra
- 2001
(Show Context)
Citation Context ...ll the parses for the ith sentence. We may take xi1 as positive samples, and x ij(j>1) as negative samples. However, experiments have shown that this is not the best way to utilize SVMs in reranking (=-=Dijkstra, 2001-=-). A trick to be used here is to take a pair of parses as a sample: for any i and j > 1, (xi1, xij) is a positive sample, and (xij, xi1) is a negative sample. Similar idea was employed in the early wo... |

1 | margin to sparsity - From |