## Filtering-ranking perceptron learning for partial parsing (2004)

### Cached

### Download Links

Venue: | MACHINE LEARNING |

Citations: | 16 - 6 self |

### BibTeX

@INPROCEEDINGS{Carreras04filtering-rankingperceptron,

author = {Xavier Carreras and Lluís Màrquez and Jorge Castro},

title = {Filtering-ranking perceptron learning for partial parsing},

booktitle = {MACHINE LEARNING},

year = {2004},

pages = {1--3},

publisher = {}

}

### OpenURL

### Abstract

This work introduces a phrase recognition system based on perceptrons, and a global online learning algorithm to train them together. The method applies to complex domains in which some structure has to be recognized. This global problem is broken down into two layers of local subproblems: a filtering layer, which reduces the search space by identifying plausible phrase candidates, and a ranking layer, which discriminatively builds the optimal phrase structure. A recognitionbased feedback rule is presented which reflects to each local function its committed errors from a global point of view, and allows to train them together online as perceptrons. As a result, the learned functions automatically behave as filters and rankers, rather than binary classifiers, which we argue to be better for this type of problems. Extensive experimentation on partial parsing tasks gives state-of-the-art results and evinces the advantages of the global training method over optimizing each function locally, as in the traditional approach.

### Citations

2331 | Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML
- Lafferty, McCallum, et al.
- 2001
(Show Context)
Citation Context .... In this way, Conditional Random Fields (CRF) provide techniques to estimate a single exponential model for the joint probability of the entire sequence of labels given the sequence of observations (=-=Lafferty, McCallum, & Pereira, 2001-=-). More recently, Dynamic CRFs represent a generalization of linear–chain CRFs, which are able to represent several subtasks jointly in a single graphical model and to capture interactions between the... |

674 | Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning 2:285–318
- Littlestone
- 1988
(Show Context)
Citation Context ...he FR-Perceptron architecture with other online algorithms not suitable for kernels, which might report better results. We have in mind to replace Perceptron by some variants of the Winnow algorithm (=-=Littlestone, 1988-=-), an online mistake–driven algorithm with multiplicative updates. Finally, we would like to move to other complex Natural Language Processing tasks which involve intermediate subtasks, so as to apply... |

491 | Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms
- Collins
- 2002
(Show Context)
Citation Context ...ector in the same dimensionality produces a prediction score for that pair. Then, the learning problem consists of estimating the weight vector so that the correct solution is ranked the highest. In (=-=Collins, 2002-=-), a perceptron algorithm is presented for tagging problems, working with representations for which there exists a decoding algorithm that, given the weight vector, picks the top-ranked solution for a... |

446 | Shallow parsing with conditional random fields
- Sha, Pereira
- 2003
(Show Context)
Citation Context ... Essentially, we reproduce similar feature spaces than other relevant works for these tasks, which reported state-of-the-art results. For Chunking, we consider (Kudo & Matsumoto, 2001; Collins, 2002; =-=Sha & Pereira, 2003-=-), and for Clause Identification (Carreras et al., 2002). First, we define a set of primitive functions which apply to words or sequences of words, and will be used to define the representations: − Wo... |

438 | Max-margin Markov networks
- Taskar, Guestrin, et al.
- 2003
(Show Context)
Citation Context ...here have been also attempts to maximize the global margins in a training sample, such as Hidden Markov Support Vector Machines (Altun, Tsochantaridis, & Hofmann, 2003) or Max-Margin Markov Networks (=-=Taskar, Guestrin, & Koller, 2003-=-). Regarding the pure probabilistic methods there has been also interest on training globally avoiding to some extent the locality of conditional models. In this way, Conditional Random Fields (CRF) p... |

412 | Large margin classification using the perceptron algorithm
- Freund, Schapire
- 1999
(Show Context)
Citation Context ...ly makes use of the last w J vector). Moreover, we work with the dual formulation of the model, which allows the use of kernel functions, and thus learning non-linear separators. It is also shown in (=-=Freund & Schapire, 1999-=-) that a vector w can be expressed as the sum of instances x j that were added (s j = +1) or subtracted (s j = −1) in order to create it, as w = � J j=1 s j x j . Given a kernel function K(x,x ′ ), th... |

408 | Text chunking using transformation-based learning - Ramshaw, Marcus - 1995 |

345 | Parsing By Chunks
- Abney
- 1991
(Show Context)
Citation Context ...em into a number of local learnable subproblems. Then, a decoder algorithm infers the global solution from the outcomes of the local subproblems. The basic task in partial parsing, known as Chunking (=-=Abney, 1991-=-; Tjong Kim Sang & Buchholz, 2000), consists of finding non-recursive base phrases (or chunks) organized in sequence. A typical approach is to perform a tagging. In this case, local subproblems includ... |

269 | Discriminative reranking for natural language parsing - Collins - 2000 |

248 | Ultraconservative online algorithms for multiclass problems - Crammer, Singer - 2003 |

189 | Hidden Markov support vector machines
- Altun, Tsochantaridis, et al.
- 2003
(Show Context)
Citation Context ...p-ranked solutions. Thus, rather than simply making margins positive, there have been also attempts to maximize the global margins in a training sample, such as Hidden Markov Support Vector Machines (=-=Altun, Tsochantaridis, & Hofmann, 2003-=-) or Max-Margin Markov Networks (Taskar, Guestrin, & Koller, 2003). Regarding the pure probabilistic methods there has been also interest on training globally avoiding to some extent the locality of c... |

182 | Chunking with support vector machines
- Kudo, Matsumoto
- 2001
(Show Context)
Citation Context ...rase, etc.), and the inference process consists of sequentially computing the optimal tag sequence which encodes the phrases, by means of dynamic programming (Punyakanok & Roth, 2001) or beam search (=-=Kudo & Matsumoto, 2001-=-). When hierarchical structure has to be recognized, additional local decisions are required to determine the embedding of phrases, resulting in a more complex inference process which recursively buil... |

163 | Learning to parse natural language with maximum entropy models
- Ratnaparkhi
- 1999
(Show Context)
Citation Context ... of systems can be found for deeper levels of partial parsing (Tjong Kim Sang & Déjean, 2001; Carreras, Màrquez, Punyakanok, & Roth, 2002; Kudo & Matsumoto, 2002), or in full parsing (Magerman, 1996; =-=Ratnaparkhi, 1999-=-; Haruno, Shirai, Ooyama, & Aizawa, 1999). In general, a learning system for these tasks makes use of several learned functions which interact in some way to determine the final structure. A usual met... |

134 | Introduction to the CoNLL-2000 Shared Task: Chunking - Sang, Buchholz - 2000 |

131 |
On convergence proofs on perceptrons
- Novikoff
- 1962
(Show Context)
Citation Context ... a training set {(xi ,yi )} m i=1 satisfying Definition 2 with margin γ. Then the number of se-lf stages of FR-Perceptron is at most 2R 2 SE γ 2 . Proof. It follows immediately from Novikoff’s proof (=-=Novikoff, 1962-=-) for the standard perceptron algorithm. ✷ Now, to conclude the convergence of the algorithm FR-Perceptron we must demonstrate that after a se-lf stage there is room only for a finite number of consec... |

122 | Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data
- Sutton, Rohanimanesh, et al.
- 2004
(Show Context)
Citation Context ...More recently, Dynamic CRFs represent a generalization of linear–chain CRFs, which are able to represent several subtasks jointly in a single graphical model and to capture interactions between them (=-=Sutton, Rohanimanesh, & McCallum, 2004-=-). In this paper we introduce a learning architecture based on filters and rankers for the general task of recognizing phrases in a sentence. The strategy for recognizing phrases is adopted from (Carr... |

90 | 2001. The use of classifiers in sequential inference
- Punyakanok, Roth
(Show Context)
Citation Context ...phrase of some type (noun phrase, verb phrase, etc.), and the inference process consists of sequentially computing the optimal tag sequence which encodes the phrases, by means of dynamic programming (=-=Punyakanok & Roth, 2001-=-) or beam search (Kudo & Matsumoto, 2001). When hierarchical structure has to be recognized, additional local decisions are required to determine the embedding of phrases, resulting in a more complex ... |

89 | Use of support vector learning for chunk identi
- Kudoh, Matsumoto
- 2000
(Show Context)
Citation Context ...7 94.17 (Kudo & Matsumoto, 2001) SVM voting 93.89 93.92 93.91 (Kudo & Matsumoto, 2001) SVM single 93.95 93.75 93.85 FRP-Chunker F&R VP 94.20 93.38 93.79 (Zhang et al., 2002) Winnow 93.54 93.60 93.57 (=-=Kudo & Matsumoto, 2000-=-) SVM 93.45 93.51 93.48 (van Halteren, 2000) MBL&WPD voting 93.13 93.51 93.32 (Tjong Kim Sang, 2000) MBL voting 94.04 91.00 92.50 performing a pairwise multiclass prediction. They report a performance... |

67 | Kernel Methods for Relational Learning
- Cumby, Roth
- 2003
(Show Context)
Citation Context ...e dealing with. As an alternative of working in implicit feature spaces via kernel functions, we would like to work also with explicitly propositionalized rich feature spaces, as the ones defined in (=-=Cumby & Roth, 2003-=-). This approach would open the avenue for using the FR-Perceptron architecture with other online algorithms not suitable for kernels, which might report better results. We have in mind to replace Per... |

61 | A Family of Additive Online Algorithms for Category Ranking - Crammer, Singer - 2003 |

53 | Parameter estimation for statistical parsing models: Theory and practice of distribution-free methods - Collins - 2004 |

48 | Constraint classification for multiclass classification and ranking
- Har-Peled, Roth, et al.
- 2003
(Show Context)
Citation Context ...l solution of an example. For instance, in the multiclass setting a certain prototype is demoted only if its prediction score is higher than the score of the true-label’s prototype. Also related, in (=-=Har-Peled, Roth, & Zimak, 2002-=-) a generalized article.tex; 2/06/2004; 20:26; p.10sconstraint classification setting is proposed, which allows to model, in such a conservative way, multiclass and multilabel classification problems,... |

44 | Text chunking based on a generalization of winnow
- Zhang, Damerau, et al.
- 2002
(Show Context)
Citation Context ...RANKING PERCEPTRON LEARNING FOR PARTIAL PARSING 67 Table 6. Comparison on the Chunking task, with the top-performing systems published so-far on the test set. Reference Technique Precision Recall F1 (=-=Zhang, Damereau, & Johnson, 2002-=-) Winnow (+) 94.28 94.07 94.17 (Kudo & Matsumoto, 2001) SVM voting 93.89 93.92 93.91 (Kudo & Matsumoto, 2001) SVM single 93.95 93.75 93.85 FRP-Chunker F&R VP 94.17 93.26 93.71 (Zhang, Damereau, & John... |

32 | Using decision trees to construct a practical parser
- Haruno, Shirai, et al.
- 1999
(Show Context)
Citation Context ...found for deeper levels of partial parsing (Tjong Kim Sang & Déjean, 2001; Carreras, Màrquez, Punyakanok, & Roth, 2002; Kudo & Matsumoto, 2002), or in full parsing (Magerman, 1996; Ratnaparkhi, 1999; =-=Haruno, Shirai, Ooyama, & Aizawa, 1999-=-). In general, a learning system for these tasks makes use of several learned functions which interact in some way to determine the final structure. A usual methodology for solving the local subproble... |

31 | Phrase recognition by filtering and ranking with perceptrons - Carreras, Màrquez - 2003 |

17 | Boosting Trees for Clause Splitting
- Carreras, Marquez
- 2001
(Show Context)
Citation Context ... the top-performing systems published so-far on the test set. Reference Technique Precision Recall F1 FR-Perceptron F&R VP 88.17 82.10 85.03 (Carreras et al., 2002) AdaBoost class. 90.18 78.11 83.71 (=-=Carreras & Màrquez, 2001-=-) AdaBoost class. 84.82 78.85 81.73 (Molina & Pla, 2001) HMM 70.85 70.51 70.68 each prediction; and, (c) batch, with the SVM binary classification algorithm. Note that in this setting, where start-end... |

14 | Chunking with WPDV models - Halteren |

13 | D.(2002).Learning and inference for clause identification
- Carreras, Marquez, et al.
(Show Context)
Citation Context ...ses, resulting in a more complex inference process which recursively builds the global solution. Such type of systems can be found for deeper levels of partial parsing (Tjong Kim Sang & Déjean, 2001; =-=Carreras, Màrquez, Punyakanok, & Roth, 2002-=-; Kudo & Matsumoto, 2002), or in full parsing (Magerman, 1996; Ratnaparkhi, 1999; Haruno, Shirai, Ooyama, & Aizawa, 1999). In general, a learning system for these tasks makes use of several learned fu... |

13 | Introduction to the CoNLL-2001 shared task: Clause identification - Sang, F, et al. - 2001 |

12 |
Japanese dependency analyisis using cascaded chunking
- Kudo, Matsumoto
- 2002
(Show Context)
Citation Context ...ocess which recursively builds the global solution. Such type of systems can be found for deeper levels of partial parsing (Tjong Kim Sang & Déjean, 2001; Carreras, Màrquez, Punyakanok, & Roth, 2002; =-=Kudo & Matsumoto, 2002-=-), or in full parsing (Magerman, 1996; Ratnaparkhi, 1999; Haruno, Shirai, Ooyama, & Aizawa, 1999). In general, a learning system for these tasks makes use of several learned functions which interact i... |

10 | Online learning via global feedback for phrase recognition
- Carreras, Màrquez
- 2004
(Show Context)
Citation Context ...approximated to behave as word filters and phrase rankers, and thus, become adapted to the recognition strategy. This paper extends and gives a comprehensive empirical study of two preliminary works (=-=Carreras & Màrquez, 2003-=-a, 2003b). Regarding the analysis of the algorithm a convergence proof is presented. Regarding the empirical study, we provide extensive experimentation on two relevant problems of the Partial Parsing... |

10 |
Learning grammatical structure using statistical decision-trees
- Magerman
- 1996
(Show Context)
Citation Context ...ution. Such type of systems can be found for deeper levels of partial parsing (Tjong Kim Sang & Déjean, 2001; Carreras, Màrquez, Punyakanok, & Roth, 2002; Kudo & Matsumoto, 2002), or in full parsing (=-=Magerman, 1996-=-; Ratnaparkhi, 1999; Haruno, Shirai, Ooyama, & Aizawa, 1999). In general, a learning system for these tasks makes use of several learned functions which interact in some way to determine the final str... |

10 | Clause Detection using HMM
- Molina, Pla
- 2001
(Show Context)
Citation Context ... Reference Technique Precision Recall F1 FR-Perceptron F&R VP 88.17 82.10 85.03 (Carreras et al., 2002) AdaBoost class. 90.18 78.11 83.71 (Carreras & Màrquez, 2001) AdaBoost class. 84.82 78.85 81.73 (=-=Molina & Pla, 2001-=-) HMM 70.85 70.51 70.68 This fact indicates that the score function can be straightforwardly learned, at the cost of working in a computationally very expensive search space. However, the combination ... |

7 | Text chunking by system combination - Sang - 2000 |

2 | Inference with classifiers: The phrase identification problem - Punyakanok, Roth - 2004 |

1 |
Springer-Verlag Lecture Notes Series in Artificial Intelligence 1147. article.tex; 2/06/2004; 20:26; p.37
- Molina, Pla
- 2001
(Show Context)
Citation Context ... Reference Technique Precision Recall F1 FR-Perceptron F&R VP 88.17 82.10 85.03 (Carreras et al., 2002) AdaBoost class. 90.18 78.11 83.71 (Carreras & Màrquez, 2001) AdaBoost class. 84.82 78.85 81.73 (=-=Molina & Pla, 2001-=-) HMM 70.85 70.51 70.68 each prediction; and, (c) batch, with the SVM binary classification algorithm. Note that in this setting, where start-end functions are fixed, the only advantage of training a ... |