## Comparison of Evaluation Metrics for a Broad Coverage Parser LREC Workshop: Beyond PARSEVAL Towards Improved Evaluation Measures for Parsing Systems (2002)

Citations: | 38 - 5 self |

### BibTeX

@MISC{Crouch02comparisonof,

author = {Richard Crouch and Ronald M. Kaplan and Tracy H. King and Stefan Riezler},

title = {Comparison of Evaluation Metrics for a Broad Coverage Parser LREC Workshop: Beyond PARSEVAL Towards Improved Evaluation Measures for Parsing Systems},

year = {2002}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper reports on the use of two distinct evaluation metrics for assessing a stochastic parsing model consisting of a broad-coverage Lexical-Functional Grammar (LFG), an efficient constraint-based parser and a stochastic disambiguation model. The first evaluation metric measures matches of predicate-argument relations in LFG f-structures (henceforth the LFG annotation scheme) to a gold standard of manually annotated f-structures for a subset of the UPenn Wall Street Journal treebank. The other metric maps predicate-argument relations in LFG f-structures to dependency relations (henceforth DR annotations) as proposed by Carroll et al. (Carroll et al., 1999). For evaluation, these relations are matched against Carroll et al.’s gold standard which was manually annnotated on a subset of the Brown corpus. The parser plus stochastic disambiguator gives an F-measure of 79 % (LFG) or 73 % (DR) on the WSJ test set. This shows that the two evaluation schemes are similar in spirit, although accuracy is impaired systematically by mapping one annotation scheme to the other. A systematic loss of accuracy is incurred also by corpus variation: Training the stochastic disambiguation model on WSJ data and testing on Carroll et al.’s Brown corpus data yields an F-score of 74 % (DR) for dependency-relation match. A variant of this measure comparable to the measure reported by Carroll et al. yields an F-measure of 76%. We examine divergences between annotation schemes aiming at a future improvement of methods for assessing parser quality. 1.

### Citations

2107 | Numerical Recipes in C: The Art of Scientific Computing - Press, Flannery, et al. - 1992 |

301 | The Penn Treebank: Annotating Predicate Argument Structure
- Marcus, Kim, et al.
- 1994
(Show Context)
Citation Context ...n reasonably be argued that the standard evaluation procedure for stochastic parsing—precision and recall of matching labeled bracketing to section 23 of the UPenn Wall Street Journal (WSJ) treebank (=-=Marcus et al., 1994-=-)—is not appropriate for assessing the quality of parsers on matching predicateargument relations. A new standard for evaluation on predicate-argument relations and for annotating a gold standard is n... |

277 | Inside-Outside Reestimation from Partially Bracketed Corpora - Pereira, Schabes - 1992 |

273 | Convolution kernels for natural language
- Collins, Duffy
- 2002
(Show Context)
Citation Context ...e estimation techniques have recently received great attention in the statistical machine learning community and have already been applied to statistical parsing (Johnson et al., 1999; Collins, 2000; =-=Collins and Duffy, 2001-=-). In discriminative estimation, only the conditional relation of an analysis given an example is considered relevant, whereas in maximum likelihood estimation the joint probability of the training da... |

150 | Estimators for Stochastic “Unification-Based” Grammars
- Johnson, Geman, et al.
- 1999
(Show Context)
Citation Context ...mployed for stochastic disambiguation is the well-known family of exponential models. These models have already been applied successfully for disambiguation of various constraint-based grammars (LFG (=-=Johnson et al., 1999-=-), HPSG (Bouma et al., 2000), DCG (Osborne, 2000)). In this paper we are concerned with conditional exponential models of the form: where is the set of parses for sentence , is a normalizing constant,... |

101 | Corpus variation and parser performance - Gildea |

98 |
A Grammar Writer’s Cookbook
- Butt, King, et al.
- 1999
(Show Context)
Citation Context ...sures and corpora are described in section 4. 2. Robust Parsing using LFG 2.1. A Broad-Coverage Lexical-Functional Grammar The grammar used for this project has been developed in the ParGram project (=-=Butt et al., 1999-=-). It uses LFG as a formalism, producing c(onstituent)-structures (trees) andf(unctional)-structures (attribute value matrices) as output. The c-structures encode constituency. Each c-structure has a... |

92 |
The interface between phrasal and functional constraints
- Maxwell, Kaplan
- 1994
(Show Context)
Citation Context ...ode constituency. Each c-structure has at least one corresponding f-structure. F-structures encode predicate-argument relations and other grammatical information, e.g., number, tense. The XLE parser (=-=Maxwell and Kaplan, 1993-=-) was used to produce packed representations, specifying all possible grammar analyses of the input. The grammar has 314 rules with regular expression right-hand sides which compile into a collection ... |

57 | Corpus annotation for parser evaluation
- Carroll, Minnen, et al.
- 1999
(Show Context)
Citation Context ...et of the UPenn Wall Street Journal treebank. The other metric maps predicate-argument relations in LFG f-structures to dependency relations (henceforth DR annotations) as proposed by Carroll et al. (=-=Carroll et al., 1999-=-). For evaluation, these relations are matched against Carroll et al.’s gold standard which was manually annnotated on a subset of the Brown corpus. The parser plus stochastic disambiguator gives an F... |

56 | Maximum conditional likelihood via bound maximization and the CEM algorithm
- Jebara, Pentland
- 1998
(Show Context)
Citation Context ...variant. Furthermore, only sentences which received at most 1,000 parses were 2 An alternative numerical method would be a combination of iterative scaling techniques with a conditional EM algorithm (=-=Jebara and Pentland, 1998-=-) However, it has been shown experimentally that conjugate gradient techniques can outperform iterative scaling techniques by far in running time (Minka, 2001).taken under consideration. From this se... |

22 | Wide-coverage computational analysis of dutch - Alpino - 2000 |