## A distributional analysis of a lexicalized statistical parsing model (2004)

### Cached

### Download Links

Venue: | In EMNLP |

Citations: | 20 - 0 self |

### BibTeX

@INPROCEEDINGS{Bikel04adistributional,

author = {Daniel M. Bikel},

title = {A distributional analysis of a lexicalized statistical parsing model},

booktitle = {In EMNLP},

year = {2004},

pages = {182--189}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper presents some of the first data visualizations and analysis of distributions for a lexicalized statistical parsing model, in order to better understand their nature. In the course of this analysis, we have paid particular attention to parameters that include bilexical dependencies. The prevailing view has been that such statistics are very informative but suffer greatly from sparse data problems. By using a parser to constrain-parse its own output, and by hypothesizing and testing for distributional similarity with back-off distributions, we have evidence that finally explains that (a) bilexical statistics are actually getting used quite often but that (b) the distributions are so similar to those that do not include head words as to be nearly indistinguishable insofar as making parse decisions. Finally, our analysis has provided for the first time an effective way to do parameter selection for a generative lexicalized statistical parsing model. 1

### Citations

8635 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...onal similarity, as explored by (Lee, 1999), is the JensenShannon divergence (Lin, 1991): JS (p � q) = 1 � � � D p �� � � � avgp,q + D q �� �� avgp,q 2 (2) where D is the Kullback-Leibler divergence (=-=Cover and Thomas, 1991-=-) and where avg p,q = 1 2 (p(A) + q(A)) for an event A in the event space of at least one of the two distributions. One interpretation for the Jensen-Shannon divergence due to Slonim et al. (2002) is ... |

2124 | Building a large annotated corpus of English: The Penn Treebank
- Marcus, Santorini, et al.
- 1993
(Show Context)
Citation Context ...a Collins-style model as the basis for analysis, throughout this paper, we will attempt to present information that is widely applicable because it pertains to properties of the widely-used Treebank (=-=Marcus et al., 1993-=-) and lexicalized parsing models in general. This work also sheds light on the much-discussed “bilexical dependencies” of statistical parsing models. Beginning with the seminal work at IBM (Black et a... |

959 | Head-Driven Statistical Models for Natural Language Parsing
- Collins
- 1999
(Show Context)
Citation Context ...een to compare a model’s overall parsing performance with and without a feature. Often, it has seemed that features that are derived from linguistic principles result in higherperforming models (cf. (=-=Collins, 1999-=-)). While this may be true, it is clearly inappropriate to highlight ex post facto the linguistically-motivated features and rationalize their inclusion and state how effective they are. A rigorous an... |

829 | A maximum-entropyinspired parser - Charniak - 2000 |

417 | Divergence Measures Based on the Shannon Entropy
- Lin
- 1991
(Show Context)
Citation Context ...ke almost no difference in terms of parse accuracy. 5.1 Distributional similarity A useful metric for measuring distributional similarity, as explored by (Lee, 1999), is the JensenShannon divergence (=-=Lin, 1991-=-): JS (p � q) = 1 � � � D p �� � � � avgp,q + D q �� �� avgp,q 2 (2) where D is the Kullback-Leibler divergence (Cover and Thomas, 1991) and where avg p,q = 1 2 (p(A) + q(A)) for an event A in the eve... |

255 | Three new probabilistic models for dependency parsing: An exploration
- Eisner
- 1996
(Show Context)
Citation Context ... dependencies” of statistical parsing models. Beginning with the seminal work at IBM (Black et al., 1991; Black et al., 1992b; Black et al., 1992a), and continuing with such lexicalist approaches as (=-=Eisner, 1996-=-), these features have been lauded for their ability to approximate a word’s semantics as a means to override syntactic preferences with semantic ones (Collins, 1999; Eisner, 2000). However, the work ... |

232 | Measures of distributional similarity
- Lee
- 1999
(Show Context)
Citation Context ...sus those that do not are so similar as to make almost no difference in terms of parse accuracy. 5.1 Distributional similarity A useful metric for measuring distributional similarity, as explored by (=-=Lee, 1999-=-), is the JensenShannon divergence (Lin, 1991): JS (p � q) = 1 � � � D p �� � � � avgp,q + D q �� �� avgp,q 2 (2) where D is the Kullback-Leibler divergence (Cover and Thomas, 1991) and where avg p,q ... |

230 |
A procedure for quantitatively comparing the syntactic coverage of English grammars
- Black
- 1991
(Show Context)
Citation Context ... al., 1993) and lexicalized parsing models in general. This work also sheds light on the much-discussed “bilexical dependencies” of statistical parsing models. Beginning with the seminal work at IBM (=-=Black et al., 1991-=-; Black et al., 1992b; Black et al., 1992a), and continuing with such lexicalist approaches as (Eisner, 1996), these features have been lauded for their ability to approximate a word’s semantics as a ... |

155 | Natural Language Parsing as Statistical Pattern Recognition - Magerman - 1994 |

110 | Intricacies of Collins’ parsing model - Bikel |

95 | Corpus variation and parser performance - Gildea - 2001 |

54 | Development and evaluation of a broad-coverage probabilistic grammar of English-language computer manuals
- Black, Lafferty, et al.
- 1992
(Show Context)
Citation Context ...calized parsing models in general. This work also sheds light on the much-discussed “bilexical dependencies” of statistical parsing models. Beginning with the seminal work at IBM (Black et al., 1991; =-=Black et al., 1992-=-b; Black et al., 1992a), and continuing with such lexicalist approaches as (Eisner, 1996), these features have been lauded for their ability to approximate a word’s semantics as a means to override sy... |

52 | Conditional structure versus conditional estimation in NLP models
- Klein, D, et al.
- 2002
(Show Context)
Citation Context ...re (the extreme example is TO, a tag that can only be assigned to the word to). This is an example of the “label bias” problem, which has been the subject of recent discussion (Lafferty et al., 2001; =-=Klein and Manning, 2002-=-). Of course, just because there is “label bias” does not necessarily mean there is a problem. If the decoder pursues a theory to a nonterminal/partof-speech tag preterminal that has an extremely low ... |

51 | Bilexical grammars and their cubic-time parsing algorithms
- Eisner
- 2000
(Show Context)
Citation Context ...alist approaches as (Eisner, 1996), these features have been lauded for their ability to approximate a word’s semantics as a means to override syntactic preferences with semantic ones (Collins, 1999; =-=Eisner, 2000-=-). However, the work of Gildea (2001) showed that, with an approximate reimplementation of Collins’ Model 1, removing all parameters that involved dependencies between a modifier word and its head res... |