Master of Logic Thesis; Institute for Logic, Language and Computation
user correction - Legacy Corrections
Data-Oriented Parsing (DOP) is a probabilistic performance approach to parsing natural language. Several DOP models have been proposed since it was introduced by Scha (1990), achieving promising results. One important feature of these models is the probability estimation procedure. Two major estimators have been put forward: Bod (1993) uses a relative frequency estimator; Bonnema (1999) adds a rescaling factor to correct for tree size effects. Both estimators, however, present biases. Moreover, Bod's estimator has been shown to be inconsistent (Johnson, 2002), meaning that the probability estimates hypothesized by the model do not approach the true probabilities that generated the data as the sample size grows. In this thesis, we implement a new estimation procedure that tackles the shortcomings of the two previous methods. The main idea is to treat derivation events not as disjoint, but as interrelated in a hierarchical cascade of parse tree derivations. We show that this new estimator -- called the Back-Off DOP (BO-DOP) estimator -- outperforms both previous models. We tested it on the OVIS treebank, a Dutch language, speech-based system, and report error reductions of up to 11.4% and 15% when compared to, respectively, Bod's and Bonnema's estimators.