## Fast Exact Inference with a Factored Model for Natural Language Parsing (2003)

Venue: | In Advances in Neural Information Processing Systems 15 (NIPS |

Citations: | 220 - 7 self |

### BibTeX

@INPROCEEDINGS{Klein03fastexact,

author = {Dan Klein and Christopher D. Manning},

title = {Fast Exact Inference with a Factored Model for Natural Language Parsing},

booktitle = {In Advances in Neural Information Processing Systems 15 (NIPS},

year = {2003},

pages = {3--10},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present a novel generative model for natural language tree structures in which semantic (lexical dependency) and syntactic (PCFG) structures are scored with separate models. This factorization provides conceptual simplicity, straightforward opportunities for separately improving the component models, and a level of performance comparable to similar, non-factored models. Most importantly, unlike other modern parsing models, the factored model admits an extremely effective A* parsing algorithm, which enables efficient, exact inference.

### Citations

1005 | Head-driven Statistical Models for Natural Language Parsing
- Collins
- 1999
(Show Context)
Citation Context ...known to be very effective. Additionally, methods based only on key lexical dependencies have been shown to be very effective in choosing between valid syntactic forms [1]. Modern statistical parsers =-=[2, 3] standardl-=-y use complex joint models of over both category labels and lexical items, where "everything is conditioned on everything" to the extent possible within the limits of data sparseness and fin... |

861 | A Maximum-EntropyInspired Parser
- Charniak
- 2000
(Show Context)
Citation Context ...age, contain more attachment ambiguity. The top F 1 of 86.7% is greater than that of the lexicalized parsers presented in [15, 16], but less than that of the newer, more complex, parsers presented in =-=[3, 2]-=-, which reach as high as 90.1% F 1 . 7 A tree T is viewed as a set of constituents c(T ). Constituents in the correct and the proposed tree must have the same start, end, and label to be considered id... |

540 | Training products of experts by minimizing contrastive divergence
- Hinton
- 2002
(Show Context)
Citation Context ...gs. Therefore, the total mass assigned to valid structures will be less than one. We could imagine fixing this by renormalizing. For example, this situation fits into the product-of-experts framework =-=[6]-=-, with one semantic expert and one syntactic expert that must agree on a single structure. However, since we are presently only interested in finding most-likely parses, no global renormalization cons... |

451 | A New Statistical Parser Based on Bigram Lexical Dependencies
- Collins
- 1996
(Show Context)
Citation Context ...dition of the dependency model helps more for longer sentences, which, on average, contain more attachment ambiguity. The top F 1 of 86.7% is greater than that of the lexicalized parsers presented in =-=[15, 16]-=-, but less than that of the newer, more complex, parsers presented in [3, 2], which reach as high as 90.1% F 1 . 7 A tree T is viewed as a set of constituents c(T ). Constituents in the correct and th... |

336 | Statistical Decision-Tree Modeling for Parsing
- Magerman
- 1995
(Show Context)
Citation Context ...dition of the dependency model helps more for longer sentences, which, on average, contain more attachment ambiguity. The top F 1 of 86.7% is greater than that of the lexicalized parsers presented in =-=[15, 16]-=-, but less than that of the newer, more complex, parsers presented in [3, 2], which reach as high as 90.1% F 1 . 7 A tree T is viewed as a set of constituents c(T ). Constituents in the correct and th... |

301 | Structural ambiguity and lexical relations
- Hindle, Rooth
(Show Context)
Citation Context ...uments, lexical preferences are known to be very effective. Additionally, methods based only on key lexical dependencies have been shown to be very effective in choosing between valid syntactic forms =-=[1]. Modern s-=-tatistical parsers [2, 3] standardly use complex joint models of over both category labels and lexical items, where "everything is conditioned on everything" to the extent possible within th... |

280 | Discriminative Reranking for Natural Language Parsing
- Collins
- 2000
(Show Context)
Citation Context ...ct inference for standard parsers, a common strategy is to take a PCFG backbone, extract a set of top parses, either the top k or all parses within a score threshold of the top parse, and rerank them =-=[3, 17]-=-. This pruning is done for efficiency; the question is whether it is hurting accuracy. That is, would exact inference be preferable? Figure 5 shows the result of parsing with our combined model, using... |

274 |
Trainable grammars for speech recognition
- Baker
- 1979
(Show Context)
Citation Context ...l and includes e, the words before it, and the words after it, as depicted in figure 3. Outside scores are a Viterbi analog of the standard outside probabilities given by the inside-outside algorithm =-=[11]-=-. For the syntactic model, P(T ), well-known cubic PCFG parsing algorithms are easily adapted to find outside scores. For the semantic model, P(D), there are several presentations of cubic dependency ... |

238 | Tree-Bank Grammars
- Charniak
- 1996
(Show Context)
Citation Context ...llowing sub-models. For P(T ), we used successively more accurate PCFGs. The simplest, PCFG-BASIC, used the raw treebank grammar, with nonterminals and rewrites taken directly from the training trees =-=[7]-=-. In this model, nodes rewrite atomically, in a top-down manner, in only the ways observed in the training data. For improved models of P(T ), tree nodes' labels were annotated with various contextual... |

223 | PCFG Models of Linguistic Tree Representations
- Johnson
- 1998
(Show Context)
Citation Context ...the ways observed in the training data. For improved models of P(T ), tree nodes' labels were annotated with various contextual markers. In PCFG-PA, each node was marked with its parent's label as in =-=[8]-=-. It is now well known that such annotation improves the accuracy of PCFG parsing by weakening the PCFG independence assumptions. For example, the NP in figure 1a would actually have been labeled NPS.... |

98 | Parsing Algorithm and Metrics
- Goodman
- 1996
(Show Context)
Citation Context ...al. [14] find that pruning based on estimates for P(e|s) raises accuracy slightly, for a non-lexicalized PCFG. As they note, their pruning metric seems to mimic Goodman's maximum-constituents parsing =-=[18]-=-, which maximizes the expected number of correct nodes rather than the likelihood of the entire parse. In any case, we see it as valuable to have an exact parser with which these types of questions ca... |

96 | Efficient parsing for bilexical context-free grammars and headautomaton grammars
- Eisner, Satta
- 1999
(Show Context)
Citation Context ... ), well-known cubic PCFG parsing algorithms are easily adapted to find outside scores. For the semantic model, P(D), there are several presentations of cubic dependency parsing algorithms, including =-=[9]-=- and [12]. These can also be adapted to produce outside scores in cubic time, though since their basic data structures are not edges, there is some subtlety. For space reasons, we omit the details of ... |

83 | Grammatical Trigrams: A Probabilistic Model of Link Grammar
- Lafferty, Sleator, et al.
- 1992
(Show Context)
Citation Context ...-known cubic PCFG parsing algorithms are easily adapted to find outside scores. For the semantic model, P(D), there are several presentations of cubic dependency parsing algorithms, including [9] and =-=[12]-=-. These can also be adapted to produce outside scores in cubic time, though since their basic data structures are not edges, there is some subtlety. For space reasons, we omit the details of these pha... |

59 | Parsing and hypergraphs
- Klein, Manning
- 2001
(Show Context)
Citation Context ...18.5 PCFG-LING 83.7 82.1 82.9 25.7 Dependency Model Dependency Acc DEP-BASIC 76.3 DEP-VAL 85.0 (a) The PCFG Model (b) The Dependency Model Figure 4: Performance of the sub-models alone. (but given in =-=[13]-=-). However, removing edges by inside score is not practical (see section 4 for an empirical demonstration), because all small edges end up having better scores than any large edges. Luckily, the optim... |

56 | Edge-Based Best-First Chart Parsing
- Charniak, Goldwater, et al.
- 1998
(Show Context)
Citation Context ...tes. [2] uses extensive probabilistic pruning -- this amounts to giving pruned edges infinitely low priority. Absolute pruning can, and does, prevent the most likely parse from being returned at all. =-=[14]-=- removes edges in order of estimates of their correctness. This, too, may result in the first parse found not being the most likely parse, but it has another more subtle drawback: if we hold back an e... |

27 | What is the minimal set of fragments that achieves maximal parse accuracy
- Bod
- 2001
(Show Context)
Citation Context ...tatistical interactions between syntactic and semantic structure, and, if deeper underlying variables of communication are not modeled, everything tends to be dependent on everything else in language =-=[4]-=-. However, the above considerations suggest that there might be considerable value in a factored model, which provides separate models of syntactic configurations and lexical dependencies, and then co... |

15 | 2001b. Parsing with treebank grammars: Empirical bounds, theoretical models, and the structure of the Penn treebank
- Klein, Manning
(Show Context)
Citation Context ...ar is huge (these grammars often contain tens of thousands of rules once binarized), and larger sentences are more likely to contain structures which unlock increasingly large regions of the grammar (=-=[10]-=- describes how this can cause the sentence length to leak into terms which are analyzed as constant, leading to empirical growth far faster than the predicted bounds). We did implement a version of th... |

1 |
Mel ′ čuk. Dependency Syntax: theory and practice
- A
- 1988
(Show Context)
Citation Context ... shown in figure 1. Figure 1a is a plain phrase-structure tree T , which primarily models syntactic units, figure 1b is a dependency tree D, which primarily models word-to-word selectional affinities =-=[5]-=-, and figure 1c is a lexicalized phrase-structure tree L, which carries both category and (part-of-speech tagged) head word information at each node. A lexicalized tree can be viewed as the pair L = (... |