## Dynamic Programming for Linear-Time Incremental Parsing

### Cached

### Download Links

Citations: | 42 - 3 self |

### BibTeX

@MISC{Huang_dynamicprogramming,

author = {Liang Huang and Kenji Sagae},

title = {Dynamic Programming for Linear-Time Incremental Parsing},

year = {}

}

### OpenURL

### Abstract

Incremental parsing techniques such as shift-reduce have gained popularity thanks to their efficiency, but there remains a major problem: the search is greedy and only explores a tiny fraction of the whole space (even with beam search) as opposed to dynamic programming. We show that, surprisingly, dynamic programming is in fact possible for many shift-reduce parsers, by merging “equivalent ” stacks based on feature values. Empirically, our algorithm yields up to a five-fold speedup over a state-of-the-art shift-reduce dependency parser with no loss in accuracy. Better search also leads to better learning, and our final parser outperforms all previously reported dependency parsers for English and Chinese, yet is much faster. 1

### Citations

822 | A Maximum-Entropy-Inspired Parser - Charniak |

652 | An efficient context-free parsing algorithm
- EARLEY
- 1970
(Show Context)
Citation Context ...parsing which runs in polynomial time in theory, but linear-time (with beam search) in practice. The key idea is to merge equivalent stacks according to feature functions, inspired by Earley parsing (=-=Earley, 1970-=-; Stolcke, 1995) and generalized LR parsing (Tomita, 1991). However, our formalism is more flexible and our algorithm more practical. Specifically, we make the following contributions: • theoretically... |

271 | Non-projective dependency parsing using spanning tree algorithms - McDonald, Pereira, et al. - 2005 |

255 | Three New Probabilistic Models for Dependency Parsing: An Exploration
- Eisner
- 1996
(Show Context)
Citation Context ..., and the DP algorithm subsumes the non-DP one as a special case where no two states are equivalent. 3.5 Example: Edge-Factored Model As a concrete example, Figure 4 simulates an edge-factored model (=-=Eisner, 1996-=-; McDonald et al., 2005a) using shift-reduce with dynamic programming, which is similar to bilexical PCFG parsing using CKY (Eisner and Satta, 1999). Here the kernel feature function is ˜ f(j, S) = (j... |

211 | PCFG models of linguistic tree representations
- Johnson
- 1998
(Show Context)
Citation Context ...stack top. For example, the kernel feature function in Eq. 5 is bounded and monotonic, since f2 is less refined than f1 and f0. These two requirements are related to grammar refinement by annotation (=-=Johnson, 1998-=-), where annotations must be bounded and monotonic: for example, one cannot refine a grammar by only remembering the grandparent but not the parent symbol. The difference here is that the annotations ... |

203 |
The Theory of Parsing
- Aho, Ullman
- 1972
(Show Context)
Citation Context ...left-to-right scan of the input sentence, and at each step, choose one of the two actions: either shift the current word onto the stack, or reduce the top two (or more) items at the end of the stack (=-=Aho and Ullman, 1972-=-). To adapt it to dependency parsing, we split the reduce action into two cases, re� and re�, depending on which one of the two items becomes the head after reduction. This procedure is known as “arc-... |

187 | An E cient Probabilistic Context-Free Parsing Algorithm that Computes Pre x Probabilities
- Stolcke
- 1995
(Show Context)
Citation Context ...runs in polynomial time in theory, but linear-time (with beam search) in practice. The key idea is to merge equivalent stacks according to feature functions, inspired by Earley parsing (Earley, 1970; =-=Stolcke, 1995-=-) and generalized LR parsing (Tomita, 1991). However, our formalism is more flexible and our algorithm more practical. Specifically, we make the following contributions: • theoretically, we show that ... |

180 | Improved inference for unlexicalized parsing
- Petrov, Klein
- 2007
(Show Context)
Citation Context .../C++, Py=Python, Ja=Java. Time is in seconds per sentence. Search spaces: ‡ linear; others exponential. (on a 3.2GHz Xeon CPU). Best-performing constituency parsers like Charniak (2000) and Berkeley (=-=Petrov and Klein, 2007-=-) do outperform our parser, since they consider more information during parsing, but they are at least 5 times slower. Figure 8 shows the parse time in seconds for each test sentence. The observed tim... |

163 |
Making and correcting errors during sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences
- Frazier, Rayner
- 1982
(Show Context)
Citation Context ...Both have pros and cons: the former performs an exact search (in cubic time) over an exponentially large space, while the latter is much faster (in linear-time) and is psycholinguistically motivated (=-=Frazier and Rayner, 1982-=-), but its greedy nature may suffer from severe search errors, as it only explores a tiny fraction of the whole space even with a beam. Can we combine the advantages of both approaches, that is, const... |

148 | Better k-best Parsing - Huang, Chiang - 2005 |

140 | Incremental parsing with the perceptron algorithm - Collins, Roark - 2004 |

121 | Statistical Dependency Analysis with Support Vector machines - Yamada, Matsumoto - 2003 |

119 | The Structure of Shared Forests in Ambiguous Parsing
- Billot, Lang
- 1989
(Show Context)
Citation Context ...f techniques for tabular simulation of nondeterministic pushdown automata based on deductive systems (Lang, 1974), which allow for cubictime exhaustive shift-reduce parsing with contextfree grammars (=-=Billot and Lang, 1989-=-). Our work advances this line of research in two aspects. First, ours is more general than GLR in 9 Duan et al. (2007) and Zhang and Clark (2008) did not report word accuracies, but those can be reco... |

92 | Efficient parsing for bilexical context-free grammars and head automaton grammars
- Eisner, Satta
- 1999
(Show Context)
Citation Context ...a concrete example, Figure 4 simulates an edge-factored model (Eisner, 1996; McDonald et al., 2005a) using shift-reduce with dynamic programming, which is similar to bilexical PCFG parsing using CKY (=-=Eisner and Satta, 1999-=-). Here the kernel feature function is ˜ f(j, S) = (j, h(s1), h(s0)) 5 Note that using inside cost v for ordering would be a bad idea, as it will always prefer shorter derivations like in best-first p... |

86 | reranking: Discriminative parsing with non-local features - Forest |

81 |
Deterministic techniques for efficient non-deterministic parsers
- Lang
- 1974
(Show Context)
Citation Context ...ynamic programming chart in chart parsing (see Footnote 4). In fact, Tomita’s GLR is an instance of techniques for tabular simulation of nondeterministic pushdown automata based on deductive systems (=-=Lang, 1974-=-), which allow for cubictime exhaustive shift-reduce parsing with contextfree grammars (Billot and Lang, 1989). Our work advances this line of research in two aspects. First, ours is more general than... |

39 | Incrementality in deterministic dependency parsing
- Nivre
- 2004
(Show Context)
Citation Context ... it to dependency parsing, we split the reduce action into two cases, re� and re�, depending on which one of the two items becomes the head after reduction. This procedure is known as “arc-standard” (=-=Nivre, 2004-=-), and has been engineered to achieve state-of-the-art parsing accuracy in Huang et al. (2009), which is also the reference parser in our experiments. 2 More formally, we describe a parser configurati... |

26 |
Coarseto-fine-grained n-best parsing and discriminative reranking
- Charniak, Johnson
- 2005
(Show Context)
Citation Context ...oracle) kbest lists than those in the beam (see Fig. 6b). The forest itself has an oracle of 98.15 (as if k → ∞), computed à la Huang (2008, Sec. 4.1). These candidate sets may be used for reranking (=-=Charniak and Johnson, 2005-=-; Huang, 2008). 8 4.3 Perceptron Training and Early Updates Another interesting advantage of DP over non-DP is the faster training with perceptron, even when both parsers use the same beam width. This... |

21 | semi-supervised dependency parsing - Simple |

19 | Bilingually-constrained (monolingual) shift-reduce parsing - Huang, Jiang, et al. - 2009 |

18 | Probabilistic models for action-based Chinese dependency parsing - Duan, Zhao, et al. - 2007 |

13 |
Online large-margin training of dependency parsers
- 2005a
(Show Context)
Citation Context ...el score, which triggers update immediately. By contrast, in non-DP beam search, states such as p might still 8 DP’s k-best lists are extracted from the forest using the algorithm of Huang and Chiang =-=(2005)-=-, rather than those in the final beam as in the non-DP case, because many derivations have been merged during dynamic programming.avg. model score 2394 2391 2388 2385 2382 2379 2376 2373 2370 b=16 b=... |

7 | Linear complexity context-free parsing pipelines via chart constraints - Roark, Hollingshead - 2009 |

7 |
Graph-structured Stack and Natural Language Parsing
- Tomita
- 1988
(Show Context)
Citation Context ...w · fsh(i, s ′ d ...s′ 0 ) and λ = w · fre�(j, sd...s0). Figure 3: Deductive system for shift-reduce parsing with dynamic programming. The predictor state set π is an implicit graph-structured stack (=-=Tomita, 1988-=-) while the prefix cost c is inspired by Stolcke (1995). The re� case is similar, replacing s ′ � 0 s0 with s ′ � 0 s0, and λ with ρ = w · fre�(j, sd...s0). Irrelevant information in a deduction step ... |