## Cube Summing, Approximate Inference with Non-Local Features, and Dynamic Programming without Semirings

### Cached

### Download Links

Citations: | 7 - 3 self |

### BibTeX

@MISC{Gimpel_cubesumming,,

author = {Kevin Gimpel and Noah A. Smith},

title = {Cube Summing, Approximate Inference with Non-Local Features, and Dynamic Programming without Semirings},

year = {}

}

### OpenURL

### Abstract

We introduce cube summing, a technique that permits dynamic programming algorithms for summing over structures (like the forward and inside algorithms) to be extended with non-local features that violate the classical structural independence assumptions. It is inspired by cube pruning (Chiang, 2007; Huang and Chiang, 2007) in its computation of non-local features dynamically using scored k-best lists, but also maintains additional residual quantities used in calculating approximate marginals. When restricted to local features, cube summing reduces to a novel semiring (k-best+residual) that generalizes many of the semirings of Goodman (1999). When non-local features are included, cube summing does not reduce to any semiring, but is compatible with generic techniques for solving dynamic programming equations. 1

### Citations

2309 | Conditional random fields: probabilistic models for segmenting and labeling sequence data
- Lafferty, McCallum, et al.
- 2001
(Show Context)
Citation Context ...n-local features, but leaves open the question of how the feature weights or probabilities are learned. Meanwhile, some learning algorithms, like maximum likelihood for conditional log-linear models (=-=Lafferty et al., 2001-=-), unsupervised models (Pereira and Schabes, 1992), and models with hidden variables (Koo and Collins, 2005; Wang et al., 2007; Blunsom et al., 2008), require summing over the scores of many structure... |

1164 |
Error bounds for convolutional codes and an asymptotically optimum decoding algorithm
- Viterbi
- 1967
(Show Context)
Citation Context ...st+residual Semiring Viterbi proof (Goodman, 1999) ignore proof When restricted to local features, cube pruning and cube summing can be seen as proper semirk-best proof (Goodman, 1999) k = 1 Viterbi (=-=Viterbi, 1967-=-) ignore residual k = ∞ k-best + residual all proof (Goodman, 1999) k = 0 inside (Baum et al., 1970) Figure 2: Semirings generalized by k-best+residual. ings. Cube pruning reduces to an implementation... |

831 | An introduction to variational methods for graphical models
- Jordan, Ghahramani, et al.
- 1999
(Show Context)
Citation Context ...on (Sutton and McCallum, 2004; Smith and Eisner, 2008), Gibbs sampling (Finkel et al., 2005), sequential Monte Carlo methods such as particle filtering (Levy et al., 2008), and variational inference (=-=Jordan et al., 1999-=-; MacKay, 1997; Kurihara and Sato, 2006). Also relevant are stacked learning (Cohen and Carvalho, 2005), interpretable as approximation of non-local feature values (Martins et al., 2008), and M-estima... |

772 |
A maximization technique occuring in the statistical analysis of probabilistic functions of Markov chains
- Baum, Petrie, et al.
- 1970
(Show Context)
Citation Context ...es, cube pruning and cube summing can be seen as proper semirk-best proof (Goodman, 1999) k = 1 Viterbi (Viterbi, 1967) ignore residual k = ∞ k-best + residual all proof (Goodman, 1999) k = 0 inside (=-=Baum et al., 1970-=-) Figure 2: Semirings generalized by k-best+residual. ings. Cube pruning reduces to an implementation of the k-best semiring (Goodman, 1998), and cube summing reduces to a novel semiring we call the k... |

383 | Incorporating non-local information into information extraction systems by gibbs sampling
- Finkel, Grenager, et al.
- 2005
(Show Context)
Citation Context ...decoding and summing problems with non-local features. Some stem from work on graphical models, including loopy belief propagation (Sutton and McCallum, 2004; Smith and Eisner, 2008), Gibbs sampling (=-=Finkel et al., 2005-=-), sequential Monte Carlo methods such as particle filtering (Levy et al., 2008), and variational inference (Jordan et al., 1999; MacKay, 1997; Kurihara and Sato, 2006). Also relevant are stacked lear... |

375 | Hierarchical phrase-based translation
- Chiang
- 2007
(Show Context)
Citation Context ...summing over structures (like the forward and inside algorithms) to be extended with non-local features that violate the classical structural independence assumptions. It is inspired by cube pruning (=-=Chiang, 2007-=-; Huang and Chiang, 2007) in its computation of non-local features dynamically using scored k-best lists, but also maintains additional residual quantities used in calculating approximate marginals. W... |

272 | Inside-Outside Reestimation from Partially Bracketed Corpora
- Pereira, Schabes
- 1992
(Show Context)
Citation Context ...n of how the feature weights or probabilities are learned. Meanwhile, some learning algorithms, like maximum likelihood for conditional log-linear models (Lafferty et al., 2001), unsupervised models (=-=Pereira and Schabes, 1992-=-), and models with hidden variables (Koo and Collins, 2005; Wang et al., 2007; Blunsom et al., 2008), require summing over the scores of many structures to calculate marginals. We first review the sem... |

165 | Principles and implementation of deductive parsing
- Shieber, Schabes, et al.
- 1995
(Show Context)
Citation Context ...2007; Blunsom et al., 2008), require summing over the scores of many structures to calculate marginals. We first review the semiring-weighted logic programming view of dynamic programming algorithms (=-=Shieber et al., 1995-=-) and identify an intuitive property of a program called proof locality that follows from feature locality in the underlying probability model (§2). We then provide an analysis of cube pruning as an a... |

161 | Online Learning of Approximate Dependency Parsing Algorithms - McDonald, Pereira |

148 | Better k-best Parsing - Huang, Chiang - 2005 |

116 | A linear programming formulation for global inference in natural language tasks. Defense Technical Information
- Roth, Yih
- 2004
(Show Context)
Citation Context ...Several other approaches used frequently in NLP are approximate methods for decoding only. These include beam search (Lowerre, 1976), cube pruning, which we discuss in §3, integer linear programming (=-=Roth and Yih, 2004-=-), in which arbitrary features can act as constraints on y, and approximate solutions like McDonald and Pereira (2006), in which an exact solution to a related decoding problem is found and then modif... |

110 | A Differential Approach to Inference in Bayesian Networks
- Darwiche
(Show Context)
Citation Context ...· · ⋆ ¯wN ′) . . .)), g0, 〉 0 Figure 1: Combination operation for cube summing, where S = {1, 2, . . . , N ′ } and P(S) is the power set of S excluding ∅. tool for performing probabilistic inference (=-=Darwiche, 2003-=-). In the directed graph, there are vertices corresponding to axioms (these are sinks in the graph), ⊕ vertices corresponding to theorems, and ⊗ vertices corresponding to summands in the dynamic progr... |

82 | Parsing Inside-Out
- Goodman
- 1998
(Show Context)
Citation Context ...-best + residual all proof (Goodman, 1999) k = 0 inside (Baum et al., 1970) Figure 2: Semirings generalized by k-best+residual. ings. Cube pruning reduces to an implementation of the k-best semiring (=-=Goodman, 1998-=-), and cube summing reduces to a novel semiring we call the k-best+residual semiring. Binary instantiations of ⊗ and ⊕ can be iteratively reapplied to give the equivalent formulations in Eqs. 12 and 1... |

79 | Ensemble learning for hidden Markov models
- MacKay
- 1997
(Show Context)
Citation Context ...um, 2004; Smith and Eisner, 2008), Gibbs sampling (Finkel et al., 2005), sequential Monte Carlo methods such as particle filtering (Levy et al., 2008), and variational inference (Jordan et al., 1999; =-=MacKay, 1997-=-; Kurihara and Sato, 2006). Also relevant are stacked learning (Cohen and Carvalho, 2005), interpretable as approximation of non-local feature values (Martins et al., 2008), and M-estimation (Smith et... |

65 | Dependency parsing by belief propagation
- Smith, Eisner
- 2008
(Show Context)
Citation Context ...approximately solving instances of these decoding and summing problems with non-local features. Some stem from work on graphical models, including loopy belief propagation (Sutton and McCallum, 2004; =-=Smith and Eisner, 2008-=-), Gibbs sampling (Finkel et al., 2005), sequential Monte Carlo methods such as particle filtering (Levy et al., 2008), and variational inference (Jordan et al., 1999; MacKay, 1997; Kurihara and Sato,... |

64 | Semiring parsing
- Goodman
- 1999
(Show Context)
Citation Context ...ddition to its probability. This is often done using backpointers, but can also be accomplished by representing the most probable proof for each theorem in its entirety as part of the semiring value (=-=Goodman, 1999-=-). For generality, we define a proof as a string that is constructed from strings associated with axioms, but the particular form of a proof is problem-dependent. The “Viterbi proof” semiring includes... |

61 |
The Harpy Speech Recognition System
- Lowerre
- 1976
(Show Context)
Citation Context ..., and M-estimation (Smith et al., 2007), which allows training without inference. Several other approaches used frequently in NLP are approximate methods for decoding only. These include beam search (=-=Lowerre, 1976-=-), cube pruning, which we discuss in §3, integer linear programming (Roth and Yih, 2004), in which arbitrary features can act as constraints on y, and approximate solutions like McDonald and Pereira (... |

56 | Parsing and hypergraphs - Klein, Manning - 2001 |

51 | Forest rescoring: Faster decoding with integrated language models
- Huang, Chiang
- 2007
(Show Context)
Citation Context ...tructures (like the forward and inside algorithms) to be extended with non-local features that violate the classical structural independence assumptions. It is inspired by cube pruning (Chiang, 2007; =-=Huang and Chiang, 2007-=-) in its computation of non-local features dynamically using scored k-best lists, but also maintains additional residual quantities used in calculating approximate marginals. When restricted to local ... |

46 | A discriminative latent variable model for statistical machine translation
- Blunsom, Cohn, et al.
- 2008
(Show Context)
Citation Context ...imum likelihood for conditional log-linear models (Lafferty et al., 2001), unsupervised models (Pereira and Schabes, 1992), and models with hidden variables (Koo and Collins, 2005; Wang et al., 2007; =-=Blunsom et al., 2008-=-), require summing over the scores of many structures to calculate marginals. We first review the semiring-weighted logic programming view of dynamic programming algorithms (Shieber et al., 1995) and ... |

39 | Stacking dependency parsers
- Martins, Das, et al.
- 2008
(Show Context)
Citation Context ...onal inference (Jordan et al., 1999; MacKay, 1997; Kurihara and Sato, 2006). Also relevant are stacked learning (Cohen and Carvalho, 2005), interpretable as approximation of non-local feature values (=-=Martins et al., 2008-=-), and M-estimation (Smith et al., 2007), which allows training without inference. Several other approaches used frequently in NLP are approximate methods for decoding only. These include beam search ... |

35 |
eds), Automatic Differentiation of Algorithms
- Griewank, Corliss
- 1991
(Show Context)
Citation Context ...nt from each node to the nodes it depends on; ⊕ vertices depend on ⊗ vertices, which depend on ⊕ and axiom vertices. Arithmetic circuits are amenable to automatic differentiation in the reverse mode (=-=Griewank and Corliss, 1991-=-), commonly used in backpropagation algorithms. Importantly, this permits us to calculate the exact gradient of the approximate summation with respect to axiom values, following Eisner et al. (2005). ... |

34 | Forest reranking: Discriminative parsing with non-local features
- Huang
- 2008
(Show Context)
Citation Context ...e when 1 < k < ∞. For example, consider the probabilistic CKY algorithm as above, but using the cube decoding semiring with the non-local feature functions collectively known as “NGramTree” features (=-=Huang, 2008-=-) that score the string of terminals and nonterminals along the path from word j to word j + 1 when two constituents CY,i,j and CZ,j,k are combined. The semiring value associated with such a feature i... |

30 | Hidden-Variable Models for Discriminative Reranking
- Koo, Collins
- 2005
(Show Context)
Citation Context ...nwhile, some learning algorithms, like maximum likelihood for conditional log-linear models (Lafferty et al., 2001), unsupervised models (Pereira and Schabes, 1992), and models with hidden variables (=-=Koo and Collins, 2005-=-; Wang et al., 2007; Blunsom et al., 2008), require summing over the scores of many structures to calculate marginals. We first review the semiring-weighted logic programming view of dynamic programmi... |

26 | Stacked sequential learning
- Cohen, Carvalho
- 2005
(Show Context)
Citation Context ...uential Monte Carlo methods such as particle filtering (Levy et al., 2008), and variational inference (Jordan et al., 1999; MacKay, 1997; Kurihara and Sato, 2006). Also relevant are stacked learning (=-=Cohen and Carvalho, 2005-=-), interpretable as approximation of non-local feature values (Martins et al., 2008), and M-estimation (Smith et al., 2007), which allows training without inference. Several other approaches used freq... |

24 | Variational Bayesian grammar induction for natural language
- Kurihara, Sato
- 2006
(Show Context)
Citation Context ...h and Eisner, 2008), Gibbs sampling (Finkel et al., 2005), sequential Monte Carlo methods such as particle filtering (Levy et al., 2008), and variational inference (Jordan et al., 1999; MacKay, 1997; =-=Kurihara and Sato, 2006-=-). Also relevant are stacked learning (Cohen and Carvalho, 2005), interpretable as approximation of non-local feature values (Martins et al., 2008), and M-estimation (Smith et al., 2007), which allows... |

20 | Modeling the effects of memory on human online sentence processing with particle filters
- Levy, Reali, et al.
- 2009
(Show Context)
Citation Context ...phical models, including loopy belief propagation (Sutton and McCallum, 2004; Smith and Eisner, 2008), Gibbs sampling (Finkel et al., 2005), sequential Monte Carlo methods such as particle filtering (=-=Levy et al., 2008-=-), and variational inference (Jordan et al., 1999; MacKay, 1997; Kurihara and Sato, 2006). Also relevant are stacked learning (Cohen and Carvalho, 2005), interpretable as approximation of non-local fe... |

19 | Probabilistic inference for machine translation - Blunsom, Osborne - 2008 |

13 | Compiling Comp Ling: Practical weighted dynamic programming and the Dyna language
- Eisner, Goldlust, et al.
- 2005
(Show Context)
Citation Context ...s formal framework was the basis for the Dyna programming language, which permits a declarative specification of the logic program and compiles it into an efficient, agendabased, bottom-up procedure (=-=Eisner et al., 2005-=-). For our purposes, a DP consists of a set of recursive equations over a set of indexed variables. For example, the probabilistic CKY algorithm (run on sentence w1w2...wn) is written as CX,i−1,i = pX... |

6 | Computationally efficient M-estimation of log-linear structure models
- Smith, Vail, et al.
- 2007
(Show Context)
Citation Context ...ay, 1997; Kurihara and Sato, 2006). Also relevant are stacked learning (Cohen and Carvalho, 2005), interpretable as approximation of non-local feature values (Martins et al., 2008), and M-estimation (=-=Smith et al., 2007-=-), which allows training without inference. Several other approaches used frequently in NLP are approximate methods for decoding only. These include beam search (Lowerre, 1976), cube pruning, which we... |