## Learning Structured Classifiers with Dual Coordinate Ascent (2010)

Citations: | 4 - 2 self |

### BibTeX

@MISC{Martins10learningstructured,

author = {Andre F. T. Martins and Kevin Gimpel and Noah A. Smith},

title = {Learning Structured Classifiers with Dual Coordinate Ascent},

year = {2010}

}

### OpenURL

### Abstract

M. F. and P. A. were supported by the FET programme (EU

### Citations

3666 | Convex Optimization - Boyd, Vandenberghe - 2004 |

2309 | Conditional random fields: probabilistic models for segmenting and labeling sequence data
- Lafferty, McCallum, et al.
- 2001
(Show Context)
Citation Context ...arameter.s1 Introduction Learning structured classifiers discriminatively typically involves the minimization of a regularized loss function; the well-known cases of conditional random fields (CRFs, [=-=Lafferty et al., 2001-=-]) and structured support vector machines (SVMs, [Taskar et al., 2003, Tsochantaridis et al., 2004, Altun et al., 2003]) correspond to different choices of loss functions. For large-scale settings, th... |

732 | Gradient-based learning applied to document recognition
- LeCun, Bottou, et al.
(Show Context)
Citation Context ...ion problem is often difficult to tackle in its batch form, increasing the popularity of online algorithms. Examples are the structured perceptron [Collins, 2002a], stochastic gradient descent (SGD) [=-=LeCun et al., 1998-=-], and the margin infused relaxed algorithm (MIRA) [Crammer et al., 2006]. This paper presents a unified representation for several convex loss functions of interest in structured classification (§2).... |

488 | Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. The ACL02 conference on Empirical methods in natural language processingVolume 10 - Collins - 2002 |

436 | Max-margin Markov networks
- Taskar, Guestrin, et al.
- 2003
(Show Context)
Citation Context ...ly typically involves the minimization of a regularized loss function; the well-known cases of conditional random fields (CRFs, [Lafferty et al., 2001]) and structured support vector machines (SVMs, [=-=Taskar et al., 2003-=-, Tsochantaridis et al., 2004, Altun et al., 2003]) correspond to different choices of loss functions. For large-scale settings, the underlying optimization problem is often difficult to tackle in its... |

312 | Support vector machine learning for interdependent and structured outputspaces
- Tsochantaridis, Hofmann, et al.
(Show Context)
Citation Context ... the minimization of a regularized loss function; the well-known cases of conditional random fields (CRFs, [Lafferty et al., 2001]) and structured support vector machines (SVMs, [Taskar et al., 2003, =-=Tsochantaridis et al., 2004-=-, Altun et al., 2003]) correspond to different choices of loss functions. For large-scale settings, the underlying optimization problem is often difficult to tackle in its batch form, increasing the p... |

292 | Online passive-aggressive algorithms - Crammer, Dekel, et al. |

271 | Non-projective dependency parsing using spanning tree algorithms
- McDonald, Pereira, et al.
- 2005
(Show Context)
Citation Context ...resentation is urally to non-projective parshu-Liu-Edmonds (Chu and Edmonds, 1967) MST allding an O(n2 Figure 1: Example of aFigure dependency 1: An parse example tree dependency (adapted from tree. [=-=McDonald et al., 2005-=-]). bipartite graph with two Dependency types of nodes: representations, variable nodes, which which linkinwords our case to are the I components of y; and a set C of factor their nodes. arguments, Ea... |

256 |
Parallel Optimization: Theory, Algorithms, and Applications
- Censor, Zenios
- 1997
(Show Context)
Citation Context ...add a half-space constraint for each. This procedure approximates the constraint set by a polyhedron and the resulting problem can be addressed using row-action methods, such as Hildreth’s algorithm [=-=Censor and Zenios, 1997-=-]. This corresponds precisely to k-best MIRA. 7 5 Experiments We report experiments on two tasks: named entity recognition and dependency parsing. For each, we compare DCA (Alg. 1) with SGD. We report... |

255 | Three New Probabilistic Models for Dependency Parsing: An Exploration
- Eisner
- 1996
(Show Context)
Citation Context ...ns otherwise. There is one hard factor connected to all variables (call it TREE), its potential being one if the arc configurations form a spanning tree and zero otherwise. In the arc-factored model [=-=Eisner, 1996-=-, McDonald et al., 2005], all soft factors are unary and the graph is a tree. More sophisticated models (e.g., with siblings and grandparents) include pairwise factors, creating loops [Smith and Eisne... |

249 | CoNLL-X shared task on multilingual dependency parsing
- Wiley-Blackwell, Marsi
- 2006
(Show Context)
Citation Context ...ls than the baselines. Dependency Parsing. We trained non-projective dependency parsers for three languages (Arabic, Danish, and English), using datasets from the CoNLL-X and CoNLL-2008 shared tasks [=-=Buchholz and Marsi, 2006-=-, Surdeanu et al., 2008]. Performance is assessed by the unlabeled attachment score (UAS), the fraction of non-punctuation words which were assigned the correct parent. We adapted TurboParser 8 to han... |

189 | T.: Hidden Markov support vector machines
- Altun, Tsochantaridis, et al.
(Show Context)
Citation Context ...rized loss function; the well-known cases of conditional random fields (CRFs, [Lafferty et al., 2001]) and structured support vector machines (SVMs, [Taskar et al., 2003, Tsochantaridis et al., 2004, =-=Altun et al., 2003-=-]) correspond to different choices of loss functions. For large-scale settings, the underlying optimization problem is often difficult to tackle in its batch form, increasing the popularity of online ... |

134 | C.Gentile, “On the generalization ability of on-line learning algorithms - Cesa-Bianchi, Conconi |

65 | Dependency parsing by belief propagation
- Smith, Eisner
- 2008
(Show Context)
Citation Context ...l [Eisner, 1996, McDonald et al., 2005], all soft factors are unary and the graph is a tree. More sophisticated models (e.g., with siblings and grandparents) include pairwise factors, creating loops [=-=Smith and Eisner, 2008-=-]. 33 Variational Inference 3.1 Polytopes and Duality Let P = {Pθ(.|x) | θ ∈ R d } be the family of all distributions of the form (5), and rewrite (4) as: φ(x, y) = ∑ C∈Csoft φ C(x, yC) = F(x) · χ(y)... |

59 | Exponentiated gradient algorithms for conditional random fields and max-margin markov networks
- Collins, Globerson, et al.
- 2008
(Show Context)
Citation Context ... the regularization parameter C = 1/(λm). To choose the learning rate for SGD, we use the formula ηt = η/(1 + (t−1)/m) [LeCun et al., 1998]. We choose η using dev-set validation after a single epoch [=-=Collins et al., 2008-=-]. Named Entity Recognition. We use the English data from the CoNLL 2003 shared task [Tjong Kim Sang and De Meulder, 2003], which consist of English news articles annotated with four entity types: per... |

54 |
Minimum risk annealing for training log-linear models
- Smith, Eisner
- 2006
(Show Context)
Citation Context ... differences of losses in this family. By defining δLβ,γ = Lβ,γ − Lβ,0, the case β = 1 yields δLβ,γ(θ; x, y) = log Eθ exp ℓ(Y, y), which is an upper bound on Eθℓ(Y, y), used in minimum risk training [=-=Smith and Eisner, 2006-=-]. For β = ∞, δLβ,γ becomes a structured ramp loss [Collobert et al., 2006]. 2Kiril Ribarov Jan Hajič Institute of Formal and Applied Linguistics Charles University {ribarov,hajic}@ufal.ms.mff.cuni.c... |

53 | The CoNLL-2008 shared task on joint parsing of syntactic and semantic dependencies
- Surdeanu, Johansson, et al.
- 2008
(Show Context)
Citation Context ...endency Parsing. We trained non-projective dependency parsers for three languages (Arabic, Danish, and English), using datasets from the CoNLL-X and CoNLL-2008 shared tasks [Buchholz and Marsi, 2006, =-=Surdeanu et al., 2008-=-]. Performance is assessed by the unlabeled attachment score (UAS), the fraction of non-punctuation words which were assigned the correct parent. We adapted TurboParser 8 to handle any loss function L... |

52 | Trading convexity for scalability
- Collobert, Sinz, et al.
(Show Context)
Citation Context ...e case β = 1 yields δLβ,γ(θ; x, y) = log Eθ exp ℓ(Y, y), which is an upper bound on Eθℓ(Y, y), used in minimum risk training [Smith and Eisner, 2006]. For β = ∞, δLβ,γ becomes a structured ramp loss [=-=Collobert et al., 2006-=-]. 2Kiril Ribarov Jan Hajič Institute of Formal and Applied Linguistics Charles University {ribarov,hajic}@ufal.ms.mff.cuni.cz Abstract root John hit the ball with the bat e weighted dependency parsh... |

45 | Structured prediction, dual extragradient and Bregman projections - Taskar, Lacoste-Julien, et al. - 2006 |

39 | Factorie: Probabilistic programming via imperatively defined factor graphs
- McCallum, Schultz, et al.
- 2009
(Show Context)
Citation Context ...of representations that correspond to valid outputs. The next step is to design how the feature vector φ(x, y) decomposes, which can be conveniently done via a factor graph [Kschischang et al., 2001, =-=McCallum et al., 2009-=-]. This is a 1 Some important non-convex losses can also be written as differences of losses in this family. By defining δLβ,γ = Lβ,γ − Lβ,0, the case β = 1 yields δLβ,γ(θ; x, y) = log Eθ exp ℓ(Y, y),... |

37 |
Concise integer linear programming formulations for dependency parsing
- Martins, Smith, et al.
(Show Context)
Citation Context ...ve been recently proposed: a loopy belief propagation (BP) algorithm for computing pseudo-marginals [Smith and Eisner, 2008]; and an LP-relaxation method for approximating the most likely parse tree [=-=Martins et al., 2009-=-]. Although the two methods may look unrelated at first sight, both optimize over outer bounds of the marginal polytope. See [Martins et al., 2010] for further discussion. 4 Online Learning We now pro... |

32 | Structured prediction models via the matrix-tree theorem
- Koo, Globerson, et al.
- 2007
(Show Context)
Citation Context ...γ(θ; x, y) and ∇Lβ,γ(θ; x, y) may be computed exactly by modifying the log-potentials, invoking the matrix-tree theorem to compute the log-partition function and the marginals [Smith and Smith, 2007, =-=Koo et al., 2007-=-, McDonald and Satta, 2007], and using the fact that H(¯z) = log Z(θ, x) − θ ⊤ F(x)¯z. The marginal polytope is the same as the arborescence polytope in Martins et al. [2009]. For richer models where ... |

29 |
Turbo Parsers: Dependency Parsing by Approximate Variational Inference
- Martins, Smith, et al.
- 2010
(Show Context)
Citation Context ... method for approximating the most likely parse tree [Martins et al., 2009]. Although the two methods may look unrelated at first sight, both optimize over outer bounds of the marginal polytope. See [=-=Martins et al., 2010-=-] for further discussion. 4 Online Learning We now propose a dual coordinate ascent approach to learn the model parameters θ. This approach extends the primal-dual view of online algorithms put forth ... |

25 | Probabilistic models of nonprojective dependency trees
- Smith, Smith
- 1995
(Show Context)
Citation Context ...arc-factored model, Lβ,γ(θ; x, y) and ∇Lβ,γ(θ; x, y) may be computed exactly by modifying the log-potentials, invoking the matrix-tree theorem to compute the log-partition function and the marginals [=-=Smith and Smith, 2007-=-, Koo et al., 2007, McDonald and Satta, 2007], and using the fact that H(¯z) = log Z(θ, x) − θ ⊤ F(x)¯z. The marginal polytope is the same as the arborescence polytope in Martins et al. [2009]. For ri... |

24 | Tjong Kim Sang and F. De Meulder. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition - F - 2003 |

23 | Mind the duality gap: Logarithmic regret algorithms for online optimization
- Shalev-Shwartz, Kakade
- 2008
(Show Context)
Citation Context ... and, given a function f : Rn → ¯ R, we denote by f ⋆ : Rn → ¯ R its convex conjugate, f ⋆ (y) = supx x⊤y − f(x) (see Appendix A for a background of convex analysis). The next proposition, proved in [=-=Kakade and Shalev-Shwartz, 2008-=-], states a generalized form of Fenchel duality, which involves a dual vector µ i ∈ Rd per each instance. Proposition 2 ([Kakade and Shalev-Shwartz, 2008]) The Lagrange dual of minθ Pt(θ) is max Dt(µ ... |

20 |
Dependency Parsing
- Kübler, McDonald, et al.
- 2009
(Show Context)
Citation Context ...l, the soft factors are of the form C = {i, i + 1}. To obtain a k-gram model, redefine each Yi to be the set of all contiguous (k − 1)-tuples of labels. Dependency parsing: In this parsing formalism [=-=Kübler et al., 2009-=-], each input is a sentence (i.e., a sequence of words), and the outputs to be predicted are the dependency arcs, which link heads to modifiers, and overall must define a spanning tree (see Fig. 1 for... |

20 | Online learning meets optimization in the dual - Shalev-Shwartz, Singer - 2006 |

19 | A new perceptron algorithm for sequence labeling with non-local features
- Kazama, Torisawa
- 2007
(Show Context)
Citation Context ... and De Meulder, 2003], which consist of English news articles annotated with four entity types: person, location, organization, and miscellaneous. We used a standard set of feature templates, as in [=-=Kazama and Torisawa, 2007-=-], with token shape features [Collins, 2002b] and simple gazetteer features; a feature was included iff it occurs at least once in the training set (total 1,312,255 features). The task is evaluated us... |

3 | 13 Background on Convex Analysis We briefly review some notions of convex analysis that are used throughout the paper. For more details, see e.g - Boyd, Vandenberghe |