## Learning as search optimization: Approximate large margin methods for structured prediction (2005)

### Cached

### Download Links

- [www.cs.utah.edu]
- [hal3.name]
- [www.umiacs.umd.edu]
- [arxiv.org]
- [www-ai.cs.uni-dortmund.de]
- [www.isi.edu]
- [imls.engr.oregonstate.edu]
- [www.machinelearning.org]
- DBLP

### Other Repositories/Bibliography

Venue: | In ICML |

Citations: | 38 - 0 self |

### BibTeX

@INPROCEEDINGS{Iii05learningas,

author = {Hal Daumé Iii and Daniel Marcu},

title = {Learning as search optimization: Approximate large margin methods for structured prediction},

booktitle = {In ICML},

year = {2005},

pages = {169--176}

}

### Years of Citing Articles

### OpenURL

### Abstract

Mappings to structured output spaces (strings, trees, partitions, etc.) are typically learned using extensions of classification algorithms to simple graphical structures (eg., linear chains) in which search and parameter estimation can be performed exactly. Unfortunately, in many complex problems, it is rare that exact search or parameter estimation is tractable. Instead of learning exact models and searching via heuristic means, we embrace this difficulty and treat the structured output problem in terms of approximate search. We present a framework for learning as search optimization, and two parameter updates with convergence theorems and bounds. Empirical evidence shows that our integrated approach to learning and decoding can outperform exact models at smaller computational cost. 1.

### Citations

3697 | Artificial Intelligence : A Modern Approach - Russell, Norvig - 1995 |

2309 | Conditional random fields: probabilistic models for segmenting and labeling sequence data
- Lafferty, McCallum, et al.
- 2001
(Show Context)
Citation Context ...Most previous work on the structured outputs problem extends standard classifiers to linear chains. Among these are maximum entropy Markov models and conditional random fields (McCallum et al., 2000; =-=Lafferty et al., 2001-=-); case-factor diagrams (McAllester et al., 2004); sequential Gaussian process models (Altun et al., 2004); support vector machines for structured outputs (Tsochantaridis et al., 2004) and maxmargin M... |

788 |
The perceptron: a probabilistic model for information storage and organization in the brain
- Rosenblatt
- 1958
(Show Context)
Citation Context ...y γ. In other words, γ is the minimum over all decisions of maxg,b |w ⊤ Φ(x, g)−w ⊤ Φ(x, b)|, where g is a y-good node and b is a y-bad node. Perceptron Updates. A simple perceptronstyle update rule (=-=Rosenblatt, 1958-=-), given (w, x, sibs, nodes) is w ← w + ∆, where: ∆ = � n∈sibs Φ(x, n) |sibs| − � n∈nodes Φ(x, n) |nodes| When an update is made, the feature vector for the incorrect decisions are subtracted off, and... |

488 | Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. The ACL02 conference on Empirical methods in natural language processingVolume 10
- Collins
- 2002
(Show Context)
Citation Context ...st list of best y or a marginal distribution across the graphical structure. One alternative that alleviates some of these issues is to use a perceptron algorithm, where only the arg max is required (=-=Collins, 2002-=-), but performance can be adversely affected by the fact that even the arg max cannot be computed exactly; see (McCallum & Wellner, 2004) for example. 3. Search Optimization We present the Learning as... |

444 | Shallow parsing with conditional random fields
- Sha, Pereira
- 2003
(Show Context)
Citation Context ...state reserves] NP [to] PP [$ 217 million] NP . (4)sTypical approaches to this problem recast it as a sequence labeling task and then solve it using any of the standard sequence labeling models; see (=-=Sha & Pereira, 2002-=-) for a prototypical example using CRFs. The reduction to sequence labeling is typically done through the “BIO” encoding, where the beginning of an X phrase is tagged B-X, the non-beginning (inside) o... |

439 | Maximum entropy markov models for information extraction and segmentation
- McCallum, Freitag, et al.
- 2000
(Show Context)
Citation Context ...able. 2. Previous Work Most previous work on the structured outputs problem extends standard classifiers to linear chains. Among these are maximum entropy Markov models and conditional random fields (=-=McCallum et al., 2000-=-; Lafferty et al., 2001); case-factor diagrams (McAllester et al., 2004); sequential Gaussian process models (Altun et al., 2004); support vector machines for structured outputs (Tsochantaridis et al.... |

436 | Max-margin Markov networks - Taskar, Guestrin, et al. - 2003 |

413 | Large margin classification using the perceptron algorithm - Freund, Schapire - 1999 |

354 | Generalization in reinforcement learning: successful examples using sparse coding
- Sutton
- 1995
(Show Context)
Citation Context ... & Anandan, 1985). Similar approaches attempt to predict value functions for generalization using techniques such as temporal difference (TD) or Q-learning (Bellman et al., 1963; Boyan & Moore, 1996; =-=Sutton, 1996-=-). More recently, Zhang and Dietterich (1997) applied RL techniques to solving combinatorial scheduling problems, but again focus on the standard TD(λ) framework. These frameworks, however, are not ex... |

312 | Support vector machine learning for interdependent and structured outputspaces
- Tsochantaridis, Hofmann, et al.
(Show Context)
Citation Context ...McCallum et al., 2000; Lafferty et al., 2001); case-factor diagrams (McAllester et al., 2004); sequential Gaussian process models (Altun et al., 2004); support vector machines for structured outputs (=-=Tsochantaridis et al., 2004-=-) and maxmargin Markov models (Taskar et al., 2003); and kernel dependency estimation models (Weston et al., 2002). These models learn distributions or weights on simple graphs (typically linear chain... |

170 | SemiMarkov Conditional Random Fields for Information Extraction - Sarawagi, Cohen - 2004 |

140 | Incremental parsing with the perceptron algorithm - Collins, Roark - 2004 |

122 | Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data
- Sutton, Rohanimanesh, et al.
- 2004
(Show Context)
Citation Context ... as a preprocessing step. In this section, we discuss models in which part of speech tagging and chunking are performed jointly. This task has previously been used as a benchmark for factorized CRFs (=-=Sutton et al., 2004-=-). In that work, the authors discuss many approximate inference methods to deal with the fact that inference in such joint models is intractable. For this task, we do use the BIO encoding of the chunk... |

121 |
Conditional models of identity uncertainty with application to noun coreference
- McCallum, Wellner
- 2004
(Show Context)
Citation Context ...sues is to use a perceptron algorithm, where only the arg max is required (Collins, 2002), but performance can be adversely affected by the fact that even the arg max cannot be computed exactly; see (=-=McCallum & Wellner, 2004-=-) for example. 3. Search Optimization We present the Learning as Search Optimization (LaSO) framework for predicting structured outputs. The idea is to delve into Eq (1) to first reduce the requiremen... |

91 | Nash convergence of gradient dynamics in general-sum games - Singh, Kearns, et al. - 2000 |

90 | 2001. The use of classifiers in sequential inference
- Punyakanok, Roth
(Show Context)
Citation Context ...updates, this drops to 93.1 and using standard perceptron updates, it drops to 92.5. Our work also bears a resemblance to training local classifiers and combining them together with global inference (=-=Punyakanok & Roth, 2001-=-). The primary difference is that when learning local classifiers, one must assume to have access to all possible decisions and must rank them according to some loss function. Alternatively, in our mo... |

87 | A new approximate maximal margin classification algorithm
- Gentile
(Show Context)
Citation Context ...rvive if it is good by a large margin. This setup gives rise to a bound on the number of updates made (proof sketched in Appendix A) and two corollaries (proofs are nearly identical to Theorem 4 and (=-=Gentile, 2001-=-)): Theorem 4 For any training sequence that is separable with by a margin of size γ using the approxi√mate large margin √ update rule with parameters α, B = 8/α, C = 2, the number of errors made duri... |

72 |
Pattern recognizing stochastic learning automata
- Barto, Anandan
- 1985
(Show Context)
Citation Context ...decisions are added. Whenever |sibs| = |nodes| = 1, this looks exactly like the standard perceptron update. When there is only one sibling but many nodes, this resembles the gradi(Barto et al., 1981; =-=Barto & Anandan, 1985-=-). Similar approaches attempt to predict value functions for generalization using techniques such as temporal difference (TD) or Q-learning (Bellman et al., 1963; Boyan & Moore, 1996; Sutton, 1996). M... |

61 | A Family of Additive Online Algorithms for Category Ranking
- Crammer, Singer
- 2003
(Show Context)
Citation Context ... with Jensen’s inequality to turn it into a simple sum. When there is more than one correct next hypothesis, this update rule resembles that used in multi-label or ranking variants of the perceptron (=-=Crammer & Singer, 2003-=-). In that work, different “weighting” schemes are proposed, including, for instance, one that weights the nodes in the sums proportional to the loss suffered; such schemes are also possible in our fr... |

49 | Associative search network: A reinforcement learning associative memory
- Barto, Sutton, et al.
- 1981
(Show Context)
Citation Context ...ll possible correct decisions are added. Whenever |sibs| = |nodes| = 1, this looks exactly like the standard perceptron update. When there is only one sibling but many nodes, this resembles the gradi(=-=Barto et al., 1981-=-; Barto & Anandan, 1985). Similar approaches attempt to predict value functions for generalization using techniques such as temporal difference (TD) or Q-learning (Bellman et al., 1963; Boyan & Moore,... |

44 | Exponentiated gradient algorithms for large-margin structured classification
- Collins, McAllester
- 2004
(Show Context)
Citation Context ...l, but training remains computationally expensive. Recent effort to reduce this computational demand considers employing constraints that the correct output only outweigh the k-best model hypotheses (=-=Bartlett et al., 2004-=-). Alternatively an online algorithm for which only very small QPs are solved is also possible (McDonald et al., 2004). At the heart of all these algorithms, batch or online, likelihood- or margin-bas... |

44 | Text chunking based on a generalization of winnow
- Zhang, Damerau, et al.
- 2002
(Show Context)
Citation Context ... sentences (212k words) and 2012 test sentences (47k words). We compare our proposed models against several baselines. The first baseline is denoted ZDJ02 and is the best system on this task to date (=-=Zhang et al., 2002-=-). The second baseline is the likelihood-trained model, denoted SemiCRF. We use 10% of the training data to tune model parameters. The third baseline is the standard structured perceptron algorithm, d... |

34 | Gaussian process classification for segmenting and annotating sequences
- Altun, Hofmann, et al.
- 2004
(Show Context)
Citation Context ...ese are maximum entropy Markov models and conditional random fields (McCallum et al., 2000; Lafferty et al., 2001); case-factor diagrams (McAllester et al., 2004); sequential Gaussian process models (=-=Altun et al., 2004-=-); support vector machines for structured outputs (Tsochantaridis et al., 2004) and maxmargin Markov models (Taskar et al., 2003); and kernel dependency estimation models (Weston et al., 2002). These ... |

32 |
Polynomial approximation – a new computational technique in dynamic programming
- Bellman, Kalaba, et al.
- 1963
(Show Context)
Citation Context ...sembles the gradi(Barto et al., 1981; Barto & Anandan, 1985). Similar approaches attempt to predict value functions for generalization using techniques such as temporal difference (TD) or Q-learning (=-=Bellman et al., 1963-=-; Boyan & Moore, 1996; Sutton, 1996). More recently, Zhang and Dietterich (1997) applied RL techniques to solving combinatorial scheduling problems, but again focus on the standard TD(λ) framework. Th... |

21 |
Simulation of self-organizing systems by digital computer. Institute of Radio Engineers
- Farley, Clark
- 1954
(Show Context)
Citation Context ...meter optimization within search resembles reinforcement learning without the confounding factor of “exploration.” Early research in reinforcement learning focused on arbitrary input/output mappings (=-=Farley & Clark, 1954-=-), though this was not framed as search. Later, associative RL was introduced, where a context input (akin to our input x) was given to a RL algorithm Learning as Search Optimization Note that this al... |

16 | Solving combinatorial optimization tasks by reinforcement learning: A general methodology applied to resourceconstrained scheduling - Zhang, Dietterich - 1997 |

6 |
Large margin online learning algorithms for scalable structured classification
- McDonald, Crammer, et al.
- 2004
(Show Context)
Citation Context ...ing constraints that the correct output only outweigh the k-best model hypotheses (Bartlett et al., 2004). Alternatively an online algorithm for which only very small QPs are solved is also possible (=-=McDonald et al., 2004-=-). At the heart of all these algorithms, batch or online, likelihood- or margin-based, is the computation:sˆy = arg max f(x, y; w) (1) y∈Y This seemingly innocuous statement is necessary in all models... |

2 | Loss functions and discriminitive training of energy-based models. AI-Stats - LeCun, Huang - 2005 |

2 |
Kernel dependency estimation. NIPS
- Weston, Chapelle, et al.
- 2002
(Show Context)
Citation Context ...s models (Altun et al., 2004); support vector machines for structured outputs (Tsochantaridis et al., 2004) and maxmargin Markov models (Taskar et al., 2003); and kernel dependency estimation models (=-=Weston et al., 2002-=-). These models learn distributions or weights on simple graphs (typically linear chains). Probabilistic models are optimized by gradient descent on the log likelihood, which requires computable expec... |

1 |
Learning as Search Optimization
- McAllester, Collins, et al.
- 2004
(Show Context)
Citation Context ...roblem extends standard classifiers to linear chains. Among these are maximum entropy Markov models and conditional random fields (McCallum et al., 2000; Lafferty et al., 2001); case-factor diagrams (=-=McAllester et al., 2004-=-); sequential Gaussian process models (Altun et al., 2004); support vector machines for structured outputs (Tsochantaridis et al., 2004) and maxmargin Markov models (Taskar et al., 2003); and kernel d... |

1 |
Casefactor digrams for structured probabilistic modeling
- McAllester, Collins, et al.
- 2004
(Show Context)
Citation Context ...roblem extends standard classifiers to linear chains. Among these are maximum entropy Markov models and conditional random fields (McCallum et al., 2000; Lafferty et al., 2001); case-factor diagrams (=-=McAllester et al., 2004-=-); sequential Gaussian process models (Altun et al., 2004); support vector machines for structured outputs (Tsochantaridis et al., 2004) and maxmargin Markov models (Taskar et al., 2003); and kernel d... |

1 |
Learning as Search Optimization
- McCallum, Freitag, et al.
- 2000
(Show Context)
Citation Context ...able. 2. Previous Work Most previous work on the structured outputs problem extends standard classifiers to linear chains. Among these are maximum entropy Markov models and conditional random fields (=-=McCallum et al., 2000-=-; Lafferty et al., 2001); case-factor diagrams (McAllester et al., 2004); sequential Gaussian process models (Altun et al., 2004); support vector machines for structured outputs (Tsochantaridis et al.... |