## Sequential Learning of Classifiers for Structured Prediction Problems

### Cached

### Download Links

Citations: | 6 - 2 self |

### BibTeX

@MISC{Roth_sequentiallearning,

author = {Dan Roth and Kevin Small and Ivan Titov},

title = {Sequential Learning of Classifiers for Structured Prediction Problems},

year = {}

}

### OpenURL

### Abstract

Many classification problems with structured outputs can be regarded as a set of interrelated sub-problems where constraints dictate valid variable assignments. The standard approaches to these problems include either independent learning of individual classifiers for each of the sub-problems or joint learning of the entire set of classifiers with the constraints enforced during learning. We propose an intermediate approach where we learn these classifiers in a sequence using previously learned classifiers to guide learning of the next classifier by enforcing constraints between their outputs. We provide a theoretical motivation to explain why this learning protocol is expected to outperform both alternatives when individual problems have different ‘complexity’. This analysis motivates an algorithm for choosing a preferred order of classifier learning. We evaluate our technique on artificial experiments and on the entity and relation identification problem where the proposed method outperforms both joint and independent learning. 1

### Citations

701 | Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods
- Platt
- 1999
(Show Context)
Citation Context ...nal reasons or because the learner does not automatically induce a probability estimator. E.g., even though a linear classifier learned with Perceptron or SVM can be used to estimate the probability (=-=Platt, 1999-=-), the factor regulating sharpness of the distribution needs to be selected by hand and may even require adjustment during the course of learning. Therefore, in this case we propose replacing marginal... |

488 | Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. The ACL02 conference on Empirical methods in natural language processingVolume 10
- Collins
- 2002
(Show Context)
Citation Context ...n x. In the LO and L+I scenarios, each classifier is trained independently, whereas with IBT all T classifiers are trained jointly to optimize a global objective. The structured Perceptron algorithm (=-=Collins, 2002-=-) is one example of an IBT training algorithm. for t = 1 : T do for k /∈ {s1,..., st−1} do (1) Estimate wk of the model P (yk|x<t, w<t, wk) e ∑ φ(yk,xk,wk) y<t ec(y<t,yk)+ Pt−1 j=1 φ(ys j ,xs j ,ws j ... |

436 | Max-margin Markov networks
- Taskar, Guestrin, et al.
- 2003
(Show Context)
Citation Context ... less than 1. We cannot directly apply results of (Zhang, 2002) to structured prediction with constrained classifiers (IBT), but we can use the technique similar to that considered in (Collins, 2002; =-=Taskar et al., 2004-=-); the 3 The average per-label loss is equal to the expected Hamming error divided by the length of the sequence. key point is to demonstrate that the set of predefined constraints do not influence ge... |

413 | Large margin classification using the perceptron algorithm
- Freund, Schapire
- 1999
(Show Context)
Citation Context ...for each problem was set to 100. The results are averaged over 10 different problem instances generated as explained above. The learning algorithm used for all experiments is the averaged Perceptron (=-=Freund and Schapire, 1998-=-). The SL protocol, except for preserving the averaged vector and running the training iteration more than once, 6 was identical to the one shown in Figure 2. The size of the testing set was equal 6 T... |

163 | Learning structured prediction models: A large margin approach
- Taskar, Chatalbashev, et al.
- 2005
(Show Context)
Citation Context ...ail and we explain only the proof strategy. We bound the multi-error-level covering number N mul ∞ (Fc, ɛ, n) for the considered constrained function class Fc = {ϕ(y, x, w), ∀w, |wIBT t | < bt} (see (=-=Taskar, 2004-=-), def. A.1.9 for the multi-error-level covering number definition) by the product of the covering numbers for linear function with bounded norms of parameter vectors ∏T t=1 N∞(FL, ɛ, n). This is done... |

131 |
On convergence proofs on perceptrons
- Novikoff
- 1962
(Show Context)
Citation Context ...in Figure 1. For simplicity, we assume that only one pass over the training is done when training each vector wk although more passes would normally be required. As motivated by the Novikoff Theorem (=-=Novikoff, 1963-=-), we use the number of Perceptron updates as a criterion when deciding which of the candidate classifiers is more reliable. 2 for t = 1 : T do for k /∈ {s1,..., st−1} do ɛk = 0, wk = 0 for i = 1 : N ... |

116 | A linear programming formulation for global inference in natural language tasks. Defense Technical Information
- Roth, Yih
- 2004
(Show Context)
Citation Context ...-problems. This aspect of joint learning lies outside of the scope of this paper. We evaluate our approach with artificial experiments and a real problem, the entity and relation identification task (=-=Roth and Yih, 2004-=-). For artificial experiments, we consider sequence labeling with randomly generated constraints and variable complexity of the individual sub-problems. The model trained with the SL method achieves s... |

113 | A unified architecture for natural language processing: Deep neural networks with multitask learning
- Weston
- 2008
(Show Context)
Citation Context ...ipeline approach. Results were mixed with the best systems not using joint learning, but other competitive systems getting improvement from using joint learning. successful joint learning techniques (=-=Collobert and Weston, 2008-=-; Henderson et al., 2008) exploit shared internal representation, such as vectors of latent variables in graphical models or hidden layers in neural networks, to relate underlying properties of jointl... |

107 | Introduction to the conll-2004 shared task: Semantic role labeling
- Carreras, Màrquez
- 2004
(Show Context)
Citation Context ...al Conference on Artificial Intelligence and Statistics (AISTATS) 2009, Clearwater Beach, Florida, USA. Volume 5 of JMLR: W&CP 5. Copyright 2009 by the authors. the semantic role labeling (SRL) task (=-=Carreras and Màrquez, 2004-=-a), where a model needs to predict positions of verbal and nominal predicates in a sentence, select their sense, and identify their arguments for each possible argument role. For many of these problem... |

53 | The CoNLL-2008 shared task on joint parsing of syntactic and semantic dependencies
- Surdeanu, Johansson, et al.
- 2008
(Show Context)
Citation Context ...clusions dispute a general belief in NLP that joint modeling is preferable to pipelines. 1 However, the most 1 The CoNLL 2008 shared task examined joint learning of syntactic and semantic structures (=-=Surdeanu et al., 2008-=-), hoping to use joint learning to improve upon the standard pipeline approach. Results were mixed with the best systems not using joint learning, but other competitive systems getting improvement fro... |

49 | The importance of syntactic parsing and inference in semantic role labeling
- Punyakanok, Roth, et al.
- 2008
(Show Context)
Citation Context ...bserve that a subject and an object of a verbal predicate always appear on opposite sides of the verb or given a predicate some of its arguments are illegal, e.g., the verb say cannot have an object (=-=Punyakanok et al., 2008-=-). The most common approaches to solving these problems are either joint learning where global inference is used to enforce constraints during training (Inference Based Training, IBT) or independent l... |

43 | Learning and inference over constrained output
- Punyakanok, Roth, et al.
- 2005
(Show Context)
Citation Context ... by enforcing constraints at test time (Learning + Inference, L+I). On a number of problems it has been observed that when individual sub-problems are easy to learn in isolation, L+I outperforms IBT (=-=Punyakanok et al., 2005-=-; Carreras and Màrquez, 2004b). Punyakanok et al. (2005) presented a theoretical analysis which suggests that if individual problems are linearly separable then LO should outperform IBT, assuming that... |

32 | Solving the problem of cascading errors: Approximate Bayesian inference for linguistic annotation pipelines
- Finkel, Manning, et al.
- 2006
(Show Context)
Citation Context ...er simple distributions, whereas joint decoding with SL predicts correct outputs. The proposed technique is related to classifier pipelines, standard in statistical natural language processing (NLP) (=-=Finkel et al., 2006-=-; Bunescu, 2008), where the output of a classifier trained for one subproblem is provided as an input to the subsequent classifier. As an example, consider the case where predicted part-of-speech tags... |

20 |
Generalization bounds and consistency for structured labeling
- McAllester
- 2007
(Show Context)
Citation Context ... IBT method (Theorem 2) is greater than the average over T bounds for the SL method (Theorem 3). This is the case when features predictive of yt in windows xt ′ 4 This corresponds to the analysis in (=-=McAllester, 2007-=-), where arbitrary loss functions are bounded using PACBayesian bounds based upon the Hamming distance. 5 Here we assume that C is equal for both problems. Formally, we should explicitly define the C ... |

16 |
A latent variable model of synchronous parsing for syntactic and semantic dependencies
- Henderson, Merlo, et al.
- 2008
(Show Context)
Citation Context ...re mixed with the best systems not using joint learning, but other competitive systems getting improvement from using joint learning. successful joint learning techniques (Collobert and Weston, 2008; =-=Henderson et al., 2008-=-) exploit shared internal representation, such as vectors of latent variables in graphical models or hidden layers in neural networks, to relate underlying properties of jointly learned sub-problems. ... |

10 | Online learning via global feedback for phrase recognition
- Carreras, Màrquez
- 2004
(Show Context)
Citation Context ...al Conference on Artificial Intelligence and Statistics (AISTATS) 2009, Clearwater Beach, Florida, USA. Volume 5 of JMLR: W&CP 5. Copyright 2009 by the authors. the semantic role labeling (SRL) task (=-=Carreras and Màrquez, 2004-=-a), where a model needs to predict positions of verbal and nominal predicates in a sentence, select their sense, and identify their arguments for each possible argument role. For many of these problem... |

6 |
Learning with probabilistic features for improved pipeline models
- Bunescu
- 2008
(Show Context)
Citation Context ...ns, whereas joint decoding with SL predicts correct outputs. The proposed technique is related to classifier pipelines, standard in statistical natural language processing (NLP) (Finkel et al., 2006; =-=Bunescu, 2008-=-), where the output of a classifier trained for one subproblem is provided as an input to the subsequent classifier. As an example, consider the case where predicted part-of-speech tags of words in a ... |