Results 1  10
of
25
Minimal Loss Hashing for Compact Binary Codes
"... We propose a method for learning similaritypreserving hash functions that map highdimensional data onto binary codes. The formulation is based on structured prediction with latent variables and a hingelike loss function. It is efficient to train for large datasets, scales well to large code lengths ..."
Abstract

Cited by 68 (3 self)
 Add to MetaCart
(Show Context)
We propose a method for learning similaritypreserving hash functions that map highdimensional data onto binary codes. The formulation is based on structured prediction with latent variables and a hingelike loss function. It is efficient to train for large datasets, scales well to large code lengths, and outperforms stateoftheart methods. 1.
Structured Ramp Loss Minimization for Machine Translation
"... This paper seeks to close the gap between training algorithms used in statistical machine translation and machine learning, specifically the framework of empirical risk minimization. We review wellknown algorithms, arguing that they do not optimize the loss functions they are assumed to optimize wh ..."
Abstract

Cited by 35 (4 self)
 Add to MetaCart
This paper seeks to close the gap between training algorithms used in statistical machine translation and machine learning, specifically the framework of empirical risk minimization. We review wellknown algorithms, arguing that they do not optimize the loss functions they are assumed to optimize when applied to machine translation. Instead, most have implicit connections to particular forms of ramp loss. We propose to minimize ramp loss directly and present a training algorithm that is easy to implement and that performs comparably to others. Most notably, our structured ramp loss minimization algorithm, RAMPION, is less sensitive to initialization and random seeds than standard approaches. 1
2011. Generalization bounds and consistency for latent structural probit and ramp loss
 In Proc. of NIPS
"... We consider latent structural versions of probit loss and ramp loss. We show that these surrogate loss functions are consistent in the strong sense that for any feature map (finite or infinite dimensional) they yield predictors approaching the infimum task loss achievable by any linear predictor ove ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
(Show Context)
We consider latent structural versions of probit loss and ramp loss. We show that these surrogate loss functions are consistent in the strong sense that for any feature map (finite or infinite dimensional) they yield predictors approaching the infimum task loss achievable by any linear predictor over the given features. We also give finite sample generalization bounds (convergence rates) for these loss functions. These bounds suggest that probit loss converges more rapidly. However, ramp loss is more easily optimized on a given sample. 1
Doubly Robust Policy Evaluation and Learning
"... We study decision making in environments where the reward is only partially observed, but can be modeled as a function of an action and an observed context. This setting, known as contextual bandits, encompasses a wide variety of applications including healthcare policy and Internet advertising. A ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
(Show Context)
We study decision making in environments where the reward is only partially observed, but can be modeled as a function of an action and an observed context. This setting, known as contextual bandits, encompasses a wide variety of applications including healthcare policy and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy. Previous approaches rely either on models of rewards or models of the past policy. The former are plagued by a large bias whereas the latter have a large variance. In this work, we leverage the strength and overcome the weaknesses of the two approaches by applying the doubly robust technique to the problems of policy evaluation and optimization. We prove that this approach yields accurate value estimates when we have either a good (but not necessarily consistent) model of rewards or a good (but not necessarily consistent) model of past policy. Extensive empirical comparison demonstrates that the doubly robust approach uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies. As such, we expect the doubly robust approach to become common practice. 1.
Imitation learning by coaching
 In Advances in Neural Information Processing Systems 25
, 2012
"... Imitation Learning has been shown to be successful in solving many challenging realworld problems. Some recent approaches give strong performance guarantees by training the policy iteratively. However, it is important to note that these guarantees depend on how well the policy we found can imitate ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
Imitation Learning has been shown to be successful in solving many challenging realworld problems. Some recent approaches give strong performance guarantees by training the policy iteratively. However, it is important to note that these guarantees depend on how well the policy we found can imitate the oracle on the training data. When there is a substantial difference between the oracle’s ability and the learner’s policy space, we may fail to find a policy that has low error on the training set. In such cases, we propose to use a coach that demonstrates easytolearn actions for the learner and gradually approaches the oracle. By a reduction of learning by demonstration to online learning, we prove that coaching can yield a lower regret bound than using the oracle. We apply our algorithm to costsensitive dynamic feature selection, a hard decision problem that considers a userspecified accuracycost tradeoff. Experimental results on UCI datasets show that our method outperforms stateoftheart imitation learning methods in dynamic feature selection and two static feature selection methods. 1
Structured Prediction via Output Space Search
 Journal of Machine Learning Research (JMLR
, 2014
"... We consider a framework for structured prediction based on search in the space of complete structured outputs. Given a structured input, an output is produced by running a timebounded search procedure guided by a learned cost function, and then returning the least cost output uncovered during the s ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
(Show Context)
We consider a framework for structured prediction based on search in the space of complete structured outputs. Given a structured input, an output is produced by running a timebounded search procedure guided by a learned cost function, and then returning the least cost output uncovered during the search. This framework can be instantiated for a wide range of search spaces and search procedures, and easily incorporates arbitrary structuredprediction loss functions. In this paper, we make two main technical contributions. First, we describe a novel approach to automatically defining an effective search space over structured outputs, which is able to leverage the availability of powerful classification learning algorithms. In particular, we define the limiteddiscrepancy search space and relate the quality of that space to the quality of learned classifiers. We also define a sparse version of the search space to improve the efficiency of our overall approach. Second, we give a generic cost function learning approach that is applicable to a wide range of search procedures. The key idea is to learn a cost function that attempts to mimic the behavior of conducting searches guided by the true loss function. Our experiments on six benchmark domains show that a small amount of search in limited discrepancy search space is often sufficient for significantly improving on stateoftheart structuredprediction performance. We also demonstrate significant speed improvements for our approach using sparse search spaces with little or no loss in accuracy.
HCsearch: A learning framework for searchbased structured prediction
 JAIR
"... Structured prediction is the problem of learning a function that maps structured inputs to structured outputs. Prototypical examples of structured prediction include partofspeech tagging and semantic segmentation of images. Inspired by the recent successes of searchbased structured prediction, we ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
(Show Context)
Structured prediction is the problem of learning a function that maps structured inputs to structured outputs. Prototypical examples of structured prediction include partofspeech tagging and semantic segmentation of images. Inspired by the recent successes of searchbased structured prediction, we introduce a new framework for structured prediction called HCSearch. Given a structured input, the framework uses a search procedure guided by a learned heuristic H to uncover high quality candidate outputs and then employs a separate learned cost function C to select a final prediction among those outputs. The overall loss of this prediction architecture decomposes into the loss due to H not leading to high quality outputs, and the loss due to C not selecting the best among the generated outputs. Guided by this decomposition, we minimize the overall loss in a greedy stagewise manner by first training H to quickly uncover high quality outputs via imitation learning, and then training C to correctly rank the outputs generated via H according to their true losses. Importantly, this training procedure is sensitive to the particular loss function of interest and the timebound allowed for predictions. Experiments on several benchmark domains show that our approach significantly outperforms several stateoftheart methods. 1.
Learning for Structured Prediction Using Approximate Subgradient Descent with Working Sets
"... We propose a working set based approximate subgradient descent algorithm to minimize the marginsensitive hinge loss arising from the soft constraints in maxmargin learning frameworks, such as the structured SVM. We focus on the setting of general graphical models, such as loopy MRFs and CRFs commo ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
We propose a working set based approximate subgradient descent algorithm to minimize the marginsensitive hinge loss arising from the soft constraints in maxmargin learning frameworks, such as the structured SVM. We focus on the setting of general graphical models, such as loopy MRFs and CRFs commonly used in image segmentation, where exact inference is intractable and the most violated constraints can only be approximated, voiding the optimality guarantees of the structured SVM’s cutting plane algorithm as well as reducing the robustness of existing subgradient based methods. We show that the proposed method obtains better approximate subgradients through the use of working sets, leading to improved convergence properties and increased reliability. Furthermore, our method allows new constraints to be randomly sampled instead of computed using the more expensive approximate inference techniques such as belief propagation and graph cuts, which can be used to reduce learning time at only a small cost of performance. We demonstrate the strength of our method empirically on the segmentation of a new publicly available electron microscopy dataset as well as the popular MSRC data set and show stateoftheart results. 1.
Direct Error Rate Minimization of Hidden Markov Models
 In Proceedings of INTERSPEECH
, 2011
"... We explore discriminative training of HMM parameters that directly minimizes the expected error rate. In discriminative training one is interested in training a system to minimize a desired error function, like word error rate, phone error rate, or frame error rate. We review a recent method (McAlle ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
We explore discriminative training of HMM parameters that directly minimizes the expected error rate. In discriminative training one is interested in training a system to minimize a desired error function, like word error rate, phone error rate, or frame error rate. We review a recent method (McAllester, Hazan and Keshet, 2010), which introduces an analytic expression for the gradient of the expected errorrate. The analytic expression leads to a perceptronlike update rule, which is adapted here for training of HMMs in an online fashion. While the proposed method can work with any type of the error function used in speech recognition, we evaluated it on phoneme recognition of TIMIT, when the desired error function used for training was frame error rate. Except for the case of GMM with a single mixture per state, the proposed update rule provides lower error rates, both in terms of frame error rate and phone error rate, than other approaches, including MCE and large margin. Index Terms: hidden Markov models, online learning, direct error minimization, discriminative training, automatic speech recognition, minimum phone error, minimum frame error 1.
Learning efficient random maximum aposteriori predictors with nondecomposable loss functions
 Advances in Neural Information Processing Systems
"... In this work we develop efficient methods for learning random MAP predictors for structured label problems. In particular, we construct posterior distributions over perturbations that can be adjusted via stochastic gradient methods. We show that every smooth posterior distribution would suffice to d ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
In this work we develop efficient methods for learning random MAP predictors for structured label problems. In particular, we construct posterior distributions over perturbations that can be adjusted via stochastic gradient methods. We show that every smooth posterior distribution would suffice to define a smooth PACBayesian risk bound suitable for gradient methods. In addition, we relate the posterior distributions to computational properties of the MAP predictors. We suggest multiplicative posteriors to learn supermodular potential functions that accompany specialized MAP predictors such as graphcuts. We also describe labelaugmented posterior models that can use efficient MAP approximations, such as those arising from linear program relaxations. 1