Results 1 - 10
of
13
Structured learning with approximate inference
- Advances in Neural Information Processing Systems
"... In many structured prediction problems, the highest-scoring labeling is hard to compute exactly, leading to the use of approximate inference methods. However, when inference is used in a learning algorithm, a good approximation of the score may not be sufficient. We show in particular that learning ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
In many structured prediction problems, the highest-scoring labeling is hard to compute exactly, leading to the use of approximate inference methods. However, when inference is used in a learning algorithm, a good approximation of the score may not be sufficient. We show in particular that learning can fail even with an approximate inference method with rigorous approximation guarantees. There are two reasons for this. First, approximate methods can effectively reduce the expressivity of an underlying model by making it impossible to choose parameters that reliably give good predictions. Second, approximations can respond to parameter changes in such a way that standard learning algorithms are misled. In contrast, we give two positive results in the form of learning bounds for the use of LP-relaxed inference in structured perceptron and empirical risk minimization settings. We argue that without understanding combinations of inference and learning, such as these, that are appropriately compatible, learning performance under approximate inference cannot be guaranteed. 1
Bundle Methods for Regularized Risk Minimization
"... A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as L1 and L2 penalties. In addition to the unified framework we present tight convergence bounds, which show that our algorithm converges in O(1/ɛ) steps to ɛ precision for general convex problems and in O(log(1/ɛ)) steps for continuously differentiable problems. We demonstrate the performance of our general purpose solver on a variety of publicly available datasets.
Exponential family graph matching and ranking
- CoRR
"... We present a method for learning max-weight matching predictors in bipartite graphs. The method consists of performing maximum a posteriori estimation in exponential families with sufficient statistics that encode permutations and data features. Although inference is in general hard, we show that fo ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We present a method for learning max-weight matching predictors in bipartite graphs. The method consists of performing maximum a posteriori estimation in exponential families with sufficient statistics that encode permutations and data features. Although inference is in general hard, we show that for one very relevant application–document ranking–exact inference is efficient. For general model instances, an appropriate sampler is readily available. Contrary to existing max-margin matching models, our approach is statistically consistent and, in addition, experiments with increasing sample sizes indicate superior improvement over such models. We apply the method to graph matching in computer vision as well as to a standard benchmark dataset for learning document ranking, in which we obtain state-of-the-art results, in particular improving on max-margin variants. The drawback of this method with respect to max-margin alternatives is its runtime for large graphs, which is comparatively high. 1
Sequential Learning of Classifiers for Structured Prediction Problems
"... Many classification problems with structured outputs can be regarded as a set of interrelated sub-problems where constraints dictate valid variable assignments. The standard approaches to these problems include either independent learning of individual classifiers for each of the sub-problems or joi ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Many classification problems with structured outputs can be regarded as a set of interrelated sub-problems where constraints dictate valid variable assignments. The standard approaches to these problems include either independent learning of individual classifiers for each of the sub-problems or joint learning of the entire set of classifiers with the constraints enforced during learning. We propose an intermediate approach where we learn these classifiers in a sequence using previously learned classifiers to guide learning of the next classifier by enforcing constraints between their outputs. We provide a theoretical motivation to explain why this learning protocol is expected to outperform both alternatives when individual problems have different ‘complexity’. This analysis motivates an algorithm for choosing a preferred order of classifier learning. We evaluate our technique on artificial experiments and on the entity and relation identification problem where the proposed method outperforms both joint and independent learning. 1
2011. Generalization bounds and consistency for latent structural probit and ramp loss
- In Proc. of NIPS
"... We consider latent structural versions of probit loss and ramp loss. We show that these surrogate loss functions are consistent in the strong sense that for any feature map (finite or infinite dimensional) they yield predictors approaching the infimum task loss achievable by any linear predictor ove ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We consider latent structural versions of probit loss and ramp loss. We show that these surrogate loss functions are consistent in the strong sense that for any feature map (finite or infinite dimensional) they yield predictors approaching the infimum task loss achievable by any linear predictor over the given features. We also give finite sample generalization bounds (convergence rates) for these loss functions. These bounds suggest that probit loss converges more rapidly. However, ramp loss is more easily optimized on a given sample. 1
PAC-Bayesian Analysis of Co-clustering and Beyond
"... We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering. 1 We begin with the analysis of co-clustering, which is a widely used approa ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering. 1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative filtering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization
Better Multiclass Classification via a Margin-Optimized Single Binary Problem
, 2008
"... We develop a new multiclass classification method that reduces the multiclass problem to a single binary classifier (SBC). Our method constructs the binary problem by embedding smaller binary problems into a single space. A good embedding will allow for large margin classification. We show that the ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
We develop a new multiclass classification method that reduces the multiclass problem to a single binary classifier (SBC). Our method constructs the binary problem by embedding smaller binary problems into a single space. A good embedding will allow for large margin classification. We show that the construction of such an embedding can be reduced to the task of learning linear combinations of kernels. We provide a bound on the generalization error of the multiclass classifier obtained with our construction and outline the conditions for its consistency. Our empirical examination of the new method indicates that it outperforms one-vs-all, all-pairs and the error-correcting output coding scheme at least when the number of classes is small.
A Learning Theory Framework for Association Rules and Sequential Events A Learning Theory Framework for Association Rules and Sequential Events
"... Editor: We present a framework and generalization analysis for the use of association rules in the setting of supervised learning. We are specifically interested in a sequential event prediction problem where data are revealed one by one, and the goal is to determine what will next be revealed. In t ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Editor: We present a framework and generalization analysis for the use of association rules in the setting of supervised learning. We are specifically interested in a sequential event prediction problem where data are revealed one by one, and the goal is to determine what will next be revealed. In the context of this problem, algorithms based on association rules have a distinct advantage over classical statistical and machine learning methods; however, to our knowledge there has not previously been a theoretical foundation established for using association rules in supervised learning. We present two simple algorithms that incorporate association rules. These algorithms can be used both for sequential event prediction and for supervised classification. We provide generalization guarantees on these algorithms based on algorithmic stability analysis from statistical learning theory. We include a discussion of the strict minimum support threshold often used in association rule mining, and introduce an “adjusted confidence ” measure that provides a weaker minimum support condition that has advantages over the strict minimum support. The paper brings together ideas from statistical learning theory, association rule mining and Bayesian analysis.
Magic Moments for Structured Output Prediction
"... Most approaches to structured output prediction rely on a hypothesis space of prediction functions that compute their output by maximizing a linear scoring function. In this paper we present two novel learning algorithms for this hypothesis class, and a statistical analysis of their performance. The ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Most approaches to structured output prediction rely on a hypothesis space of prediction functions that compute their output by maximizing a linear scoring function. In this paper we present two novel learning algorithms for this hypothesis class, and a statistical analysis of their performance. The methods rely on efficiently computing the first two moments of the scoring function over the output space, and using them to create convex objective functions for training. We report extensive experimental results for sequence alignment, named entity recognition, and RNA secondary structure prediction. Keywords: PAC bound structured output prediction, discriminative learning, Z-score, discriminant analysis, 1.
Simple Risk Bounds for Position-Sensitive Max-Margin Ranking Algorithms
"... We present risk bounds for position-sensitive max-margin ranking algorithms that follow straightforwardly from a structural result for Rademacher averages presented by [1]. We apply this result to pairwise and listwise hinge loss that are position-sensitive by virtue of rescaling the margin by a pai ..."
Abstract
- Add to MetaCart
We present risk bounds for position-sensitive max-margin ranking algorithms that follow straightforwardly from a structural result for Rademacher averages presented by [1]. We apply this result to pairwise and listwise hinge loss that are position-sensitive by virtue of rescaling the margin by a pairwise or listwise position-sensitive prediction loss. 1

