Results 1  10
of
20
Structured learning with approximate inference
 Advances in Neural Information Processing Systems
"... In many structured prediction problems, the highestscoring labeling is hard to compute exactly, leading to the use of approximate inference methods. However, when inference is used in a learning algorithm, a good approximation of the score may not be sufficient. We show in particular that learning ..."
Abstract

Cited by 48 (1 self)
 Add to MetaCart
In many structured prediction problems, the highestscoring labeling is hard to compute exactly, leading to the use of approximate inference methods. However, when inference is used in a learning algorithm, a good approximation of the score may not be sufficient. We show in particular that learning can fail even with an approximate inference method with rigorous approximation guarantees. There are two reasons for this. First, approximate methods can effectively reduce the expressivity of an underlying model by making it impossible to choose parameters that reliably give good predictions. Second, approximations can respond to parameter changes in such a way that standard learning algorithms are misled. In contrast, we give two positive results in the form of learning bounds for the use of LPrelaxed inference in structured perceptron and empirical risk minimization settings. We argue that without understanding combinations of inference and learning, such as these, that are appropriately compatible, learning performance under approximate inference cannot be guaranteed. 1
Bundle Methods for Regularized Risk Minimization
"... A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional ..."
Abstract

Cited by 36 (2 self)
 Add to MetaCart
A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for datalocality, and can deal with regularizers such as L1 and L2 penalties. In addition to the unified framework we present tight convergence bounds, which show that our algorithm converges in O(1/ɛ) steps to ɛ precision for general convex problems and in O(log(1/ɛ)) steps for continuously differentiable problems. We demonstrate the performance of our general purpose solver on a variety of publicly available datasets.
PACBayesian Analysis of Coclustering and Beyond
"... We derive PACBayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as coclustering, matrix trifactorization, graphical models, graph clustering, and pairwise clustering. 1 We begin with the analysis of coclustering, which is a widely used approa ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
We derive PACBayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as coclustering, matrix trifactorization, graphical models, graph clustering, and pairwise clustering. 1 We begin with the analysis of coclustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in cooccurrence matrices. We derive PACBayesian generalization bounds for the expected outofsample performance of coclusteringbased solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of coclustering. The bounds suggest that the expected performance of coclustering is governed by a tradeoff between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this tradeoff for discriminative prediction tasks. This algorithm achieved stateoftheart performance in the MovieLens collaborative filtering task. Our coclustering model can also be seen as matrix trifactorization and the results provide generalization bounds, regularization
2011. Generalization bounds and consistency for latent structural probit and ramp loss
 In Proc. of NIPS
"... We consider latent structural versions of probit loss and ramp loss. We show that these surrogate loss functions are consistent in the strong sense that for any feature map (finite or infinite dimensional) they yield predictors approaching the infimum task loss achievable by any linear predictor ove ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
We consider latent structural versions of probit loss and ramp loss. We show that these surrogate loss functions are consistent in the strong sense that for any feature map (finite or infinite dimensional) they yield predictors approaching the infimum task loss achievable by any linear predictor over the given features. We also give finite sample generalization bounds (convergence rates) for these loss functions. These bounds suggest that probit loss converges more rapidly. However, ramp loss is more easily optimized on a given sample. 1
Sequential Learning of Classifiers for Structured Prediction Problems
"... Many classification problems with structured outputs can be regarded as a set of interrelated subproblems where constraints dictate valid variable assignments. The standard approaches to these problems include either independent learning of individual classifiers for each of the subproblems or joi ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Many classification problems with structured outputs can be regarded as a set of interrelated subproblems where constraints dictate valid variable assignments. The standard approaches to these problems include either independent learning of individual classifiers for each of the subproblems or joint learning of the entire set of classifiers with the constraints enforced during learning. We propose an intermediate approach where we learn these classifiers in a sequence using previously learned classifiers to guide learning of the next classifier by enforcing constraints between their outputs. We provide a theoretical motivation to explain why this learning protocol is expected to outperform both alternatives when individual problems have different ‘complexity’. This analysis motivates an algorithm for choosing a preferred order of classifier learning. We evaluate our technique on artificial experiments and on the entity and relation identification problem where the proposed method outperforms both joint and independent learning. 1
Exponential family graph matching and ranking
 CoRR
"... We present a method for learning maxweight matching predictors in bipartite graphs. The method consists of performing maximum a posteriori estimation in exponential families with sufficient statistics that encode permutations and data features. Although inference is in general hard, we show that fo ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We present a method for learning maxweight matching predictors in bipartite graphs. The method consists of performing maximum a posteriori estimation in exponential families with sufficient statistics that encode permutations and data features. Although inference is in general hard, we show that for one very relevant application–document ranking–exact inference is efficient. For general model instances, an appropriate sampler is readily available. Contrary to existing maxmargin matching models, our approach is statistically consistent and, in addition, experiments with increasing sample sizes indicate superior improvement over such models. We apply the method to graph matching in computer vision as well as to a standard benchmark dataset for learning document ranking, in which we obtain stateoftheart results, in particular improving on maxmargin variants. The drawback of this method with respect to maxmargin alternatives is its runtime for large graphs, which is comparatively high. 1
Regret analysis for performance metrics in multilabel classification: The case of hamming and subset zeroone loss
 In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
"... Abstract. In multilabel classification (MLC), each instance is associated with a subset of labels instead of a single class, as in conventional classification, and this generalization enables the definition of a multitude of loss functions. Indeed, a large number of losses has already been proposed ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Abstract. In multilabel classification (MLC), each instance is associated with a subset of labels instead of a single class, as in conventional classification, and this generalization enables the definition of a multitude of loss functions. Indeed, a large number of losses has already been proposed and is commonly applied as performance metrics in experimental studies. However, even though these loss functions are of a quite different nature, a concrete connection between the type of multilabel classifier used and the loss to be minimized is rarely established, implicitly giving the misleading impression that the same method can be optimal for different loss functions. In this paper, we elaborate on risk minimization and the connection between loss functions in MLC, both theoretically and empirically. In particular, we compare two important loss functions, namely the Hamming loss and the subset 0/1 loss. We perform a regret analysis, showing how poor a classifier intended to minimize the subset 0/1 loss can become in terms of Hamming loss and vice versa. The theoretical results are corroborated by experimental studies, and their implications for MLC methods are discussed in a broader context. 1
Better Multiclass Classification via a MarginOptimized Single Binary Problem
, 2008
"... We develop a new multiclass classification method that reduces the multiclass problem to a single binary classifier (SBC). Our method constructs the binary problem by embedding smaller binary problems into a single space. A good embedding will allow for large margin classification. We show that the ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
We develop a new multiclass classification method that reduces the multiclass problem to a single binary classifier (SBC). Our method constructs the binary problem by embedding smaller binary problems into a single space. A good embedding will allow for large margin classification. We show that the construction of such an embedding can be reduced to the task of learning linear combinations of kernels. We provide a bound on the generalization error of the multiclass classifier obtained with our construction and outline the conditions for its consistency. Our empirical examination of the new method indicates that it outperforms onevsall, allpairs and the errorcorrecting output coding scheme at least when the number of classes is small.
A Learning Theory Framework for Association Rules and Sequential Events A Learning Theory Framework for Association Rules and Sequential Events
"... Editor: We present a framework and generalization analysis for the use of association rules in the setting of supervised learning. We are specifically interested in a sequential event prediction problem where data are revealed one by one, and the goal is to determine what will next be revealed. In t ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Editor: We present a framework and generalization analysis for the use of association rules in the setting of supervised learning. We are specifically interested in a sequential event prediction problem where data are revealed one by one, and the goal is to determine what will next be revealed. In the context of this problem, algorithms based on association rules have a distinct advantage over classical statistical and machine learning methods; however, to our knowledge there has not previously been a theoretical foundation established for using association rules in supervised learning. We present two simple algorithms that incorporate association rules. These algorithms can be used both for sequential event prediction and for supervised classification. We provide generalization guarantees on these algorithms based on algorithmic stability analysis from statistical learning theory. We include a discussion of the strict minimum support threshold often used in association rule mining, and introduce an “adjusted confidence ” measure that provides a weaker minimum support condition that has advantages over the strict minimum support. The paper brings together ideas from statistical learning theory, association rule mining and Bayesian analysis.
PACBayesBernstein inequality for martingales and its application to multiarmed bandits
 JMLR Workshop and Conference Proceedings
"... We develop a new tool for datadependent analysis of the explorationexploitation tradeoff in learning under limited feedback. Our tool is based on two main ingredients. The first ingredient is a new concentration inequality that makes it possible to control the concentration of weighted averages o ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
We develop a new tool for datadependent analysis of the explorationexploitation tradeoff in learning under limited feedback. Our tool is based on two main ingredients. The first ingredient is a new concentration inequality that makes it possible to control the concentration of weighted averages of multiple (possibly uncountably many) simultaneously evolving and interdependent martingales. 1 The second ingredient is an application of this inequality to the explorationexploitation tradeoff via importance weighted sampling. We apply the new tool to the stochastic multiarmed bandit problem, however, the main importance of this paper is the development and understanding of the new tool rather than improvement of existing algorithms for stochastic multiarmed bandits. In the followup work we demonstrate that the new tool can improve over stateoftheart in structurally richer problems, such as stochastic multiarmed bandits with side information (Seldin et al., 2011a).