Results 1  10
of
24
Proximal stochastic dual coordinate ascent
 CoRR
"... We introduce a proximal version of dual coordinate ascent method. We demonstrate how the derived algorithmic framework can be used for numerous regularized loss minimization problems, including `1 regularization and structured output SVM. The convergence rates we obtain match, and sometimes improve, ..."
Abstract

Cited by 29 (4 self)
 Add to MetaCart
We introduce a proximal version of dual coordinate ascent method. We demonstrate how the derived algorithmic framework can be used for numerous regularized loss minimization problems, including `1 regularization and structured output SVM. The convergence rates we obtain match, and sometimes improve, stateoftheart results. 1
Towards Minimax Policies for Online Linear Optimization with Bandit Feedback
 In COLT
, 2012
"... We address the online linear optimization problem with bandit feedback. Our contribution is twofold. First, we provide an algorithm (based on exponential weights) with a regret of order dn log N for any finite action set with N actions, under the assumption that the instantaneous loss is bounded by ..."
Abstract

Cited by 24 (1 self)
 Add to MetaCart
We address the online linear optimization problem with bandit feedback. Our contribution is twofold. First, we provide an algorithm (based on exponential weights) with a regret of order dn log N for any finite action set with N actions, under the assumption that the instantaneous loss is bounded by 1. This shaves off an extraneous √ d factor compared to previous works, and gives a regret bound of order d √ n log n for any compact set of actions. Without further assumptions on the action set, this last bound is minimax optimal up to a logarithmic factor. Interestingly, our result also shows that the minimax regret for bandit linear optimization with expert advice in d dimension is the same as for the basic darmed bandit with expert advice. Our second contribution is to show how to use the Mirror Descent algorithm to obtain computationally efficient strategies with minimax optimal regret bounds in specific examples. More precisely we study two canonical action sets: the hypercube and the Euclidean ball. In the former case, we obtain the first computationally efficient algorithm with a d √ n regret, thus improving by a factor √ d log n over the best known result for a computationally efficient algorithm. In the latter case, our approach gives the first algorithm with a √ dn log n regret, again shaving off an extraneous √ d compared to previous works.
Structured sparsity and generalization
 J. Machine Learning Research
, 2012
"... We present a data dependent generalization bound for a large class of regularized algorithms which implement structured sparsity constraints. The bound can be applied to standard squarednorm regularization, the Lasso, the group Lasso, some versions of the group Lasso with overlapping groups, multip ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
(Show Context)
We present a data dependent generalization bound for a large class of regularized algorithms which implement structured sparsity constraints. The bound can be applied to standard squarednorm regularization, the Lasso, the group Lasso, some versions of the group Lasso with overlapping groups, multiple kernel learning and other regularization schemes. In all these cases competitive results are obtained. A novel feature of our bound is that it can be applied in an infinite dimensional setting such as the Lasso in a separable Hilbert space or multiple kernel learning with a countable number of kernels.
Largescale Multilabel Learning with Missing Labels
"... The multilabel classification problem has generated significant interest in recent years. However, existing approaches do not adequately address two key challenges: (a) scaling up to problems with a large number (say millions) of labels, and (b) handling data with missing labels. In this paper, ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
The multilabel classification problem has generated significant interest in recent years. However, existing approaches do not adequately address two key challenges: (a) scaling up to problems with a large number (say millions) of labels, and (b) handling data with missing labels. In this paper, we directly address both these problems by studying the multilabel problem in a generic empirical risk minimization (ERM) framework. Our framework, despite being simple, is surprisingly able to encompass several recent labelcompression based methods which can be derived as special cases of our method. To optimize the ERM problem, we develop techniques that exploit the structure of specific loss functionssuch as the squared loss function to obtain efficient algorithms. We further show that our learning framework admits excess risk bounds even in the presence of missing labels. Our bounds are tight and demonstrate better generalization performance for lowrank promoting tracenorm regularization when compared to (rank insensitive) Frobenius norm regularization. Finally, we present extensive empirical results on a variety of benchmark datasets and show that our methods perform significantly better than existing label compression based methods and can scale up to very large datasets such as a Wikipedia dataset that has more than 200,000 labels. 1.
Nearoptimal algorithms for online matrix prediction
 CoRR
"... In several online prediction problems of recent interest the comparison class is composed of matrices with bounded entries. For example, in the online maxcut problem, the comparison class is matrices which represent cuts of a given graph and in online gambling the comparison class is matrices which ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
(Show Context)
In several online prediction problems of recent interest the comparison class is composed of matrices with bounded entries. For example, in the online maxcut problem, the comparison class is matrices which represent cuts of a given graph and in online gambling the comparison class is matrices which represent permutations over n teams. Another important example is online collaborative filtering in which a widely used comparison class is the set of matrices with a small trace norm. In this paper we isolate a property of matrices, which we call (β, τ)decomposability, and derive an efficient online learning algorithm, that enjoys a regret bound of Õ ( √ β τ T) for all problems in which the comparison class is composed of (β, τ)decomposable matrices. By analyzing the decomposability of cut matrices, triangular matrices, and low tracenorm matrices, we derive near optimal regret bounds for online maxcut, online gambling, and online collaborative filtering. In particular, this resolves (in the affirmative) an open problem posed by Abernethy [2010], Kleinberg et al. [2010]. Finally, we derive lower bounds for the three problems and show that our upper bounds are optimal up to logarithmic factors. In particular, our lower bound for the online collaborative filtering problem resolves another open problem posed by Shamir and Srebro [2011]. 1
Online Learning in the Embedded Manifold of Lowrank Matrices
"... When learning models that are represented in matrix forms, enforcing a lowrank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. However, naive approaches to minimizing functions over the set of lowrank matrices are eithe ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
When learning models that are represented in matrix forms, enforcing a lowrank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. However, naive approaches to minimizing functions over the set of lowrank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the lowrank matrix). We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a secondorder retraction back to the manifold. While the ideal retraction is costly to compute, and so is the projection operator that approximates it, we describe another retraction that can be computed efficiently. It has run time and memory complexity of O((n+m)k) for a rankk matrix of dimension m×n, when using an online procedure with rankone gradients. We use this algorithm, LORETA, to learn a matrixform similarity measure over pairs of documents represented as high dimensional vectors. LORETA improves the mean average precision over a passiveaggressive approach in a factorized model, and also improves over a full model trained on preselected features using the same memory requirements. We further adapt LORETA to learn positive semidefinite lowrank matrices, providing an online algorithm for lowrank metric learning. LORETA also shows consistent improvement over standard weakly supervised methods in a large (1600 classes and 1 million images, using ImageNet) multilabel image classification task.
MIXABILITY IS BAYES RISK CURVATURE RELATIVE TO LOG LOSS
"... Given K codes, a standard result from source coding tells us how to design a single universal code with codelengths within log(K) bits of the best code, on any data sequence. Translated to the online learning setting of prediction with expert advice, this result implies that for logarithmic loss one ..."
Abstract

Cited by 9 (6 self)
 Add to MetaCart
Given K codes, a standard result from source coding tells us how to design a single universal code with codelengths within log(K) bits of the best code, on any data sequence. Translated to the online learning setting of prediction with expert advice, this result implies that for logarithmic loss one can guarantee constant regret, which does not grow with the number of outcomes that need to be predicted. In this setting, it is known for which other losses the same guarantee can be given: these are the losses that are mixable. We show that among the mixable losses, log loss is special: in fact, one may understand the class of mixable losses as those that behave like log loss in an essential way. More specifically, a loss is mixable if and only if the curvature of its Bayes risk is at least as large as the curvature of the Bayes risk for log loss (for which the Bayes risk equals the entropy). 1.
On the Generalization Ability of Online Learning Algorithms for Pairwise Loss Functions
"... In this paper, we study the generalization properties of online learning based stochastic methods for supervised learning problems where the loss function is dependent on more than one training sample (e.g., metric learning, ranking). We present a generic decoupling technique that enables us to prov ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
In this paper, we study the generalization properties of online learning based stochastic methods for supervised learning problems where the loss function is dependent on more than one training sample (e.g., metric learning, ranking). We present a generic decoupling technique that enables us to provide Rademacher complexitybased generalization error bounds. Our bounds are in general tighter than those obtained by Wang et al. (2012) for the same problem. Using our decoupling technique, we are further able to obtain fast convergence rates for strongly convex pairwise loss functions. We are also able to analyze a class of memory efficient online learning algorithms for pairwise learning problems that use only a bounded subset of past training samples to update the hypothesis at each step. Finally, in order to complement our generalization bounds, we propose a novel memory efficient online learning algorithm for higher order learning problems with bounded regret guarantees. 1.
Scalable Matrixvalued Kernel Learning for Highdimensional Nonlinear Multivariate Regression and Granger Causality
"... We propose a general matrixvalued multiple kernel learning framework for highdimensionalnonlinearmultivariateregression problems. This framework allows a broad class of mixed norm regularizers, including those that induce sparsity, to be imposedonadictionaryofvectorvaluedReproducing Kernel Hilbert ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
We propose a general matrixvalued multiple kernel learning framework for highdimensionalnonlinearmultivariateregression problems. This framework allows a broad class of mixed norm regularizers, including those that induce sparsity, to be imposedonadictionaryofvectorvaluedReproducing Kernel Hilbert Spaces. We develop a highly scalable and eigendecompositionfree algorithm that orchestrates two inexact solvers for simultaneously learning both the input and output components of separable matrixvalued kernels. As a key application enabled by our framework, we show how highdimensional causal inference tasks can be naturally cast as sparse function estimation problems, leading to novel nonlinear extensions of a class of Graphical Granger Causality techniques. Our algorithmic developments and extensive empirical studies are complemented by theoretical analyses in terms of Rademacher generalization bounds. 1
A generalized online mirror descent with applications to classification and regression
, 2012
"... Online learning algorithms are fast, memoryefficient, easy to implement, and applicable to many prediction problems, including classification, regression, and ranking. Several online algorithms were proposed in the past few decades, some based on additive updates, like the Perceptron, and some othe ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
Online learning algorithms are fast, memoryefficient, easy to implement, and applicable to many prediction problems, including classification, regression, and ranking. Several online algorithms were proposed in the past few decades, some based on additive updates, like the Perceptron, and some other on multiplicative updates, like Winnow. Online convex optimization is a general framework to unify both the design and the analysis of online algorithms using a single prediction strategy: online mirror descent. Different firstorder online algorithms are obtained by choosing the regularization function in online mirror descent. We generalize online mirror descent to sequences of timevarying regularizers. Our approach allows us to recover as special cases many recently proposed secondorder algorithms, such as the VovkAzouryWarmuth, the secondorder Perceptron, and the AROW algorithm. Moreover, we derive a new second order adaptive pnorm algorithm, and improve bounds for some firstorder algorithms, such as PassiveAggressive (PAI). Keywords: Online learning, Convex optimization, Secondorder algorithms