Results 1  10
of
36
Mind the Duality Gap: Logarithmic regret algorithms for online optimization
"... We describe a primaldual framework for the design and analysis of online strongly convex optimization algorithms. Our framework yields the tightest known logarithmic regret bounds for FollowTheLeader and for the gradient descent algorithm proposed in Hazan et al. [2006]. We then show that one can ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
We describe a primaldual framework for the design and analysis of online strongly convex optimization algorithms. Our framework yields the tightest known logarithmic regret bounds for FollowTheLeader and for the gradient descent algorithm proposed in Hazan et al. [2006]. We then show that one can interpolate between these two extreme cases. In particular, we derive a new algorithm that shares the computational simplicity of gradient descent but achieves lower regret in many practical situations. Finally, we further extend our framework for generalized strongly convex functions. 1
Efficient Euclidean Projections in Linear Time
"... We consider the problem of computing the Euclidean projection of a vector of length n onto a closed convex set including the ℓ1 ball and the specialized polyhedra employed in (ShalevShwartz & Singer, 2006). These problems have played building block roles in solving several ℓ 1norm based sparse lea ..."
Abstract

Cited by 21 (7 self)
 Add to MetaCart
We consider the problem of computing the Euclidean projection of a vector of length n onto a closed convex set including the ℓ1 ball and the specialized polyhedra employed in (ShalevShwartz & Singer, 2006). These problems have played building block roles in solving several ℓ 1norm based sparse learning problems. Existing methods have a worstcase time complexity of O(n log n). In this paper, we propose to cast both Euclidean projections as root finding problems associated with specific auxiliary functions, which can be solved in linear time via bisection. We further make use of the special structure of the auxiliary functions, and propose an improved bisection algorithm. Empirical studies demonstrate that the proposed algorithms are much more efficient than the competing ones for computing the projections. 1.
On the equivalence of weak learnability and linear separability: New relaxations and efficient boosting algorithms
 IN: PROCEEDINGS OF THE 21ST ANNUAL CONFERENCE ON COMPUTATIONAL LEARNING THEORY
"... Boosting algorithms build highly accurate prediction mechanisms from a collection of lowaccuracy predictors. To do so, they employ the notion of weaklearnability. The starting point of this paper is a proof which shows that weak learnability is equivalent to linear separability with ℓ1 margin. Whil ..."
Abstract

Cited by 20 (6 self)
 Add to MetaCart
Boosting algorithms build highly accurate prediction mechanisms from a collection of lowaccuracy predictors. To do so, they employ the notion of weaklearnability. The starting point of this paper is a proof which shows that weak learnability is equivalent to linear separability with ℓ1 margin. While this equivalence is a direct consequence of von Neumann’s minimax theorem, we derive the equivalence directly using Fenchel duality. We then use our derivation to describe a family of relaxations to the weaklearnability assumption that readily translates to a family of relaxations of linear separability with margin. This alternative perspective sheds new light on known softmargin boosting algorithms and also enables us to derive several new relaxations of the notion of linear separability. Last, we describe and analyze an efficient boosting framework that can be used for minimizing the loss functions derived from our family of relaxations. In particular, we obtain efficient boosting algorithms for maximizing hard and soft versions of the ℓ1 margin.
Fast rates for regularized objectives
 In Neural Information Processing Systems
, 2008
"... We study convergence properties of empirical minimization of a stochastic strongly convex objective, where the stochastic component is linear. We show that the value attained by the empirical minimizer converges to the optimal value with rate 1/n. The result applies, in particular, to the SVM object ..."
Abstract

Cited by 20 (7 self)
 Add to MetaCart
We study convergence properties of empirical minimization of a stochastic strongly convex objective, where the stochastic component is linear. We show that the value attained by the empirical minimizer converges to the optimal value with rate 1/n. The result applies, in particular, to the SVM objective. Thus, we obtain a rate of 1/n on the convergence of the SVM objective (with fixed regularization parameter) to its infinite data limit. We demonstrate how this is essential for obtaining certain type of oracle inequalities for SVMs. The results extend also to approximate minimization as well as to strong convexity with respect to an arbitrary norm, and so also to objectives regularized using other ℓp norms. 1
Stochastic Methods for ℓ1 Regularized Loss Minimization Shai ShalevShwartz
"... We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteratio ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature/example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform stateoftheart deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets. 1.
Smoothness, low noise and fast rates
 In NIPS
, 2010
"... We establish an excess risk bound of Õ HR2 n + √ HL ∗) Rn for ERM with an Hsmooth loss function and a hypothesis class with Rademacher complexity Rn, where L ∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = √ R/n, this translates to a learning rate o ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
We establish an excess risk bound of Õ HR2 n + √ HL ∗) Rn for ERM with an Hsmooth loss function and a hypothesis class with Rademacher complexity Rn, where L ∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = √ R/n, this translates to a learning rate of Õ (RH/n) in the separable (L ∗ = 0) case and Õ RH/n + √ L ∗) RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth nonnegative objective. 1
Beating the adaptive bandit with high probability
, 2009
"... We provide a principled way of proving Õ( √ T) highprobability guarantees for partialinformation (bandit) problems over convex decision sets. First, we prove a regret guarantee for the fullinformation problem in terms of “local ” norms, both for entropy and selfconcordant barrier regularization, ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
We provide a principled way of proving Õ( √ T) highprobability guarantees for partialinformation (bandit) problems over convex decision sets. First, we prove a regret guarantee for the fullinformation problem in terms of “local ” norms, both for entropy and selfconcordant barrier regularization, unifying these methods. Given one of such algorithms as a blackbox, we can convert a bandit problem into a fullinformation problem using a sampling scheme. The main result states that a highprobability Õ ( √ T) bound holds whenever the blackbox, the sampling scheme, and the estimates of missing information satisfy a number of conditions, which are relatively easy to check. At the heart of the method is a construction of linear upper bounds on confidence intervals. As applications of the main result, we provide the first known efficient algorithm for the sphere with an Õ( √ T) highprobability bound. We also derive the result for the nsimplex, improving the O ( √ nT log(nT)) bound of Auer et al [3] by replacing the log T term with log log T and closing the gap to the lower bound of Ω ( √ nT). The guarantees we obtain hold for adaptive adversaries (unlike the inexpectation results of [1]) and the algorithms are efficient, given that the linear upper bounds on confidence can be computed. 1
Predicting the Labelling of a Graph via Minimum pSeminorm Interpolation
"... We study the problem of predicting the labelling of a graph. The graph is given and a trial sequence of (vertex,label) pairs is then incrementally revealed to the learner. On each trial a vertex is queried and the learner predicts a boolean label. The true label is then returned. The learner’s goal ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
We study the problem of predicting the labelling of a graph. The graph is given and a trial sequence of (vertex,label) pairs is then incrementally revealed to the learner. On each trial a vertex is queried and the learner predicts a boolean label. The true label is then returned. The learner’s goal is to minimise mistaken predictions. We propose minimum pseminorm interpolation to solve this problem. To this end we give a pseminorm on the space of graph labellings. Thus on every trial we predict using the labelling which minimises the pseminorm and is also consistent with the revealed (vertex, label) pairs. When p = 2 this is the harmonic energy minimisation procedure of [22], also called (Laplacian) interpolated regularisation in [1]. In the limit as p → 1 this is equivalent to predicting with a labelconsistent mincut. We give mistake bounds relative to a labelconsistent mincut and a resistive cover of the graph. We say an edge is cut with respect to a labelling if the connected vertices have disagreeing labels. We find that minimising the pseminorm with p = 1 + ɛ where ɛ → 0 as the graph diameter D → ∞ gives a bound of O(Φ 2 log D) versus a bound of O(ΦD) when p = 2 where Φ is the number of cut edges. 1
Online Multiple Kernel Learning: Algorithms and Mistake Bounds
"... Abstract. Online learning and kernel learning are two active research topics in machine learning. Although each of them has been studied extensively, there is a limited effort in addressing the intersecting research. In this paper, we introduce a new research problem, termed Online Multiple Kernel L ..."
Abstract

Cited by 10 (10 self)
 Add to MetaCart
Abstract. Online learning and kernel learning are two active research topics in machine learning. Although each of them has been studied extensively, there is a limited effort in addressing the intersecting research. In this paper, we introduce a new research problem, termed Online Multiple Kernel Learning (OMKL), that aims to learn a kernel based prediction function from a pool of predefined kernels in an online learning fashion. OMKL is generally more challenging than typical online learning because both the kernel classifiers and their linear combination weights must be learned simultaneously. In this work, we consider two setups for OMKL, i.e. combining binary predictions or realvalued outputs from multiple kernel classifiers, and we propose both deterministic and stochastic approaches in the two setups for OMKL. The deterministic approach updates all kernel classifiers for every misclassified example, while the stochastic approach randomly chooses a classifier(s) for updating according to some sampling strategies. Mistake bounds are derived for all the proposed OMKL algorithms. Keywords: Online learning and relative loss bounds, Kernels 1
Learnability, stability and uniform convergence
 JMLR
, 2010
"... The problem of characterizing learnability is the most basic question of statistical learning theory. A fundamental and longstanding answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the popu ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
The problem of characterizing learnability is the most basic question of statistical learning theory. A fundamental and longstanding answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is learnable via empirical risk minimization. In this paper, we consider the General Learning Setting (introduced by Vapnik), which includes most statistical learning problems as special cases. We show that in this setting, there are nontrivial learning problems where uniform convergence does not hold, empirical risk minimization fails, and yet they are learnable using alternative mechanisms. Instead of uniform convergence, we identify stability as the key necessary and sufficient condition for learnability. Moreover, we show that the conditions for learnability in the general setting are significantly more complex than in supervised classification and regression.