Results 1  10
of
64
Efficient Euclidean Projections in Linear Time
"... We consider the problem of computing the Euclidean projection of a vector of length n onto a closed convex set including the ℓ1 ball and the specialized polyhedra employed in (ShalevShwartz & Singer, 2006). These problems have played building block roles in solving several ℓ 1norm based sparse ..."
Abstract

Cited by 42 (9 self)
 Add to MetaCart
(Show Context)
We consider the problem of computing the Euclidean projection of a vector of length n onto a closed convex set including the ℓ1 ball and the specialized polyhedra employed in (ShalevShwartz & Singer, 2006). These problems have played building block roles in solving several ℓ 1norm based sparse learning problems. Existing methods have a worstcase time complexity of O(n log n). In this paper, we propose to cast both Euclidean projections as root finding problems associated with specific auxiliary functions, which can be solved in linear time via bisection. We further make use of the special structure of the auxiliary functions, and propose an improved bisection algorithm. Empirical studies demonstrate that the proposed algorithms are much more efficient than the competing ones for computing the projections. 1.
Fast rates for regularized objectives
 In Neural Information Processing Systems
, 2008
"... We study convergence properties of empirical minimization of a stochastic strongly convex objective, where the stochastic component is linear. We show that the value attained by the empirical minimizer converges to the optimal value with rate 1/n. The result applies, in particular, to the SVM object ..."
Abstract

Cited by 41 (9 self)
 Add to MetaCart
(Show Context)
We study convergence properties of empirical minimization of a stochastic strongly convex objective, where the stochastic component is linear. We show that the value attained by the empirical minimizer converges to the optimal value with rate 1/n. The result applies, in particular, to the SVM objective. Thus, we obtain a rate of 1/n on the convergence of the SVM objective (with fixed regularization parameter) to its infinite data limit. We demonstrate how this is essential for obtaining certain type of oracle inequalities for SVMs. The results extend also to approximate minimization as well as to strong convexity with respect to an arbitrary norm, and so also to objectives regularized using other ℓp norms. 1
Mind the Duality Gap: Logarithmic regret algorithms for online optimization
"... We describe a primaldual framework for the design and analysis of online strongly convex optimization algorithms. Our framework yields the tightest known logarithmic regret bounds for FollowTheLeader and for the gradient descent algorithm proposed in Hazan et al. [2006]. We then show that one can ..."
Abstract

Cited by 35 (0 self)
 Add to MetaCart
(Show Context)
We describe a primaldual framework for the design and analysis of online strongly convex optimization algorithms. Our framework yields the tightest known logarithmic regret bounds for FollowTheLeader and for the gradient descent algorithm proposed in Hazan et al. [2006]. We then show that one can interpolate between these two extreme cases. In particular, we derive a new algorithm that shares the computational simplicity of gradient descent but achieves lower regret in many practical situations. Finally, we further extend our framework for generalized strongly convex functions. 1
On the equivalence of weak learnability and linear separability: New relaxations and efficient boosting algorithms
 IN: PROCEEDINGS OF THE 21ST ANNUAL CONFERENCE ON COMPUTATIONAL LEARNING THEORY
"... Boosting algorithms build highly accurate prediction mechanisms from a collection of lowaccuracy predictors. To do so, they employ the notion of weaklearnability. The starting point of this paper is a proof which shows that weak learnability is equivalent to linear separability with ℓ1 margin. Whil ..."
Abstract

Cited by 33 (7 self)
 Add to MetaCart
(Show Context)
Boosting algorithms build highly accurate prediction mechanisms from a collection of lowaccuracy predictors. To do so, they employ the notion of weaklearnability. The starting point of this paper is a proof which shows that weak learnability is equivalent to linear separability with ℓ1 margin. While this equivalence is a direct consequence of von Neumann’s minimax theorem, we derive the equivalence directly using Fenchel duality. We then use our derivation to describe a family of relaxations to the weaklearnability assumption that readily translates to a family of relaxations of linear separability with margin. This alternative perspective sheds new light on known softmargin boosting algorithms and also enables us to derive several new relaxations of the notion of linear separability. Last, we describe and analyze an efficient boosting framework that can be used for minimizing the loss functions derived from our family of relaxations. In particular, we obtain efficient boosting algorithms for maximizing hard and soft versions of the ℓ1 margin.
Smoothness, low noise and fast rates
 In NIPS
, 2010
"... We establish an excess risk bound of Õ HR2 n + √ HL ∗) Rn for ERM with an Hsmooth loss function and a hypothesis class with Rademacher complexity Rn, where L ∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = √ R/n, this translates to a learning rate o ..."
Abstract

Cited by 31 (11 self)
 Add to MetaCart
We establish an excess risk bound of Õ HR2 n + √ HL ∗) Rn for ERM with an Hsmooth loss function and a hypothesis class with Rademacher complexity Rn, where L ∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = √ R/n, this translates to a learning rate of Õ (RH/n) in the separable (L ∗ = 0) case and Õ RH/n + √ L ∗) RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth nonnegative objective. 1
Stochastic Methods for ℓ1 Regularized Loss Minimization Shai ShalevShwartz
"... We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteratio ..."
Abstract

Cited by 29 (3 self)
 Add to MetaCart
(Show Context)
We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature/example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform stateoftheart deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets. 1.
Regularization techniques for learning with matrices
 In Journal of Machine Learning Research
"... ar ..."
Learnability, Stability and Uniform Convergence
"... The problem of characterizing learnability is the most basic question of statistical learning theory. A fundamental and longstanding answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the popu ..."
Abstract

Cited by 18 (4 self)
 Add to MetaCart
The problem of characterizing learnability is the most basic question of statistical learning theory. A fundamental and longstanding answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is learnable via empirical risk minimization. In this paper, we consider the General Learning Setting (introduced by Vapnik), which includes most statistical learning problems as special cases. We show that in this setting, there are nontrivial learning problems where uniform convergence does not hold, empirical risk minimization fails, and yet they are learnable using alternative mechanisms. Instead of uniform convergence, we identify stability as the key necessary and sufficient condition for learnability. Moreover, we show that the conditions for learnability in the general setting are significantly more complex than in supervised classification and regression.
Predicting the Labelling of a Graph via Minimum pSeminorm Interpolation
"... We study the problem of predicting the labelling of a graph. The graph is given and a trial sequence of (vertex,label) pairs is then incrementally revealed to the learner. On each trial a vertex is queried and the learner predicts a boolean label. The true label is then returned. The learner’s goal ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
We study the problem of predicting the labelling of a graph. The graph is given and a trial sequence of (vertex,label) pairs is then incrementally revealed to the learner. On each trial a vertex is queried and the learner predicts a boolean label. The true label is then returned. The learner’s goal is to minimise mistaken predictions. We propose minimum pseminorm interpolation to solve this problem. To this end we give a pseminorm on the space of graph labellings. Thus on every trial we predict using the labelling which minimises the pseminorm and is also consistent with the revealed (vertex, label) pairs. When p = 2 this is the harmonic energy minimisation procedure of [22], also called (Laplacian) interpolated regularisation in [1]. In the limit as p → 1 this is equivalent to predicting with a labelconsistent mincut. We give mistake bounds relative to a labelconsistent mincut and a resistive cover of the graph. We say an edge is cut with respect to a labelling if the connected vertices have disagreeing labels. We find that minimising the pseminorm with p = 1 + ɛ where ɛ → 0 as the graph diameter D → ∞ gives a bound of O(Φ 2 log D) versus a bound of O(ΦD) when p = 2 where Φ is the number of cut edges. 1
Beating the adaptive bandit with high probability
, 2009
"... We provide a principled way of proving Õ( √ T) highprobability guarantees for partialinformation (bandit) problems over convex decision sets. First, we prove a regret guarantee for the fullinformation problem in terms of “local ” norms, both for entropy and selfconcordant barrier regularization, ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
(Show Context)
We provide a principled way of proving Õ( √ T) highprobability guarantees for partialinformation (bandit) problems over convex decision sets. First, we prove a regret guarantee for the fullinformation problem in terms of “local ” norms, both for entropy and selfconcordant barrier regularization, unifying these methods. Given one of such algorithms as a blackbox, we can convert a bandit problem into a fullinformation problem using a sampling scheme. The main result states that a highprobability Õ ( √ T) bound holds whenever the blackbox, the sampling scheme, and the estimates of missing information satisfy a number of conditions, which are relatively easy to check. At the heart of the method is a construction of linear upper bounds on confidence intervals. As applications of the main result, we provide the first known efficient algorithm for the sphere with an Õ( √ T) highprobability bound. We also derive the result for the nsimplex, improving the O ( √ nT log(nT)) bound of Auer et al [3] by replacing the log T term with log log T and closing the gap to the lower bound of Ω ( √ nT). The guarantees we obtain hold for adaptive adversaries (unlike the inexpectation results of [1]) and the algorithms are efficient, given that the linear upper bounds on confidence can be computed. 1