Results 1  10
of
39
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
, 2010
"... Stochastic subgradient methods are widely used, well analyzed, and constitute effective tools for optimization and online learning. Stochastic gradient methods ’ popularity and appeal are largely due to their simplicity, as they largely follow predetermined procedural schemes. However, most common s ..."
Abstract

Cited by 63 (1 self)
 Add to MetaCart
(Show Context)
Stochastic subgradient methods are widely used, well analyzed, and constitute effective tools for optimization and online learning. Stochastic gradient methods ’ popularity and appeal are largely due to their simplicity, as they largely follow predetermined procedural schemes. However, most common subgradient approaches are oblivious to the characteristics of the data being observed. We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradientbased learning. The adaptation, in essence, allows us to find needles in haystacks in the form of very predictive but rarely seenfeatures. Ourparadigmstemsfromrecentadvancesinstochasticoptimizationandonlinelearning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. In a companion paper, we validate experimentally our theoretical analysis and show that the adaptive subgradient approach outperforms stateoftheart, but nonadaptive, subgradient algorithms. 1
Adaptive forwardbackward greedy algorithm for learning sparse representations
 IEEE Trans. Inform. Theory
, 2011
"... Consider linear prediction models where the target function is a sparse linear combination of a set of basis functions. We are interested in the problem of identifying those basis functions with nonzero coefficients and reconstructing the target function from noisy observations. Two heuristics that ..."
Abstract

Cited by 55 (8 self)
 Add to MetaCart
(Show Context)
Consider linear prediction models where the target function is a sparse linear combination of a set of basis functions. We are interested in the problem of identifying those basis functions with nonzero coefficients and reconstructing the target function from noisy observations. Two heuristics that are widely used in practice are forward and backward greedy algorithms. First, we show that neither idea is adequate. Second, we propose a novel combination that is based on the forward greedy algorithm but takes backward steps adaptively whenever beneficial. We prove strong theoretical results showing that this procedure is effective in learning sparse representations. Experimental results support our theory. 1
Stochastic Methods for ℓ1 Regularized Loss Minimization Shai ShalevShwartz
"... We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteratio ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature/example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform stateoftheart deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets. 1.
Online Alternating Direction Method
 In ICML
, 2012
"... Online optimization has emerged as powerful tool in large scale optimization. In this paper, we introduce efficient online algorithms based on the alternating directions method (ADM). We introduce a new proof technique for ADM in the batch setting, which yields the O(1/T) convergence rate of ADM and ..."
Abstract

Cited by 9 (4 self)
 Add to MetaCart
(Show Context)
Online optimization has emerged as powerful tool in large scale optimization. In this paper, we introduce efficient online algorithms based on the alternating directions method (ADM). We introduce a new proof technique for ADM in the batch setting, which yields the O(1/T) convergence rate of ADM and forms the basis of regret analysis in the online setting. We consider two scenarios in the online setting, based on whether the solution needs to lie in the feasible set or not. In both settings, we establish regret bounds for both the objective function as well as constraint violation for general and strongly convex functions. Preliminary results are presented to illustrate the performance of the proposed algorithms. 1.
Sparsity regret bounds for individual sequences in online linear regression
 JMLR Workshop and Conference Proceedings, 19 (COLT 2011 Proceedings):377–396
, 2011
"... We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T. We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in th ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T. We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an onlinelearning algorithm called SeqSEW and based on exponential weighting and datadriven truncation. In a second part we apply a parameterfree version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design.
Sample Size Selection in Optimization Methods for Machine Learning
, 2012
"... This paper presents a methodology for using varying sample sizes in batchtype optimization methods for large scale machine learning problems. The first part of the paper deals with the delicate issue of dynamic sample selection in the evaluation of the function and gradient. We propose a criterion ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
This paper presents a methodology for using varying sample sizes in batchtype optimization methods for large scale machine learning problems. The first part of the paper deals with the delicate issue of dynamic sample selection in the evaluation of the function and gradient. We propose a criterion for increasing the sample size based on variance estimates obtained during the computation of a batch gradient. We establish an O(1/ɛ) complexity bound on the total cost of a gradient method. The second part of the paper describes a practical Newton method that uses a smaller sample to compute Hessian vectorproducts than to evaluate the function and the gradient, and that also employs a dynamic sampling technique. The focus of the paper shifts in the third part of the paper to L1 regularized problems designed to produce sparse solutions. We propose a Newtonlike method that consists of two phases: a (minimalistic) gradient projection phase that identifies zero variables, and subspace phase that applies a subsampled Hessian Newton iteration in the free variables. Numerical tests on speech recognition problems illustrate the performance of the algorithms.
Dual averaging and proximal gradient descent for online alternating direction multiplier method
 In Proceedings of the 30th International Conference on Machine Learning
, 2013
"... We develop new stochastic optimization methods that are applicable to a wide range of structured regularizations. Basically our methods are combinations of basic stochastic optimization techniques and Alternating Direction Multiplier Method (ADMM). ADMM is a general framework for optimizing a comp ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
We develop new stochastic optimization methods that are applicable to a wide range of structured regularizations. Basically our methods are combinations of basic stochastic optimization techniques and Alternating Direction Multiplier Method (ADMM). ADMM is a general framework for optimizing a composite function, and has a wide range of applications. We propose two types of online variants of ADMM, which correspond to online proximal gradient descent and regularized dual averaging respectively. The proposed algorithms are computationally efficient and easy to implement. Our methods yield O(1/ T) convergence of the expected risk. Moreover, the online proximal gradient descent type method yields O(log(T)/T) convergence for a strongly convex loss. Numerical experiments show effectiveness of our methods in learning tasks with structured sparsity such as overlapped group lasso. 1.
RANDOMIZED SMOOTHING FOR STOCHASTIC OPTIMIZATION
, 2012
"... We analyze convergence rates of stochastic optimization algorithms for nonsmooth convex optimization problems. By combining randomized smoothing techniques with accelerated gradient methods, we obtain convergence rates of stochastic optimization procedures, both in expectation and with high probabi ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
We analyze convergence rates of stochastic optimization algorithms for nonsmooth convex optimization problems. By combining randomized smoothing techniques with accelerated gradient methods, we obtain convergence rates of stochastic optimization procedures, both in expectation and with high probability, that have optimal dependence on the variance of the gradient estimates. To the best of our knowledge, these are the first variancebased rates for nonsmooth optimization. We give several applications of our results to statistical estimation problems and provide experimental results that demonstrate the effectiveness of the proposed algorithms. We also describe how a combination of our algorithm with recent work on decentralized optimization yields a distributed stochastic optimization algorithm that is orderoptimal.
Interpreting prediction markets: a stochastic approach
 In Advances in Neural Information Processing Systems 25
, 2012
"... We strengthen recent connections between prediction markets and learning by showing that a natural class of market makers can be understood as performing stochastic mirror descent when trader demands are sequentially drawn from a fixed distribution. This provides new insights into how market prices ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
We strengthen recent connections between prediction markets and learning by showing that a natural class of market makers can be understood as performing stochastic mirror descent when trader demands are sequentially drawn from a fixed distribution. This provides new insights into how market prices (and price paths) may be interpreted as a summary of the market’s belief distribution by relating them to the optimization problem being solved. In particular, we show that under certain conditions the stationary point of the stochastic process of prices generated by the market is equal to the market’s Walrasian equilibrium of classic market analysis. Together, these results suggest how traditional market making mechanisms might be replaced with general purpose learning algorithms while still retaining guarantees about their behaviour. 1 Introduction and literature review This paper is part of an ongoing line of research, spanning several authors, into formal
A generalized online mirror descent with applications to classification and regression
, 2012
"... Online learning algorithms are fast, memoryefficient, easy to implement, and applicable to many prediction problems, including classification, regression, and ranking. Several online algorithms were proposed in the past few decades, some based on additive updates, like the Perceptron, and some othe ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Online learning algorithms are fast, memoryefficient, easy to implement, and applicable to many prediction problems, including classification, regression, and ranking. Several online algorithms were proposed in the past few decades, some based on additive updates, like the Perceptron, and some other on multiplicative updates, like Winnow. Online convex optimization is a general framework to unify both the design and the analysis of online algorithms using a single prediction strategy: online mirror descent. Different firstorder online algorithms are obtained by choosing the regularization function in online mirror descent. We generalize online mirror descent to sequences of timevarying regularizers. Our approach allows us to recover as special cases many recently proposed secondorder algorithms, such as the VovkAzouryWarmuth, the secondorder Perceptron, and the AROW algorithm. Moreover, we derive a new second order adaptive pnorm algorithm, and improve bounds for some firstorder algorithms, such as PassiveAggressive (PAI). Keywords: Online learning, Convex optimization, Secondorder algorithms