Results 1  10
of
34
Efficient Online and Batch Learning using Forward Backward Splitting
"... We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem ..."
Abstract

Cited by 134 (1 self)
 Add to MetaCart
(Show Context)
We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1. We derive concrete and very simple algorithms for minimization of loss functions with ℓ1, ℓ2, ℓ 2 2, and ℓ ∞ regularization. We also show how to construct efficient algorithms for mixednorm ℓ1/ℓq regularization. We further extend the algorithms and give efficient implementations for very highdimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural datasets.
TreeGuided Group Lasso for MultiTask Regression with Structured Sparsity
"... We consider the problem of learning a sparse multitask regression, where the structure in the outputs can be represented as a tree with leaf nodes as outputs and internal nodes as clusters of the outputs at multiple granularity. Our goal is to recover the common set of relevant inputs for each outp ..."
Abstract

Cited by 117 (13 self)
 Add to MetaCart
(Show Context)
We consider the problem of learning a sparse multitask regression, where the structure in the outputs can be represented as a tree with leaf nodes as outputs and internal nodes as clusters of the outputs at multiple granularity. Our goal is to recover the common set of relevant inputs for each output cluster. Assuming that the tree structure is available as prior knowledge, we formulate this problem as a new multitask regularized regression called treeguided group lasso. Our structured regularization is based on a grouplasso penalty, where groups are defined with respect to the tree structure. We describe a systematic weighting scheme for the groups in the penalty such that each output variable is penalized in a balanced manner even if the groups overlap. We present an efficient optimization method that can handle a largescale problem. Using simulated and yeast datasets, we demonstrate that our method shows a superior performance in terms of both prediction errors and recovery of true sparsity patterns compared to other methods for multitask learning. 1.
Smoothing Proximal Gradient Method for General Structured Sparse Learning
"... We study the problem of learning high dimensional regression models regularized by a structuredsparsityinducing penalty that encodes prior structural information on either input or output sides. We consider two widely adopted types of such penalties as our motivating examples: 1) overlapping group ..."
Abstract

Cited by 55 (7 self)
 Add to MetaCart
(Show Context)
We study the problem of learning high dimensional regression models regularized by a structuredsparsityinducing penalty that encodes prior structural information on either input or output sides. We consider two widely adopted types of such penalties as our motivating examples: 1) overlapping group lasso penalty, based on the ℓ1/ℓ2 mixednorm penalty, and 2) graphguided fusion penalty. For both types of penalties, due to their nonseparability, developing an efficient optimization method has remained a challenging problem. In this paper, we propose a general optimization approach, called smoothing proximal gradient method, which can solve the structured sparse regression problems with a smooth convex loss and a wide spectrum of structuredsparsityinducing penalties. Our approach is based on a general smoothing technique of Nesterov [17]. It achieves a convergence rate faster than the standard firstorder method, subgradient method, and is much more scalable than the most widely used interiorpoint method. Numerical results are reported to demonstrate the efficiency and scalability of the proposed method. 1
MultiTask Feature Learning Via Efficient L2,1Norm Minimization
, 2009
"... The problem of joint feature selection across a group of related tasks has applications in many areas including biomedical informatics and computer vision. We consider the ℓ2,1norm regularized regression model for joint feature selection from multiple tasks, which can be derived in the probabilisti ..."
Abstract

Cited by 51 (3 self)
 Add to MetaCart
The problem of joint feature selection across a group of related tasks has applications in many areas including biomedical informatics and computer vision. We consider the ℓ2,1norm regularized regression model for joint feature selection from multiple tasks, which can be derived in the probabilistic framework by assuming a suitable prior from the exponential family. One appealing feature of the ℓ2,1norm regularization is that it encourages multiple predictors to share similar sparsity patterns. However, the resulting optimization problem is challenging to solve due to the nonsmoothness of the ℓ2,1norm regularization. In this paper, we propose to accelerate the computation by reformulating it as two equivalent smooth convex optimization problems which are then solved via the Nesterov’s method—an optimal firstorder blackbox method for smooth convex optimization. A key building block in solving the reformulations is the Euclidean projection. We show that the Euclidean projection for the first reformulation can be analytically computed, while the Euclidean projection for the second one can be computed in linear time. Empirical evaluations on several data sets verify the efficiency of the proposed algorithms.
Structured Sparsity through Convex Optimization
"... Abstract. Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. While naturally cast as a combinatorial optimization problem, variable or feature selection admits a convex relaxation through the regularization by the ℓ1norm. In this paper, we cons ..."
Abstract

Cited by 48 (7 self)
 Add to MetaCart
(Show Context)
Abstract. Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. While naturally cast as a combinatorial optimization problem, variable or feature selection admits a convex relaxation through the regularization by the ℓ1norm. In this paper, we consider situations where we are not only interested in sparsity, but where some structural prior knowledge is available as well. We show that the ℓ1norm can then be extended to structured norms built on either disjoint or overlapping groups of variables, leading to a flexible framework that can deal with various structures. We present applications to unsupervised learning, for structured sparse principal component analysis and hierarchical dictionary learning, and to supervised learning in the context of nonlinear variable selection. Key words and phrases: Sparsity, convex optimization. 1.
Efficient spectral feature selection with minimum redundancy
 In Proceedings of the Twenty4th AAAI Conference on Artificial Intelligence (AAAI), 2010. Ji Zhu, Saharon
"... Spectral feature selection identifies relevant features by measuring their capability of preserving sample similarity. It provides a powerful framework for both supervised and unsupervised feature selection, and has been proven to be effective in many realworld applications. One common drawback ass ..."
Abstract

Cited by 29 (7 self)
 Add to MetaCart
Spectral feature selection identifies relevant features by measuring their capability of preserving sample similarity. It provides a powerful framework for both supervised and unsupervised feature selection, and has been proven to be effective in many realworld applications. One common drawback associated with most existing spectral feature selection algorithms is that they evaluate features individually and cannot identify redundant features. Since redundant features can have significant adverse effect on learning performance, it is necessary to address this limitation for spectral feature selection. To this end, we propose a novel spectral feature selection algorithm to handle feature redundancy, adopting an embedded model. The algorithm is derived from a formulation based on a sparse multioutput regression with a L2,1norm constraint. We conduct theoretical analysis on the properties of its optimal solutions, paving the way for designing an efficient pathfollowing solver. Extensive experiments show that the proposed algorithm can do well in both selecting relevant features and removing redundancy.
Heterogeneous Multitask Learning with Joint Sparsity Constraints
"... Multitask learning addresses the problem of learning related tasks that presumably share some commonalities on their inputoutput mapping functions. Previous approaches to multitask learning usually deal with homogeneous tasks, such as purely regression tasks, or entirely classification tasks. In th ..."
Abstract

Cited by 27 (0 self)
 Add to MetaCart
(Show Context)
Multitask learning addresses the problem of learning related tasks that presumably share some commonalities on their inputoutput mapping functions. Previous approaches to multitask learning usually deal with homogeneous tasks, such as purely regression tasks, or entirely classification tasks. In this paper, we consider the problem of learning multiple related tasks of predicting both continuous and discrete outputs from a common set of input variables that lie in a highdimensional feature space. All of the tasks are related in the sense that they share the same set of relevant input variables, but the amount of influence of each input on different outputs may vary. We formulate this problem as a combination of linear regressions and logistic regressions, and model the joint sparsity as L1/L ∞ or L1/L2 norm of the model parameters. Among several possible applications, our approach addresses an important open problem in genetic association mapping, where the goal is to discover genetic markers that influence multiple correlated traits jointly. In our experiments, we demonstrate our method in this setting, using simulated and clinical asthma datasets, and we show that our method can effectively recover the relevant inputs with respect to all of the tasks. 1
Boosting with Structural Sparsity
"... We derive generalizations of AdaBoost and related gradientbased coordinate descent methods that incorporate sparsitypromoting penalties for the norm of the predictor that is being learned. The end result is a family of coordinate descent algorithms that integrate forward feature induction and back ..."
Abstract

Cited by 25 (2 self)
 Add to MetaCart
We derive generalizations of AdaBoost and related gradientbased coordinate descent methods that incorporate sparsitypromoting penalties for the norm of the predictor that is being learned. The end result is a family of coordinate descent algorithms that integrate forward feature induction and backpruning through regularization and give an automatic stopping criterion for feature induction. We study penalties based on the ℓ1, ℓ2, and ℓ ∞ norms of the predictor and introduce mixednorm penalties that build upon the initial penalties. The mixednorm regularizers facilitate structural sparsity in parameter space, which is a useful property in multiclass prediction and other related tasks. We report empirical results that demonstrate the power of our approach in building accurate and structurally sparse models. 1. Introduction and
Efficient learning using forwardbackward splitting
 in Advances in Neural Information Processing Systems 23
, 2009
"... We describe, analyze, and experiment with a new framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization prob ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
(Show Context)
We describe, analyze, and experiment with a new framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This yields a simple yet effective algorithm for both batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1. We derive concrete and very simple algorithms for minimization of loss functions with ℓ1, ℓ2, ℓ 2 2, and ℓ ∞ regularization. We also show how to construct efficient algorithms for mixednorm ℓ1/ℓq regularization. We further extend the algorithms and give efficient implementations for very highdimensional data with sparsity. We demonstrate the potential of the proposed framework in experiments with synthetic and natural datasets. 1
Accelerated gradient method for multitask sparse learning problem
 in Proceedings of the International Conference on Data Mining
, 2009
"... Abstract—Many real world learning problems can be recast as multitask learning problems which utilize correlations among different tasks to obtain better generalization performance than learning each task individually. The feature selection problem in multitask setting has many applications in fie ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Many real world learning problems can be recast as multitask learning problems which utilize correlations among different tasks to obtain better generalization performance than learning each task individually. The feature selection problem in multitask setting has many applications in fields of computer vision, text classification and bioinformatics. Generally, it can be realized by solving a L1infinity regularized optimization problem. And the solution automatically yields the joint sparsity among different tasks. However, due to the nonsmooth nature of the L1infinity norm, there lacks an efficient training algorithm for solving such problem with general convex loss functions. In this paper, we propose an accelerated gradient method based on an “optimal ” first order blackbox method named after Nesterov and provide the convergence rate for smooth convex loss functions. For nonsmooth convex loss functions, such as hinge loss, our method still has fast convergence rate empirically. Moreover, by exploiting the structure of the L1infinity ball, we solve the blackbox oracle in Nesterov’s method by a simple sorting scheme. Our method is suitable for largescale multitask learning problem since it only utilizes the first order information and is very easy to implement. Experimental results show that our method significantly outperforms the most stateoftheart methods in both convergence speed and learning accuracy. Keywordsmultitask learning; L1infinity regularization; optimal method; gradient descend I.