## Convex multi-task feature learning (2007)

### Cached

### Download Links

- [www.cs.ucl.ac.uk]
- [www0.cs.ucl.ac.uk]
- [eprints.pascal-network.org]
- [www.cs.ucl.ac.uk]
- [www0.cs.ucl.ac.uk]
- [www0.cs.ucl.ac.uk]
- DBLP

### Other Repositories/Bibliography

Venue: | Machine Learning |

Citations: | 143 - 16 self |

### BibTeX

@INPROCEEDINGS{Argyriou07convexmulti-task,

author = {Andreas Argyriou and Theodoros Evgeniou and Massimiliano Pontil and Andreas Argyriou and Theodoros Evgeniou and Massimiliano Pontil},

title = {Convex multi-task feature learning},

booktitle = {Machine Learning},

year = {2007},

publisher = {press}

}

### OpenURL

### Abstract

Summary. We present a method for learning sparse representations shared across multiple tasks. This method is a generalization of the well-known singletask 1-norm regularization. It is based on a novel non-convex regularizer which controls the number of learned features common across the tasks. We prove that the method is equivalent to solving a convex optimization problem for which there is an iterative algorithm which converges to an optimal solution. The algorithm has a simple interpretation: it alternately performs a supervised and an unsupervised step, where in the former step it learns task-specific functions and in the latter step it learns common-across-tasks sparse representations for these functions. We also provide an extension of the algorithm which learns sparse nonlinear representations using kernels. We report experiments on simulated and real data sets which demonstrate that the proposed method can both improve the performance relative to learning each task independently and lead to a few learned features common across related tasks. Our algorithm can also be used, as a special case, to simply select – not learn – a few common variables across the tasks 3.

### Citations

3703 |
Convex optimization
- Boyd, Vandenberghe
- 2004
(Show Context)
Citation Context ...blem related to (14) has been presented in [2]. Theorem 4. If W is an optimal solution of problem (14) then for every t ∈ NT there exists a vector ct ∈ R mT such that wt = T� s=1 i=1 m� (ct)siϕ(xsi). =-=(15)-=- Proof. Let L = span{ϕ(xsi) : s ∈ NT , i ∈ Nm}. We can write wt = pt + nt , t ∈ NT where pt ∈ L and nt ∈ L ⊥ . Hence W = P + N, where P is the matrix with columns pt and N the matrix with columns nt. ... |

878 |
The Elements of
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ... many supervised learning tasks. In particular, we develop a novel non-convex multi-task generalization of the 1-norm regularization known to provide sparse variable selection in the single-task case =-=[20, 27, 40]-=-. Our method learns a few features common across the tasks using a novel regularizer which both couples the tasks and enforces sparsity. These features are orthogonal functions in a prescribed reprodu... |

786 |
Theory of reproducing kernels
- Aronszajn
- 1950
(Show Context)
Citation Context ...tion of Algorithm 1. 5.1 A Representer Theorem We begin by restating our optimization problem in the more general case when the tasks’ functions belong to a reproducing kernel Hilbert space, see e.g. =-=[7, 37, 44]-=- and references therein. Formally, we now wish to learn T regression functions ft, t ∈ NT of the form ft(x) = 〈at, U ⊤ ϕ(x)〉 = 〈wt, ϕ(x)〉 , x ∈ R d , where ϕ : R d → R M is a prescribed feature map. T... |

472 | Multitask learning
- Caruana
- 1997
(Show Context)
Citation Context ...zation methods have been derived for the simpler problem of feature selection [31], prior work on multi-task feature learning has been based on more complex optimization problems which are not convex =-=[3, 9, 18]-=- and, so, these methods are not guaranteed to converge to a global minimum. In particular, in [9, 18] different neural networks with one or more hidden layers are trained for each task and they all sh... |

343 | For most large underdetermined systems of linear equations, the minimal ℓ1 solution is also the sparsest solution
- Donoho
- 2006
(Show Context)
Citation Context ... many supervised learning tasks. In particular, we develop a novel non-convex multi-task generalization of the 1-norm regularization known to provide sparse variable selection in the single-task case =-=[20, 27, 40]-=-. Our method learns a few features common across the tasks using a novel regularizer which both couples the tasks and enforces sparsity. These features are orthogonal functions in a prescribed reprodu... |

322 | A framework for learning predictive structures from multiple tasks and unlabeled data
- Ando, Zhang
- 2005
(Show Context)
Citation Context ...ed learning (e.g., using 1-norm regularization) or for unsupervised learning (e.g., using principal component analysis (PCA) or independent component analysis (ICA)), there has been only limited work =-=[3, 9, 31, 48]-=- in the multi-task supervised learning setting. In this paper, we present a novel method for learning sparse representations common across many supervised learning tasks. In particular, we develop a n... |

313 |
Relations between two sets of variates
- Hotelling
- 1936
(Show Context)
Citation Context ...y be pursued in the context of multi-task learning include multivariate linear models in statistics such as reduced rank regression [30], partial least squares [45] and canonical correlation analysis =-=[29]-=- (see also [16]). These methods are based on generalized eigenvalue problems – see, for example, [13, Chapter 4] for a nice review. They have also been extended in an RKHS setting, see, for example, [... |

240 | Sharing features: efficient boosting procedures for multiclass object detection
- Torralba, Murphy, et al.
- 2004
(Show Context)
Citation Context ...ting a specific object in images is treated as a single supervised learning task. Images of different objects may share a number of features that are different from the pixel representation of images =-=[28, 41, 43]-=-. In modeling users/consumers’ preferences [1, 33], there may be common product features (e.g., for cars, books, webpages, consumer electronics, etc) that are considered to be important by a number of... |

166 | Canonical correlation analysis; An overview with application to learning methods
- Hardoon, Szedmak, et al.
- 2003
(Show Context)
Citation Context ...] (see also [16]). These methods are based on generalized eigenvalue problems – see, for example, [13, Chapter 4] for a nice review. They have also been extended in an RKHS setting, see, for example, =-=[11, 26]-=- and references therein. Although these methods have proved useful in practical applications, they require that the same input examples are shared by all the tasks. On the contrary, our approach does ... |

158 | Learning multiple tasks with kernel methods
- Evgeniou, Micchelli, et al.
(Show Context)
Citation Context ...re we know what the underlying features used in all tasks are) and real datasets, also using our nonlinear generalization of the proposed method. The results show that in agreement with previous work =-=[3, 8, 9, 10, 19, 21, 31, 37, 38, 43, 46, 47, 48]-=- multi-task learning improves performance relative to single-task learning when the tasks are related. More importantly, the results confirm that when the tasks are related in the way we define in thi... |

154 | A Rank minimization heuristic with application to minimum order system approximation
- Fazel, Hindi, et al.
- 2001
(Show Context)
Citation Context ...=1 where we have defined �W �tr := trace(W W ⊤ ) 1 2 . The expression �W �tr in the regularizer is called the trace norm. It can also be expressed as the sum of the singular values of W . As shown in =-=[23]-=-, the trace norm is the convex envelope of rank(W ) in the unit ball, which gives another interpretation of the relationship between the rank and γ in our experiments. Solving this problem directly is... |

149 |
Spline models for observational data, volume 59
- Wahba
- 1990
(Show Context)
Citation Context ...tion of Algorithm 1. 5.1 A Representer Theorem We begin by restating our optimization problem in the more general case when the tasks’ functions belong to a reproducing kernel Hilbert space, see e.g. =-=[7, 37, 44]-=- and references therein. Formally, we now wish to learn T regression functions ft, t ∈ NT of the form ft(x) = 〈at, U ⊤ ϕ(x)〉 = 〈wt, ϕ(x)〉 , x ∈ R d , where ϕ : R d → R M is a prescribed feature map. T... |

146 | Maximum-margin matrix factorization
- Srebro, Rennie, et al.
- 2004
(Show Context)
Citation Context ...rnating minimization strategy of Algorithm 1, which is simple tosConvex Multi-Task Feature Learning 15 implement and natural to interpret. We also note here that a similar problem has been studied in =-=[42]-=- for the particular case of an SVM loss function. It was shown there that the optimization problem can be solved through an equivalent semi-definite programming problem. We will further discuss relati... |

144 | A Model of Inductive Bias Learning
- Baxter
(Show Context)
Citation Context ...zation methods have been derived for the simpler problem of feature selection [31], prior work on multi-task feature learning has been based on more complex optimization problems which are not convex =-=[3, 9, 18]-=- and, so, these methods are not guaranteed to converge to a global minimum. In particular, in [9, 18] different neural networks with one or more hidden layers are trained for each task and they all sh... |

135 | Convex Analysis and Nonlinear Optimization: Theory and Examples
- BORWEIN, LEWIS
(Show Context)
Citation Context ...Modifying slightly to account for the feature map, we obtain the problems16 Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil � T� m� min L(yti, 〈wt, ϕ(xti)〉) + γ�W � 2 tr : W ∈ R d×T � . =-=(14)-=- t=1 i=1 This problem can be viewed as a generalization of the standard 2norm regularization problem. Indeed, in the case t = 1 the trace norm �W �tr is simply equal to �w1�2. In this case, it is well... |

134 | Multi-task feature learning
- Argyriou, Evgeniou, et al.
- 2006
(Show Context)
Citation Context ...3 . Key words: Collaborative Filtering, Inductive Transfer, Kernels, MultiTask Learning, Regularization, Transfer Learning, Vector-Valued Functions. 3 This is a longer version of the conference paper =-=[4]-=-. It includes new theoretical and experimental results.s2 Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil 1 Introduction We study the problem of learning data representations that are co... |

109 | Task clustering and gating for bayesian multitask learning
- Bakker, Heskes
(Show Context)
Citation Context ...re we know what the underlying features used in all tasks are) and real datasets, also using our nonlinear generalization of the proposed method. The results show that in agreement with previous work =-=[3, 8, 9, 10, 19, 21, 31, 37, 38, 43, 46, 47, 48]-=- multi-task learning improves performance relative to single-task learning when the tasks are related. More importantly, the results confirm that when the tasks are related in the way we define in thi... |

99 | Multitask learning for classification with Dirichlet process priors
- Xue, Liao, et al.
(Show Context)
Citation Context ...re we know what the underlying features used in all tasks are) and real datasets, also using our nonlinear generalization of the proposed method. The results show that in agreement with previous work =-=[3, 8, 9, 10, 19, 21, 31, 37, 38, 43, 46, 47, 48]-=- multi-task learning improves performance relative to single-task learning when the tasks are related. More importantly, the results confirm that when the tasks are related in the way we define in thi... |

97 | Learning Gaussian processes from multiple tasks
- Yu, Tresp, et al.
- 2005
(Show Context)
Citation Context |

89 | Exploiting Task Relatedness for Multiple Task Learning
- Ben-David, Schuller
(Show Context)
Citation Context |

78 |
The covariance problem in linear regression. The partial least squares (PLS) approach to generalized inverses
- Wold, Ruhe, et al.
(Show Context)
Citation Context ...n. Other interesting approaches which may be pursued in the context of multi-task learning include multivariate linear models in statistics such as reduced rank regression [30], partial least squares =-=[45]-=- and canonical correlation analysis [29] (see also [16]). These methods are based on generalized eigenvalue problems – see, for example, [13, Chapter 4] for a nice review. They have also been extended... |

71 |
Predicting multivariate responses in multiple linear regression
- Breiman, Friedman
- 1997
(Show Context)
Citation Context ... the context of multi-task learning include multivariate linear models in statistics such as reduced rank regression [30], partial least squares [45] and canonical correlation analysis [29] (see also =-=[16]-=-). These methods are based on generalized eigenvalue problems – see, for example, [13, Chapter 4] for a nice review. They have also been extended in an RKHS setting, see, for example, [11, 26] and ref... |

65 | Multi-task feature and kernel selection for SVMs
- Jebara
- 2004
(Show Context)
Citation Context ...ed learning (e.g., using 1-norm regularization) or for unsupervised learning (e.g., using principal component analysis (PCA) or independent component analysis (ICA)), there has been only limited work =-=[3, 9, 31, 48]-=- in the multi-task supervised learning setting. In this paper, we present a novel method for learning sparse representations common across many supervised learning tasks. In particular, we develop a n... |

65 |
On learning vector–valued functions
- Micchelli, Pontil
- 2005
(Show Context)
Citation Context |

62 | Categorization by learning and combining object parts
- Heisele, Serre, et al.
- 2001
(Show Context)
Citation Context ...ting a specific object in images is treated as a single supervised learning task. Images of different objects may share a number of features that are different from the pixel representation of images =-=[28, 41, 43]-=-. In modeling users/consumers’ preferences [1, 33], there may be common product features (e.g., for cars, books, webpages, consumer electronics, etc) that are considered to be important by a number of... |

59 | A theory of object recognition: computations and circuits in the feedforward path of the ventral stream in primate visual cortex, CBCL Paper #259/AI Memo #2005-036
- Serre, Kouh, et al.
(Show Context)
Citation Context ...ting a specific object in images is treated as a single supervised learning task. Images of different objects may share a number of features that are different from the pixel representation of images =-=[28, 41, 43]-=-. In modeling users/consumers’ preferences [1, 33], there may be common product features (e.g., for cars, books, webpages, consumer electronics, etc) that are considered to be important by a number of... |

58 | Learning Multidimensional Signal Processing
- Borga
- 1998
(Show Context)
Citation Context ...ays a role similar to that of the barrier used in interior-point methods. In Appendix A, we prove that the optimal solution of problem (12) is given by Dε(W ) = (W W ⊤ + εI) 1 2 trace(W W ⊤ + εI) 1 2 =-=(13)-=- � and the optimal value equals trace(W W ⊤ + εI) 1 �2 2 . In the same appendix, we also show that for ε = 0, equation (13) gives the minimizer of the function R(W, ·) subject to the constraints in pr... |

56 |
Matrix Analysis. Graduate Texts in Mathematics
- Bhatia
- 1996
(Show Context)
Citation Context ...Learning 13 In the second step, we keep matrix W fixed, and minimize Rε with respect to D. To this end, we solve the problem � T� min 〈wt, D −1 wt〉 + ε trace(D −1 ) : D ∈ S d � ++, trace(D) ≤ 1 . t=1 =-=(12)-=- The term trace(D−1 ) keeps the D-iterates of the algorithm at a certain distance from the boundary of Sd + and plays a role similar to that of the barrier used in interior-point methods. In Appendix ... |

51 | Multi-task feature selection
- Obozinski, Taskar, et al.
- 2006
(Show Context)
Citation Context ..., measured according to a prescribed loss function L : R × R → R+ which is convex in the second argument 5 A similar regularization function, but without matrix U, was also independently developed by =-=[39]-=- for the purpose of multi-task feature selection – see problem (5) below.s6 Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 √ T 2 4 6 8 10 12 1... |

47 | Learning to learn with the informative vector machine - Lawrence, Platt - 2004 |

46 |
Hierarchical Bayes Conjoint Analysis: Recovery of Partworth Heterogeneity from Reduced Experimental Designs
- Lenk, DeSarbo, et al.
- 1996
(Show Context)
Citation Context ... supervised learning task. Images of different objects may share a number of features that are different from the pixel representation of images [28, 41, 43]. In modeling users/consumers’ preferences =-=[1, 33]-=-, there may be common product features (e.g., for cars, books, webpages, consumer electronics, etc) that are considered to be important by a number of people (we consider modeling an individual’s pref... |

45 | Learning multiple related tasks using latent independent component analysis
- Zhang, Ghahramani, et al.
(Show Context)
Citation Context ...ed learning (e.g., using 1-norm regularization) or for unsupervised learning (e.g., using principal component analysis (PCA) or independent component analysis (ICA)), there has been only limited work =-=[3, 9, 31, 48]-=- in the multi-task supervised learning setting. In this paper, we present a novel method for learning sparse representations common across many supervised learning tasks. In particular, we develop a n... |

41 | A sparse representation for function approximation
- Poggio, Girosi
- 1998
(Show Context)
Citation Context ... many supervised learning tasks. In particular, we develop a novel non-convex multi-task generalization of the 1-norm regularization known to provide sparse variable selection in the single-task case =-=[20, 27, 40]-=-. Our method learns a few features common across the tasks using a novel regularizer which both couples the tasks and enforces sparsity. These features are orthogonal functions in a prescribed reprodu... |

39 | Learning convex combinations of continuously parameterized basic kernels
- Argyriou, Micchelli, et al.
- 2005
(Show Context)
Citation Context ...hich is convex in the second argument 5 A similar regularization function, but without matrix U, was also independently developed by [39] for the purpose of multi-task feature selection – see problem =-=(5)-=- below.s6 Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 √ T 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 1... |

37 | Fast rates for regularized least-squares algorithm
- Caponnetto, Vito
- 2005
(Show Context)
Citation Context ...rely on this assumption. Our work may be extended in different directions. First, it would be interesting to carry out a learning theory analysis of the algorithms presented in this paper. Results in =-=[17, 35]-=- may be useful for this purpose. Another interesting question is to study how the solution of our algorithms depends on the regularization parameter and investigates34 Andreas Argyriou, Theodoros Evge... |

31 |
Low-rank matrix factorization with attributes
- Abernethy, Bach, et al.
- 2006
(Show Context)
Citation Context ... this result to the more general form (14). Our proof is connected to the theory of operator monotone functions. We note that a representer theorem for a problem related to (14) has been presented in =-=[2]-=-. Theorem 4. If W is an optimal solution of problem (14) then for every t ∈ NT there exists a vector ct ∈ R mT such that wt = T� s=1 i=1 m� (ct)siϕ(xsi). (15) Proof. Let L = span{ϕ(xsi) : s ∈ NT , i ∈... |

28 | The convex analysis of unitarily invariant matrix norms - Lewis - 1995 |

27 |
Marketing Research
- Aaker, Kumar, et al.
- 1995
(Show Context)
Citation Context ... supervised learning task. Images of different objects may share a number of features that are different from the pixel representation of images [28, 41, 43]. In modeling users/consumers’ preferences =-=[1, 33]-=-, there may be common product features (e.g., for cars, books, webpages, consumer electronics, etc) that are considered to be important by a number of people (we consider modeling an individual’s pref... |

26 |
Reduced-rank regression for the multivariate linear model
- Izenman
- 1975
(Show Context)
Citation Context ...with any convex loss function. Other interesting approaches which may be pursued in the context of multi-task learning include multivariate linear models in statistics such as reduced rank regression =-=[30]-=-, partial least squares [45] and canonical correlation analysis [29] (see also [16]). These methods are based on generalized eigenvalue problems – see, for example, [13, Chapter 4] for a nice review. ... |

23 | Bounds for linear multi-task learning
- Maurer
- 2006
(Show Context)
Citation Context ...rely on this assumption. Our work may be extended in different directions. First, it would be interesting to carry out a learning theory analysis of the algorithms presented in this paper. Results in =-=[17, 35]-=- may be useful for this purpose. Another interesting question is to study how the solution of our algorithms depends on the regularization parameter and investigates34 Andreas Argyriou, Theodoros Evge... |

20 | A machine learning approach to conjoint analysis
- Chapelle, Harchaoui
- 2005
(Show Context)
Citation Context |

11 |
A convex optimization approach to modeling consumer heterogeneity in conjoint estimation
- Evgeniou, Pontil, et al.
(Show Context)
Citation Context ... know that W = UΣQ , where T ×δ′ U ∈ RM×δ′ , Σ ∈ Sδ′ ++ diagonal, Q ∈ R orthogonal, δ ′ ≤ δ, and the columns of U are the significant features learned. From this and (21) we obtain that U = � ΦBQΣ −1 =-=(22)-=- and Σ and Q can be computed from QΣ 2 Q ⊤ = W ⊤ W = B ⊤ � Φ ⊤ � ΦB . Finally, the coefficient matrix A can be computed from W = UA, (21) and (22), yielding ⎛ ⎞ A = ⎝ ΣQ⊤ ⎠ . 0 The computational cost ... |

7 |
Nonparametric identification of population models via gaussian processes
- Neve, Nicolao, et al.
- 2007
(Show Context)
Citation Context |

5 |
An optimization perspective on partial least square
- Bennett, Embrechts
- 2003
(Show Context)
Citation Context ...] (see also [16]). These methods are based on generalized eigenvalue problems – see, for example, [13, Chapter 4] for a nice review. They have also been extended in an RKHS setting, see, for example, =-=[11, 26]-=- and references therein. Although these methods have proved useful in practical applications, they require that the same input examples are shared by all the tasks. On the contrary, our approach does ... |

5 | Variational problems arising from balancing several error criteria
- Micchelli, Pinkus
- 1994
(Show Context)
Citation Context ... many components of the learned vector at are zero, see [20] and references therein. Moreover, the number of nonzero components of a solution of problem (3) is typically a nonincreasing function of γ =-=[36]-=-.sConvex Multi-Task Feature Learning 7 Since we do not simply want to select the features but also learn them, we further minimize the function E over U. Therefore, our approach for multi-task feature... |

4 |
Multilevel modelling of survey data
- Goldstein
- 1991
(Show Context)
Citation Context ...kernel. 6.3 School Data We have also tested our algorithms on the data from the Inner London Education Authority 7 . This data set has been used in previous work on multitask learning, for example in =-=[8, 21, 24]-=-. It consists of examination scores of 15362 students from 139 secondary schools in London during the years 1985, 1986 and 1987. Thus, there are 139 tasks, corresponding to predicting student performa... |

4 |
The rademacher complexity of linear transformation classes
- Maurer
- 2006
(Show Context)
Citation Context ...ny convex loss function. Our work may be extended in different directions. First, it would be interesting to carry out a learning theory analysis of the algorithms presented in this paper. Results in =-=[12, 25]-=- may be useful for this purpose. Another interesting question is to study how the solutions of our algorithm depend on the regularization parameter and investigate conditions which ensure that the num... |

1 |
Representer theorems for spectral norms. Working paper
- Argyriou, Micchelli, et al.
- 2007
(Show Context)
Citation Context ...r. We also have that 〈wt, ϕ(xti)〉 = 〈pt, ϕ(xti)〉. Thus, we conclude that whenever W is optimal, N must be zero. ⊓⊔ We also note that this theorem can be extended to a general family of spectral norms =-=[6]-=-. An alternative way to write equation (15), using matrix notation, is to express W as a multiple of the input matrix. The latter is the matrix Φ ∈ R M×mT whose (t, i)-th column is the vector ϕ(xti) ∈... |