## Regularized multi-task learning (2004)

### Cached

### Download Links

Citations: | 170 - 1 self |

### BibTeX

@INPROCEEDINGS{Micchelli04regularizedmulti-task,

author = {Charles A. Micchelli and Massimiliano Pontil},

title = {Regularized multi-task learning},

booktitle = {},

year = {2004},

pages = {109--117}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper provides a foundation for multi–task learning using reproducing kernel Hilbert spaces of vector–valued functions. In this setting, the kernel is a matrix–valued function. Some explicit examples will be described which go beyond our earlier results in [7]. In particular, we characterize classes of matrix– valued kernels which are linear and are of the dot product or the translation invariant type. We discuss how these kernels can be used to model relations between the tasks and present linear multi–task learning algorithms. Finally, we present a novel proof of the representer theorem for a minimizer of a regularization functional which is based on the notion of minimal norm interpolation. 1

### Citations

9946 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...) variables, so the total number of inputs for each student in each of the schools was 27. Since the goal is to predict the exam scores of the students we run regression using the SVM ɛ–loss function =-=[25]-=- for the multi–task learning method proposed. We consider each school to be “one task”. Therefore we had 139 tasks. We made 10 random splits of the data into training (75% of the data, hence around 70... |

1418 |
Spline Models for Observational Data
- Wahba
- 1990
(Show Context)
Citation Context ...(x) =wt · x, where“·” denotes the standard inner product in R d . The generalization to nonlinear models will then be done through the use of Reproducing Kernel Hilbert Spaces (RKHS), see for example =-=[20, 25, 26]-=-. In the case of classification each yit takes the values ±1, and ft is the sign of wt ·x. Below we consider this case – regression can be treated similarly. All previously proposed frameworks and met... |

850 |
Theory of reproducing kernels
- Aronszajn
- 1950
(Show Context)
Citation Context ...h inner product 〈·, ·〉. We present two methods to enhance standard RKHS to vector–valued functions. 2.1 Matrix–valued kernels based on Aronszajn The first approach extends the scalar case, Y = IR, in =-=[2]-=-. Definition 1 We say that H is a reproducing kernel Hilbert space (RKHS) of functions f : X → Y, when for any y ∈ Y and x ∈ X the linear functional which maps f ∈ H to (y, f(x)) is continuous on H. W... |

785 | Laplacian eigenmaps for dimensionality reduction and data representation
- BELKIN, NIYOGI
- 1998
(Show Context)
Citation Context ...ℓ,q∈INn �uℓ − uq� 2 Aℓq = � ℓ,q∈INn (uℓ, uq)Lℓq (4.19) � where L = D − A with Dℓq = δℓq h∈INn Aℓh. The matrix A could be the weight matrix of a graph with n vertices and L the graph Laplacian, see eg =-=[4]-=-. The equation Aℓq = 0smeans that tasks ℓ and q are not related, whereas Aℓq = 1 means strong relation. In order to derive the matrix–valued kernel we note that (4.19) can be written as (u, ˜ L, u) wh... |

288 | Regularization networks and support vector machines - Evgeniou, Pontil, et al. - 2000 |

151 | A model of inductive bias learning
- Baxter
(Show Context)
Citation Context ...e from a variety of perspectives. Our main motivation is the practical problem of multi–task learning where we wish to learn many related regression or classification functions simultaneously, see eg =-=[3, 5, 6]-=-. For instance, image understanding requires the estimation of multiple binary classifiers simultaneously, where each classifier is used to detect a specific object. Specific examples include locating... |

148 | A generalized representer theorem
- Schölkopf, Herbrich, et al.
(Show Context)
Citation Context ... scalar–valued kernel defined, for all (x, y), (t, z) ∈ X × Y, by the formula 2.2 Feature map K 1 ((x, y), (t, z)) := (y, K(x, t)z). (2.6) The second approach uses the notion of feature map, see e.g. =-=[9]-=-. A feature map is a function Φ : X × Y → W where W is a Hilbert space. A feature map representation of a kernel K has the property that, for every x, t ∈ X and y, z ∈ Y there holds the equation (Φ(x,... |

120 |
Marketing Models of Consumer Heterogeneity
- Allenby, Rossi
- 1999
(Show Context)
Citation Context ...sting models for predicting the value of many possibly related indicators simultaneously is often required; in marketing modeling the preferences of many individuals simultaneously is common practice =-=[1, 2]-=-. When there are relations between the tasks to learn, it can be advantageous to learn all tasks simultaneously instead of following the more traditional approach of learning each task independently o... |

114 | Task clustering and gating for bayesian multitask learning
- Bakker, Heskes
(Show Context)
Citation Context ...ning each task independently of the others. There has been a lot of experimental work showing the benefits of such multi–task learning relative to individual task learning when tasks are related, see =-=[4, 11, 15, 22]-=-. There have also been various attempts to theoretically study multi–task learning, see [4, 5, 6, 7, 8, 15, 23]. In this paper we develop methods for multi–task learning that are natural extensions of... |

106 |
Learning to Learn
- Thrun, Pratt
- 1998
(Show Context)
Citation Context ...ning each task independently of the others. There has been a lot of experimental work showing the benefits of such multi–task learning relative to individual task learning when tasks are related, see =-=[4, 11, 15, 22]-=-. There have also been various attempts to theoretically study multi–task learning, see [4, 5, 6, 7, 8, 15, 23]. In this paper we develop methods for multi–task learning that are natural extensions of... |

92 | Exploiting task relatedness for multiple task learning
- Ben-David, Schuller
(Show Context)
Citation Context ...e from a variety of perspectives. Our main motivation is the practical problem of multi–task learning where we wish to learn many related regression or classification functions simultaneously, see eg =-=[3, 5, 6]-=-. For instance, image understanding requires the estimation of multiple binary classifiers simultaneously, where each classifier is used to detect a specific object. Specific examples include locating... |

78 | A Bayesian/information theoretic model of learning to learn via multiple task sampling - Baxter - 1997 |

77 |
Predicting Multivariate Responses in Multiple Linear Regression
- Breiman, Friedman
- 1997
(Show Context)
Citation Context ...idered in [6]) in the case that the learning tasks are related in a particular way defined. The problem of multi–task learning has been also studied in the statistics literature. Breiman and Friedman =-=[9]-=- propose the curds&whey method, where the relations between the various tasks are modeled in a post–processing fashion. Brown and Zidek [10] consider the case of regression and propose an extension o... |

71 |
On Learning Vector-Valued Functions
- Micchelli, Pontil
- 2005
(Show Context)
Citation Context ... reproducing kernel Hilbert spaces of vector–valued functions. In this setting, the kernel is a matrix–valued function. Some explicit examples will be described which go beyond our earlier results in =-=[7]-=-. In particular, we characterize classes of matrix– valued kernels which are linear and are of the dot product or the translation invariant type. We discuss how these kernels can be used to model rela... |

66 | Categorization by learning and combining object parts
- Heisele, Serre, et al.
(Show Context)
Citation Context ... data for task t come from a space Xt ×Yt – this is for example the machine vision case of learning to recognize a face by first learning to recognize parts of the face, such as eyes, mouth, and nose =-=[14]-=-. Each of these related tasks can be learned using images of different size (or different representations). The methods we develop below may be extended to handle such scenarios, for example through t... |

34 |
Theory of linear operators in Hilbert Space", vol I
- Akhiezer
- 1981
(Show Context)
Citation Context ...g kernel Hilbert space (RKHS) of functions f : X → Y, when for any y ∈ Y and x ∈ X the linear functional which maps f ∈ H to (y, f(x)) is continuous on H. We conclude from the Riesz Lemma (see, e.g., =-=[1]-=-) that, for every x ∈ X and y ∈ Y, there is a linear operator Kx : Y → H such that (y, f(x)) = 〈Kxy, f〉. (2.1) For every x, t ∈ X we also introduce the linear operator K(x, t) : Y → Y defined, for eve... |

32 | Modeling consumer demand for variety - Kim, Allenby, et al. - 2002 |

31 | Clustering learning tasks and the selective cross–task transfer of knowledge
- Thrun, O’Sullivan
- 1998
(Show Context)
Citation Context ...its of such multi–task learning relative to individual task learning when tasks are related, see [4, 11, 15, 22]. There have also been various attempts to theoretically study multi–task learning, see =-=[4, 5, 6, 7, 8, 15, 23]-=-. In this paper we develop methods for multi–task learning that are natural extensions of existing kernel based learning methods for single task learning, such as Support Vector Machines (SVMs) [25]. ... |

30 | The parallel transfer of task knowledge using dynamic learning rates based on a measure of relatedness. Connection Science - Silver, Mercer - 1996 |

30 | Fast Polyhedral Adaptive Conjoint Estimation. Marketing Sci
- Tobia, Simester, et al.
- 2003
(Show Context)
Citation Context ... below can be seen as results for a classification problem. We followed the basic simulation design used by other researchers in the past. In particular we simply replicated the experimental setup of =-=[3, 12, 24]-=-. For completeness we briefly describe that setup. We generated data describing products with 4 attributes (i.e. size, weight, functionality, and ease–of–use of a product), each attribute taking 4 val... |

28 |
2001), “Improving Parameter Estimates and Model Prediction by Aggregate Customization
- Arora, Huber
(Show Context)
Citation Context ... below can be seen as results for a classification problem. We followed the basic simulation design used by other researchers in the past. In particular we simply replicated the experimental setup of =-=[3, 12, 24]-=-. For completeness we briefly describe that setup. We generated data describing products with 4 attributes (i.e. size, weight, functionality, and ease–of–use of a product), each attribute taking 4 val... |

24 | Empirical bayes for learning to learn
- Heskes
- 2000
(Show Context)
Citation Context ...ning each task independently of the others. There has been a lot of experimental work showing the benefits of such multi–task learning relative to individual task learning when tasks are related, see =-=[4, 11, 15, 22]-=-. There have also been various attempts to theoretically study multi–task learning, see [4, 5, 6, 7, 8, 15, 23]. In this paper we develop methods for multi–task learning that are natural extensions of... |

23 |
A Hierarchical Bayes Model of Primary and Secondary Demand. Marketing Sci
- Arora, Allenby, et al.
- 1998
(Show Context)
Citation Context ... 3.1 Simulated Data We tested the proposed method using data that capture the preferences of individuals (consumers) when they choose among products. This is the standard problem of conjoint analysis =-=[1, 2]-=- for preference modeling. It turns out [12] that this problem is equivalent to solving a classification problem, therefore the results we report below can be seen as results for a classification probl... |

21 |
Vapnik: “Statistical learning theory
- N
- 1998
(Show Context)
Citation Context ...ed kernel K : X × X → IR n×n that reflects the interaction amongst the components of f. This paper provides a foundation for this approach. For example, in the case of support vector machines (SVM’s) =-=[10]-=-, appropriate choices of the matrix–valued kernel implement a trade–off between large margin of each per–task SVM and large margin of combinations of these SVM’s, eg their average. The paper is organi... |

19 | A theoretical framework for learning from a pool of disparate data sources
- Ben-David, Gehrke, et al.
- 2002
(Show Context)
Citation Context ...its of such multi–task learning relative to individual task learning when tasks are related, see [4, 11, 15, 22]. There have also been various attempts to theoretically study multi–task learning, see =-=[4, 5, 6, 7, 8, 15, 23]-=-. In this paper we develop methods for multi–task learning that are natural extensions of existing kernel based learning methods for single task learning, such as Support Vector Machines (SVMs) [25]. ... |

9 |
M.: A function representation for learning in banach spaces
- Micchelli, Pontil
- 2004
(Show Context)
Citation Context ...nique and admits the form ˆ f = � Kxj cj. j∈INm (4.13) We refer to [7] for a proof. This approach achieves both simplicity and generality. For example, it can be extended to normed linear spaces, see =-=[8]-=-. Our next result establishes that the form of any local minimizer 1 indeed has the same form as in Lemma 3. This result improves upon [9] where it is proven only for a global minimizer. Theorem 1 If ... |

8 |
Adaptive multivariate ridge regression
- Brown, Zidek
- 1980
(Show Context)
Citation Context ... studied in the statistics literature. Breiman and Friedman [9] propose the curds&whey method, where the relations between the various tasks are modeled in a post–processing fashion. Brown and Zidek =-=[10]-=- consider the case of regression and propose an extension of the standard ridge regression to multivariate ridge regression. Finally, a number of approaches for learning multiple tasks or for learning... |

5 |
On Learning Vector-Valued Functions, Research Note RN/03/08
- Micchelli, Pontil
- 2003
(Show Context)
Citation Context ...ng images of different size (or different representations). The methods we develop below may be extended to handle such scenarios, for example through the appropriate choice of a matrix–valued kernel =-=[20]-=- discussed in section 2.2. 2. METHODS FOR MULTI–TASK LEARNING For simplicity we first assume that function ft for the t th task is a hyperplane, that is ft(x) =wt · x, where“·” denotes the standard in... |

3 | A framework for genomic data fusion and its application to membrane protein prediction - Lanckriet, Bie, et al. - 2003 |