## Kernel dimension reduction in regression (2006)

### Cached

### Download Links

Citations: | 28 - 12 self |

### BibTeX

@TECHREPORT{Fukumizu06kerneldimension,

author = {Kenji Fukumizu and Francis R. Bach and Michael I. Jordan},

title = {Kernel dimension reduction in regression},

institution = {},

year = {2006}

}

### OpenURL

### Abstract

Acknowledgements. The authors thank the editor and anonymous refer-ees for their helpful comments. The authors also thank Dr. Yoichi Nishiyama for his helpful comments on the uniform convergence of empirical processes. We would like to acknowledge support from JSPS KAKENHI 15700241,

### Citations

1283 | Spline models for observational data - Wahba - 1990 |

786 |
Theory of reproducing kernels
- Aronszajn
- 1950
(Show Context)
Citation Context ...ase space is a topological space, the Borel σ-field is always assumed. Let (HX , kX ) and (HY, kY) be RKHS’s of functions on X and Y, respectively, with measurable positive definite kernels kX and kY =-=[1]-=-. We consider a random vector (X, Y ) : Ω → X × Y with the law PXY . The marginal distribution of X and Y are denoted by PX and PY , respectively. It is always assumed that the positive definite kerne... |

574 |
Convergence of Stochastic Processes
- POLLARD
- 1984
(Show Context)
Citation Context ...i) ≤ δ holds for any θ ∈ Θ. We write N(δ) for N(δ, d, Θ) if there is no confusion. For δ > 0, the covering integral J(δ) for Θ is defined by J(δ) = ∫ δ 0 ( 8 log(N(u) 2 /u ) 1/2du. The chaining lemma =-=[25]-=-, which plays a crucial role in the uniform central limit theorem, is readily extendable to a random process in a Banach space. Lemma 18 (Chaining Lemma). Let Θ be a set with semimetric d, and let {Z(... |

408 | Projection pursuit regression
- Friedman, Stuetzle
- 1981
(Show Context)
Citation Context ...ditional distribution P (Y | X) or the regression E(Y | X). These methods include ordinary least squares, partial least squares, canonical correlation analysis, ACE [4], projection pursuit regression =-=[12]-=-, neural networks, and LASSO [29]. These methods can be effective if the modeling assumptions that they embody are met, but if these assumptions do not hold there is no guarantee of finding the centra... |

388 | Asymptotic Statistics - Vaart - 1998 |

329 | Regression shrinkage and selection via the - Tibshirani - 1996 |

324 | Kernel independent component analysis
- FR, MI
- 2002
(Show Context)
Citation Context ...at the Gaussian RBF kernel exp(−‖x−y‖ 2 /σ2) and the socalled Laplacian kernel exp(−α ∑m i=1 |xi − yi|) (α > 0) are characteristic on Rm or on a compact subset of Rm with respect to the Borel σ-field =-=[2, 15, 28]-=-. The following theorem improves Theorem 7 in [13], and is the theoretical basis of kernel dimension reduction. In the following, let PB denote the probability on X induced from PX by the projection B... |

280 | Probability in Banach Spaces - Ledoux, Talagrand - 1991 |

238 | Estimating Optimal Transformations for Multiple Regression and Correlation - Breiman, JH - 1985 |

188 | Sliced inverse regression for dimension reduction - Li - 1991 |

178 | Protocol Analysis - Ericcson, Simon - 1984 |

164 | On the influence of the kernel on the consistency of support vector machines - Steinwart - 2001 |

156 | der Vaart. Asymptotic Statistics - van - 1998 |

148 |
Spline Models for Observational Data, volume 59
- Wahba
- 1990
(Show Context)
Citation Context ...itive definite kernels are Hilbert spaces of smooth functions that are “small” enough to yield computationally-tractable procedures, but are rich enough to capture nonparametric phenomena of interest =-=[32]-=-, and this computational focus is an important aspect of our work. On the other hand, whereas in nonparametric regression and classification the role of RKHS’s is to provide basis expansions of regres... |

147 |
The Theory of Tikhonov Regularization for Fredholm Equations of the First Kind
- Groetsch
- 1984
(Show Context)
Citation Context ...denote the empirical conditional covariance + εnI ) −1 Σ B(n) XY . (15) The regularization term εnI (εn > 0) is required to enable operator inversion and is thus analogous to Tikhonov regularization =-=[17]-=-. We will see that the regularization term is also needed for consistency. We now define the KDR estimator B (n) as any minimizer of Tr[ Σ B(n) Y Y |X] on the manifold Sm d (R); that is, any matri... |

133 | Harmonic analysis on semigroups - Berg, Christensen, et al. - 1984 |

118 | Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces
- Fukumizu, Bach, et al.
- 2004
(Show Context)
Citation Context ...re of no practical method that attacks SDR directly by using nonparametric methodology to assess departures from conditional independence. We presented an earlier kernel dimension reduction method in =-=[13]-=-. The contrast function presented in that paper, however, was not derived as an estimator of a conditional covariance operator, and it was not possible to establish a consistency result for that appro... |

97 | Measuring statistical dependence with hilbert-schmidt norms - Gretton, Bousquet, et al. - 2005 |

82 | Functional Analysis - Lax - 2002 |

73 | An Introduction to Regression Graphics
- Cook, Weisberg
- 1994
(Show Context)
Citation Context .... Our methodology derives directly from the formulation of SDR in terms of the conditional independence of the covariate X from the response Y , given the projection of X on the central subspace (cf. =-=[23, 6]-=-). We show that this conditional independence assertion can be characterized in terms of conditional covariance operators on reproducing kernel Hilbert spaces and we show how this characterization lea... |

67 |
On principal hessian directions for data visualization and dimension reduction: Another application of Stein’s lemma
- Li
- 1992
(Show Context)
Citation Context ...ile others aim at finding a central mean subspace, which is a subspace of the central subspace that is effective only for the regression E[Y |X]. The latter include principal Hessian directions (pHd, =-=[24]-=-) and contour regression [22]. A particular focus of these more recent developments has been the exploitation of second moments within an inverse regression framework. While the inverse regression per... |

52 | Kernel measures of conditional dependence
- Fukumizu, Gretton, et al.
(Show Context)
Citation Context ...at the Gaussian RBF kernel exp(−‖x−y‖ 2 /σ2) and the socalled Laplacian kernel exp(−α ∑m i=1 |xi − yi|) (α > 0) are characteristic on Rm or on a compact subset of Rm with respect to the Borel σ-field =-=[2, 15, 28]-=-. The following theorem improves Theorem 7 in [13], and is the theoretical basis of kernel dimension reduction. In the following, let PB denote the probability on X induced from PX by the projection B... |

43 | Joint measures and cross-covariance operators
- Baker
- 1973
(Show Context)
Citation Context ...EX[kX (X, X)] for f ∈ HX . 8The cross-covariance operator of (X, Y ) is an operator from HX to HY so that 〈g, ΣY Xf〉HY [ = EXY (f(X) − EX[f(X)])(g(Y ) − EY [g(Y )]) ] holds for all f ∈ HX and g ∈ HY =-=[3, 13]-=-. Obviously, ΣY X = Σ∗ XY , where T ∗ denotes the adjoint of an operator T . If Y is equal to X, the positive self-adjoint operator ΣXX is called the covariance operator. For a random variable X : Ω →... |

43 |
Multivariate Statistics: A Practical Approach (pp 181-233
- Flury, Riedwyl
- 1988
(Show Context)
Citation Context ... dimension reduction methods. The first data set that we studied is Swiss bank notes which has been previously studied in the dimension reduction context by Cook and Lee [7], with the data taken from =-=[11]-=-. The problem is that of classifying counterfeit and genuine Swiss bank notes. The data is a sample of 100 counterfeit and 100 genuine notes. There are six continuous explanatory variables that repres... |

36 | Injective hilbert space embeddings of probability measures
- Sriperumbudur, Gretton, et al.
- 2008
(Show Context)
Citation Context ...at the Gaussian RBF kernel exp(−‖x−y‖ 2 /σ2) and the socalled Laplacian kernel exp(−α ∑m i=1 |xi − yi|) (α > 0) are characteristic on Rm or on a compact subset of Rm with respect to the Borel σ-field =-=[2, 15, 28]-=-. The following theorem improves Theorem 7 in [13], and is the theoretical basis of kernel dimension reduction. In the following, let PB denote the probability on X induced from PX by the projection B... |

25 | Dimension reduction and visualization in discriminant analysis(with discussion - Cook, Yin |

25 | An adaptive estimation of dimension reduction - Xia, Tong, et al. |

24 |
Exploring regression structure using nonparametric functional estimation
- Samarov
- 1993
(Show Context)
Citation Context ...he derivative of the regression function; these are based on the fact that the derivative of the conditional expectation g(x) = E[y | BT x] with respect to x belongs to a dimension reduction subspace =-=[27, 18]-=-. The purpose of these methods is again to extract a central mean subspace; this differs from the central subspace which is the focus of KDR. The difference is clear, for example, if 6we consider the... |

22 | An adaptive estimation of dimension reduction space
- Xia, Tong, et al.
(Show Context)
Citation Context ...bspace under weak conditions. There are alternatives to the inverse regression approach in the literature that have some similarities to KDR. In particular, minimum average variance estimation (MAVE, =-=[33]-=-) is based on nonparametric estimation of the conditional covariance of Y given X, an idea related to KDR. This method explicitly estimates the regressor, however, assuming an additive noise model Y =... |

20 |
Dimension reduction for conditional mean in regression,” The
- Cook, Li
- 2002
(Show Context)
Citation Context ...better performance than the other methods. In this case, pHd fails to find the true subspace; this is due to the fact that pHd is incapable of estimating a direction that only appears in the variance =-=[8]-=-. We note also that the results in [22] show that the contour regression methods SCR and GCR yield average norms larger than 1.3. Although the estimation of variance structure is generally more diffic... |

20 | Structure Adaptive Approach for Dimension Reduction
- Hristache, Juditsky, et al.
- 2001
(Show Context)
Citation Context ...he derivative of the regression function; these are based on the fact that the derivative of the conditional expectation g(x) = E[y | BT x] with respect to x belongs to a dimension reduction subspace =-=[27, 18]-=-. The purpose of these methods is again to extract a central mean subspace; this differs from the central subspace which is the focus of KDR. The difference is clear, for example, if 6we consider the... |

16 | Statistical consistency of kernel canonical correlation analysis - Fukumizu, Bach, et al. - 2007 |

15 |
Contour regression: a general approach to dimension reduction
- Li, Zha, et al.
- 2005
(Show Context)
Citation Context ...entral mean subspace, which is a subspace of the central subspace that is effective only for the regression E[Y |X]. The latter include principal Hessian directions (pHd, [24]) and contour regression =-=[22]-=-. A particular focus of these more recent developments has been the exploitation of second moments within an inverse regression framework. While the inverse regression perspective has been quite usefu... |

11 | Discussion of Li - Cook, Weisberg - 1991 |

11 |
Consistency of kernel canonical correlation analysis
- Fukumizu, Bach, et al.
- 2006
(Show Context)
Citation Context ...roduct AB is trace-class with ‖AB‖tr ≤ ‖A‖HS‖B‖HS. It is known that cross-covariance operators and covariance operators are Hilbert-Schmidt and trace class, respectively, under the assumption Eq. (2) =-=[16, 14]-=-. The Hilbert-Schmidt norm of ΣY X is given by ‖ΣY X‖ 2 HS = ∥ ∥ EY X[(kX (·, X) − mX)(kY(·, Y ) − mY )] ∥ ∥ 2 HX ⊗HY , (19) where HX ⊗ HY is the direct product of HX and HY, and the trace norm of ΣXX... |

9 |
Dimension reduction in regressions with a binary response
- Cook, Lee
- 1999
(Show Context)
Citation Context ...s only a one-dimensional subspace. Finally, in the binary classification setting, if the covariance matrices of the two classes are the same, SAVE and pHd also provide only a one-dimensional subspace =-=[7]-=-. The general problem in these cases is that the estimated subspace is smaller than the central subspace. One approach to tackling these limitations is to incorporate higher-order moments of Y |X [34]... |

7 |
Sufficient dimension reduction and graphics in regression
- Chiaromonte, Cook
- 2002
(Show Context)
Citation Context ...ible to show that under weak conditions the intersection of dimension reduction subspaces is itself a dimension reduction subspace, in which case the intersection is referred to as a central subspace =-=[6, 5]-=-. As suggested in a seminal paper by Li [23], it is of great interest to develop procedures for estimating this subspace, quite apart from any interest in the conditional distribution P (Y | X) or the... |

6 |
Direction estimation in single-index regressions
- Yin, Cook
(Show Context)
Citation Context ...derivative. Also, there has also been some recent work on nonparametric methods for estimation of central subspaces. One such method estimates the central subspace based on an expected log likelihood =-=[35]-=-. This requires, however, an estimate of the joint probability density, and is limited to single-index regression. Finally, Zhu and Zeng [36] have proposed a method for estimating the central subspace... |

5 | Fourier methods for estimating the central subspace and the central mean subspace in regression
- Zhu, Zeng
- 2006
(Show Context)
Citation Context ...s the central subspace based on an expected log likelihood [35]. This requires, however, an estimate of the joint probability density, and is limited to single-index regression. Finally, Zhu and Zeng =-=[36]-=- have proposed a method for estimating the central subspace based on the Fourier transform. This method is similar to the KDR method in its use of Hilbert space methods and in its use of a contrast fu... |

4 |
Moment-based dimension reduction for multivariate response regression
- Yin, Bura
- 2006
(Show Context)
Citation Context ... [7]. The general problem in these cases is that the estimated subspace is smaller than the central subspace. One approach to tackling these limitations is to incorporate higher-order moments of Y |X =-=[34]-=-, but in practice the gains achievable by the use of higher-order moments are limited by robustness issues. In this paper we present a new methodology for SDR that is rather different from the approac... |