Results 1  10
of
57
Consistency of the group lasso and multiple kernel learning
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2007
"... We consider the leastsquare regression problem with regularization by a block 1norm, i.e., a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1norm where all spaces have dimension one, where it ..."
Abstract

Cited by 281 (34 self)
 Add to MetaCart
We consider the leastsquare regression problem with regularization by a block 1norm, i.e., a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1norm where all spaces have dimension one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic model consistency of the group Lasso. We derive necessary and sufficient conditions for the consistency of group Lasso under practical assumptions, such as model misspecification. When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. Using tools from functional analysis, and in particular covariance operators, we extend the consistency results to this infinite dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisfied.
Kernel measures of conditional dependence
 In Adv. NIPS
, 2008
"... We propose a new measure of conditional dependence of random variables, based on normalized crosscovariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a ..."
Abstract

Cited by 87 (46 self)
 Add to MetaCart
(Show Context)
We propose a new measure of conditional dependence of random variables, based on normalized crosscovariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments. 1
Nonlinear causal discovery with additive noise models
"... The discovery of causal relationships between a set of observed variables is a fundamental problem in science. For continuousvalued data linear acyclic causal models with additive noise are often used because these models are well understood and there are wellknown methods to fit them to data. In ..."
Abstract

Cited by 78 (30 self)
 Add to MetaCart
(Show Context)
The discovery of causal relationships between a set of observed variables is a fundamental problem in science. For continuousvalued data linear acyclic causal models with additive noise are often used because these models are well understood and there are wellknown methods to fit them to data. In reality, of course, many causal relationships are more or less nonlinear, raising some doubts as to the applicability and usefulness of purely linear methods. In this contribution we show that in fact the basic linear framework can be generalized to nonlinear models. In this extended framework, nonlinearities in the datagenerating process are in fact a blessing rather than a curse, as they typically provide information on the underlying causal system and allow more aspects of the true datagenerating mechanisms to be identified. In addition to theoretical results we show simulations and some simple real data experiments illustrating the identification power provided by nonlinearities. 1
Hilbert Space Embeddings and Metrics on Probability Measures
"... A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseu ..."
Abstract

Cited by 65 (34 self)
 Add to MetaCart
A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γk, indexed by the kernel function k that defines the inner product in the RKHS. We present three theoretical properties of γk. First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translationinvariant on R d, then it is characteristic if and only if the support of its Fourier transform is the entire R d.
Injective hilbert space embeddings of probability measures
 In COLT
, 2008
"... A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). The emb ..."
Abstract

Cited by 56 (32 self)
 Add to MetaCart
A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). The embedding function has been proven to be injective when the reproducing kernel is universal. In this case, the embedding induces a metric on the space of probability distributions defined on compact metric spaces. In the present work, we consider more broadly the problem of specifying characteristic kernels, defined as kernels for which the RKHS embedding of probability measures is injective. In particular, characteristic kernels can include nonuniversal kernels. We restrict ourselves to translationinvariant kernels on Euclidean space, and define the associated metric on probability measures in terms of the Fourier spectrum of the kernel and characteristic functions of these measures. The support of the kernel spectrum is important in finding whether a kernel is characteristic: in particular, the embedding is injective if and only if the kernel spectrum has the entire domain as its support. Characteristic kernels may nonetheless have difficulty in distinguishing certain distributions on the basis of finite samples, again due to the interaction of the kernel spectrum and the characteristic functions of the measures. 1
Prı́ncipe, “Correntropy: properties and applications in nongaussian signal processing
 Signal Processing, IEEE Transactions on
, 2007
"... Abstract—The optimality of secondorder statistics depends heavily on the assumption of Gaussianity. In this paper, we elucidate further the probabilistic and geometric meaning of the recently defined correntropy function as a localized similarity measure. A close relationship between correntropy a ..."
Abstract

Cited by 40 (2 self)
 Add to MetaCart
(Show Context)
Abstract—The optimality of secondorder statistics depends heavily on the assumption of Gaussianity. In this paper, we elucidate further the probabilistic and geometric meaning of the recently defined correntropy function as a localized similarity measure. A close relationship between correntropy and Mestimation is established. Connections and differences between correntropy and kernel methods are presented. As such correntropy has vastly different properties compared with secondorder statistics that can be very useful in nonGaussian signal processing, especially in the impulsive noise environment. Examples are presented to illustrate the technique. Index Terms—Generalized correlation function, information theoretic learning, kernel methods, metric, temporal principal component analysis (TPCA). I.
Causal inference using the algorithmic Markov condition
, 2008
"... Inferring the causal structure that links n observables is usually basedupon detecting statistical dependences and choosing simple graphs that make the joint measure Markovian. Here we argue why causal inference is also possible when only single observations are present. We develop a theory how to g ..."
Abstract

Cited by 26 (20 self)
 Add to MetaCart
(Show Context)
Inferring the causal structure that links n observables is usually basedupon detecting statistical dependences and choosing simple graphs that make the joint measure Markovian. Here we argue why causal inference is also possible when only single observations are present. We develop a theory how to generate causal graphs explaining similarities between single objects. To this end, we replace the notion of conditional stochastic independence in the causal Markov condition with the vanishing of conditional algorithmic mutual information anddescribe the corresponding causal inference rules. We explain why a consistent reformulation of causal inference in terms of algorithmic complexity implies a new inference principle that takes into account also the complexity of conditional probability densities, making it possible to select among Markov equivalent causal graphs. This insight provides a theoretical foundation of a heuristic principle proposed in earlier work. We also discuss how to replace Kolmogorov complexity with decidable complexity criteria. This can be seen as an algorithmic analog of replacing the empirically undecidable question of statistical independence with practical independence tests that are based on implicit or explicit assumptions on the underlying distribution. email:
Undercomplete blind subspace deconvolution
 JMLR
, 2007
"... We introduce the blind subspace deconvolution (BSSD) problem, which is the extension of both the blind source deconvolution (BSD) and the independent subspace analysis (ISA) tasks. We examine the case of the undercomplete BSSD (uBSSD). Applying temporal concatenation we reduce this problem to ISA. T ..."
Abstract

Cited by 25 (17 self)
 Add to MetaCart
(Show Context)
We introduce the blind subspace deconvolution (BSSD) problem, which is the extension of both the blind source deconvolution (BSD) and the independent subspace analysis (ISA) tasks. We examine the case of the undercomplete BSSD (uBSSD). Applying temporal concatenation we reduce this problem to ISA. The associated ‘high dimensional ’ ISA problem can be handled by a recent technique called joint fdecorrelation (JFD). Similar decorrelation methods have been used previously for kernel independent component analysis (kernelICA). More precisely, the kernel canonical correlation (KCCA) technique is a member of this family, and, as is shown in this paper, the kernel generalized variance (KGV) method can also be seen as a decorrelation method in the feature space. These kernel based algorithms will be adapted to the ISA task. In the numerical examples, we (i) examine how efficiently the emerging higher dimensional ISA tasks can be tackled, and (ii) explore the working and advantages of the derived kernelISA methods.
Consistent Nonparametric Tests of Independence
, 2009
"... Three simple and explicit procedures for testing the independence of two multidimensional random variables are described. Two of the associated test statistics (L1, loglikelihood) are defined when the empirical distribution of the variables is restricted to finite partitions. A third test statist ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
Three simple and explicit procedures for testing the independence of two multidimensional random variables are described. Two of the associated test statistics (L1, loglikelihood) are defined when the empirical distribution of the variables is restricted to finite partitions. A third test statistic is defined as a kernelbased independence measure. Two kinds of tests are provided. Distributionfree strong consistent tests are derived on the basis of large deviation bounds on the test statistcs: these tests make almost surely no Type I or Type II error after a random sample size. Asymptotically αlevel tests are obtained from the limiting distribution of the test statistics. For the latter tests, the Type I error converges to a fixed nonzero value α, and the Type II error drops to zero, for increasing sample size. All tests reject the null hypothesis of independence if the test statistics become large. The performance of the tests is evaluated experimentally on benchmark data.
Consistent Feature Selection for Pattern Recognition in Polynomial Time
"... We analyze two different feature selection problems: finding a minimal feature set optimal for classification (MINIMALOPTIMAL) vs. finding all features relevant to the target variable (ALLRELEVANT). The latter problem is motivated by recent applications within bioinformatics, particularly gene exp ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
We analyze two different feature selection problems: finding a minimal feature set optimal for classification (MINIMALOPTIMAL) vs. finding all features relevant to the target variable (ALLRELEVANT). The latter problem is motivated by recent applications within bioinformatics, particularly gene expression analysis. For both problems, we identify classes of data distributions for which there exist consistent, polynomialtime algorithms. We also prove that ALLRELEVANT is much harder than MINIMALOPTIMAL and propose two consistent, polynomialtime algorithms. We argue that the distribution classes considered are reasonable in many practical cases, so that our results simplify feature selection in a wide range of machine learning tasks.