Results 1  10
of
31
Least Squares Linear Discriminant Analysis
"... Linear Discriminant Analysis (LDA) is a wellknown method for dimensionality reduction and classification. LDA in the binaryclass case has been shown to be equivalent to linear regression with the class label as the output. This implies that LDA for binaryclass classifications can be formulated as ..."
Abstract

Cited by 26 (6 self)
 Add to MetaCart
Linear Discriminant Analysis (LDA) is a wellknown method for dimensionality reduction and classification. LDA in the binaryclass case has been shown to be equivalent to linear regression with the class label as the output. This implies that LDA for binaryclass classifications can be formulated as a least squares problem. Previous studies have shown certain relationship between multivariate linear regression and LDA for the multiclass case. Many of these studies show that multivariate linear regression with a specific class indicator matrix as the output can be applied as a preprocessing step for LDA. However, directly casting LDA as a least squares problem is challenging for the multiclass case. In this paper, a novel formulation for multivariate linear regression is proposed. The equivalence relationship between the proposed least squares formulation and LDA for multiclass classifications is rigorously established under a mild condition, which is shown empirically to hold in many applications involving highdimensional data. Several LDA extensions based on the equivalence relationship are discussed. 1.
Multiclass Discriminant Kernel Learning via Convex Programming
"... Regularized kernel discriminant analysis (RKDA) performs linear discriminant analysis in the feature space via the kernel trick. Its performance depends on the selection of kernels. In this paper, we consider the problem of multiple kernel learning (MKL) for RKDA, in which the optimal kernel matrix ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
Regularized kernel discriminant analysis (RKDA) performs linear discriminant analysis in the feature space via the kernel trick. Its performance depends on the selection of kernels. In this paper, we consider the problem of multiple kernel learning (MKL) for RKDA, in which the optimal kernel matrix is obtained as a linear combination of prespecified kernel matrices. We show that the kernel learning problem in RKDA can be formulated as convex programs. First, we show that this problem can be formulated as a semidefinite program (SDP). Based on the equivalence relationship between RKDA and least square problems in the binaryclass case, we propose a convex quadratically constrained quadratic programming (QCQP) formulation for kernel learning in RKDA. A semiinfinite linear programming (SILP) formulation is derived to further improve the efficiency. We extend these formulations to the multiclass case based on a key result established in this paper. That is, the multiclass RKDA kernel learning problem can be decomposed into a set of binaryclass kernel learning problems which are constrained to share a common kernel. Based on this decomposition property, SDP formulations are proposed for the multiclass case. Furthermore, it leads naturally to QCQP and SILP formulations. As the performance of RKDA depends on the regularization parameter, we show that this parameter can also be optimized in a joint framework with the kernel. Extensive experiments have been conducted and analyzed, and connections to other algorithms are discussed.
Reduced support vector machines: A statistical theory
 IEEE Trans. Neural Netw
, 2007
"... In dealing with large datasets the reduced support vector machine (RSVM) was proposed for the practical objective to overcome the computational difficulties as well as to reduce the model complexity. 1 In this article, we study the RSVM from the viewpoint of robust design for model building and con ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
In dealing with large datasets the reduced support vector machine (RSVM) was proposed for the practical objective to overcome the computational difficulties as well as to reduce the model complexity. 1 In this article, we study the RSVM from the viewpoint of robust design for model building and consider the nonlinear separating surface as a mixture of kernels. The RSVM uses a reduced model representation instead of a full one. Our main results center on two major themes. One is on the robustness of the random subset mixture model. The robustness is judged by a few criteria: (1) model variation measure, (2) model bias (deviation) between the reduced model and the full model and (3) test power in distinguishing the reduced model from the full one. The other is on the spectral analysis of the reduced kernel. We compare the eigenstructures of the full kernel matrix and the approximation kernel matrix. The approximation kernels are generated by uniform random subsets. The small discrepancies between them indicate that the approximation kernels can retain most of the relevant information for learning tasks in the full kernel. We focus on some statistical theory of the reduced set method mainly in the context of the RSVM. The use of a uniform random subset is not limited to the RSVM. This approach can act as a supplementalalgorithm on top of a basic optimization algorithm, wherein the actual optimization takes place on the subsetapproximated data. The statistical properties discussed in this paper are still valid. Key words and phrases: canonical angles, kernel methods, maximinity, minimaxity, model complexity, reduced set, MonteCarlo sampling, Nyström ap
On Relevant Dimensions in Kernel Feature Spaces
 J. MACHINE LEARNING RESEARCH
, 2008
"... We show that the relevant information of a supervised learning problem is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is suf ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
We show that the relevant information of a supervised learning problem is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufficiently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classification results.
Optimising Kernel Parameters and Regularisation Coefficients for NonLinear Discriminant Analysis
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2006
"... In this paper we consider a novel Bayesian interpretation of Fisher's discriminant analysis. We relate Rayleigh's coefficient to a noise model that minimises a cost based on the most probable class centres and that abandons the `regression to the labels' assumption used by other algorithms. Optimis ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
In this paper we consider a novel Bayesian interpretation of Fisher's discriminant analysis. We relate Rayleigh's coefficient to a noise model that minimises a cost based on the most probable class centres and that abandons the `regression to the labels' assumption used by other algorithms. Optimisation of the noise model yields a direction of discrimination equivalent to Fisher's discriminant, and with the incorporation of a prior we can apply Bayes' rule to infer the posterior distribution of the direction of discrimination. Nonetheless, we argue that an additional constraining distribution has to be included if sensible results are to be obtained. Going further, with the use of a Gaussian process prior we show the equivalence of our model to a regularised kernel Fisher's discriminant. A key advantage of our approach is the facility to determine kernel parameters and the regularisation coefficient through the optimisation of the marginal loglikelihood of the data. An added bonus of the new formulation is that it enables us to link the regularisation coefficient with the generalisation error.
Solving the preimage problem in kernel machines: a direct method
 Proc. of the 19th IEEE Workshop on Machine Learning for Signal Processing
"... In this paper, we consider the preimage problem in kernel machines, such as denoising with kernelPCA. For a given reproducing kernel Hilbert space (RKHS), by solving the preimage problem one seeks a pattern whose image in the RKHS is approximately a given feature. Traditional techniques include an ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
In this paper, we consider the preimage problem in kernel machines, such as denoising with kernelPCA. For a given reproducing kernel Hilbert space (RKHS), by solving the preimage problem one seeks a pattern whose image in the RKHS is approximately a given feature. Traditional techniques include an iterative technique (Mika et al.) and a multidimensional scaling (MDS) approach (Kwok et al.). In this paper, we propose a new technique to learn the preimage. In the RKHS, we construct a basis having an isometry with the input space, with respect to a training data. Then representing any feature in this basis gives us information regarding its preimage in the input space. We show that doing a preimage can be done directly using the kernel values, without having to compute distances in any of the spaces as with the MDS approach. Simulation results illustrates the relevance of the proposed method, as we compare it to these techniques. Index Terms — kernel machines, preimage problem, kernel matrix regression, denoising 1.
A LeastSquares Framework for Component Analysis
, 2009
"... ... (SC) have been extensively used as a feature extraction step for modeling, clustering, classification, and visualization. CA techniques are appealing because many can be formulated as eigenproblems, offering great potential for learning linear and nonlinear representations of data in closedfo ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
... (SC) have been extensively used as a feature extraction step for modeling, clustering, classification, and visualization. CA techniques are appealing because many can be formulated as eigenproblems, offering great potential for learning linear and nonlinear representations of data in closedform. However, the eigenformulation often conceals important analytic and computational drawbacks of CA techniques, such as solving generalized eigenproblems with rank deficient matrices (e.g., small sample size problem), lacking intuitive interpretation of normalization factors, and understanding commonalities and differences between CA methods. This paper proposes a unified leastsquares framework to formulate many CA methods. We show how PCA, LDA, CCA, LE, SC, and their kernel and regularized extensions, correspond to a particular instance of leastsquares weighted kernel reduced rank regression (LSWKRRR). The LSWKRRR formulation of CA methods has several benefits: (1) provides a clean connection between many CA techniques and an intuitive framework to understand normalization factors; (2) yields efficient numerical schemes to solve CA techniques; (3) overcomes the small sample size problem; (4) provides a framework to easily extend CA methods. We derive new weighted generalizations of PCA, LDA, CCA and SC, and several novel CA techniques.
Nonsparse multiple kernel learning for fisher discriminant analysis
 In International Conference on Data Mining
, 2009
"... Abstract—We consider the problem of learning a linear combination of prespecified kernel matrices in the Fisher discriminant analysis setting. Existing methods for such a task impose an ℓ1 norm regularisation on the kernel weights, which produces sparse solution but may lead to loss of information. ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Abstract—We consider the problem of learning a linear combination of prespecified kernel matrices in the Fisher discriminant analysis setting. Existing methods for such a task impose an ℓ1 norm regularisation on the kernel weights, which produces sparse solution but may lead to loss of information. In this paper, we propose to use ℓ2 norm regularisation instead. The resulting learning problem is formulated as a semiinfinite program and can be solved efficiently. Through experiments on both synthetic data and a very challenging object recognition benchmark, the relative advantages of the proposed method and its ℓ1 counterpart are demonstrated, and insights are gained as to how the choice of regularisation norm should be made.
A Unification of Component Analysis Methods
, 2009
"... ... extraction step for modeling, classification, visualization and clustering. CA techniques are appealing because many can be formulated as eigenproblems, offering great potential for learning linear and nonlinear representations of data without local minima. However, the eigenformulation often ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
... extraction step for modeling, classification, visualization and clustering. CA techniques are appealing because many can be formulated as eigenproblems, offering great potential for learning linear and nonlinear representations of data without local minima. However, the eigenformulation often conceals important analytic and computational drawbacks of CA techniques, such as solving generalized eigenproblems with rank deficient matrices, lacking intuitive interpretation of normalization factors, and understanding relationships between CA methods. This chapter proposes a unified framework to formulate many CA methods as a leastsquares estimation problem. We show how PCA, LDA, CCA, kmeans, spectral graph methods and kernel extensions correspond to a particular instance of a least squares weighted kernel reduced rank regression (LSKRRR). The leastsquares formulation allows better understanding of normalization factors, provides a clean framework to understand the communalities and differences between many CA methods, yields efficient optimization algorithms for many CA algorithms, suggest easy derivation for online learning methods, and provides an easier generalization of CA techniques. In particular, we derive the matrix expressions for weighted generalizations of PCA, LDA, SC and CCA (including kernel extensions).
Developmental stage annotation of drosophila gene expression pattern images via an entire solution path for LDA
 ACM Transactions Knowledge Discovery from Data
, 2008
"... Gene expression in a developing embryo occurs in particular cells (spatial patterns) in a timespecific manner (temporal patterns), which leads to the differentiation of cell fates. Images of a Drosophila melanogaster embryo at a given developmental stage, showing a particular gene expression pattern ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
Gene expression in a developing embryo occurs in particular cells (spatial patterns) in a timespecific manner (temporal patterns), which leads to the differentiation of cell fates. Images of a Drosophila melanogaster embryo at a given developmental stage, showing a particular gene expression pattern revealed by a genespecific probe, can be compared for spatial overlaps. The comparison is fundamentally important to formulating and testing gene interaction hypotheses. Expression pattern comparison is most biologically meaningful when images from a similar time point (developmental stage) are compared. In this paper, we present LdaPath, a novel formulation of Linear Discriminant Analysis (LDA) for automatic developmental stage range classification. It employs multivariate linear regression with the L1norm penalty controlled by a regularization parameter for feature extraction and visualization. LdaPath computes an entire solution path for all values of regularization parameter with essentially the same computational cost as fitting one LDA model. Thus, it facilitates efficient model selection. It is based on the equivalence relationship between LDA and the least squares method for multiclass classifications. This equivalence relationship is established under a mild condition, which we show empirically to hold for many highdimensional datasets, such as expression pattern images. Our experiments on a collection of 2705