## SVM Decision Boundary Based Discriminative Subspace Induction (2002)

Citations: | 5 - 1 self |

### BibTeX

@TECHREPORT{Zhang02svmdecision,

author = {Jiayong Zhang and Yanxi Liu},

title = {SVM Decision Boundary Based Discriminative Subspace Induction},

institution = {},

year = {2002}

}

### OpenURL

### Abstract

Dimensionality reduction is widely acceptes as an analysis and modeling tool to deal with high-dimensional spaces, although researches from different disciplines have different interpretations of what properties should be preserved in the reduction process. We study the problem of linear dimension reduction for classification, with a focus on sufficient dimension reduction, i.e., inducing subspaces without loss of discriminative information. Decision boundary analysis (DBA), originally proposed by Lee & Landgrebe (1993), can directly find the smallest subspace with such property. However, existing DBA implementations are computationally expensive and sensitive to sample size. In this paper, we first formulate the problem of sufficient dimension reduction for classification in parallel terms as for regression. Disclosures of these connections lead to several meaningful observations. Then we present a novel space reduction algorithm that combines SVM and DBA, thus inheriting several appealing properties from kernel machines such as good generalization, weak assumption, and efficient computation. In addition, the proposed method provides a natural way to reduce the complexity, and even improve the accuracy, of SVM itself. We demonstrate its superiority by comparative experiments on one simulated and four real-world benchmark datasets.

### Citations

9946 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...es three important ideas into a very successful whole: quadratic programming for convex optimization, kernel representation, and maximum-margin. For a complete coverage, readers are referred to, e.g. =-=[76, 19, 35]-=-. The decision function of a two-class problem derived by SVM can be written as h(x) = w #(x) + b = n # i=1 # i y i K(x, x i ) + b , (14) where x i # R d is the training sample, and y i # {1} is the c... |

5369 |
Neural Networks for Pattern Recognition
- Bishop
- 1995
(Show Context)
Citation Context ...odeling tool to deal with high-dimensional spaces. There are several reasons to keep the dimensionality as low as possible, such as to reduce system complexity, to alleviate "curse of dimensional=-=ity" [3, 6]-=-, and to enhance understanding of the data. In general, dimensionality reduction can be defined as the search for a low-dimensional linear or nonlinear subspace that preserves some intrinsic propertie... |

4457 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...o aspects: 1. All the datasets we selected have been extensively benchmarked (see Appendix B for further details). Benchmark error rates have been listed in Table 4, that is, the Bayes error for WAVE =-=[11]-=-, the lowest error rates of more than 20 classifiers for PIMA, VEHICLE and LETTER reported in the StatLog Project [59], and the median result of 25 classifier combination schemes for MFEAT [44]. To fa... |

3085 |
UCI repository of machine learning databases
- Blake, Merz
- 1998
(Show Context)
Citation Context ...y facilitate such explorations. 4 Experiments We evaluate the proposed linear dimension reduction algorithm by one simulated and four realworld datasets drawn from the UCI Machine Learning Repository =-=[7]-=-. Their basic information is summarized in Table 4. On each dataset, we compare the goodness of the subspaces induced by SVM-DBA to those induced by PCA and LDA. Our comparative study di#ers from prio... |

2928 |
Introduction to Statistical Pattern Recognition, Electrical Science Series
- Fukunaga
- 1972
(Show Context)
Citation Context ...tion optimality is concerned. In this paper we study the problem of dimensionality reduction for classification, which is commonly referred to in pattern recognition literatures as feature extraction =-=[28, 44]-=-. Particularly, we restrict ourselves to linear dimension reduction, i.e., inducing subspaces through linear mappings. Linear mapping is mathematically tractable and computationally simple, with certa... |

2334 |
Algorithms for Clustering Data
- Jain, Dubes
- 1988
(Show Context)
Citation Context ...tion, where the challenge is to embed a set of high-dimensional observations into a low-dimensional Eucledian space that preserves as closely as possible their intrinsic global/local metric structure =-=[43, 71, 65]-=-. 2. Regression, in which the goal is to reduce the dimension of the predictor vector with the minimum loss in its capacity to infer about the conditional distribution of the univariate response varia... |

1792 |
A global geometric framework for non- linear dimensionality reduction
- TENENBAUM, SILVA, et al.
- 2000
(Show Context)
Citation Context ...tion, where the challenge is to embed a set of high-dimensional observations into a low-dimensional Eucledian space that preserves as closely as possible their intrinsic global/local metric structure =-=[43, 71, 65]-=-. 2. Regression, in which the goal is to reduce the dimension of the predictor vector with the minimum loss in its capacity to infer about the conditional distribution of the univariate response varia... |

1725 | Nonlinear dimensionality reduction by locally linear embedding
- ROWEIS, SAUL
- 2000
(Show Context)
Citation Context ...tion, where the challenge is to embed a set of high-dimensional observations into a low-dimensional Eucledian space that preserves as closely as possible their intrinsic global/local metric structure =-=[43, 71, 65]-=-. 2. Regression, in which the goal is to reduce the dimension of the predictor vector with the minimum loss in its capacity to infer about the conditional distribution of the univariate response varia... |

1697 | Independent Component Analysis
- Hyvärinen, Karhunen, et al.
- 2001
(Show Context)
Citation Context ...size supervised algorithms that utilize the category information associated with each sample, thus have ignored a bunch of well-known unsupervised methods such as PCA, Projection Pursuit [41] and ICA =-=[42]-=-. Although supervised methods generally have better performance than unsupervised methods in context of discrimination tasks, this is not always the case. For example, it was demonstrated [58] that, w... |

1145 | Nonlinear component analysis as a kernel eigenvalue problem
- Scholkopf, Smola, et al.
- 1998
(Show Context)
Citation Context ...t sometimes makes it outperform nonlinear models. In addition, it may be nonlinearly extended, for example, through global coordinations of local linear models (e.g. [15, 29]) or kernel mapping (e.g. =-=[69, 60]-=-). For recognition systems, inducing subspaces without any loss of discrimination power is especially attractive. We call this process su#cient dimension reduction, borrowing terminology from classica... |

1053 |
An Introduction to Support Vector Machines
- Cristianini, Shawe-Taylor
- 2000
(Show Context)
Citation Context ...es three important ideas into a very successful whole: quadratic programming for convex optimization, kernel representation, and maximum-margin. For a complete coverage, readers are referred to, e.g. =-=[76, 19, 35]-=-. The decision function of a two-class problem derived by SVM can be written as h(x) = w #(x) + b = n # i=1 # i y i K(x, x i ) + b , (14) where x i # R d is the training sample, and y i # {1} is the c... |

752 | Statistical pattern recognition: a review
- Jain, Duin, et al.
(Show Context)
Citation Context ...tion optimality is concerned. In this paper we study the problem of dimensionality reduction for classification, which is commonly referred to in pattern recognition literatures as feature extraction =-=[28, 44]-=-. Particularly, we restrict ourselves to linear dimension reduction, i.e., inducing subspaces through linear mappings. Linear mapping is mathematically tractable and computationally simple, with certa... |

746 | Gene selection for cancer classification using support vector machines
- Guyon, Weston, et al.
(Show Context)
Citation Context ...very and elaboration of kernel methods for classification and regression seem to suggest that learning in very high dimensions is not necessarily a terrible mistake. Several feature selection methods =-=[10, 77, 31]-=-, as special cases of linear dimension reduction, have been shown to be successful which directly depend on the intrinsic generalization ability of kernel machines in high dimensional spaces. In the s... |

701 |
Pattern Recognition: A Statistical Approach
- Devijver, Kittler
- 1982
(Show Context)
Citation Context ... Bhattacharyya bound, an upper bound on the Bayes error. The practical implementations are accompanied by a parametric estimation (unimodal Gaussian) of the densities at hand. Patrick-Fisher distance =-=[22]-=- can be defined nonparametrically using Parzen estimators with Gaussian kernels of the class-conditional densities of the projections. It induces an upper bound on Bayes error that is larger than thos... |

639 | Irrelevant Features and the Subset Selection Problem
- John, Kohavi, et al.
- 1994
(Show Context)
Citation Context .... the SVM classifier) itself, and it may work "badly" for other evaluators. To the best of our knowledge, almost all existing linear dimension reduction algorithms can be labeled as so-calle=-=d filters [46, 8]-=-, and we believe further study on the coupling e#ect between reduction methods and types of classifiers deserves attention. On the other hand, we believe that there are some common regularities in mos... |

466 | Selection of relevant features and examples in machine learning
- Blum, Langley
- 1997
(Show Context)
Citation Context .... the SVM classifier) itself, and it may work "badly" for other evaluators. To the best of our knowledge, almost all existing linear dimension reduction algorithms can be labeled as so-calle=-=d filters [46, 8]-=-, and we believe further study on the coupling e#ect between reduction methods and types of classifiers deserves attention. On the other hand, we believe that there are some common regularities in mos... |

394 | The Random Subspace Method for constructing decision forests
- Ho
- 1998
(Show Context)
Citation Context ...re considered to possess some favorable properties over deterministic ones. It is not surprising that the combination of multiple random subset selections contains su#cient discriminative information =-=[47, 36]-=-. However, it seems less obvious that how much separability can be preserved in a single random projection. In fact there are two relevant theoretical results [21]: 1) Data from a mixture of k Gaussia... |

358 | Fisher discriminant analysis with kernels
- MIKA, RÄTSCH, et al.
- 1999
(Show Context)
Citation Context ...t sometimes makes it outperform nonlinear models. In addition, it may be nonlinearly extended, for example, through global coordinations of local linear models (e.g. [15, 29]) or kernel mapping (e.g. =-=[69, 60]-=-). For recognition systems, inducing subspaces without any loss of discrimination power is especially attractive. We call this process su#cient dimension reduction, borrowing terminology from classica... |

340 | Regularized discriminant analysis
- Friedman
- 1989
(Show Context)
Citation Context ...of these problems. For example, besides the commonly used ML estimator of S w , various regularization techniques are available to obtain robust estimates in situations of small sample-size (e.g. RDA =-=[26]-=-) or high collinearity (e.g. PDA [32]). Okada and Tomita [61] lifted the (Q - 1) limitation by selecting the projection axes one at a time under an orthogonality constraint. Campbell [14] first shown ... |

322 | Pca versus lda
- Martinez, Kak
- 2001
(Show Context)
Citation Context ...and ICA [42]. Although supervised methods generally have better performance than unsupervised methods in context of discrimination tasks, this is not always the case. For example, it was demonstrated =-=[58]-=- that, when training set was small, PCA could outperform LDA in appearance-based object recognition. Another example is our comparative results on the WAVE-21 dataset (see Figure 1(a)). 3. We restrict... |

278 |
Adaptive Control Processes
- Bellman
- 1961
(Show Context)
Citation Context ...odeling tool to deal with high-dimensional spaces. There are several reasons to keep the dimensionality as low as possible, such as to reduce system complexity, to alleviate "curse of dimensional=-=ity" [3, 6]-=-, and to enhance understanding of the data. In general, dimensionality reduction can be defined as the search for a low-dimensional linear or nonlinear subspace that preserves some intrinsic propertie... |

265 | Exploratory projection pursuit
- Friedman
- 1987
(Show Context)
Citation Context ...nalytical simplification. Aladjem [1] studied the optimization of the highly nonlinear PF distance function in case of two-class problems. Stimulated by an idea of Friedman called "structure remo=-=val" [25]-=-, he proposed a recursive optimization procedure for searching the directions corresponding to several large local maxima of the PF distance. The main idea is to transform the data along a found direc... |

231 | Sliced inverse regression for dimension reduction (with discussion
- LI
- 1991
(Show Context)
Citation Context ... Regression, in which the goal is to reduce the dimension of the predictor vector with the minimum loss in its capacity to infer about the conditional distribution of the univariate response variable =-=[53, 54, 17]-=-. 3. Classification, where we should seek reductions that minimize the lowest attainable classification error in the transformed space, i.e., the Bayes error [13]. Such disparate interpretations might... |

221 | Feature selection for svms
- Weston, Mukherjee, et al.
(Show Context)
Citation Context ...very and elaboration of kernel methods for classification and regression seem to suggest that learning in very high dimensions is not necessarily a terrible mistake. Several feature selection methods =-=[10, 77, 31]-=-, as special cases of linear dimension reduction, have been shown to be successful which directly depend on the intrinsic generalization ability of kernel machines in high dimensional spaces. In the s... |

159 | Discriminant Analysis by Gaussian Mixtures
- Hastie, Tibshirani
- 2007
(Show Context)
Citation Context ...suited to the form of classifier being used [46]. MLLT turns out to be a particular case of Kumar's HDA when the dimensions of the original and the projected space are the same. Hastie and Tibshirani =-=[33]-=- assumed that each observed class is in fact a mixture of unobserved normally distributed subclasses with equal within-subclass covariance. The resultant problem, called Mixture Discriminant Analysis ... |

152 | Penalized discriminant analysis
- Hastie, Buja, et al.
- 1995
(Show Context)
Citation Context ...es the commonly used ML estimator of S w , various regularization techniques are available to obtain robust estimates in situations of small sample-size (e.g. RDA [26]) or high collinearity (e.g. PDA =-=[32]-=-). Okada and Tomita [61] lifted the (Q - 1) limitation by selecting the projection axes one at a time under an orthogonality constraint. Campbell [14] first shown that reduced-rank LDA is equivalent t... |

140 |
Transmission of Information: A Statistical Theory of Communications
- Fano
- 1961
(Show Context)
Citation Context ...to data with deflated maxima of the PF distance, and then iterate to obtain the next direction. Shannon's mutual information determines a lower bound to the Bayes error according to Fano's inequality =-=[24]-=-. Hence the lower bound on error probability is minimized when the mutual information between feature vector and class label is maximized. However, the MMI 18 criterion has not been in wide use due to... |

135 |
Semiparametric estimation of index coefficients
- Powell, Stock, et al.
- 1989
(Show Context)
Citation Context ... Such correspondences are summarized in Table 1. We also note that the DBA estimator is in essence similar to the so-called average derivative estimation (ADE) which were first introduced in [70] and =-=[63]-=-. ADE has been used to estimate the central mean subspace of a single-index or multiindex regression model [66, 38]. The only difference between ADE and DBA is that, in ADE, surface normals are replac... |

118 | Flexible discriminant analysis by optimal scoring
- Hastie, Tibshirani, et al.
- 1994
(Show Context)
Citation Context ...s of functionalities are added. By solving the M-step via weighted "optimal scoring", i.e. multiple linear regression of a blurred response matrix followed by an eigen analysis, both PDA [32=-=] and FDA [34]-=- adapt naturally to MDA. Another feature called subclass shrinkage is proposed to bias the decomposition, and the positions of the means themselves, towards classification. MDA can be viewed as a smoo... |

105 | Maximum likelihood modeling with Gaussian distributions for classification
- Gopinath
- 1998
(Show Context)
Citation Context ...given by the maximization of the projected between-class scatter volume. If diagonal covariance modeling is adopted in the projected subspace, the so-called Maximum Likelihood Linear Transform (MLLT) =-=[30]-=- may be applied after HDA to minimize the loss in likelihood between the full and diagonal covariance Gaussian models, i.e., to make the diagonal constraint more valid as evidenced from the data. This... |

98 |
Experiments with random projection
- Dasgupta
- 2000
(Show Context)
Citation Context ...cient discriminative information [47, 36]. However, it seems less obvious that how much separability can be preserved in a single random projection. In fact there are two relevant theoretical results =-=[21]-=-: 1) Data from a mixture of k Gaussians can be projected into O(log k) dimensions while still retaining the approximate level of separation between the clusters. The projected dimension is independent... |

93 |
Heteroscedastic discriminant analysis and reduced rank hmms for improved speech recognition
- Kumar, Andreou
- 1998
(Show Context)
Citation Context ...ss covariance constraint and the rank m constraint on the class centroids. In light of this connection, several generalizations of LDA have been proposed under the maximum likelihood framework. Kumar =-=[48]-=- removed the equal within-class covariance assumption when studying the case for diagonal covariance modeling, and proposed Heteroscedastic Discriminant Analysis (Kumar 's HDA), which is a maximum lik... |

89 | A: Dimensionality reduction using genetic algorithms
- Raymer, Punch, et al.
(Show Context)
Citation Context ...or taken as a result of another dimensionreducing algorithm. As can be imagined, the algorithm is computationally expensive due to the high complexity of the error estimation procedure. Raymer et al. =-=[64]-=- used k-NN error as the fitness function in a genetic algorithm that simultaneously optimizes a real-valued scaling vector for feature transformation, a binary mask vector for feature selection, and t... |

85 |
Projection pursuit (with discussion
- Huber
- 1985
(Show Context)
Citation Context .... 2. We emphasize supervised algorithms that utilize the category information associated with each sample, thus have ignored a bunch of well-known unsupervised methods such as PCA, Projection Pursuit =-=[41]-=- and ICA [42]. Although supervised methods generally have better performance than unsupervised methods in context of discrimination tasks, this is not always the case. For example, it was demonstrated... |

82 | Maximum likelihood discriminant feature spaces
- Saon, Padmanabhan, et al.
- 2000
(Show Context)
Citation Context ...roscedastic Discriminant Analysis (Kumar 's HDA), which is a maximum likelihood solution to a Gaussian model with common covariances in the rejected subspace. A di#erent version of HDA is proposed in =-=[68]-=-, which directly maximizes the between-class separation in the projected subspace while keeping the product of individual with-class separations constant. It can be interpreted as a constrained maximu... |

79 |
On principal Hessian directions for data visualization and dimension reduction: another application of Stein’s lemma
- LI
- 1992
(Show Context)
Citation Context ... Regression, in which the goal is to reduce the dimension of the predictor vector with the minimum loss in its capacity to infer about the conditional distribution of the univariate response variable =-=[53, 54, 17]-=-. 3. Classification, where we should seek reductions that minimize the lowest attainable classification error in the transformed space, i.e., the Bayes error [13]. Such disparate interpretations might... |

74 | Complexity measures of supervised classification problems
- Ho, Basu
(Show Context)
Citation Context ...n. On the other hand, we believe that there are some common regularities in most real-world datasets that distinguish them from pure random sets, as has been confirmed by the results of other authors =-=[20, 37]-=-. Hence a subspace that allows high performance of one classifier should also facilitate high performance of a di#erent classifier. As a prelimenary attempt, we replace the SVM evaluator with a 1NN cl... |

72 | Multiclass Linear Dimension Reduction by Weighted Pairwise Fisher Criteria
- Loog, Duin, et al.
- 2001
(Show Context)
Citation Context ...s distances since the resulting transform preserves the distance of already well-separated classes while causing a large overlap of neighboring classes. Approximate Pairwise Accuracy Criterion (aPAC) =-=[55]-=- is a computationally inexpensive method to tackle this problem, in which the weighting is derived from an attempt to approximate the Bayes error for pairs of classes in the projected subspace. aPAC s... |

67 |
Nonparametric discrimination analysis
- Fukunaga, Mantock
- 1983
(Show Context)
Citation Context ...ion, and the positions of the means themselves, towards classification. MDA can be viewed as a smooth version of LVQ. Nonparametric Discriminant Analysis (NDA), first proposed by Fukunaga and Mantock =-=[27]-=-, generalizes LDA by replacing the parametric between-class scatter S b with a distribution free version through k-nearest neighbor local density estimation. Here is roughly how NDA works for a two-cl... |

66 | Feature selection via mathematical programming
- Bradley, Mangasarian, et al.
- 1998
(Show Context)
Citation Context ...very and elaboration of kernel methods for classification and regression seem to suggest that learning in very high dimensions is not necessarily a terrible mistake. Several feature selection methods =-=[10, 77, 31]-=-, as special cases of linear dimension reduction, have been shown to be successful which directly depend on the intrinsic generalization ability of kernel machines in high dimensional spaces. In the s... |

64 |
Learning Kernel Classifiers
- Herbrich
- 2002
(Show Context)
Citation Context ...es three important ideas into a very successful whole: quadratic programming for convex optimization, kernel representation, and maximum-margin. For a complete coverage, readers are referred to, e.g. =-=[76, 19, 35]-=-. The decision function of a two-class problem derived by SVM can be written as h(x) = w #(x) + b = n # i=1 # i y i K(x, x i ) + b , (14) where x i # R d is the training sample, and y i # {1} is the c... |

59 | Feature extraction based on decision boundaries
- Lee, Landgrebe
- 1993
(Show Context)
Citation Context ...ome projection indices that either approximate or bound the Bayes error [56, 67, 72], yet they involve time-consuming iterations and require a given output dimension. As an exception, Lee & Landgrebe =-=[51] found tha-=-t, "the necessary feature vectors to achieve the same classification accuracy as in the original space" can be uniquely determined by the Bayes optimal decision boundary. Thus they suggested... |

56 |
Regression Graphics: Ideas for Studying Regressions Through Graphics
- COOK
- 1998
(Show Context)
Citation Context ... Regression, in which the goal is to reduce the dimension of the predictor vector with the minimum loss in its capacity to infer about the conditional distribution of the univariate response variable =-=[53, 54, 17]-=-. 3. Classification, where we should seek reductions that minimize the lowest attainable classification error in the transformed space, i.e., the Bayes error [13]. Such disparate interpretations might... |

51 | Direct Semiparametric Estimation of Single-Index Models With Discrete Covariates
- Horowitz, Härdle
- 1996
(Show Context)
Citation Context ...he so-called average derivative estimation (ADE) which were first introduced in [70] and [63]. ADE has been used to estimate the central mean subspace of a single-index or multiindex regression model =-=[66, 38]-=-. The only di#erence between ADE and DBA is that, in ADE, surface normals are replaced by the gradient F(x) = # f (x) of the regression function at every point x. The main issue of ADE is again how to... |

50 | Mutual Information in Learning Feature Transformations
- Torkkola, Campbell
- 2000
(Show Context)
Citation Context ...s from the problem", as mentioned by the authors themselves, "of overlapping mutual information contributed by each feature, which becomes worse as more features are extracted". Torkkol=-=a and Campbell [75, 73]-=- combined Renyi's quadratic entropy and Parzen window density estimator into a di#erentiable and relatively simple optimization criterion for (non)linear transformation design. The criterion only depe... |

49 | Consistent estimation of scaled coefficients
- Stoker
- 1986
(Show Context)
Citation Context ...ximation. Such correspondences are summarized in Table 1. We also note that the DBA estimator is in essence similar to the so-called average derivative estimation (ADE) which were first introduced in =-=[70]-=- and [63]. ADE has been used to estimate the central mean subspace of a single-index or multiindex regression model [66, 38]. The only difference between ADE and DBA is that, in ADE, surface normals a... |

37 |
Fractional-step dimensionality reduction
- Lotlikar, Kothari
- 2000
(Show Context)
Citation Context ... approximations introduced in order to arrive at a solution that is computationally simple. Various other weighting heuristics are available to tweak LDA performance (e.g., the iterative weighting in =-=[57]-=-). Lotlikar and Kothari [56] introduced an adaptive approach which is quite similar to aPAC, except that an exact pairwise Bayes error function without any approximation is minimized by iterative adju... |

34 |
Hyperspectral data analysis and supervised feature reduction via projection pursuit
- Jimenez, Landgrebe
- 1999
(Show Context)
Citation Context ... glance, this approach seems questionable because one of the motivations for dimension reduction is just that we cannot accurately estimate the boundary in high-dimensional space. Later studies (e.g. =-=[45, 49]-=-) also confirmed that DBA is expensive in time and sensitive to sample size. However, recent discovery and elaboration of kernel methods for classification and regression seem to suggest that learning... |

33 |
Semiparametric estimation of index coe¢ cients,Econometrica
- Powell, Stock, et al.
- 1989
(Show Context)
Citation Context ... Such correspondences are summarized in Table 1. We also note that the DBA estimator is in essence similar to the so-called average derivative estimation (ADE) which were first introduced in [70] and =-=[63]-=-. ADE has been used to estimate the central mean subspace of a single-index or multiindex regression model [66, 38]. The only di#erence between ADE and DBA is that, in ADE, surface normals are replace... |

31 |
Exploring regression structure using nonparametric functional estimation
- Samarov
- 1993
(Show Context)
Citation Context ...he so-called average derivative estimation (ADE) which were first introduced in [70] and [63]. ADE has been used to estimate the central mean subspace of a single-index or multiindex regression model =-=[66, 38]-=-. The only di#erence between ADE and DBA is that, in ADE, surface normals are replaced by the gradient F(x) = # f (x) of the regression function at every point x. The main issue of ADE is again how to... |