Results 1 - 10
of
26
Core vector machines: Fast SVM training on very large data sets
- Journal of Machine Learning Research
, 2005
"... Standard SVM training has O(m 3) time and O(m 2) space complexities, where m is the training set size. It is thus computationally infeasible on very large data sets. By observing that practical SVM implementations only approximate the optimal solution by an iterative strategy, we scale up kernel met ..."
Abstract
-
Cited by 61 (11 self)
- Add to MetaCart
Standard SVM training has O(m 3) time and O(m 2) space complexities, where m is the training set size. It is thus computationally infeasible on very large data sets. By observing that practical SVM implementations only approximate the optimal solution by an iterative strategy, we scale up kernel methods by exploiting such “approximateness ” in this paper. We first show that many kernel methods can be equivalently formulated as minimum enclosing ball (MEB) problems in computational geometry. Then, by adopting an efficient approximate MEB algorithm, we obtain provably approximately optimal solutions with the idea of core sets. Our proposed Core Vector Machine (CVM) algorithm can be used with nonlinear kernels and has a time complexity that is linear in m and a space complexity that is independent of m. Experiments on large toy and realworld data sets demonstrate that the CVM is as accurate as existing SVM implementations, but is much faster and can handle much larger data sets than existing scale-up methods. For example, CVM with the Gaussian kernel produces superior results on the KDDCUP-99 intrusion detection data, which has about five million training patterns, in only 1.4 seconds on a 3.2GHz Pentium–4 PC.
Training a support vector machine in the primal
- Neural Computation
, 2007
"... Most literature on Support Vector Machines (SVMs) concentrate on the dual optimization problem. In this paper, we would like to point out that the primal problem can also be solved efficiently, both for linear and non-linear SVMs, and that there is no reason for ignoring this possibilty. On the cont ..."
Abstract
-
Cited by 47 (5 self)
- Add to MetaCart
Most literature on Support Vector Machines (SVMs) concentrate on the dual optimization problem. In this paper, we would like to point out that the primal problem can also be solved efficiently, both for linear and non-linear SVMs, and that there is no reason for ignoring this possibilty. On the contrary, from the primal point of view new families of algorithms for large scale SVM training can be investigated.
Fast gaussian process regression using kd-trees
- In Advances in Neural Information Processing Systems 18
, 2006
"... The computation required for Gaussian process regression with n training examples is about O(n 3) during training and O(n) for each prediction. This makes Gaussian process regression too slow for large datasets. In this paper, we present a fast approximation method, based on kd-trees, that significa ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
The computation required for Gaussian process regression with n training examples is about O(n 3) during training and O(n) for each prediction. This makes Gaussian process regression too slow for large datasets. In this paper, we present a fast approximation method, based on kd-trees, that significantly reduces both the prediction and the training times of Gaussian process regression. 1
Approximation Methods for Gaussian Process Regression
, 2007
"... A wealth of computationally efficient approximation methods for Gaussian process regression have been recently proposed. We give a unifying overview of sparse approximations, following Quiñonero-Candela and Rasmussen (2005), and a brief review of approximate matrix-vector multiplication methods. 1 ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
A wealth of computationally efficient approximation methods for Gaussian process regression have been recently proposed. We give a unifying overview of sparse approximations, following Quiñonero-Candela and Rasmussen (2005), and a brief review of approximate matrix-vector multiplication methods. 1
Automatic online tuning for fast Gaussian summation
"... Many machine learning algorithms require the summation of Gaussian kernel functions, an expensive operation if implemented straightforwardly. Several methods have been proposed to reduce the computational complexity of evaluating such sums, including tree and analysis based methods. These achieve va ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
Many machine learning algorithms require the summation of Gaussian kernel functions, an expensive operation if implemented straightforwardly. Several methods have been proposed to reduce the computational complexity of evaluating such sums, including tree and analysis based methods. These achieve varying speedups depending on the bandwidth, dimension, and prescribed error, making the choice between methods difficult for machine learning tasks. We provide an algorithm that combines tree methods with the Improved Fast Gauss Transform (IFGT). As originally proposed the IFGT suffers from two problems: (1) the Taylor series expansion does not perform well for very low bandwidths, and (2) parameter selection is not trivial and can drastically affect performance and ease of use. We address the first problem by employing a tree data structure, resulting in four evaluation methods whose performance varies based on the distribution of sources and targets and input parameters such as desired accuracy and bandwidth. To solve the second problem, we present an online tuning approach that results in a black box method that automatically chooses the evaluation method and its parameters to yield the best performance for the input data, desired accuracy, and bandwidth. In addition, the new IFGT parameter selection approach allows for tighter error bounds. Our approach chooses the fastest method at negligible additional cost, and has superior performance in comparisons with previous approaches. 1
Cross-validation optimization for large scale hierarchical classification kernel methods
- In NIPS
, 2007
"... We propose a highly efficient framework for kernel multi-class models with a large and structured set of classes. Kernel parameters are learned automatically by maximizing the cross-validation log likelihood, and predictive probabilities are estimated. We demonstrate our approach on large scale text ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
We propose a highly efficient framework for kernel multi-class models with a large and structured set of classes. Kernel parameters are learned automatically by maximizing the cross-validation log likelihood, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical class structure, achieving state-of-the-art results in an order of magnitude less time than previous work. 1
A fast algorithm for learning large scale preference relations
- Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics
, 2007
"... We consider the problem of learning the ranking function that maximizes a generalization of the Wilcoxon-Mann-Whitney statistic on training data. Relying on an ɛ-exact approximation for the error-function, we reduce the computational complexity of each iteration of a conjugate gradient algorithm for ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
We consider the problem of learning the ranking function that maximizes a generalization of the Wilcoxon-Mann-Whitney statistic on training data. Relying on an ɛ-exact approximation for the error-function, we reduce the computational complexity of each iteration of a conjugate gradient algorithm for learning ranking functions from O(m 2), to O(m), where m is the size of the training data. Experiments on public benchmarks for ordinal regression and collaborative filtering show that the proposed algorithm is as accurate as the best available methods in terms of ranking accuracy, when trained on the same data, and is several orders of magnitude faster. 1
Fast large scale Gaussian process regression using approximate matrix-vector products. Presented at the Learning workshop 2007
, 2007
"... Gaussian processes (GP) allow the treatment of non-linear non-parametric regression problems in a Bayesian framework [6]. Unfortunately its nonparametric nature causes computational problems for large data sets, due to an unfavorable O(N 3) time and O(N 2) memory scaling for training. The key comput ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Gaussian processes (GP) allow the treatment of non-linear non-parametric regression problems in a Bayesian framework [6]. Unfortunately its nonparametric nature causes computational problems for large data sets, due to an unfavorable O(N 3) time and O(N 2) memory scaling for training. The key computational task involves inversion of an N × N covariance matrix K + σ 2 I, where [K]ij = K(xi, xj), K is the covariance function of the GP, and σ 2 is the noise variance. Direct computation of the inverse requires O(N 3) operations and O(N 2) storage, which is impractical even for problems of moderate size (typically a few thousands). An important subfield of work in GP has attempted to bring this scaling down to O � m 2 N � by making sparse
Adaptive constraint reduction for training support vector machines
, 2007
"... A support vector machine (SVM) determines whether a given observed pattern lies in a particular class. The decision is based on prior training of the SVM on a set of patterns with known classification, and training is achieved by solving a convex quadratic programming problem. Since there are typica ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
A support vector machine (SVM) determines whether a given observed pattern lies in a particular class. The decision is based on prior training of the SVM on a set of patterns with known classification, and training is achieved by solving a convex quadratic programming problem. Since there are typically a large number of training patterns, this can be expensive. In this work, we propose an adaptive constraint reduction primal-dual interior-point method for training a linear SVM with ℓ1 penalty (hinge loss) for misclassification. We reduce the computational effort by assembling the normal equation matrix using only a well-chosen subset of patterns. Starting with a large portion of the patterns, our algorithm excludes more and more unnecessary patterns as the iteration proceeds. We extend our approach to training nonlinear SVMs through Gram matrix approximation methods. We demonstrate the effectiveness of the algorithm on a variety of standard test problems.
Greedy forward selection algorithms to sparse Gaussian Process Regression
- In Proceedings of 2006 International Joint Conference on Neural Networks (IJCNN 2006
, 2006
"... Abstract — This paper considers the basis vector selection issue invloved in forward selection algorithms to sparse Gaussian Process Regression (GPR). Firstly, we re-examine a previous basis vector selection criterion proposed by Smola and Bartlett [20], referred as loss-smola and give some new form ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract — This paper considers the basis vector selection issue invloved in forward selection algorithms to sparse Gaussian Process Regression (GPR). Firstly, we re-examine a previous basis vector selection criterion proposed by Smola and Bartlett [20], referred as loss-smola and give some new formulae to implement this criterion for the full-greedy strategy more efficiently in O(n 2 kmax) time instead of the original O(n 2 k 2 max), where n is the number of training examples and kmax ≪ n is the maximally allowed number of selected basis vectors. Secondly, in order to make the algorithm linearly scaling in n, which is quite preferable for large datasets, we present an approximate version loss-sun to loss-smola criterion. We compare the full greedy algorithms induced by the loss-sun and loss-smola criteria, respectively, on several medium-scale datasets. In contrast to loss-smola, the advantage associated with loss-sun criterion is that it could lead to an algorithm which scales as O(nk 2 max) time and O(nkmax) memory if coupled with the sub-greedy scheme [20], [7]. Our criterion is similar to a matching pursuit approach, referred as loss-keert proposed very recently by Keerthi and Chu [7] but with different motivations. Numerical experiments on a number of large-scale datasets have demonstrated that our proposed method is always better than loss-keert in both generalization performance and running time. Finally, we discuss the drawbacks of the sub-greedy strategy and present two approximate full-greedy strategies, which can be applied to all three basis vector selection criteria discussed in this paper. I.

