Results 1 
9 of
9
QuasiMonte Carlo Feature Maps for ShiftInvariant Kernels. ICML
, 2014
"... Abstract We consider the problem of improving the efficiency of randomized Fourier feature maps to accelerate training and testing speed of kernel methods on large data sets. These approximate feature maps arise as Monte Carlo approximations to integral representations of shiftinvariant kernel fun ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract We consider the problem of improving the efficiency of randomized Fourier feature maps to accelerate training and testing speed of kernel methods on large data sets. These approximate feature maps arise as Monte Carlo approximations to integral representations of shiftinvariant kernel functions (e.g., Gaussian kernel). In this paper, we propose to use QuasiMonte Carlo (QMC) approximations instead, where the relevant integrands are evaluated on a lowdiscrepancy sequence of points as opposed to random point sets as in the Monte Carlo approach. We derive a new discrepancy measure called box discrepancy based on theoretical characterizations of the integration error with respect to a given sequence. We then propose to learn QMC sequences adapted to our setting based on explicit box discrepancy minimization. Our theoretical analyses are complemented with empirical results that demonstrate the effectiveness of classical and adaptive QMC techniques for this problem.
How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets
"... † and ‡: shared first and second coauthorships, respectively ¶: to whom questions and comments should be sent ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
† and ‡: shared first and second coauthorships, respectively ¶: to whom questions and comments should be sent
Scale up nonlinear component analysis with doubly stochastic gradients.
 In NIPS,
, 2015
"... Abstract Nonlinear component analysis such as kernel Principle Component Analysis (KPCA) and kernel Canonical Correlation Analysis (KCCA) are widely used in machine learning, statistics and data analysis, but they cannot scale up to big datasets. Recent attempts have employed random feature approxi ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract Nonlinear component analysis such as kernel Principle Component Analysis (KPCA) and kernel Canonical Correlation Analysis (KCCA) are widely used in machine learning, statistics and data analysis, but they cannot scale up to big datasets. Recent attempts have employed random feature approximations to convert the problem to the primal form for linear computational complexity. However, to obtain high quality solutions, the number of random features should be the same order of magnitude as the number of data points, making such approach not directly applicable to the regime with millions of data points. We propose a simple, computationally efficient, and memory friendly algorithm based on the "doubly stochastic gradients" to scale up a range of kernel nonlinear component analysis, such as kernel PCA, CCA and SVD. Despite the nonconvex nature of these problems, our method enjoys theoretical guarantees that it converges at the rateÕ(1/t) to the global optimum, even for the top k eigen subspace. Unlike many alternatives, our algorithm does not require explicit orthogonalization, which is infeasible on big datasets. We demonstrate the effectiveness and scalability of our algorithm on large scale synthetic and real world datasets.
Improper Deep Kernels
"... Abstract Neural networks have recently reemerged as a powerful hypothesis class, yielding impressive classification accuracy in multiple domains. However, their training is a nonconvex optimization problem which poses theoretical and practical challenges. Here we address this difficulty by turnin ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract Neural networks have recently reemerged as a powerful hypothesis class, yielding impressive classification accuracy in multiple domains. However, their training is a nonconvex optimization problem which poses theoretical and practical challenges. Here we address this difficulty by turning to "improper" learning of neural nets. In other words, we learn a classifier that is not a neural net but is competitive with the best neural net model given a sufficient number of training examples. Our approach relies on a novel kernel construction scheme in which the kernel is a result of integration over the set of all possible instantiation of neural models. It turns out that the corresponding integral can be evaluated in closedform via a simple recursion. Thus we translate the nonconvex learning problem of a neural net to an SVM with an appropriate kernel. We also provide sample complexity results which depend on the stability of the optimal neural net.
Additive Approximations in High Dimensional Nonparametric Regression via the SALSA
"... Abstract High dimensional nonparametric regression is an inherently difficult problem with known lower bounds depending exponentially in dimension. A popular strategy to alleviate this curse of dimensionality has been to use additive models of first order, which model the regression function as a s ..."
Abstract
 Add to MetaCart
Abstract High dimensional nonparametric regression is an inherently difficult problem with known lower bounds depending exponentially in dimension. A popular strategy to alleviate this curse of dimensionality has been to use additive models of first order, which model the regression function as a sum of independent functions on each dimension. Though useful in controlling the variance of the estimate, such models are often too restrictive in practical settings. Between nonadditive models which often have large variance and first order additive models which have large bias, there has been little work to exploit the tradeoff in the middle via additive models of intermediate order. In this work, we propose SALSA, which bridges this gap by allowing interactions between variables, but controls model capacity by limiting the order of interactions. SALSA minimises the residual sum of squares with squared RKHS norm penalties. Algorithmically, it can be viewed as Kernel Ridge Regression with an additive kernel. When the regression function is additive, the excess risk is only polynomial in dimension. Using the GirardNewton formulae, we efficiently sum over a combinatorial number of terms in the additive expansion. Via a comparison on 15 real datasets, we show that our method is competitive against 21 other alternatives.
Towards More Efficient SPSD Matrix Approximation and CUR Matrix Decomposition
, 2016
"... Abstract Symmetric positive semidefinite (SPSD) matrix approximation methods have been extensively used to speed up largescale eigenvalue computation and kernel learning methods. The standard sketch based method, which we call the prototype model, produces relatively accurate approximations, but ..."
Abstract
 Add to MetaCart
Abstract Symmetric positive semidefinite (SPSD) matrix approximation methods have been extensively used to speed up largescale eigenvalue computation and kernel learning methods. The standard sketch based method, which we call the prototype model, produces relatively accurate approximations, but is inefficient on large square matrices. The Nyström method is highly efficient, but can only achieve low accuracy. In this paper we propose a novel model that we call the fast SPSD matrix approximation model. The fast model is nearly as efficient as the Nyström method and as accurate as the prototype model. We show that the fast model can potentially solve eigenvalue problems and kernel learning problems in linear time with respect to the matrix size n to achieve 1 + relativeerror, whereas both the prototype model and the Nyström method cost at least quadratic time to attain comparable error bound. Empirical comparisons among the prototype model, the Nyström method, and our fast model demonstrate the superiority of the fast model. We also contribute new understandings of the Nyström method. The Nyström method is a special instance of our fast model and is approximation to the prototype model. Our technique can be straightforwardly applied to make the CUR matrix decomposition more efficiently computed without much affecting the accuracy.
Utilize Old Coordinates: Faster Doubly Stochastic Gradients for Kernel Methods
"... Abstract To address the scalability issue of kernel methods, random features are commonly used for kernel approximation ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract To address the scalability issue of kernel methods, random features are commonly used for kernel approximation
TimeAccuracy Tradeoffs in Kernel Prediction: Controlling Prediction Quality
, 2017
"... Abstract Kernel regression or classification (also referred to as weighted NN methods in Machine Learning) are appealing for their simplicity and therefore ubiquitous in data analysis. However, practical implementations of kernel regression or classification consist of quantizing or subsampling d ..."
Abstract
 Add to MetaCart
Abstract Kernel regression or classification (also referred to as weighted NN methods in Machine Learning) are appealing for their simplicity and therefore ubiquitous in data analysis. However, practical implementations of kernel regression or classification consist of quantizing or subsampling data for improving time efficiency, often at the cost of prediction quality. While such tradeoffs are necessary in practice, their statistical implications are generally not well understood, hence practical implementations come with few performance guarantees. In particular, it is unclear whether it is possible to maintain the statistical accuracy of kernel predictioncrucial in some applicationswhile improving prediction time. The present work provides guiding principles for combining kernel prediction with dataquantization so as to guarantee good tradeoffs between prediction time and accuracy, and in particular so as to approximately maintain the good accuracy of vanilla kernel prediction. Furthermore, our tradeoff guarantees are worked out explicitly in terms of a tuning parameter which acts as a knob that favors either time or accuracy depending on practical needs. On one end of the knob, prediction time is of the same order as that of singlenearestneighbor prediction (which is statistically inconsistent) while maintaining consistency; on the other end of the knob, the prediction risk is nearly minimaxoptimal (in terms of the original data size) while still reducing time complexity. The analysis thus reveals the interaction between the dataquantization approach and the kernel prediction method, and most importantly gives explicit control of the tradeoff to the practitioner rather than fixing the tradeoff in advance or leaving it opaque. The theoretical results are validated on data from a range of realworld application domains; in particular we demonstrate that the theoretical knob performs as expected.
Extended and Unscented Kitchen Sinks
"... Abstract We propose a scalable multipleoutput generalization of unscented and extended Gaussian processes. These algorithms have been designed to handle general likelihood models by linearizing them using a Taylor series or the Unscented Transform in a variational inference framework. We build upo ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract We propose a scalable multipleoutput generalization of unscented and extended Gaussian processes. These algorithms have been designed to handle general likelihood models by linearizing them using a Taylor series or the Unscented Transform in a variational inference framework. We build upon random feature approximations of Gaussian process covariance functions and show that, on smallscale singletask problems, our methods can attain similar performance as the original algorithms while having less computational cost. We also evaluate our methods at a larger scale on MNIST and on a seismic inversion which is inherently a multitask problem.