Results 1  10
of
16
Recent Advances of Largescale Linear Classification
"... Linear classification is a useful tool in machine learning and data mining. For some data in a rich dimensional space, the performance (i.e., testing accuracy) of linear classifiers has shown to be close to that of nonlinear classifiers such as kernel methods, but training and testing speed is much ..."
Abstract

Cited by 32 (6 self)
 Add to MetaCart
Linear classification is a useful tool in machine learning and data mining. For some data in a rich dimensional space, the performance (i.e., testing accuracy) of linear classifiers has shown to be close to that of nonlinear classifiers such as kernel methods, but training and testing speed is much faster. Recently, many research works have developed efficient optimization methods to construct linear classifiers and applied them to some largescale applications. In this paper, we give a comprehensive survey on the recent development of this active research area.
Fast and Scalable Polynomial Kernels via Explicit Feature Maps *
"... Approximation of nonlinear kernels using random feature mapping has been successfully employed in largescale data analysis applications, accelerating the training of kernel machines. While previous random feature mappings run in O(ndD) time for n training samples in ddimensional space and D rando ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
Approximation of nonlinear kernels using random feature mapping has been successfully employed in largescale data analysis applications, accelerating the training of kernel machines. While previous random feature mappings run in O(ndD) time for n training samples in ddimensional space and D random feature maps, we propose a novel randomized tensor product technique, called Tensor Sketching, for approximating any polynomial kernel in O(n(d + D log D)) time. Also, we introduce both absolute and relative error bounds for our approximation to guarantee the reliability of our estimation algorithm. Empirically, Tensor Sketching achieves higher accuracy and often runs orders of magnitude faster than the stateoftheart approach for largescale realworld datasets.
Efficient kernel clustering using random fourier features
 In Proceedings of ICDM’12
, 2012
"... Abstract—Kernel clustering algorithms have the ability to capture the nonlinear structure inherent in many real world data sets and thereby, achieve better clustering performance than Euclidean distance based clustering algorithms. However, their quadratic computational complexity renders them nons ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Kernel clustering algorithms have the ability to capture the nonlinear structure inherent in many real world data sets and thereby, achieve better clustering performance than Euclidean distance based clustering algorithms. However, their quadratic computational complexity renders them nonscalable to large data sets. In this paper, we employ random Fourier maps, originally proposed for large scale classification, to accelerate kernel clustering. The key idea behind the use of random Fourier maps for clustering is to project the data into a lowdimensional space where the inner product of the transformed data points approximates the kernel similarity between them. An efficient linear clustering algorithm can then be applied to the points in the transformed space. We also propose an improved scheme which uses the top singular vectors of the transformed data matrix to perform clustering, and yields a better approximation of kernel clustering under appropriate conditions. Our empirical studies demonstrate that the proposed schemes can be efficiently applied to large data sets containing millions of data points, while achieving accuracy similar to that achieved by stateoftheart kernel clustering algorithms. KeywordsKernel clustering, Kernel kmeans, Random Fourier features, Scalability
Compact Random Feature Maps
"... Kernel approximation using random feature maps has recently gained a lot of interest. This is mainly due to their applications in reducing training and testing times of kernel based learning algorithms. In this work, we identify that previous approaches for polynomial kernel approximation create m ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Kernel approximation using random feature maps has recently gained a lot of interest. This is mainly due to their applications in reducing training and testing times of kernel based learning algorithms. In this work, we identify that previous approaches for polynomial kernel approximation create maps that can be rank deficient, and therefore may not utilize the capacity of the projected feature space effectively. To address this challenge, we propose compact random feature maps (CRAFTMaps) to approximate polynomial kernels more concisely and accurately. We prove the error bounds of CRAFTMaps demonstrating their superior kernel reconstruction performance compared to the previous approximation schemes. We show how structured random matrices can be used to efficiently generate CRAFTMaps, and present a singlepass algorithm using CRAFTMaps to learn nonlinear multiclass classifiers. We present experiments on multiple standard datasets with performance competitive with stateoftheart results. 1.
Fast Flux Discriminant for LargeScale Sparse Nonlinear Classification
"... In this paper, we propose a novel supervised learning method, Fast Flux Discriminant (FFD), for largescale nonlinear classification. Compared with other existing methods, FFD has unmatched advantages, as it attains the efficiency and interpretability of linear models as well as the accuracy of no ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
In this paper, we propose a novel supervised learning method, Fast Flux Discriminant (FFD), for largescale nonlinear classification. Compared with other existing methods, FFD has unmatched advantages, as it attains the efficiency and interpretability of linear models as well as the accuracy of nonlinear models. It is also sparse and naturally handles mixed data types. It works by decomposing the kernel density estimation in the entire feature space into selected lowdimensional subspaces. Since there are many possible subspaces, we propose a submodular optimization framework for subspace selection. The selected subspace predictions are then transformed to new features on which a linear model can be learned. Besides, since the transformed features naturally expect nonnegative weights, we only require smooth optimization even with the `1 regularization. Unlike other nonlinear models such as kernel methods, the FFD model is interpretable as it gives importance weights on the original features. Its training and testing are also much faster than traditional kernel models. We carry out extensive empirical studies on realworld datasets and show that the proposed model achieves stateoftheart classification results with sparsity, interpretability, and exceptional scalability. Our model can be learned in minutes on datasets with millions of samples, for which most existing nonlinear methods will be prohibitively expensive in space and time.
Subspace embeddings for the polynomial kernel
 In NIPS
, 2014
"... Sketching is a powerful dimensionality reduction tool for accelerating statistical learning algorithms. However, its applicability has been limited to a certain extent since the crucial ingredient, the socalled oblivious subspace embedding, can only be applied to data spaces with an explicit repres ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Sketching is a powerful dimensionality reduction tool for accelerating statistical learning algorithms. However, its applicability has been limited to a certain extent since the crucial ingredient, the socalled oblivious subspace embedding, can only be applied to data spaces with an explicit representation as the column span or row span of a matrix, while in many settings learning is done in a highdimensional space implicitly defined by the data matrix via a kernel transformation. We propose the first fast oblivious subspace embeddings that are able to embed a space induced by a nonlinear kernel without explicitly mapping the data to the highdimensional space. In particular, we propose an embedding for mappings induced by the polynomial kernel. Using the subspace embeddings, we obtain the fastest known algorithms for computing an implicit low rank approximation of the higherdimension mapping of the data matrix, and for computing an approximate kernel PCA of the data, as well as doing approximate kernel principal component regression. 1
Scalable Kernel Methods via Doubly Stochastic Gradients
, 2014
"... The general perception is that kernel methods are not scalable, and neural nets are the methods of choice for largescale nonlinear learning problems. Or have we simply not tried hard enough for kernel methods? Here we propose an approach that scales up kernel methods using a novel concept called “d ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
The general perception is that kernel methods are not scalable, and neural nets are the methods of choice for largescale nonlinear learning problems. Or have we simply not tried hard enough for kernel methods? Here we propose an approach that scales up kernel methods using a novel concept called “doubly stochastic functional gradients”. Our approach relies on the fact that many kernel methods can be expressed as convex optimization problems, and we solve the problems by making two unbiased stochastic approximations to the functional gradient, one using random training points and another using random features associated with the kernel, and then descending using this noisy functional gradient. Our algorithm is simple, does not need to commit to a preset number of random features, and allows the flexibility of the function class to grow as we see more incoming data in the streaming setting. We show that a function learned by this procedure after t iterations converges to the optimal function in the reproducing kernel Hilbert space in rate O(1/t), and achieves a generalization performance of O(1/ t). Our approach can readily scale kernel methods up to the regimes which are dominated by neural nets. We show that our method can achieve competitive performance to neural nets in datasets such as 2.3 million energy materials from MolecularSpace, 8 million handwritten digits from MNIST, and 1 million photos from ImageNet using convolution features. 1
How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets
"... † and ‡: shared first and second coauthorships, respectively ¶: to whom questions and comments should be sent ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
† and ‡: shared first and second coauthorships, respectively ¶: to whom questions and comments should be sent
DualTree Fast Exact MaxKernel Search
, 2013
"... The problem of maxkernel search arises everywhere: given a query point pq, a set of reference objects Sr and some kernel K, find arg maxpr∈Sr K(pq, pr). Maxkernel search is ubiquitous and appears in countless domains of science, thanks to the wide applicability of kernels. A few domains include im ..."
Abstract
 Add to MetaCart
(Show Context)
The problem of maxkernel search arises everywhere: given a query point pq, a set of reference objects Sr and some kernel K, find arg maxpr∈Sr K(pq, pr). Maxkernel search is ubiquitous and appears in countless domains of science, thanks to the wide applicability of kernels. A few domains include image matching, information retrieval, bioinformatics, similarity search, and collaborative filtering (to name just a few). However, there are no generalized techniques for efficiently solving maxkernel search. This paper presents a singletree algorithm called singletree FastMKS which returns the maxkernel solution for a single query point in provably O(logN) time (where N is the number of reference objects), and also a dualtree algorithm (dualtree FastMKS) which is useful for maxkernel search with many query points. If the set of query points is of size O(N), this algorithm returns a solution in provably O(N) time, which is significantly better than the O(N2) linear scan solution; these bounds are dependent on the expansion constant of the data. These algorithms work for abstract objects, as they do not require explicit representation of the points in kernel space. Empirical results for a variety of datasets show up to 5 orders of magnitude speedup in some cases. In addition, we present approximate extensions of the FastMKS algorithms that can achieve further speedups. 1 Maxkernel search
Stochastic Optimization for Kernel PCA∗
"... Kernel Principal Component Analysis (PCA) is a popular extension of PCA which is able to find nonlinear patterns from data. However, the application of kernel PCA to largescale problems remains a big challenge, due to its quadratic space complexity and cubic time complexity in the number of examp ..."
Abstract
 Add to MetaCart
Kernel Principal Component Analysis (PCA) is a popular extension of PCA which is able to find nonlinear patterns from data. However, the application of kernel PCA to largescale problems remains a big challenge, due to its quadratic space complexity and cubic time complexity in the number of examples. To address this limitation, we utilize techniques from stochastic optimization to solve kernel PCA with linear space and time complexities per iteration. Specifically, we formulate it as a stochastic composite optimization problem, where a nuclear norm regularizer is introduced to promote lowrankness, and then develop a simple algorithm based on stochastic proximal gradient descent. During the optimization process, the proposed algorithm always maintains a lowrank factorization of iterates that can be conveniently held in memory. Compared to previous iterative approaches, a remarkable property of our algorithm is that it is equipped with an explicit rate of convergence. Theoretical analysis shows that the solution of our algorithm converges to the optimal one at an O(1/T) rate, where T is the number of iterations.