Results 1  10
of
13
Recent Advances of Largescale Linear Classification
"... Linear classification is a useful tool in machine learning and data mining. For some data in a rich dimensional space, the performance (i.e., testing accuracy) of linear classifiers has shown to be close to that of nonlinear classifiers such as kernel methods, but training and testing speed is much ..."
Abstract

Cited by 32 (6 self)
 Add to MetaCart
Linear classification is a useful tool in machine learning and data mining. For some data in a rich dimensional space, the performance (i.e., testing accuracy) of linear classifiers has shown to be close to that of nonlinear classifiers such as kernel methods, but training and testing speed is much faster. Recently, many research works have developed efficient optimization methods to construct linear classifiers and applied them to some largescale applications. In this paper, we give a comprehensive survey on the recent development of this active research area.
Scalable Kernel Methods via Doubly Stochastic Gradients
, 2014
"... The general perception is that kernel methods are not scalable, and neural nets are the methods of choice for largescale nonlinear learning problems. Or have we simply not tried hard enough for kernel methods? Here we propose an approach that scales up kernel methods using a novel concept called “d ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
The general perception is that kernel methods are not scalable, and neural nets are the methods of choice for largescale nonlinear learning problems. Or have we simply not tried hard enough for kernel methods? Here we propose an approach that scales up kernel methods using a novel concept called “doubly stochastic functional gradients”. Our approach relies on the fact that many kernel methods can be expressed as convex optimization problems, and we solve the problems by making two unbiased stochastic approximations to the functional gradient, one using random training points and another using random features associated with the kernel, and then descending using this noisy functional gradient. Our algorithm is simple, does not need to commit to a preset number of random features, and allows the flexibility of the function class to grow as we see more incoming data in the streaming setting. We show that a function learned by this procedure after t iterations converges to the optimal function in the reproducing kernel Hilbert space in rate O(1/t), and achieves a generalization performance of O(1/ t). Our approach can readily scale kernel methods up to the regimes which are dominated by neural nets. We show that our method can achieve competitive performance to neural nets in datasets such as 2.3 million energy materials from MolecularSpace, 8 million handwritten digits from MNIST, and 1 million photos from ImageNet using convolution features. 1
Subspace embeddings for the polynomial kernel
 In NIPS
, 2014
"... Sketching is a powerful dimensionality reduction tool for accelerating statistical learning algorithms. However, its applicability has been limited to a certain extent since the crucial ingredient, the socalled oblivious subspace embedding, can only be applied to data spaces with an explicit repres ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
Sketching is a powerful dimensionality reduction tool for accelerating statistical learning algorithms. However, its applicability has been limited to a certain extent since the crucial ingredient, the socalled oblivious subspace embedding, can only be applied to data spaces with an explicit representation as the column span or row span of a matrix, while in many settings learning is done in a highdimensional space implicitly defined by the data matrix via a kernel transformation. We propose the first fast oblivious subspace embeddings that are able to embed a space induced by a nonlinear kernel without explicitly mapping the data to the highdimensional space. In particular, we propose an embedding for mappings induced by the polynomial kernel. Using the subspace embeddings, we obtain the fastest known algorithms for computing an implicit low rank approximation of the higherdimension mapping of the data matrix, and for computing an approximate kernel PCA of the data, as well as doing approximate kernel principal component regression. 1
Compact Random Feature Maps
"... Kernel approximation using random feature maps has recently gained a lot of interest. This is mainly due to their applications in reducing training and testing times of kernel based learning algorithms. In this work, we identify that previous approaches for polynomial kernel approximation create m ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Kernel approximation using random feature maps has recently gained a lot of interest. This is mainly due to their applications in reducing training and testing times of kernel based learning algorithms. In this work, we identify that previous approaches for polynomial kernel approximation create maps that can be rank deficient, and therefore may not utilize the capacity of the projected feature space effectively. To address this challenge, we propose compact random feature maps (CRAFTMaps) to approximate polynomial kernels more concisely and accurately. We prove the error bounds of CRAFTMaps demonstrating their superior kernel reconstruction performance compared to the previous approximation schemes. We show how structured random matrices can be used to efficiently generate CRAFTMaps, and present a singlepass algorithm using CRAFTMaps to learn nonlinear multiclass classifiers. We present experiments on multiple standard datasets with performance competitive with stateoftheart results. 1.
QuasiMonte Carlo Feature Maps for ShiftInvariant Kernels. ICML
, 2014
"... Abstract We consider the problem of improving the efficiency of randomized Fourier feature maps to accelerate training and testing speed of kernel methods on large data sets. These approximate feature maps arise as Monte Carlo approximations to integral representations of shiftinvariant kernel fun ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract We consider the problem of improving the efficiency of randomized Fourier feature maps to accelerate training and testing speed of kernel methods on large data sets. These approximate feature maps arise as Monte Carlo approximations to integral representations of shiftinvariant kernel functions (e.g., Gaussian kernel). In this paper, we propose to use QuasiMonte Carlo (QMC) approximations instead, where the relevant integrands are evaluated on a lowdiscrepancy sequence of points as opposed to random point sets as in the Monte Carlo approach. We derive a new discrepancy measure called box discrepancy based on theoretical characterizations of the integration error with respect to a given sequence. We then propose to learn QMC sequences adapted to our setting based on explicit box discrepancy minimization. Our theoretical analyses are complemented with empirical results that demonstrate the effectiveness of classical and adaptive QMC techniques for this problem.
Fast Flux Discriminant for LargeScale Sparse Nonlinear Classification
"... In this paper, we propose a novel supervised learning method, Fast Flux Discriminant (FFD), for largescale nonlinear classification. Compared with other existing methods, FFD has unmatched advantages, as it attains the efficiency and interpretability of linear models as well as the accuracy of no ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
In this paper, we propose a novel supervised learning method, Fast Flux Discriminant (FFD), for largescale nonlinear classification. Compared with other existing methods, FFD has unmatched advantages, as it attains the efficiency and interpretability of linear models as well as the accuracy of nonlinear models. It is also sparse and naturally handles mixed data types. It works by decomposing the kernel density estimation in the entire feature space into selected lowdimensional subspaces. Since there are many possible subspaces, we propose a submodular optimization framework for subspace selection. The selected subspace predictions are then transformed to new features on which a linear model can be learned. Besides, since the transformed features naturally expect nonnegative weights, we only require smooth optimization even with the `1 regularization. Unlike other nonlinear models such as kernel methods, the FFD model is interpretable as it gives importance weights on the original features. Its training and testing are also much faster than traditional kernel models. We carry out extensive empirical studies on realworld datasets and show that the proposed model achieves stateoftheart classification results with sparsity, interpretability, and exceptional scalability. Our model can be learned in minutes on datasets with millions of samples, for which most existing nonlinear methods will be prohibitively expensive in space and time.
Fast and guaranteed tensor decomposition via sketching. In
 NIPS,
, 2015
"... Abstract Tensor CANDECOMP/PARAFAC (CP) decomposition has wide applications in statistical learning of latent variable models and in data mining. In this paper, we propose fast and randomized tensor CP decomposition algorithms based on sketching. We build on the idea of count sketches, but introduce ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract Tensor CANDECOMP/PARAFAC (CP) decomposition has wide applications in statistical learning of latent variable models and in data mining. In this paper, we propose fast and randomized tensor CP decomposition algorithms based on sketching. We build on the idea of count sketches, but introduce many novel ideas which are unique to tensors. We develop novel methods for randomized computation of tensor contractions via FFTs, without explicitly forming the tensors. Such tensor contractions are encountered in decomposition methods such as tensor power iterations and alternating least squares. We also design novel colliding hashes for symmetric tensors to further save time in computing the sketches. We then combine these sketching ideas with existing whitening and tensor power iterative techniques to obtain the fastest algorithm on both sparse and dense tensors. The quality of approximation under our method does not depend on properties such as sparsity, uniformity of elements, etc. We apply the method for topic modeling and obtain competitive results.
Highperformance Kernel Machines with Implicit Distributed Optimization and Randomization
"... Abstract—Complex machine learning tasks arising in several domains increasingly require “big models " to be trained on “big data". Such models tend to grow with the complexity and size of the training data, and do not make strong parametric assumptions upfront on the nature of the underlyi ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Complex machine learning tasks arising in several domains increasingly require “big models " to be trained on “big data". Such models tend to grow with the complexity and size of the training data, and do not make strong parametric assumptions upfront on the nature of the underlying statistical dependencies. Kernel methods constitute a very popular, versatile and principled statistical methodology for solving a wide range of nonparametric modelling problems. However, their storage requirements and high computational complexity poses a significant barrier to their widespread adoption in big data applications. We propose an algorithmic framework for massivescale training of kernelbased machine learning models. Our framework combines two key technical ingredients: (i) distributed general purpose convex optimization for a class of problems involving very large but implicit datasets, and (ii) the use of randomization to significantly accelerate the training process as well as prediction speed for kernelbased models. Our approach is based on a blocksplitting variant of the Alternating Directions Method of Multipliers (ADMM) which is carefully reconfigured to handle very large random feature matrices only implicitly, while exploiting hybrid parallelism in compute environments composed of loosely or tightly coupled clusters of multicore machines. Our implementation supports a variety of machine learning tasks by enabling several loss functions, regularization schemes, kernels, and layers of randomized approximations for both dense and sparse datasets, in a highly extensible framework. We study the scalability of our framework on both commodity clusters as well as on BlueGene/Q, and provide a comparison against existing sequential and parallel libraries for such problems. I.
Kernel clustering
"... h i g h l i g h t s • Kernel competitive learning (KCL) cannot be applied in large scale data problem. • Propose a projection based approximate KCL method for large scale data problem. • Provide theoretical analysis on why the approximation modelling would work for KCL. • A pseudoparallelled approx ..."
Abstract
 Add to MetaCart
(Show Context)
h i g h l i g h t s • Kernel competitive learning (KCL) cannot be applied in large scale data problem. • Propose a projection based approximate KCL method for large scale data problem. • Provide theoretical analysis on why the approximation modelling would work for KCL. • A pseudoparallelled approximate computation framework for large scale KCL is developed. • Experimentally show the effectiveness and efficiency of the proposals. a r t i c l e i n f o Article history:
piCholesky: Polynomial Interpolation of Multiple Cholesky Factors for Efficient Approximate CrossValidation
"... Performing kfold cross validation to avoid overfitting is a standard procedure in statistical learning. For least squares problems, validating each fold requires solving a linear system of equations, for which using Newton’s method is a popular choice. An efficient way to perform Newton’s method ..."
Abstract
 Add to MetaCart
(Show Context)
Performing kfold cross validation to avoid overfitting is a standard procedure in statistical learning. For least squares problems, validating each fold requires solving a linear system of equations, for which using Newton’s method is a popular choice. An efficient way to perform Newton’s method is to find the Choleksy decomposition of the Hessian matrix followed by forward and backsubstitution. In this work, we demonstrate that performing Cholesky factorization for a dense set of regularization parameter values can be the dominant cost in cross validation, and therefore a significant bottleneck in largescale learning. To overcome this challenge, we propose an efficient way to densely interpolate Cholesky factors computed over a sparsely sampled set of the regularization parameter values. This enables us to optimally minimize the holdout error while incurring only a fraction of the computational cost. Our key insight is that Cholesky factors for different regularization parameter values lie over smooth curves that can be approximated using polynomial functions. We present a framework to learn these multiple polynomial functions simultaneously, and propose solutions to several efficiency challenges in the implementation of our framework. We show results on multiple data sets to demonstrate that using the interpolated Cholesky factors as opposed to the exact ones results in a substantial speedup in crossvalidation with no noticeable increase in the holdout error. 1.