Results 1  10
of
27
Fast and Scalable Polynomial Kernels via Explicit Feature Maps *
"... Approximation of nonlinear kernels using random feature mapping has been successfully employed in largescale data analysis applications, accelerating the training of kernel machines. While previous random feature mappings run in O(ndD) time for n training samples in ddimensional space and D rando ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
(Show Context)
Approximation of nonlinear kernels using random feature mapping has been successfully employed in largescale data analysis applications, accelerating the training of kernel machines. While previous random feature mappings run in O(ndD) time for n training samples in ddimensional space and D random feature maps, we propose a novel randomized tensor product technique, called Tensor Sketching, for approximating any polynomial kernel in O(n(d + D log D)) time. Also, we introduce both absolute and relative error bounds for our approximation to guarantee the reliability of our estimation algorithm. Empirically, Tensor Sketching achieves higher accuracy and often runs orders of magnitude faster than the stateoftheart approach for largescale realworld datasets.
Efficient kernel clustering using random fourier features
 In Proceedings of ICDM’12
, 2012
"... Abstract—Kernel clustering algorithms have the ability to capture the nonlinear structure inherent in many real world data sets and thereby, achieve better clustering performance than Euclidean distance based clustering algorithms. However, their quadratic computational complexity renders them nons ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Kernel clustering algorithms have the ability to capture the nonlinear structure inherent in many real world data sets and thereby, achieve better clustering performance than Euclidean distance based clustering algorithms. However, their quadratic computational complexity renders them nonscalable to large data sets. In this paper, we employ random Fourier maps, originally proposed for large scale classification, to accelerate kernel clustering. The key idea behind the use of random Fourier maps for clustering is to project the data into a lowdimensional space where the inner product of the transformed data points approximates the kernel similarity between them. An efficient linear clustering algorithm can then be applied to the points in the transformed space. We also propose an improved scheme which uses the top singular vectors of the transformed data matrix to perform clustering, and yields a better approximation of kernel clustering under appropriate conditions. Our empirical studies demonstrate that the proposed schemes can be efficiently applied to large data sets containing millions of data points, while achieving accuracy similar to that achieved by stateoftheart kernel clustering algorithms. KeywordsKernel clustering, Kernel kmeans, Random Fourier features, Scalability
Scalable sparse subspace clustering
 CVPR
"... In this paper, we address two problems in Sparse Subspace Clustering algorithm (SSC), i.e., scalability issue and outofsample problem. SSC constructs a sparse similarity graph for spectral clustering by using 1minimization based coefficients, has achieved stateoftheart results for image clus ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
(Show Context)
In this paper, we address two problems in Sparse Subspace Clustering algorithm (SSC), i.e., scalability issue and outofsample problem. SSC constructs a sparse similarity graph for spectral clustering by using 1minimization based coefficients, has achieved stateoftheart results for image clustering and motion segmentation. However, the time complexity of SSC is proportion to the cubic of problem size such that it is inefficient to apply SSC into large scale setting. Moreover, SSC does not handle with outofsample data that are not used to construct the similarity graph. For each new datum, SSC needs recalculating the cluster membership of the whole data set, which makes SSC is not competitive in fast online clustering. To address the problems, this paper proposes outofsample extension of SSC, named as Scalable Sparse Subspace Clustering (SSSC), which makes SSC feasible to cluster large scale data sets. The solution of SSSC adopts a ”sampling, clustering, coding, and classifying ” strategy. Extensive experimental results on several popular data sets demonstrate the effectiveness and efficiency of our method comparing with the stateoftheart algorithms. 1.
Embed and Conquer: Scalable Embeddings for Kernel kMeans on MapReduce
"... The kernel kmeans is an effective method for data clustering which extends the commonlyused kmeans algorithm to work on a similarity matrix over complex data structures. It is, however, computationally very complex as it requires the complete kernel matrix to be calculated and stored. Further, i ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
The kernel kmeans is an effective method for data clustering which extends the commonlyused kmeans algorithm to work on a similarity matrix over complex data structures. It is, however, computationally very complex as it requires the complete kernel matrix to be calculated and stored. Further, its kernelized nature hinders the parallelization of its computations on modern scalable infrastructures for distributed computing. In this paper, we are defining a family of kernelbased lowdimensional embeddings that allows for scaling kernel kmeans on MapReduce via an efficient and unified parallelization strategy. Afterwards, we propose two practical methods for lowdimensional embedding that adhere to our definition of the embeddings family. Exploiting the proposed parallelization strategy, we present two scalable MapReduce algorithms for kernel kmeans. We demonstrate the effectiveness and efficiency of the proposed algorithms through an empirical evaluation on benchmark datasets. 1
A DivideandConquer Solver for Kernel Support Vector Machines
"... The kernel support vector machine (SVM) is one of the most widely used classification methods; however, the amount of computation required becomes the bottleneck when facing millions of samples. In this paper, we propose and analyze a novel divideandconquer solver for kernel SVMs (DCSVM). In t ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
The kernel support vector machine (SVM) is one of the most widely used classification methods; however, the amount of computation required becomes the bottleneck when facing millions of samples. In this paper, we propose and analyze a novel divideandconquer solver for kernel SVMs (DCSVM). In the division step, we partition the kernel SVM problem into smaller subproblems by clustering the data, so that each subproblem can be solved independently and efficiently. We show theoretically that the support vectors identified by the subproblem solution are likely to be support vectors of the entire kernel SVM problem, provided that the problem
An improved bound for the nystrom method for large eigengap. arXiv preprint arXiv:1209.0001
, 2012
"... ar ..."
Dedications and acknowledgments
, 1983
"... I would like to thank Michelle Rheaume for her continued support. ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
I would like to thank Michelle Rheaume for her continued support.
Euler Clustering
"... By always mapping data from lower dimensional space into higher or even infinite dimensional space, kernel kmeans is able to organize data into groups when data of different clusters are not linearly separable. However, kernel kmeans incurs the large scale computation due to the representation t ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
By always mapping data from lower dimensional space into higher or even infinite dimensional space, kernel kmeans is able to organize data into groups when data of different clusters are not linearly separable. However, kernel kmeans incurs the large scale computation due to the representation theorem, i.e. keeping an extremely large kernel matrix in memory when using popular Gaussian and spatial pyramid matching kernels, which largely limits its use for processing large scale data. Also, existing kernel clustering can be overfitted by outliers as well. In this paper, we introduce an Euler clustering, which can not only maintain the benefit of nonlinear modeling using kernel function but also significantly solve the large scale computational problem in kernelbased clustering. This is realized by incorporating Euler kernel. Euler kernel is relying on a nonlinear and robust cosine metric that is less sensitive to outliers. More important it intrinsically induces an empirical map which maps data onto a complex space of the same dimension. Euler clustering takes these advantages to measure the similarity between data in a robust way without increasing the dimensionality of data, and thus solves the large scale problem in kernel kmeans. We evaluate Euler clustering and show its superiority against related methods on five publicly available datasets.
Date
, 2013
"... AFITENG13J07 Artificial Immune Systems (AISs) are a type of statistical Machine Learning (ML) algorithm based on the Biological Immune System (BIS) applied to classification problems. Inspired by increased performance in other ML algorithms when combined with kernel methods, this research explor ..."
Abstract
 Add to MetaCart
(Show Context)
AFITENG13J07 Artificial Immune Systems (AISs) are a type of statistical Machine Learning (ML) algorithm based on the Biological Immune System (BIS) applied to classification problems. Inspired by increased performance in other ML algorithms when combined with kernel methods, this research explores using kernel methods as the distance measure for a specific AIS algorithm, the Realvalued Negative Selection Algorithm (RNSA). This research also demonstrates that the hard binary decision from the traditional RNSA can be relaxed to a continuous output, while maintaining the ability to map back to the original RNSA decision boundary if necessary. Continuous output is used in this research to generate Receiver Operating Characteristic (ROC) curves and calculate Area Under Curves (AUCs), but can also be used as a basis of classification confidence or probability. The resulting Kernel Extended Realvalued Negative Selection Algorithm (KERNSA) offers performance improvements over a comparable RNSA implementation. Using the Sigmoid kernel in
Scalable Single Linkage Hierarchical Clustering For Big Data
"... Abstract—Personal computing technologies are everywhere; hence, there are an abundance of staggeringly large data sets— the Library of Congress has stored over 160 terabytes of web data and it is estimated that Facebook alone logs nearly a petabyte of data per day. Thus, there is a pertinent need fo ..."
Abstract
 Add to MetaCart
Abstract—Personal computing technologies are everywhere; hence, there are an abundance of staggeringly large data sets— the Library of Congress has stored over 160 terabytes of web data and it is estimated that Facebook alone logs nearly a petabyte of data per day. Thus, there is a pertinent need for systems by which one can elucidate the similarity and dissimilarity among and between groups in these big data sets. Clustering is one way to find these groups. In this paper, we extend the scalable Visual Assessment of Tendency (sVAT) algorithm to return singlelinkage partitions of big data sets. The sVAT algorithm is designed to provide visual evidence of the number of clusters in unloadable (big) data sets. The extension we describe for sVAT enables it to also then efficiently return the data partition as indicated by the visual evidence. The computational complexity and storage requirements of sVAT are (usually) significantly less than the O(n2) requirement of the classic singlelinkage hierarchical algorithm. We show that sVAT is a scalable instantiation of singlelinkage clustering for data sets that contain c compactseparated clusters, where c n; n is the number of objects. For data sets that do not contain compactseparated clusters, we show that sVAT produces a good approximation of singlelinkage partitions. Experimental results are presented for both synthetic and real data sets. I.