Results 1  10
of
113
Kmeans++: The advantages of careful seeding.
 In Proceedings of the Eighteenth Annual ACMSIAM Symposium on Discrete Algorithms, SODA ’07,
, 2007
"... Abstract The kmeans method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting kmeans with a very simple, ran ..."
Abstract

Cited by 478 (8 self)
 Add to MetaCart
(Show Context)
Abstract The kmeans method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting kmeans with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(log k)competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of kmeans, often quite dramatically.
Data Clustering: 50 Years Beyond KMeans
, 2008
"... Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and m ..."
Abstract

Cited by 294 (7 self)
 Add to MetaCart
Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and methods for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is exploratory in nature to find structure in data. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, Kmeans, was first published in 1955. In spite of the fact that Kmeans was proposed over 50 years ago and thousands of clustering algorithms have been published since then, Kmeans is still widely used. This speaks to the difficulty of designing a general purpose clustering algorithm and the illposed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semisupervised clustering, ensemble clustering, simultaneous feature selection, and data clustering and large scale data clustering.
FINDING STRUCTURE WITH RANDOMNESS: PROBABILISTIC ALGORITHMS FOR CONSTRUCTING APPROXIMATE MATRIX DECOMPOSITIONS
"... Lowrank matrix approximations, such as the truncated singular value decomposition and the rankrevealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for ..."
Abstract

Cited by 253 (6 self)
 Add to MetaCart
(Show Context)
Lowrank matrix approximations, such as the truncated singular value decomposition and the rankrevealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing lowrank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets. This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed—either explicitly or implicitly—to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired lowrank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, speed, and robustness. These claims are supported by extensive numerical experiments and a detailed error analysis. The specific benefits of randomized techniques depend on the computational environment. Consider the model problem of finding the k dominant components of the singular value decomposition
Computing communities in large networks using random walks
 J. of Graph Alg. and App. bf
, 2004
"... Dense subgraphs of sparse graphs (communities), which appear in most realworld complex networks, play an important role in many contexts. Computing them however is generally expensive. We propose here a measure of similarities between vertices based on random walks which has several important advan ..."
Abstract

Cited by 226 (3 self)
 Add to MetaCart
(Show Context)
Dense subgraphs of sparse graphs (communities), which appear in most realworld complex networks, play an important role in many contexts. Computing them however is generally expensive. We propose here a measure of similarities between vertices based on random walks which has several important advantages: it captures well the community structure in a network, it can be computed efficiently, and it can be used in an agglomerative algorithm to compute efficiently the community structure of a network. We propose such an algorithm, called Walktrap, which runs in time O(mn 2) and space O(n 2) in the worst case, and in time O(n 2 log n) and space O(n 2) in most realworld cases (n and m are respectively the number of vertices and edges in the input graph). Extensive comparison tests show that our algorithm surpasses previously proposed ones concerning the quality of the obtained community structures and that it stands among the best ones concerning the running time.
Improved approximation algorithms for large matrices via random projections.
 In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
, 2006
"... ..."
(Show Context)
Sampling from large matrices: an approach through geometric functional analysis
 Journal of the ACM
, 2006
"... Abstract. We study random submatrices of a large matrix A. We show how to approximately compute A from its random submatrix of the smallest possible size O(r log r) with a small error in the spectral norm, where r = �A�2 F /�A�22 is the numerical rank of A. The numerical rank is always bounded by, a ..."
Abstract

Cited by 132 (5 self)
 Add to MetaCart
(Show Context)
Abstract. We study random submatrices of a large matrix A. We show how to approximately compute A from its random submatrix of the smallest possible size O(r log r) with a small error in the spectral norm, where r = �A�2 F /�A�22 is the numerical rank of A. The numerical rank is always bounded by, and is a stable relaxation of, the rank of A. This yields an asymptotically optimal guarantee in an algorithm for computing lowrank approximations of A. We also prove asymptotically optimal estimates on the spectral norm and the cutnorm of random submatrices of A. The result for the cutnorm yields a slight improvement on the best known sample complexity for an approximation algorithm for MAX2CSP problems. We use methods of Probability in Banach spaces, in particular the law of large numbers for operatorvalued random variables. 1.
Matrix Approximation and Projective Clustering via Volume Sampling
, 2006
"... Frieze, Kannan, and Vempala (JACM 2004) proved that a small sample of rows of a given matrix A spans the rows of a lowrank approximation D that minimizes A−DF within a small additive error, and the sampling can be done efficiently using just two passes over the matrix. In this paper, we genera ..."
Abstract

Cited by 90 (3 self)
 Add to MetaCart
Frieze, Kannan, and Vempala (JACM 2004) proved that a small sample of rows of a given matrix A spans the rows of a lowrank approximation D that minimizes A−DF within a small additive error, and the sampling can be done efficiently using just two passes over the matrix. In this paper, we generalize this result in two ways. First, we prove that the additive error drops exponentially by iterating the sampling in an adaptive manner (adaptive sampling). Using this result, we give a passefficient algorithm for computing a lowrank approximation with reduced additive error. Our second result is that there exist k rows of A whose span contains the rows of a multiplicative (k + 1)approximation to the best rankk matrix; moreover, this subset can be found by sampling ksubsets of rows from a natural distribution (volume sampling). Combining volume sampling with adaptive sampling yields the existence of a set of k + k(k + 1)/ε rows whose span contains the rows of a multiplicative (1 + ε)approximation. This leads to a PTAS for the following NPhard
The effectiveness of lloydtype methods for the kmeans problem
 In FOCS
, 2006
"... We investigate variants of Lloyd’s heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data s ..."
Abstract

Cited by 84 (3 self)
 Add to MetaCart
(Show Context)
We investigate variants of Lloyd’s heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd’s heuristic that quickly lead to provably nearoptimal clustering solutions when applied to wellclusterable instances. This is the first performance guarantee for a variant of Lloyd’s heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd’s method. In addition, our other algorithms are faster on wellclusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloydtype iteration. 1
The Planar kmeans Problem is NPhard
, 2009
"... In the kmeans problem, we are given a finite set S of points in ℜ m, and integer k ≥ 1, and we want to find k points (centers) so as to minimize the sum of the square of the Euclidean distance of each point in S to its nearest center. We show that this wellknown problem is NPhard even for instanc ..."
Abstract

Cited by 45 (0 self)
 Add to MetaCart
In the kmeans problem, we are given a finite set S of points in ℜ m, and integer k ≥ 1, and we want to find k points (centers) so as to minimize the sum of the square of the Euclidean distance of each point in S to its nearest center. We show that this wellknown problem is NPhard even for instances in the plane, answering an open question posed by Dasgupta [7].