Results 1  10
of
54
Kmeans++: the advantages of careful seeding
 In Proceedings of the 18th Annual ACMSIAM Symposium on Discrete Algorithms
, 2007
"... The kmeans method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting kmeans with a very simple, randomized se ..."
Abstract

Cited by 234 (6 self)
 Add to MetaCart
The kmeans method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting kmeans with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(log k)competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of kmeans, often quite dramatically. 1
Smooth sensitivity and sampling in private data analysis
 In STOC
, 2007
"... We introduce a new, generic framework for private data analysis. The goal of private data analysis is to release aggregate information about a data set while protecting the privacy of the individuals whose information the data set contains. Our framework allows one to release functions f of the data ..."
Abstract

Cited by 111 (13 self)
 Add to MetaCart
We introduce a new, generic framework for private data analysis. The goal of private data analysis is to release aggregate information about a data set while protecting the privacy of the individuals whose information the data set contains. Our framework allows one to release functions f of the data with instancebased additive noise. That is, the noise magnitude is determined not only by the function we want to release, but also by the database itself. One of the challenges is to ensure that the noise magnitude does not leak information about the database. To address that, we calibrate the noise magnitude to the smooth sensitivity of f on the database x — a measure of variability of f in the neighborhood of the instance x. The new framework greatly expands the applicability of output perturbation, a technique for protecting individuals ’ privacy by adding a small amount of random noise to the released statistics. To our knowledge, this is the first formal analysis of the effect of instancebased noise in the context of data privacy. Our framework raises many interesting algorithmic questions. Namely, to apply the framework one must compute or approximate the smooth sensitivity of f on x. We show how to do this efficiently for several different functions, including the median and the cost of the minimum spanning tree. We also give a generic procedure based on sampling that allows one to release f(x) accurately on many databases x. This procedure is applicable even when no efficient algorithm for approximating smooth sensitivity of f is known or when f is given as a black box. We illustrate the procedure by applying it to kSED (kmeans) clustering and learning mixtures of Gaussians.
Approximate Clustering without the Approximation
"... Approximation algorithms for clustering points in metric spaces is a flourishing area of research, with much research effort spent on getting a better understanding of the approximation guarantees possible for many objective functions such as kmedian, kmeans, and minsum clustering. This quest for ..."
Abstract

Cited by 37 (18 self)
 Add to MetaCart
Approximation algorithms for clustering points in metric spaces is a flourishing area of research, with much research effort spent on getting a better understanding of the approximation guarantees possible for many objective functions such as kmedian, kmeans, and minsum clustering. This quest for better approximation algorithms is further fueled by the implicit hope that these better approximations also give us more accurate clusterings. E.g., for many problems such as clustering proteins by function, or clustering images by subject, there is some unknown “correct” target clustering and the implicit hope is that approximately optimizing these objective functions will in fact produce a clustering that is close (in symmetric difference) to the truth. In this paper, we show that if we make this implicit assumption explicit—that is, if we assume that any capproximation to the given clustering objective F is ǫclose to the target—then we can produce clusterings that are O(ǫ)close to the target, even for values c for which obtaining a capproximation is NPhard. In particular, for kmedian and kmeans objectives, we show that we can achieve this guarantee for any constant c> 1, and for minsum objective we can do this for any constant c> 2. Our results also highlight a somewhat surprising conceptual difference between assuming that the optimal solution to, say, the kmedian objective is ǫclose to the target, and assuming that any approximately optimal solution is ǫclose to the target, even for approximation factor say c = 1.01. In the former case, the problem of finding a solution that is O(ǫ)close to the target remains computationally hard, and yet for the latter we have an efficient algorithm.
Secure twoparty kmeans clustering
 In CCS ’07: Proceedings of the 14th ACM conference on Computer and communications security
, 2007
"... The kMeans Clustering problem is one of the mostexplored problems in data mining to date. With the advent of protocols that have proven to be successful in performing single database clustering, the focus has changed in recent years to the question of how to extend the single database protocols to ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
The kMeans Clustering problem is one of the mostexplored problems in data mining to date. With the advent of protocols that have proven to be successful in performing single database clustering, the focus has changed in recent years to the question of how to extend the single database protocols to a multiple database setting. To date there have been numerous attempts to create specific multiparty kmeans clustering protocols that protect the privacy of each database, but according to the standard cryptographic definitions of “privacyprotection, ” so far all such attempts have fallen short of providing adequate privacy. In this paper we describe a TwoParty kMeans Clustering Protocol that guarantees privacy, and is more efficient than utilizing a general multiparty “compiler ” to achieve the same task. In particular, a main contribution of our result is a way to compute efficiently multiple iterations of kmeans clustering without revealing the intermediate values. To achieve this, we use novel techniques to perform twoparty division and sample uniformly at random from an unknown domain size. Our techniques are quite general and can be realized based on the existence of any semantically secure homomorphic encryption scheme. For concreteness, we describe our protocol based on Paillier Homomorphic Encryption scheme (see [23]). We will also demonstrate that our protocol is efficient in terms of communication, remaining competitive with existing protocols (such as [15]) that fail to protect privacy.
On Centroidal Voronoi Tessellation  Energy Smoothness and Fast Computation
, 2008
"... Centroidal Voronoi tessellation (CVT) is a fundamental geometric structure that finds many applications in ..."
Abstract

Cited by 19 (9 self)
 Add to MetaCart
Centroidal Voronoi tessellation (CVT) is a fundamental geometric structure that finds many applications in
A ptas for kmeans clustering based on weak coresets
 DELIS – Dynamically Evolving, LargeScale Information Systems
, 2007
"... Given a point set P ⊆ R d the kmeans clustering problem is to find a set C = {c1,..., ck} of k points and a partition of P into k clusters C1,..., Ck such that the sum of squared errors �k � i=1 p∈C �p − ci� i 2 2 is minimized. For given centers this cost function is minimized by assigning points t ..."
Abstract

Cited by 19 (9 self)
 Add to MetaCart
Given a point set P ⊆ R d the kmeans clustering problem is to find a set C = {c1,..., ck} of k points and a partition of P into k clusters C1,..., Ck such that the sum of squared errors �k � i=1 p∈C �p − ci� i 2 2 is minimized. For given centers this cost function is minimized by assigning points to the nearest center. The kmeans cost function is probably the most widely used cost function in the area of clustering. In this paper we show that every unweighted point set P has a weak (ɛ, k)coreset of size poly(k, 1/ɛ) for the kmeans clustering problem, i.e. its size is independent of the cardinality P  of the point set and the dimension d of the Euclidean space R d. A weak coreset is a weighted set S ⊆ P together with a set T such that T contains a (1 + ɛ)approximation for the optimal cluster centers from P and for every set of k centers from T the cost of the centers for S is a (1 ± ɛ)approximation of the cost for P. We apply our weak coreset to obtain a PTAS for the kmeans clustering problem with running time O(nkd + d · poly(k/ɛ) +
The Planar kmeans Problem is NPhard
, 2009
"... In the kmeans problem, we are given a finite set S of points in ℜ m, and integer k ≥ 1, and we want to find k points (centers) so as to minimize the sum of the square of the Euclidean distance of each point in S to its nearest center. We show that this wellknown problem is NPhard even for instanc ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
In the kmeans problem, we are given a finite set S of points in ℜ m, and integer k ≥ 1, and we want to find k points (centers) so as to minimize the sum of the square of the Euclidean distance of each point in S to its nearest center. We show that this wellknown problem is NPhard even for instances in the plane, answering an open question posed by Dasgupta [7].
Stability yields a PTAS for kMedian and kMeans Clustering
, 2010
"... We consider kmedian clustering in finite metric spaces and kmeans clustering in Euclidean spaces, in the setting where k is part of the input (not a constant). For the kmeans problem, Ostrovsky et al. [ORSS06] show that if the input satisfies the condition that the optimal (k − 1)means clusterin ..."
Abstract

Cited by 13 (8 self)
 Add to MetaCart
We consider kmedian clustering in finite metric spaces and kmeans clustering in Euclidean spaces, in the setting where k is part of the input (not a constant). For the kmeans problem, Ostrovsky et al. [ORSS06] show that if the input satisfies the condition that the optimal (k − 1)means clustering is more expensive than the optimal kmeans clustering by a factor of max{100, 1/α 2}, then one can achieve a (1 + f(α))approximation to the kmeans optimal in time polynomial in n and k by using a variant of Lloyd’s algorithm. In this work we substantially improve this approximation guarantee. We show that given only the condition that the (k − 1)means optimal is more expensive than the kmeans optimal by a factor 1 + α for some constant α> 0, we can obtain a PTAS. In particular, under this assumption, for any ǫ> 0 we achieve a (1 + ǫ)approximation to the kmeans optimal in time polynomial in n and k, and exponential in 1/ǫ and 1/α. We thus decouple the strength of the assumption from the quality of the approximation ratio. We also give a PTAS for the kmedian problem in finite metrics under the analogous assumption as well. For kmeans, we in addition give a randomized algorithm with improved running time of n O(1) (k log n) poly(1/ǫ,1/α). Our technique also obtains a PTAS under the assumption of Balcan et al. [BBG09] that all (1 + α) approximations are δclose to a desired target clustering, when all target clusters have size greater than 2δn and α> 0 is constant. Note that the motivation of [BBG09] is that the true goal in clustering is often to get the points right, with objective values serving just as a proxy, and [BBG09] already get O(δ/α)close for general α and arbitrary target cluster sizes. So the primary advance here is in further elucidating the approximation implications and in formally relating the assumptions. In particular, both results are based on a new notion of clustering stability, that extends both the notions of [ORSS06] and of [BBG09]. 1
kmeans has polynomial smoothed complexity
 IN PROC. OF THE 50TH FOCS (ATLANTA, USA
, 2009
"... The kmeans method is one of the most widely used clustering algorithms, drawing its popularity from its speed in practice. Recently, however, it was shown to have exponential worstcase running time. In order to close the gap between practical performance and theoretical analysis, the kmeans metho ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
The kmeans method is one of the most widely used clustering algorithms, drawing its popularity from its speed in practice. Recently, however, it was shown to have exponential worstcase running time. In order to close the gap between practical performance and theoretical analysis, the kmeans method has been studied in the model of smoothed analysis. But even the smoothed analyses so far are unsatisfactory as the bounds are still superpolynomial in the number n of data points. In this paper, we settle the smoothed running time of the kmeans method. We show that the smoothed number of iterations is bounded by a polynomial in n and 1/σ, where σ is the standard deviation of the Gaussian perturbations. This means that if an arbitrary input data set is randomly perturbed, then the kmeans method will run in expected polynomial time on that input set.
Clusterability: A Theoretical Study
 Proceedings of AISTATS 09, JMLR: W&CP 5
, 2009
"... We investigate measures of the clusterability of data sets. Namely, ways to define how ‘strong ’ or ‘conclusive ’ is the clustering structure of a given data set. We address this issue with generality, aiming for conclusions that apply regardless of any particular clustering algorithm or any specifi ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
We investigate measures of the clusterability of data sets. Namely, ways to define how ‘strong ’ or ‘conclusive ’ is the clustering structure of a given data set. We address this issue with generality, aiming for conclusions that apply regardless of any particular clustering algorithm or any specific data generation model. We survey several notions of clusterability that have been discussed in the literature, as well as propose a new notion of data clusterability. Our comparison of these notions reveals that, although they all attempt to evaluate the same intuitive property, they are pairwise inconsistent. Our analysis discovers an interesting phenomenon; Although most of the common clustering tasks are NPhard, finding a closetooptimal clustering for well clusterable data sets is easy (computationally). We prove instances of this general claim with respect to the various clusterability notions that we discuss. Finally, we investigate how hard it is to determine the clusterability value of a given data set. In most cases, it turns out that this is an NPhard problem. 1