Results 1  10
of
81
Kmeans++: the advantages of careful seeding
 In Proceedings of the 18th Annual ACMSIAM Symposium on Discrete Algorithms
, 2007
"... The kmeans method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting kmeans with a very simple, randomized se ..."
Abstract

Cited by 459 (8 self)
 Add to MetaCart
(Show Context)
The kmeans method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting kmeans with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(log k)competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of kmeans, often quite dramatically. 1
Smooth sensitivity and sampling in private data analysis
 In STOC
, 2007
"... We introduce a new, generic framework for private data analysis. The goal of private data analysis is to release aggregate information about a data set while protecting the privacy of the individuals whose information the data set contains. Our framework allows one to release functions f of the data ..."
Abstract

Cited by 168 (16 self)
 Add to MetaCart
(Show Context)
We introduce a new, generic framework for private data analysis. The goal of private data analysis is to release aggregate information about a data set while protecting the privacy of the individuals whose information the data set contains. Our framework allows one to release functions f of the data with instancebased additive noise. That is, the noise magnitude is determined not only by the function we want to release, but also by the database itself. One of the challenges is to ensure that the noise magnitude does not leak information about the database. To address that, we calibrate the noise magnitude to the smooth sensitivity of f on the database x — a measure of variability of f in the neighborhood of the instance x. The new framework greatly expands the applicability of output perturbation, a technique for protecting individuals ’ privacy by adding a small amount of random noise to the released statistics. To our knowledge, this is the first formal analysis of the effect of instancebased noise in the context of data privacy. Our framework raises many interesting algorithmic questions. Namely, to apply the framework one must compute or approximate the smooth sensitivity of f on x. We show how to do this efficiently for several different functions, including the median and the cost of the minimum spanning tree. We also give a generic procedure based on sampling that allows one to release f(x) accurately on many databases x. This procedure is applicable even when no efficient algorithm for approximating smooth sensitivity of f is known or when f is given as a black box. We illustrate the procedure by applying it to kSED (kmeans) clustering and learning mixtures of Gaussians.
Approximate Clustering without the Approximation
"... Approximation algorithms for clustering points in metric spaces is a flourishing area of research, with much research effort spent on getting a better understanding of the approximation guarantees possible for many objective functions such as kmedian, kmeans, and minsum clustering. This quest for ..."
Abstract

Cited by 55 (19 self)
 Add to MetaCart
(Show Context)
Approximation algorithms for clustering points in metric spaces is a flourishing area of research, with much research effort spent on getting a better understanding of the approximation guarantees possible for many objective functions such as kmedian, kmeans, and minsum clustering. This quest for better approximation algorithms is further fueled by the implicit hope that these better approximations also give us more accurate clusterings. E.g., for many problems such as clustering proteins by function, or clustering images by subject, there is some unknown “correct” target clustering and the implicit hope is that approximately optimizing these objective functions will in fact produce a clustering that is close (in symmetric difference) to the truth. In this paper, we show that if we make this implicit assumption explicit—that is, if we assume that any capproximation to the given clustering objective F is ǫclose to the target—then we can produce clusterings that are O(ǫ)close to the target, even for values c for which obtaining a capproximation is NPhard. In particular, for kmedian and kmeans objectives, we show that we can achieve this guarantee for any constant c> 1, and for minsum objective we can do this for any constant c> 2. Our results also highlight a somewhat surprising conceptual difference between assuming that the optimal solution to, say, the kmedian objective is ǫclose to the target, and assuming that any approximately optimal solution is ǫclose to the target, even for approximation factor say c = 1.01. In the former case, the problem of finding a solution that is O(ǫ)close to the target remains computationally hard, and yet for the latter we have an efficient algorithm.
The Planar kmeans Problem is NPhard
, 2009
"... In the kmeans problem, we are given a finite set S of points in ℜ m, and integer k ≥ 1, and we want to find k points (centers) so as to minimize the sum of the square of the Euclidean distance of each point in S to its nearest center. We show that this wellknown problem is NPhard even for instanc ..."
Abstract

Cited by 39 (0 self)
 Add to MetaCart
In the kmeans problem, we are given a finite set S of points in ℜ m, and integer k ≥ 1, and we want to find k points (centers) so as to minimize the sum of the square of the Euclidean distance of each point in S to its nearest center. We show that this wellknown problem is NPhard even for instances in the plane, answering an open question posed by Dasgupta [7].
On Centroidal Voronoi Tessellation  Energy Smoothness and Fast Computation
, 2008
"... Centroidal Voronoi tessellation (CVT) is a fundamental geometric structure that finds many applications in ..."
Abstract

Cited by 35 (16 self)
 Add to MetaCart
Centroidal Voronoi tessellation (CVT) is a fundamental geometric structure that finds many applications in
Secure twoparty kmeans clustering
 In CCS ’07: Proceedings of the 14th ACM conference on Computer and communications security
, 2007
"... The kMeans Clustering problem is one of the mostexplored problems in data mining to date. With the advent of protocols that have proven to be successful in performing single database clustering, the focus has changed in recent years to the question of how to extend the single database protocols to ..."
Abstract

Cited by 30 (0 self)
 Add to MetaCart
(Show Context)
The kMeans Clustering problem is one of the mostexplored problems in data mining to date. With the advent of protocols that have proven to be successful in performing single database clustering, the focus has changed in recent years to the question of how to extend the single database protocols to a multiple database setting. To date there have been numerous attempts to create specific multiparty kmeans clustering protocols that protect the privacy of each database, but according to the standard cryptographic definitions of “privacyprotection, ” so far all such attempts have fallen short of providing adequate privacy. In this paper we describe a TwoParty kMeans Clustering Protocol that guarantees privacy, and is more efficient than utilizing a general multiparty “compiler ” to achieve the same task. In particular, a main contribution of our result is a way to compute efficiently multiple iterations of kmeans clustering without revealing the intermediate values. To achieve this, we use novel techniques to perform twoparty division and sample uniformly at random from an unknown domain size. Our techniques are quite general and can be realized based on the existence of any semantically secure homomorphic encryption scheme. For concreteness, we describe our protocol based on Paillier Homomorphic Encryption scheme (see [23]). We will also demonstrate that our protocol is efficient in terms of communication, remaining competitive with existing protocols (such as [15]) that fail to protect privacy.
A ptas for kmeans clustering based on weak coresets
 DELIS – Dynamically Evolving, LargeScale Information Systems
, 2007
"... Given a point set P ⊆ R d the kmeans clustering problem is to find a set C = {c1,..., ck} of k points and a partition of P into k clusters C1,..., Ck such that the sum of squared errors �k � i=1 p∈C �p − ci� i 2 2 is minimized. For given centers this cost function is minimized by assigning points t ..."
Abstract

Cited by 30 (11 self)
 Add to MetaCart
(Show Context)
Given a point set P ⊆ R d the kmeans clustering problem is to find a set C = {c1,..., ck} of k points and a partition of P into k clusters C1,..., Ck such that the sum of squared errors �k � i=1 p∈C �p − ci� i 2 2 is minimized. For given centers this cost function is minimized by assigning points to the nearest center. The kmeans cost function is probably the most widely used cost function in the area of clustering. In this paper we show that every unweighted point set P has a weak (ɛ, k)coreset of size poly(k, 1/ɛ) for the kmeans clustering problem, i.e. its size is independent of the cardinality P  of the point set and the dimension d of the Euclidean space R d. A weak coreset is a weighted set S ⊆ P together with a set T such that T contains a (1 + ɛ)approximation for the optimal cluster centers from P and for every set of k centers from T the cost of the centers for S is a (1 ± ɛ)approximation of the cost for P. We apply our weak coreset to obtain a PTAS for the kmeans clustering problem with running time O(nkd + d · poly(k/ɛ) +
Are stable instances easy?
, 2008
"... We introduce the notion of a stable instance for a discrete optimization problem, and argue that in many practical situations only sufficiently stable instances are of interest. The question then arises whether stable instances of NP–hard problems are easier to solve. In particular, whether there ex ..."
Abstract

Cited by 30 (2 self)
 Add to MetaCart
We introduce the notion of a stable instance for a discrete optimization problem, and argue that in many practical situations only sufficiently stable instances are of interest. The question then arises whether stable instances of NP–hard problems are easier to solve. In particular, whether there exist algorithms that solve correctly and in polynomial time all sufficiently stable instances of some NP–hard problem. The paper focuses on the Max–Cut problem, for which we show that this is indeed the case.
Scalable KMeans++
"... Over half a century old and showing no signs of aging, kmeans remains one of the most popular data processing algorithms. As is wellknown, a proper initialization of kmeans is crucial for obtaining a good final solution. The recently proposed kmeans++ initialization algorithm achieves this, obta ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
(Show Context)
Over half a century old and showing no signs of aging, kmeans remains one of the most popular data processing algorithms. As is wellknown, a proper initialization of kmeans is crucial for obtaining a good final solution. The recently proposed kmeans++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the kmeans++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing kmeans that have mostly focused on the postinitialization phases of kmeans. We prove that our proposed initialization algorithm kmeans obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on realworld largescale data demonstrates that kmeans  outperforms kmeans++ in both sequential and parallel settings. 1.
Clusterability: A Theoretical Study
 Proceedings of AISTATS 09, JMLR: W&CP 5
, 2009
"... We investigate measures of the clusterability of data sets. Namely, ways to define how ‘strong ’ or ‘conclusive ’ is the clustering structure of a given data set. We address this issue with generality, aiming for conclusions that apply regardless of any particular clustering algorithm or any specifi ..."
Abstract

Cited by 22 (4 self)
 Add to MetaCart
(Show Context)
We investigate measures of the clusterability of data sets. Namely, ways to define how ‘strong ’ or ‘conclusive ’ is the clustering structure of a given data set. We address this issue with generality, aiming for conclusions that apply regardless of any particular clustering algorithm or any specific data generation model. We survey several notions of clusterability that have been discussed in the literature, as well as propose a new notion of data clusterability. Our comparison of these notions reveals that, although they all attempt to evaluate the same intuitive property, they are pairwise inconsistent. Our analysis discovers an interesting phenomenon; Although most of the common clustering tasks are NPhard, finding a closetooptimal clustering for well clusterable data sets is easy (computationally). We prove instances of this general claim with respect to the various clusterability notions that we discuss. Finally, we investigate how hard it is to determine the clusterability value of a given data set. In most cases, it turns out that this is an NPhard problem. 1