Results 1 - 10
of
29
Smooth sensitivity and sampling in private data analysis
- In STOC
, 2007
"... We introduce a new, generic framework for private data analysis. The goal of private data analysis is to release aggregate information about a data set while protecting the privacy of the individuals whose information the data set contains. Our framework allows one to release functions f of the data ..."
Abstract
-
Cited by 59 (7 self)
- Add to MetaCart
We introduce a new, generic framework for private data analysis. The goal of private data analysis is to release aggregate information about a data set while protecting the privacy of the individuals whose information the data set contains. Our framework allows one to release functions f of the data with instance-based additive noise. That is, the noise magnitude is determined not only by the function we want to release, but also by the database itself. One of the challenges is to ensure that the noise magnitude does not leak information about the database. To address that, we calibrate the noise magnitude to the smooth sensitivity of f on the database x — a measure of variability of f in the neighborhood of the instance x. The new framework greatly expands the applicability of output perturbation, a technique for protecting individuals ’ privacy by adding a small amount of random noise to the released statistics. To our knowledge, this is the first formal analysis of the effect of instance-based noise in the context of data privacy. Our framework raises many interesting algorithmic questions. Namely, to apply the framework one must compute or approximate the smooth sensitivity of f on x. We show how to do this efficiently for several different functions, including the median and the cost of the minimum spanning tree. We also give a generic procedure based on sampling that allows one to release f(x) accurately on many databases x. This procedure is applicable even when no efficient algorithm for approximating smooth sensitivity of f is known or when f is given as a black box. We illustrate the procedure by applying it to k-SED (k-means) clustering and learning mixtures of Gaussians.
Approximate Clustering without the Approximation
"... Approximation algorithms for clustering points in metric spaces is a flourishing area of research, with much research effort spent on getting a better understanding of the approximation guarantees possible for many objective functions such as k-median, k-means, and min-sum clustering. This quest for ..."
Abstract
-
Cited by 22 (14 self)
- Add to MetaCart
Approximation algorithms for clustering points in metric spaces is a flourishing area of research, with much research effort spent on getting a better understanding of the approximation guarantees possible for many objective functions such as k-median, k-means, and min-sum clustering. This quest for better approximation algorithms is further fueled by the implicit hope that these better approximations also give us more accurate clusterings. E.g., for many problems such as clustering proteins by function, or clustering images by subject, there is some unknown “correct” target clustering and the implicit hope is that approximately optimizing these objective functions will in fact produce a clustering that is close (in symmetric difference) to the truth. In this paper, we show that if we make this implicit assumption explicit—that is, if we assume that any c-approximation to the given clustering objective F is ǫ-close to the target—then we can produce clusterings that are O(ǫ)-close to the target, even for values c for which obtaining a c-approximation is NP-hard. In particular, for k-median and k-means objectives, we show that we can achieve this guarantee for any constant c> 1, and for min-sum objective we can do this for any constant c> 2. Our results also highlight a somewhat surprising conceptual difference between assuming that the optimal solution to, say, the k-median objective is ǫ-close to the target, and assuming that any approximately optimal solution is ǫ-close to the target, even for approximation factor say c = 1.01. In the former case, the problem of finding a solution that is O(ǫ)-close to the target remains computationally hard, and yet for the latter we have an efficient algorithm.
A ptas for k-means clustering based on weak coresets
- DELIS – Dynamically Evolving, Large-Scale Information Systems
, 2007
"... Given a point set P ⊆ R d the k-means clustering problem is to find a set C = {c1,..., ck} of k points and a partition of P into k clusters C1,..., Ck such that the sum of squared errors �k � i=1 p∈C �p − ci� i 2 2 is minimized. For given centers this cost function is minimized by assigning points t ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
Given a point set P ⊆ R d the k-means clustering problem is to find a set C = {c1,..., ck} of k points and a partition of P into k clusters C1,..., Ck such that the sum of squared errors �k � i=1 p∈C �p − ci� i 2 2 is minimized. For given centers this cost function is minimized by assigning points to the nearest center. The k-means cost function is probably the most widely used cost function in the area of clustering. In this paper we show that every unweighted point set P has a weak (ɛ, k)-coreset of size poly(k, 1/ɛ) for the k-means clustering problem, i.e. its size is independent of the cardinality |P | of the point set and the dimension d of the Euclidean space R d. A weak coreset is a weighted set S ⊆ P together with a set T such that T contains a (1 + ɛ)-approximation for the optimal cluster centers from P and for every set of k centers from T the cost of the centers for S is a (1 ± ɛ)-approximation of the cost for P. We apply our weak coreset to obtain a PTAS for the k-means clustering problem with running time O(nkd + d · poly(k/ɛ) +
k-means has polynomial smoothed complexity
- in Proc. of the 50th FOCS (Atlanta, USA
, 2009
"... Abstract — The k-means method is one of the most widely used clustering algorithms, drawing its popularity from its speed in practice. Recently, however, it was shown to have exponential worst-case running time. In order to close the gap between practical performance and theoretical analysis, the k- ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Abstract — The k-means method is one of the most widely used clustering algorithms, drawing its popularity from its speed in practice. Recently, however, it was shown to have exponential worst-case running time. In order to close the gap between practical performance and theoretical analysis, the k-means method has been studied in the model of smoothed analysis. But even the smoothed analyses so far are unsatisfactory as the bounds are still superpolynomial in the number n of data points. In this paper, we settle the smoothed running time of the k-means method. We show that the smoothed number of iterations is bounded by a polynomial in n and 1/σ, where σ is the standard deviation of the Gaussian perturbations. This means that if an arbitrary input data set is randomly perturbed, then the k-means method will run in expected polynomial time on that input set. Keywords-k-means; clustering; smoothed analysis 1.
On Centroidal Voronoi Tessellation — Energy Smoothness and Fast Computation
"... Centroidal Voronoi tessellation (CVT) is a fundamental geometric structure that finds many applications ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
Centroidal Voronoi tessellation (CVT) is a fundamental geometric structure that finds many applications
Clusterability: A Theoretical Study
- Proceedings of AISTATS- 09, JMLR: W&CP 5
, 2009
"... We investigate measures of the clusterability of data sets. Namely, ways to define how ‘strong ’ or ‘conclusive ’ is the clustering structure of a given data set. We address this issue with generality, aiming for conclusions that apply regardless of any particular clustering algorithm or any specifi ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
We investigate measures of the clusterability of data sets. Namely, ways to define how ‘strong ’ or ‘conclusive ’ is the clustering structure of a given data set. We address this issue with generality, aiming for conclusions that apply regardless of any particular clustering algorithm or any specific data generation model. We survey several notions of clusterability that have been discussed in the literature, as well as propose a new notion of data clusterability. Our comparison of these notions reveals that, although they all attempt to evaluate the same intuitive property, they are pairwise inconsistent. Our analysis discovers an interesting phenomenon; Although most of the common clustering tasks are NP-hard, finding a closeto-optimal clustering for well clusterable data sets is easy (computationally). We prove instances of this general claim with respect to the various clusterability notions that we discuss. Finally, we investigate how hard it is to determine the clusterability value of a given data set. In most cases, it turns out that this is an NP-hard problem. 1
On Centroidal Voronoi TessellationEnergy Smoothness and Fast Computation
- ACM Trans. Graph
, 2009
"... Centroidal Voronoi tessellation (CVT) is a particular type of Voronoi tessellation that has many applications in computational sciences and engineering, including computer graphics. The prevailing method for computing CVT is Lloyd’s method, which has linear convergence and is inefficient in practice ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Centroidal Voronoi tessellation (CVT) is a particular type of Voronoi tessellation that has many applications in computational sciences and engineering, including computer graphics. The prevailing method for computing CVT is Lloyd’s method, which has linear convergence and is inefficient in practice. We develop new efficient methods for CVT computation and demonstrate the fast convergence of these methods. Specifically, we show that the CVT energy function has 2nd order smoothness for convex domains with smooth density, as well as in most situations encountered in optimization. Due to the 2nd order smoothness, it is possible to minimize the CVT energy functions using Newton-like optimization methods and expect fast convergence. We propose a quasi-Newton method to compute CVT and demonstrate its faster convergence than Lloyd’s method with various numerical examples. It is also significantly faster and more robust than the Lloyd-Newton method, a previous attempt to accelerate CVT. We also demonstrate surface remeshing as a possible application.
Secure two-party k-means clustering
- In CCS ’07: Proceedings of the 14th ACM conference on Computer and communications security
, 2007
"... The k-Means Clustering problem is one of the most-explored problems in data mining to date. With the advent of protocols that have proven to be successful in performing single database clustering, the focus has changed in recent years to the question of how to extend the single database protocols to ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The k-Means Clustering problem is one of the most-explored problems in data mining to date. With the advent of protocols that have proven to be successful in performing single database clustering, the focus has changed in recent years to the question of how to extend the single database protocols to a multiple database setting. To date there have been numerous attempts to create specific multiparty k-means clustering protocols that protect the privacy of each database, but according to the standard cryptographic definitions of “privacy-protection, ” so far all such attempts have fallen short of providing adequate privacy. In this paper we describe a Two-Party k-Means Clustering Protocol that guarantees privacy, and is more efficient than utilizing a general multiparty “compiler ” to achieve the same task. In particular, a main contribution of our result is a way to compute efficiently multiple iterations of k-means clustering without revealing the intermediate values. To achieve this, we use novel techniques to perform two-party division and sample uniformly at random from an unknown domain size. Our techniques are quite general and can be realized based on the existence of any semantically secure homomorphic encryption scheme. For concreteness, we describe our protocol based on Paillier Homomorphic Encryption scheme (see [23]). We will also demonstrate that our protocol is efficient in terms of communication, remaining competitive with existing protocols (such as [15]) that fail to protect privacy.
Stability yields a PTAS for k-Median and k-Means Clustering
, 2010
"... We consider k-median clustering in finite metric spaces and k-means clustering in Euclidean spaces, in the setting where k is part of the input (not a constant). For the k-means problem, Ostrovsky et al. [ORSS06] show that if the input satisfies the condition that the optimal (k − 1)-means clusterin ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
We consider k-median clustering in finite metric spaces and k-means clustering in Euclidean spaces, in the setting where k is part of the input (not a constant). For the k-means problem, Ostrovsky et al. [ORSS06] show that if the input satisfies the condition that the optimal (k − 1)-means clustering is more expensive than the optimal k-means clustering by a factor of max{100, 1/α 2}, then one can achieve a (1 + f(α))-approximation to the k-means optimal in time polynomial in n and k by using a variant of Lloyd’s algorithm. In this work we substantially improve this approximation guarantee. We show that given only the condition that the (k − 1)-means optimal is more expensive than the k-means optimal by a factor 1 + α for some constant α> 0, we can obtain a PTAS. In particular, under this assumption, for any ǫ> 0 we achieve a (1 + ǫ)-approximation to the k-means optimal in time polynomial in n and k, and exponential in 1/ǫ and 1/α. We thus decouple the strength of the assumption from the quality of the approximation ratio. We also give a PTAS for the k-median problem in finite metrics under the analogous assumption as well. For k-means, we in addition give a randomized algorithm with improved running time of n O(1) (k log n) poly(1/ǫ,1/α). Our technique also obtains a PTAS under the assumption of Balcan et al. [BBG09] that all (1 + α) approximations are δ-close to a desired target clustering, when all target clusters have size greater than 2δn and α> 0 is constant. Note that the motivation of [BBG09] is that the true goal in clustering is often to get the points right, with objective values serving just as a proxy, and [BBG09] already get O(δ/α)-close for general α and arbitrary target cluster sizes. So the primary advance here is in further elucidating the approximation implications and in formally relating the assumptions. In particular, both results are based on a new notion of clustering stability, that extends both the notions of [ORSS06] and of [BBG09]. 1
Center-based clustering under perturbation stability
"... Optimal clustering under most popular objective functions is NP-hard, and therefore unlikely to be efficiently solvable in the worst case. Recently, Bilu and Linial [11] suggested an approach aimed instead at understanding the complexity of clustering instances which arise in practice. They argue th ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Optimal clustering under most popular objective functions is NP-hard, and therefore unlikely to be efficiently solvable in the worst case. Recently, Bilu and Linial [11] suggested an approach aimed instead at understanding the complexity of clustering instances which arise in practice. They argue that such instances should be stable to perturbations in the metric space and give an efficient algorithm for clustering instances which are stable to perturbations of size O(n 1/2) for Max-Cut based clustering. In addition, they conjecture that instances stable to as little as O(1) perturbations should be solvable in polynomial time. In this paper we prove that this conjecture is true for any center-based clustering objective (such as k-median, k-means, and k-center). I.e., we can efficiently find the optimal clustering assuming only stability to constantmagnitude perturbations of the underlying metric. Specifically, we show that for center-based clustering instances which are stable to O(1) perturbations, the popular Single-Linkage algorithm combined with dynamic programming will find the optimal clustering. Keywords: Clustering, k-median, k-means, Stability Conditions

