Results 1 
5 of
5
Clustering under Approximation Stability
, 2009
"... A common approach to clustering data is to view data objects as points in a metric space, and then to optimize a natural distancebased objective such as the kmedian, kmeans, or minsum score. For applications such as clustering proteins by function or clustering images by subject, the implicit ho ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
A common approach to clustering data is to view data objects as points in a metric space, and then to optimize a natural distancebased objective such as the kmedian, kmeans, or minsum score. For applications such as clustering proteins by function or clustering images by subject, the implicit hope in taking this approach is that the optimal solution to the chosen objective will closely match the desired “target ” clustering (e.g., a correct clustering of proteins by function or of images by who is in them). However, most distancebased objectives, including those above, are NPhard to optimize. So, this assumption by itself is not sufficient, assuming P ̸ = NP, to achieve clusterings of lowerror via polynomial time algorithms. In this paper, we show that we can bypass this barrier if we slightly extend this assumption to ask that for some small constant c, not only the optimal solution, but also all capproximations to the optimal solution, differ from the target on at most some ɛ fraction of points—we call this (c, ɛ)approximationstability. We show that under this condition, it is possible to efficiently obtain lowerror clusterings even if the property holds only for values c for which the objective is known to be NPhard to approximate. Specifically, for any constant c> 1, (c, ɛ)approximationstability of kmedian or kmeans objectives can be used to efficiently produce a clustering of error O(ɛ), as
On Lloyd’s algorithm: new theoretical insights for clustering in practice
"... A paradox for “kmeans clustering” kmeans objective φ of C = {ci, i ∈ [k]} on a dataset X: φX(C) = x∈X ‖x − C(x)‖2, where C(x) = arg min c∈C ‖x − c‖ Even though approximation algorithms exist, they are rarely used for applications. Instead, a few heuristics, most notably Lloyd’s algorithm, are pre ..."
Abstract
 Add to MetaCart
(Show Context)
A paradox for “kmeans clustering” kmeans objective φ of C = {ci, i ∈ [k]} on a dataset X: φX(C) = x∈X ‖x − C(x)‖2, where C(x) = arg min c∈C ‖x − c‖ Even though approximation algorithms exist, they are rarely used for applications. Instead, a few heuristics, most notably Lloyd’s algorithm, are preferred and often successful in practice. Lloyd’s algorithm (a.k.a. the “kmeans ” algorithm) Input: dataset X, X  = n); k; samples size m, m> k.
Clustering under Perturbation Resilience
, 2012
"... Recently, Bilu and Linial [8] formalized an implicit assumption often made when choosing a clustering objective: that the optimum clustering to the objective should be preserved under small multiplicative perturbations to distances between points. They showed that for maxcut clustering it is possib ..."
Abstract
 Add to MetaCart
(Show Context)
Recently, Bilu and Linial [8] formalized an implicit assumption often made when choosing a clustering objective: that the optimum clustering to the objective should be preserved under small multiplicative perturbations to distances between points. They showed that for maxcut clustering it is possible to circumvent NPhardness and obtain polynomialtime algorithms for instances resilient to large (factor O ( √ n)) perturbations, and subsequently Awasthi et al. [2] considered centerbased objectives, giving algorithms for instances resilient toO(1) factor perturbations. In this paper, we greatly advance this line of work. For the kmedian objective, we present an algorithm that can optimally cluster instances resilient to(1+ √ 2)factor perturbations, solving an open problem of Awasthi et al.[2]. We additionally give algorithms for a more relaxed assumption in which we allow the optimal solution to change in a smallǫfraction of the points after perturbation. We give the first bounds known for this more realistic and more general setting. We also provide positive results for minsum clustering which is a generally much harder objective than kmedian (and also noncenterbased). Our algorithms are based on new linkage criteria that may be of independent interest. Additionally, we give sublineartime algorithms, showing algorithms that can return an implicit clustering from only access to a small random sample.