Results 1 
9 of
9
Approximate Clustering without the Approximation
"... Approximation algorithms for clustering points in metric spaces is a flourishing area of research, with much research effort spent on getting a better understanding of the approximation guarantees possible for many objective functions such as kmedian, kmeans, and minsum clustering. This quest for ..."
Abstract

Cited by 55 (19 self)
 Add to MetaCart
Approximation algorithms for clustering points in metric spaces is a flourishing area of research, with much research effort spent on getting a better understanding of the approximation guarantees possible for many objective functions such as kmedian, kmeans, and minsum clustering. This quest for better approximation algorithms is further fueled by the implicit hope that these better approximations also give us more accurate clusterings. E.g., for many problems such as clustering proteins by function, or clustering images by subject, there is some unknown “correct” target clustering and the implicit hope is that approximately optimizing these objective functions will in fact produce a clustering that is close (in symmetric difference) to the truth. In this paper, we show that if we make this implicit assumption explicit—that is, if we assume that any capproximation to the given clustering objective F is ǫclose to the target—then we can produce clusterings that are O(ǫ)close to the target, even for values c for which obtaining a capproximation is NPhard. In particular, for kmedian and kmeans objectives, we show that we can achieve this guarantee for any constant c> 1, and for minsum objective we can do this for any constant c> 2. Our results also highlight a somewhat surprising conceptual difference between assuming that the optimal solution to, say, the kmedian objective is ǫclose to the target, and assuming that any approximately optimal solution is ǫclose to the target, even for approximation factor say c = 1.01. In the former case, the problem of finding a solution that is O(ǫ)close to the target remains computationally hard, and yet for the latter we have an efficient algorithm.
Clustering Partially Observed Graphs via Convex Optimization
"... This paper considers the problem of clustering a partially observed unweighted graph – i.e. one where for some node pairs we know there is an edge between them, for some others we know there is no edge, and for the remaining we do not know whether or not there is an edge. We want to organize the nod ..."
Abstract

Cited by 42 (12 self)
 Add to MetaCart
This paper considers the problem of clustering a partially observed unweighted graph – i.e. one where for some node pairs we know there is an edge between them, for some others we know there is no edge, and for the remaining we do not know whether or not there is an edge. We want to organize the nodes into disjoint clusters so that there is relatively dense (observed) connectivity within clusters, and sparse across clusters. We take a novel yet natural approach to this problem, by focusing on finding the clustering that minimizes the number of ”disagreements ”i.e. the sum of the number of (observed) missing edges within clusters, and (observed) present edges across clusters. Our algorithm uses convex optimization; its basis is a reduction of disagreement minimization to the problem of recovering an (unknown) lowrank matrix and an (unknown) sparse matrix from their partially observed sum. We show that our algorithm succeeds under certain natural assumptions on the optimal clustering and its disagreements. Our results significantly strengthen existing matrix splitting results for the special case of our clustering problem. Our results directly enhance solutions to the problem of Correlation Clustering (Bansal et al., 2002) with partial observations.
Stability yields a PTAS for kMedian and kMeans Clustering
, 2010
"... We consider kmedian clustering in finite metric spaces and kmeans clustering in Euclidean spaces, in the setting where k is part of the input (not a constant). For the kmeans problem, Ostrovsky et al. [ORSS06] show that if the input satisfies the condition that the optimal (k − 1)means clusterin ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
(Show Context)
We consider kmedian clustering in finite metric spaces and kmeans clustering in Euclidean spaces, in the setting where k is part of the input (not a constant). For the kmeans problem, Ostrovsky et al. [ORSS06] show that if the input satisfies the condition that the optimal (k − 1)means clustering is more expensive than the optimal kmeans clustering by a factor of max{100, 1/α 2}, then one can achieve a (1 + f(α))approximation to the kmeans optimal in time polynomial in n and k by using a variant of Lloyd’s algorithm. In this work we substantially improve this approximation guarantee. We show that given only the condition that the (k − 1)means optimal is more expensive than the kmeans optimal by a factor 1 + α for some constant α> 0, we can obtain a PTAS. In particular, under this assumption, for any ǫ> 0 we achieve a (1 + ǫ)approximation to the kmeans optimal in time polynomial in n and k, and exponential in 1/ǫ and 1/α. We thus decouple the strength of the assumption from the quality of the approximation ratio. We also give a PTAS for the kmedian problem in finite metrics under the analogous assumption as well. For kmeans, we in addition give a randomized algorithm with improved running time of n O(1) (k log n) poly(1/ǫ,1/α). Our technique also obtains a PTAS under the assumption of Balcan et al. [BBG09] that all (1 + α) approximations are δclose to a desired target clustering, when all target clusters have size greater than 2δn and α> 0 is constant. Note that the motivation of [BBG09] is that the true goal in clustering is often to get the points right, with objective values serving just as a proxy, and [BBG09] already get O(δ/α)close for general α and arbitrary target cluster sizes. So the primary advance here is in further elucidating the approximation implications and in formally relating the assumptions. In particular, both results are based on a new notion of clustering stability, that extends both the notions of [ORSS06] and of [BBG09]. 1
Clustering under Approximation Stability
, 2009
"... A common approach to clustering data is to view data objects as points in a metric space, and then to optimize a natural distancebased objective such as the kmedian, kmeans, or minsum score. For applications such as clustering proteins by function or clustering images by subject, the implicit ho ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
(Show Context)
A common approach to clustering data is to view data objects as points in a metric space, and then to optimize a natural distancebased objective such as the kmedian, kmeans, or minsum score. For applications such as clustering proteins by function or clustering images by subject, the implicit hope in taking this approach is that the optimal solution to the chosen objective will closely match the desired “target ” clustering (e.g., a correct clustering of proteins by function or of images by who is in them). However, most distancebased objectives, including those above, are NPhard to optimize. So, this assumption by itself is not sufficient, assuming P ̸ = NP, to achieve clusterings of lowerror via polynomial time algorithms. In this paper, we show that we can bypass this barrier if we slightly extend this assumption to ask that for some small constant c, not only the optimal solution, but also all capproximations to the optimal solution, differ from the target on at most some ɛ fraction of points—we call this (c, ɛ)approximationstability. We show that under this condition, it is possible to efficiently obtain lowerror clusterings even if the property holds only for values c for which the objective is known to be NPhard to approximate. Specifically, for any constant c> 1, (c, ɛ)approximationstability of kmedian or kmeans objectives can be used to efficiently produce a clustering of error O(ɛ), as
Minsum clustering of protein sequences with limited distance information
 In Proc. of the 1st International Workshop on SimilarityBased Pattern Analysis and Recognition (SIMBAD
, 2011
"... Abstract. We study the problem of efficiently clustering protein sequences in a limited information setting. We assume that we do not know the distances between the sequences in advance, and must query them during the execution of the algorithm. Our goal is to find an accurate clustering using few q ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract. We study the problem of efficiently clustering protein sequences in a limited information setting. We assume that we do not know the distances between the sequences in advance, and must query them during the execution of the algorithm. Our goal is to find an accurate clustering using few queries. We model the problem as a point set S with an unknown metric d on S, and assume that we have access to one versus all distance queries that given a point s ∈ S return the distances between s and all other points. Our one versus all query represents an efficient sequence database search program such as BLAST, which compares an input sequence to an entire data set. Given a natural assumption about the approximation stability of the minsum objective function for clustering, we design a provably accurate clustering algorithm that uses few one versus all queries. In our empirical study we show that our method compares favorably to wellestablished clustering algorithms when we compare computationally derived clusterings to goldstandard manual classifications. 1
Weighted Graph Clustering with NonUniform Uncertainties
"... We study the graph clustering problem where each observation (edge or noedge between a pair of nodes) may have a different level of confidence/uncertainty. We propose a clustering algorithm that is based on optimizing an appropriate weighted objective, where larger weights are given to observati ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
We study the graph clustering problem where each observation (edge or noedge between a pair of nodes) may have a different level of confidence/uncertainty. We propose a clustering algorithm that is based on optimizing an appropriate weighted objective, where larger weights are given to observations with lower uncertainty. Our approach leads to a convex optimization problem that is efficiently solvable. We analyze our approach under a natural generative model, and establish theoretical guarantees for recovering the underlying clusters. Our main result is a general theorem that applies to any given weight and distribution for the uncertainty. By optimizing over the weights, we derive a provably optimal weighting scheme, which matches the information theoretic lower bound up to logarithmic factors and leads to strong performance bounds in several specific settings. By optimizing over the uncertainty distribution, we show that nonuniform uncertainties can actually help. In particular, if the graph is built by spending a limited amount of resource to take measurement on each node pair, then it is beneficial to allocate the resource in a nonuniform fashion to obtain accurate measurements on a few pairs of nodes, rather than obtaining inaccurate measurements on many pairs. We provide simulation results that validate our theoretical findings.
Why Do We Want a Good Ratio Anyway? Approximation Stability and Proxy Objectives
"... When realworld problems are abstracted as optimization problems, it is often the case that the formal objective used in the optimization problem is serving as a proxy for some other underlying goal. For example, if we have a clustering problem such as clustering proteins by function, we might repre ..."
Abstract
 Add to MetaCart
When realworld problems are abstracted as optimization problems, it is often the case that the formal objective used in the optimization problem is serving as a proxy for some other underlying goal. For example, if we have a clustering problem such as clustering proteins by function, we might represent our data (protein sequences) in some natural way as points in a metric space, and then abstract this as a kmedian problem where we aim to find k clusters, along with a center point for each cluster, such that the sum of distances of the data points to their cluster centers is minimized. Here, the kmedian objective is serving as a proxy. Our hope is that we have represented data in such a way that a good kmedian solution will translate to a good solution to the true goal (having clusters that actually match the proteins ’ functions). In the field of Approximation Algorithms, we typically ignore this latter aspect and do not aim to model it: instead, we consider the optimization problem as given, and aim to understand the best worstcase approximation ratios achievable. Here, we consider: what happens if we incorporate the connection between the two objectives into the theory? That is, suppose that the reason we want a good approximation ratio is that we believe that our instance satisfies the promise that a good approximation to the formal objective (e.g., kmedian) will imply a nearoptimal solution to our
Beyond WorstCase Analysis in Privacy and Clustering: Exploiting Explicit and Implicit Assumptions
, 2013
"... ..."
(Show Context)
Center Based Clustering: A Foundational Perspective
, 2013
"... In the first part of this chapter we present existing work in center based clustering methods. In particular, we focus on the kmeans and the kmedian clustering which are two of the most widely used clustering objectives. We describe popular heuristics for these methods and theoretical guarantees a ..."
Abstract
 Add to MetaCart
(Show Context)
In the first part of this chapter we present existing work in center based clustering methods. In particular, we focus on the kmeans and the kmedian clustering which are two of the most widely used clustering objectives. We describe popular heuristics for these methods and theoretical guarantees associated with them. We also describe how to design worst case approximately optimal algorithms for these problems. In the second part of the chapter we describe recent work on how to improve on these worst case algorithms even further by using insights from the nature of real world clustering problems and data sets. Finally, we also summarize theoretical work on clustering data generated from mixture models such as a mixture of Gaussians. 1 Approximation algorithms for kmeans and kmedian One of the most popular approaches to clustering is to define an objective function over the data points and find a partitioning which achieves the optimal solution, or an approximately optimal solution to the given objective function. Common objective functions include center based objective functions such as kmedian and kmeans where one selects k center points and the clustering is obtained by assigning each data point to its closest center point. In