Results 1 - 10
of
35
Incremental Clustering and Dynamic Information Retrieval
, 1997
"... Motivated by applications such as document and image classification in information retrieval, we consider the problem of clustering dynamic point sets in a metric space. We propose a model called incremental clustering which is based on a careful analysis of the requirements of the information retri ..."
Abstract
-
Cited by 129 (3 self)
- Add to MetaCart
Motivated by applications such as document and image classification in information retrieval, we consider the problem of clustering dynamic point sets in a metric space. We propose a model called incremental clustering which is based on a careful analysis of the requirements of the information retrieval application, and which should also be useful in other applications. The goal is to efficiently maintain clusters of small diameter as new points are inserted. We analyze several natural greedy algorithms and demonstrate that they perform poorly. We propose new deterministic and randomized incremental clustering algorithms which have a provably good performance. We complement our positive results with lower bounds on the performance of incremental algorithms. Finally, we consider the dual clustering problem where the clusters are of fixed diameter, and the goal is to minimize the number of clusters. 1 Introduction We consider the following problem: as a sequence of points from a metric...
Geometric approximation via coresets
- Combinatorial and Computational Geometry, MSRI
, 2005
"... Abstract. The paradigm of coresets has recently emerged as a powerful tool for efficiently approximating various extent measures of a point set P. Using this paradigm, one quickly computes a small subset Q of P, called a coreset, that approximates the original set P and and then solves the problem o ..."
Abstract
-
Cited by 47 (7 self)
- Add to MetaCart
Abstract. The paradigm of coresets has recently emerged as a powerful tool for efficiently approximating various extent measures of a point set P. Using this paradigm, one quickly computes a small subset Q of P, called a coreset, that approximates the original set P and and then solves the problem on Q using a relatively inefficient algorithm. The solution for Q is then translated to an approximate solution to the original point set P. This paper describes the ways in which this paradigm has been successfully applied to various optimization and extent measure problems. 1.
Matrix approximation and projective clustering via volume sampling
- In SODA
, 2006
"... We present two new results for the problem of approximating a given real m × n matrix A by a rank-k matrix D, where k < min{m, n}, so as to minimize ||A − D| | 2 F. It is known that by sampling O(k/ɛ) rows of the matrix, one can find a low-rank approximation with additive error ɛ||A| | 2 F. Our firs ..."
Abstract
-
Cited by 46 (2 self)
- Add to MetaCart
We present two new results for the problem of approximating a given real m × n matrix A by a rank-k matrix D, where k < min{m, n}, so as to minimize ||A − D| | 2 F. It is known that by sampling O(k/ɛ) rows of the matrix, one can find a low-rank approximation with additive error ɛ||A| | 2 F. Our first result shows that with adaptive sampling in t rounds and O(k/ɛ) samples in each round, the additive error drops exponentially as ɛt; the computation time is nearly linear in the number of nonzero entries. This demonstrates that multiple passes can be highly beneficial for a natural (and widely studied) algorithmic problem. Our second result is that there exists a subset of O(k2 /ɛ) rows such that their span contains a rank-k approximation with multiplicative (1 + ɛ) error (i.e., the sum of squares distance has a small “core-set ” whose span determines a good approximation). This existence theorem leads to a PTAS for the following projective clustering problem: Given a set of points P in Rd, and integers k, j, find a set of j subspaces F1,..., Fj, each of dimension at most k, that minimize ∑ p∈P mini d(p, Fi) 2. 1
Computing communities in large networks using random walks
- J. of Graph Alg. and App. bf
, 2004
"... Dense subgraphs of sparse graphs (communities), which appear in most real-world complex networks, play an important role in many contexts. Computing them however is generally expensive. We propose here a measure of similarities between vertices based on random walks which has several important advan ..."
Abstract
-
Cited by 43 (1 self)
- Add to MetaCart
Dense subgraphs of sparse graphs (communities), which appear in most real-world complex networks, play an important role in many contexts. Computing them however is generally expensive. We propose here a measure of similarities between vertices based on random walks which has several important advantages: it captures well the community structure in a network, it can be computed efficiently, and it can be used in an agglomerative algorithm to compute efficiently the community structure of a network. We propose such an algorithm, called Walktrap, which runs in time O(mn 2) and space O(n 2) in the worst case, and in time O(n 2 log n) and space O(n 2) in most real-world cases (n and m are respectively the number of vertices and edges in the input graph). Extensive comparison tests show that our algorithm surpasses previously proposed ones concerning the quality of the obtained community structures and that it stands among the best ones concerning the running time.
Coresets for k-Means and k-Median Clustering and their Applications
- In Proc. 36th Annu. ACM Sympos. Theory Comput
, 2003
"... In this paper, we show the existence of small coresets for the problems of computing k-median and k-means clustering for points in low dimension. In other words, we show that given a point set P in IR , one can compute a weighted set S P , of size log n), such that one can compute the k-med ..."
Abstract
-
Cited by 41 (13 self)
- Add to MetaCart
In this paper, we show the existence of small coresets for the problems of computing k-median and k-means clustering for points in low dimension. In other words, we show that given a point set P in IR , one can compute a weighted set S P , of size log n), such that one can compute the k-median/means clustering on S instead of on P , and get an (1 + ")-approximation.
The effectiveness of lloyd-type methods for the k-means problem
- In 47th IEEE Symposium on the Foundations of Computer Science (FOCS
, 2006
"... We investigate variants of Lloyd’s heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data s ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
We investigate variants of Lloyd’s heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd’s heuristic that quickly lead to provably near-optimal clustering solutions when applied to well-clusterable instances. This is the first performance guarantee for a variant of Lloyd’s heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd’s method. In addition, our other algorithms are faster on well-clusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloyd-type iteration. 1.
Approximate Clustering without the Approximation
"... Approximation algorithms for clustering points in metric spaces is a flourishing area of research, with much research effort spent on getting a better understanding of the approximation guarantees possible for many objective functions such as k-median, k-means, and min-sum clustering. This quest for ..."
Abstract
-
Cited by 22 (14 self)
- Add to MetaCart
Approximation algorithms for clustering points in metric spaces is a flourishing area of research, with much research effort spent on getting a better understanding of the approximation guarantees possible for many objective functions such as k-median, k-means, and min-sum clustering. This quest for better approximation algorithms is further fueled by the implicit hope that these better approximations also give us more accurate clusterings. E.g., for many problems such as clustering proteins by function, or clustering images by subject, there is some unknown “correct” target clustering and the implicit hope is that approximately optimizing these objective functions will in fact produce a clustering that is close (in symmetric difference) to the truth. In this paper, we show that if we make this implicit assumption explicit—that is, if we assume that any c-approximation to the given clustering objective F is ǫ-close to the target—then we can produce clusterings that are O(ǫ)-close to the target, even for values c for which obtaining a c-approximation is NP-hard. In particular, for k-median and k-means objectives, we show that we can achieve this guarantee for any constant c> 1, and for min-sum objective we can do this for any constant c> 2. Our results also highlight a somewhat surprising conceptual difference between assuming that the optimal solution to, say, the k-median objective is ǫ-close to the target, and assuming that any approximately optimal solution is ǫ-close to the target, even for approximation factor say c = 1.01. In the former case, the problem of finding a solution that is O(ǫ)-close to the target remains computationally hard, and yet for the latter we have an efficient algorithm.
How fast is the k-means method
- Algorithmica
, 2005
"... We present polynomial upper and lower bounds on the number of iterations performed by the k-means method (a.k.a. Lloyd’s method) for k-means clustering. Our upper bounds are polynomial in the number of points, number of clusters, and the spread of the point set. We also present a lower bound, showin ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
We present polynomial upper and lower bounds on the number of iterations performed by the k-means method (a.k.a. Lloyd’s method) for k-means clustering. Our upper bounds are polynomial in the number of points, number of clusters, and the spread of the point set. We also present a lower bound, showing that in the worst case the k-means heuristic needs to perform Ω(n) iterations, for n points on the real line and two centers. Surprisingly, the spread of the point set in this construction is polynomial. This is the first construction showing that the k-means heuristic requires more than a polylogarithmic number of iterations. Furthermore, we present two alternative algorithms, with guaranteed performance, which are simple variants of the k-means method. Results of our experimental studies on these algorithms are also presented. 1
Smaller coresets for k-median and k-means clustering
, 2005
"... In this paper, we show that there exists a (k, ε)-coreset for k-median and k-means clustering of n points in ℜ d, which is of size independent of n. In particular, we construct a (k, ε)-coreset of size O(k 2 /ε d) for k-median clustering, and of size O(k 3 /ε d+1) for k-means clustering. 1 ..."
Abstract
-
Cited by 18 (5 self)
- Add to MetaCart
In this paper, we show that there exists a (k, ε)-coreset for k-median and k-means clustering of n points in ℜ d, which is of size independent of n. In particular, we construct a (k, ε)-coreset of size O(k 2 /ε d) for k-median clustering, and of size O(k 3 /ε d+1) for k-means clustering. 1
Correlation clustering with a fixed number of clusters
- Theory of Computing
, 2006
"... Abstract: We continue the investigation of problems concerning correlation clustering or clustering with qualitative information, which is a clustering formulation that has been studied recently (Bansal, Blum, Chawla (2004), Charikar, Guruswami, Wirth (FOCS’03), Charikar, Wirth (FOCS’04), Alon et al ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
Abstract: We continue the investigation of problems concerning correlation clustering or clustering with qualitative information, which is a clustering formulation that has been studied recently (Bansal, Blum, Chawla (2004), Charikar, Guruswami, Wirth (FOCS’03), Charikar, Wirth (FOCS’04), Alon et al. (STOC’05)). In this problem, we are given a complete graph on n nodes (which correspond to nodes to be clustered) whose edges are labeled + (for similar pairs of items) and − (for dissimilar pairs of items). Thus our input consists of only qualitative information on similarity and no quantitative distance measure between items. The quality of a clustering is measured in terms of its number of agreements, which is simply the number of edges it correctly classifies, that is the sum of number of − edges whose endpoints it places in different clusters plus the number of + edges both of whose endpoints it places within the same cluster. In this paper, we study the problem of finding clusterings that maximize the number of agreements, and the complementary minimization version where we seek clusterings that minimize the number of disagreements. We focus on the situation when the number of clusters is stipulated to be a small constant k. Our main result is that for every k, there is a polynomial time approximation scheme for both maximizing agreements and minimizing disagreements.

