## Polynomial Time Approximation Schemes for Geometric k-Clustering (2001)

### Cached

### Download Links

- [www.argreenhouse.com]
- [www.cs.technion.ac.il]
- DBLP

### Other Repositories/Bibliography

Venue: | J. OF THE ACM |

Citations: | 32 - 5 self |

### BibTeX

@INPROCEEDINGS{Ostrovsky01polynomialtime,

author = {Rafail Ostrovsky and Yuval Rabani},

title = {Polynomial Time Approximation Schemes for Geometric k-Clustering},

booktitle = {J. OF THE ACM},

year = {2001},

pages = {349--358},

publisher = {IEEE}

}

### Years of Citing Articles

### OpenURL

### Abstract

The Johnson-Lindenstrauss lemma states that n points in a high dimensional Hilbert space can be embedded with small distortion of the distances into an O(log n) dimensional space by applying a random linear transformation. We show that similar (though weaker) properties hold for certain random linear transformations over the Hamming cube. We use these transformations to solve NP-hard clustering problems in the cube as well as in geometric settings. More specifically, we address the following clustering problem. Given n points in a larger set (for example, R^d) endowed with a distance function (for example, L² distance), we would like to partition the data set into k disjoint clusters, each with a "cluster center", so as to minimize the sum over all data points of the distance between the point and the center of the cluster containing the point. The problem is provably NP-hard in some high dimensional geometric settings, even for k = 2. We give polynomial time approximation schemes for this problem in several settings, including the binary cube {0, 1}^d with Hamming distance, and R^d either with L¹ distance, or with L² distance, or with the square of L² distance. In all these settings, the best previous results were constant factor approximation guarantees. We note that our problem is similar in flavor to the k-median problem (and the related facility location problem), which has been considered in graph-theoretic and fixed dimensional geometric settings, where it becomes hard when k is part of the input. In contrast, we study the problem when k is fixed, but the dimension is part of the input.

### Citations

3025 | Indexing by latent semantic analysis
- Deerwester, Dumais, et al.
- 1990
(Show Context)
Citation Context ...ording to a small number of topics that can be inspected by a person in order to assist further classication and searching. Representation of documents using methods such as latent semantic indexing [=-=20, 16, 19, 9] leads to -=-high dimensional data. Systems like the \scatter/gather" 1 project [14, 13] or the \Manjara" project [30, 24] (a web meta-search engine that clusters the results of a search according to sev... |

2999 | Authoritative sources in a hyperlinked environment
- Kleinberg
- 1999
(Show Context)
Citation Context ...ee, for example, [37, 10, 38] and references therein). The recent interest in clustering problems can be attributed to applications such as the classication of web pages retrieved by a search engine [=-=40, 33, 30]-=-, or the study of gene expression in computational molecular biology [31]. In many applications, the goal is to cluster data into several clusters according to some measure, where the data has many in... |

1717 | The Probabilistic Method - Alon, Spencer - 1992 |

759 | Approximate nearest neighbors: Towards removing the curse of dimensionality
- Indyk, Motwani
- 1998
(Show Context)
Citation Context ...ection onto a random subspace) for nearly isometric dimension reduction ofsnite subsets. This lemma has found recent applications in combinatorics [22], graph algorithms [36], nearest neighbor search =-=[27]-=-, and learning mixtures of Gaussians [15]. It does not seem to be useful in our case. In the low dimensional cube, we can enumerate over the possible center locations and compute a candidate clusterin... |

680 | Scatter/gather: A cluster-based approach to browsing large document collections
- Cutting, Pedersen, et al.
- 1992
(Show Context)
Citation Context ...her classication and searching. Representation of documents using methods such as latent semantic indexing [20, 16, 19, 9] leads to high dimensional data. Systems like the \scatter/gather" 1 proj=-=ect [14, 13] or t-=-he \Manjara" project [30, 24] (a web meta-search engine that clusters the results of a search according to several topics) require clustering of such data into a relatively small number of cluste... |

574 | Using linear algebra for intelligent information retrieval
- Berry, Dumais, et al.
- 1996
(Show Context)
Citation Context ...ording to a small number of topics that can be inspected by a person in order to assist further classication and searching. Representation of documents using methods such as latent semantic indexing [=-=20, 16, 19, 9] leads to -=-high dimensional data. Systems like the \scatter/gather" 1 project [14, 13] or the \Manjara" project [30, 24] (a web meta-search engine that clusters the results of a search according to sev... |

455 | The Geometry of Graphs and some of Its Algorithmic Applications
- Linial, London, et al.
- 1995
(Show Context)
Citation Context ... linear transformation (a projection onto a random subspace) for nearly isometric dimension reduction ofsnite subsets. This lemma has found recent applications in combinatorics [22], graph algorithms =-=[36]-=-, nearest neighbor search [27], and learning mixtures of Gaussians [15]. It does not seem to be useful in our case. In the low dimensional cube, we can enumerate over the possible center locations and... |

438 | Syntactic clustering of the web - Broder, Glassman, et al. - 1997 |

425 |
Extensions of Lipschitz mappings into a Hilbert space. Conference in modern analysis and probability
- Johnson, Lindenstrauss
- 1982
(Show Context)
Citation Context ...h and it doesn't expand small distances too much. We believe that this observation might be of independent interest. We note that in Hilbert space (e.g., (R d ; L 2 )) the Johnson-Lindenstrauss Lemma =-=[29]-=- uses a random linear transformation (a projection onto a random subspace) for nearly isometric dimension reduction ofsnite subsets. This lemma has found recent applications in combinatorics [22], gra... |

326 | Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation
- Jain, Vazirani
(Show Context)
Citation Context ... Shmoys, and Tardos [11] gave a constant factor approximation algorithm, based on a rounding procedure for a natural linear programming relaxation. The constant has been improved by Jain and Vazirani =-=[28]-=-, and further by Charikar and Guha [12], using the primal-dual method. Similarly, insxed dimension d, the k-clustering problem has a polynomial time solution. To illustrate this for k = 2 (in R d with... |

245 |
Improving the retrieval of information from external sources
- Dumais
- 1991
(Show Context)
Citation Context ...ording to a small number of topics that can be inspected by a person in order to assist further classication and searching. Representation of documents using methods such as latent semantic indexing [=-=20, 16, 19, 9] leads to -=-high dimensional data. Systems like the \scatter/gather" 1 project [14, 13] or the \Manjara" project [30, 24] (a web meta-search engine that clusters the results of a search according to sev... |

220 | A constant-factor approximation algorithm for the k-median problem
- Charikar, Guha, et al.
(Show Context)
Citation Context ...possible choices for the centers. For arbitrary k, insnite metrics, the k-median problem was shown to be APX-hard by Guha and Khuller [25]. A breakthrough result by Charikar, Guha, Shmoys, and Tardos =-=[11]-=- gave a constant factor approximation algorithm, based on a rounding procedure for a natural linear programming relaxation. The constant has been improved by Jain and Vazirani [28], and further by Cha... |

209 | Improved combinatorial algorithms for facility location problems
- Charikar, Guha
(Show Context)
Citation Context ...t factor approximation algorithm, based on a rounding procedure for a natural linear programming relaxation. The constant has been improved by Jain and Vazirani [28], and further by Charikar and Guha =-=[12]-=-, using the primal-dual method. Similarly, insxed dimension d, the k-clustering problem has a polynomial time solution. To illustrate this for k = 2 (in R d with Euclidean distances), notice that the ... |

196 | Efficient search for approximate nearest neighbor in high dimensional spaces - Kushilevitz, Ostrovsky, et al. |

190 | Greedy strikes back: Improved facility location algorithms
- Guha, Khuller
- 1999
(Show Context)
Citation Context ...ivially has a polynomial time solution: Simply enumerate over all possible choices for the centers. For arbitrary k, insnite metrics, the k-median problem was shown to be APX-hard by Guha and Khuller =-=[25]-=-. A breakthrough result by Charikar, Guha, Shmoys, and Tardos [11] gave a constant factor approximation algorithm, based on a rounding procedure for a natural linear programming relaxation. The consta... |

181 | Fast Monte-Carlo Algorithms for Finding Low-Rank Approximations - Frieze, Kannan, et al. - 1998 |

179 | Two algorithms for nearest-neighbor search in high dimensions
- Kleinberg
- 1997
(Show Context)
Citation Context ...re complicated. The location of cluster centers induces for every data point a tournament among the clusters. We assign a data point to an apex of its tournament, an idea previously used by Kleinberg =-=[32] in t-=-he context of nearest neighbor search. Our other results are derived essentially by reducing the problem to clustering in the Hamming cube. This is not a \black-box" reduction, as we have to modi... |

177 |
Learning mixtures of gaussians
- Dasgupta
- 1999
(Show Context)
Citation Context ... isometric dimension reduction ofsnite subsets. This lemma has found recent applications in combinatorics [22], graph algorithms [36], nearest neighbor search [27], and learning mixtures of Gaussians =-=[15]-=-. It does not seem to be useful in our case. In the low dimensional cube, we can enumerate over the possible center locations and compute a candidate clustering for each possibility. The value of a ca... |

172 | Approximation schemes for dense instances of np-hard problems
- Arora, Karpinski
- 1995
(Show Context)
Citation Context ...l these results use one form or another of sampling. Sampling is not a common tool in the design of polynomial time approximation schemes. It has been used successfully in the context of dense graphs =-=[5, 23]-=-. In geometric settings (and in general), the ubiquitous method is dynamic programming (see [10]). One example in our context is the k-center algorithms of Agarwal and Procopiuc [1]. They give an n O(... |

143 |
Clustering Algorithms
- Rasmussen
- 1992
(Show Context)
Citation Context ...limited tosxed dimension. Clustering of data has signicant importance in manyselds, including operations research, data mining, statistics, computer vision and pattern recognition (see, for example, [=-=37, 10, 38-=-] and references therein). The recent interest in clustering problems can be attributed to applications such as the classication of web pages retrieved by a search engine [40, 33, 30], or the study of... |

126 | Constant interaction-time scatter/gather browsing of very large document collections
- Cutting, Karger, et al.
- 1993
(Show Context)
Citation Context ...her classication and searching. Representation of documents using methods such as latent semantic indexing [20, 16, 19, 9] leads to high dimensional data. Systems like the \scatter/gather" 1 proj=-=ect [14, 13] or t-=-he \Manjara" project [30, 24] (a web meta-search engine that clusters the results of a search according to several topics) require clustering of such data into a relatively small number of cluste... |

117 | Approximation schemes for Euclidean k-medians and related problems
- Arora, Raghavan, et al.
- 1998
(Show Context)
Citation Context ...la, and Vinay show it for R d with squared Euclidean distances. The NP-hardness of the Euclidean distances case is still open. We note that insxed dimension, for arbitrary k, Arora, Raghavan, and Rao =-=[6]-=- give a polynomial time approximation scheme, using dynamic programming. Our measure of the quality of our clustering is by no means the obvious choice. In fact, other measures have been proposed in t... |

103 | Fast and intuitive clustering of web documents
- Zamir, Etzioni, et al.
- 1997
(Show Context)
Citation Context ...ee, for example, [37, 10, 38] and references therein). The recent interest in clustering problems can be attributed to applications such as the classication of web pages retrieved by a search engine [=-=40, 33, 30]-=-, or the study of gene expression in computational molecular biology [31]. In many applications, the goal is to cluster data into several clusters according to some measure, where the data has many in... |

102 |
The Johnson-Lindenstrauss lemma and the sphericity of some graphs
- Frankl, Maehara
- 1987
(Show Context)
Citation Context ...emma [29] uses a random linear transformation (a projection onto a random subspace) for nearly isometric dimension reduction ofsnite subsets. This lemma has found recent applications in combinatorics =-=[22]-=-, graph algorithms [36], nearest neighbor search [27], and learning mixtures of Gaussians [15]. It does not seem to be useful in our case. In the low dimensional cube, we can enumerate over the possib... |

100 | Efficient algorithms for geometric optimization - Agarwal, Sharir - 1998 |

88 | Clustering in large graphs and matrices
- Drineas, Frieze, et al.
- 1999
(Show Context)
Citation Context ...pplications, the goal is to cluster data into several clusters according to some measure, where the data has many incomparable attributes and thus can be cast as a high dimensional clustering problem =-=[33, 18, 7, 39]-=-. In this paper, we consider the case where the dimension is very large but the number of clusters that we need to produce is relatively small. A typical case is when a large collection of documents m... |

81 | Approximation algorithms for geometric problems, Approximation Algorithms for NP-hard Problems
- Bern, Eppstein
- 1997
(Show Context)
Citation Context ...limited tosxed dimension. Clustering of data has signicant importance in manyselds, including operations research, data mining, statistics, computer vision and pattern recognition (see, for example, [=-=37, 10, 38-=-] and references therein). The recent interest in clustering problems can be attributed to applications such as the classication of web pages retrieved by a search engine [40, 33, 30], or the study of... |

78 |
The regularity lemma and approximation schemes for dense problems
- Frieze, Kannan
- 1996
(Show Context)
Citation Context ...l these results use one form or another of sampling. Sampling is not a common tool in the design of polynomial time approximation schemes. It has been used successfully in the context of dense graphs =-=[5, 23]-=-. In geometric settings (and in general), the ubiquitous method is dynamic programming (see [10]). One example in our context is the k-center algorithms of Agarwal and Procopiuc [1]. They give an n O(... |

69 | Segmentation Problems
- Kleinberg, Papadimitriou, et al.
- 1998
(Show Context)
Citation Context ...rial complexity of the problem grows exponentially with the dimension. Indeed, the k-clustering problem was shown to be NP-hard even for k = 2 in several cases. Kleinberg, Papadimitriou, and Raghavan =-=[34]-=- show it for the binary cube, and Drineas, Frieze, Kannan, Vempala, and Vinay show it for R d with squared Euclidean distances. The NP-hardness of the Euclidean distances case is still open. We note t... |

60 | Fast hierarchical clustering and other applications of dynamic closest pairs - Eppstein |

59 | Exact and approximation algorithms for clustering
- Agarwal
(Show Context)
Citation Context ...f dense graphs [5, 23]. In geometric settings (and in general), the ubiquitous method is dynamic programming (see [10]). One example in our context is the k-center algorithms of Agarwal and Procopiuc =-=[1-=-]. They give an n O(k 1 1=d ) -time exact algorithm and a polynomial time approximation scheme with running time O(n log k) + (k=) O(k 1 1=d ) for the k-center problem in R d with L p distances, for a... |

52 |
Using latent semantic analysis to improve information retrieval
- Dumais, Furnas, et al.
- 1988
(Show Context)
Citation Context |

39 | A sublinear time approximation scheme for clustering in metric spaces
- Indyk
- 1999
(Show Context)
Citation Context ...ings (including squared Euclidean distances), provided that the dimension d = o(log n= log log n). His algorithm works in higher dimension too, but the running time degrades to n O(log log n) . Indyk =-=[26]-=- gives a polynomial time approximation scheme for min-sum clustering insnite metric spaces (when two clusters are needed), based on the polynomial time approximation scheme of de la Vega and Kenyon [1... |

31 | Clustering for Edge-Cost Minimization
- Schulman
- 2002
(Show Context)
Citation Context ...pplications, the goal is to cluster data into several clusters according to some measure, where the data has many incomparable attributes and thus can be cast as a high dimensional clustering problem =-=[33, 18, 7, 39]-=-. In this paper, we consider the case where the dimension is very large but the number of clusters that we need to produce is relatively small. A typical case is when a large collection of documents m... |

29 | A randomized approximation scheme for Metric MAX-CUT
- Vega, Kenyon
(Show Context)
Citation Context ...6] gives a polynomial time approximation scheme for min-sum clustering insnite metric spaces (when two clusters are needed), based on the polynomial time approximation scheme of de la Vega and Kenyon =-=[17]-=- for metric MAX CUT. Alon and Sudakov [4] give a polynomial time approximation scheme for the maximization version of our problem in the binary cube (i.e., when the objective is tosnd a partition and ... |

18 | Subquadratic approximation algorithms for clustering problems in high dimensional spaces
- Borodin, Ostrovsky, et al.
- 1999
(Show Context)
Citation Context ...pplications, the goal is to cluster data into several clusters according to some measure, where the data has many incomparable attributes and thus can be cast as a high dimensional clustering problem =-=[33, 18, 7, 39]-=-. In this paper, we consider the case where the dimension is very large but the number of clusters that we need to produce is relatively small. A typical case is when a large collection of documents m... |

17 |
Ecient search for approximate nearest neighbor in high dimensional spaces
- Kushilevitz, Ostrovsky, et al.
- 2000
(Show Context)
Citation Context ...nor dynamic programming, nor the singular value decomposition. For the Hamming cube, we use random linear transformations to reduce the dimension. More specically, Kushilevitz, Ostrovsky, and Rabani [=-=35-=-] show that a certain random linear transformation into a low dimensional cube can be used to test for a specic Hamming distance. We strengthen their analysis to show that this transformation guarante... |

15 |
E cient algorithms for geometric optimization
- Agarwal, Sharir
- 1996
(Show Context)
Citation Context ...roximation scheme with running time O(n log k) + (k=) O(k 1 1=d ) for the k-center problem in R d with L p distances, for all p, using dynamic programming. (See also the survey of Agarwal and Sharir [=-=-=-2] for previous and related work.) A dierent idea is advocated by Drineas, Frieze, Kannan, Vempala, and Vinay [18]. They give a 2-approximation for k-clustering (xed k) for the case of squared Euclide... |

13 |
Pattern recognition
- O’Rourke, Toussaint
- 1997
(Show Context)
Citation Context ...limited tosxed dimension. Clustering of data has signicant importance in manyselds, including operations research, data mining, statistics, computer vision and pattern recognition (see, for example, [=-=37, 10, 38-=-] and references therein). The recent interest in clustering problems can be attributed to applications such as the classication of web pages retrieved by a search engine [40, 33, 30], or the study of... |

11 | On two segmentation problems
- Alon, Sudakov
- 1999
(Show Context)
Citation Context ...cheme for min-sum clustering insnite metric spaces (when two clusters are needed), based on the polynomial time approximation scheme of de la Vega and Kenyon [17] for metric MAX CUT. Alon and Sudakov =-=[4]-=- give a polynomial time approximation scheme for the maximization version of our problem in the binary cube (i.e., when the objective is tosnd a partition and centers that maximize the sum over all da... |

11 |
Fast Monte-Carlo Algorithms for Low-Rank Approximations
- Frieze, Kannan, et al.
- 1998
(Show Context)
Citation Context ...epresentation of documents using methods such as latent semantic indexing [20, 16, 19, 9] leads to high dimensional data. Systems like the \scatter/gather" 1 project [14, 13] or the \Manjara"=-=; project [30, 24]-=- (a web meta-search engine that clusters the results of a search according to several topics) require clustering of such data into a relatively small number of clusters. When k is part of the input, t... |

2 |
The genomics revolution and its challenges for algorithmic research
- Karp
- 2000
(Show Context)
Citation Context ...stering problems can be attributed to applications such as the classication of web pages retrieved by a search engine [40, 33, 30], or the study of gene expression in computational molecular biology [=-=31]-=-. In many applications, the goal is to cluster data into several clusters according to some measure, where the data has many incomparable attributes and thus can be cast as a high dimensional clusteri... |

1 |
The Manjara Meta-Search Engine. http://cluster.cs.yale.edu/about.html
- Kannan, Vinay
(Show Context)
Citation Context ...ee, for example, [37, 10, 38] and references therein). The recent interest in clustering problems can be attributed to applications such as the classication of web pages retrieved by a search engine [=-=40, 33, 30]-=-, or the study of gene expression in computational molecular biology [31]. In many applications, the goal is to cluster data into several clusters according to some measure, where the data has many in... |