#### DMCA

## K-means++: The advantages of careful seeding. (2007)

### Cached

### Download Links

Venue: | In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, |

Citations: | 478 - 8 self |

### Citations

1362 | Least squares quantization in PCM
- Lloyd
- 1982
(Show Context)
Citation Context ... to minimize φ, the sum of the squared distances between each point and its closest center. Solving this problem exactly is NP-hard, even with just two clusters [10], but twenty-five years ago, Lloyd =-=[20]-=- proposed a local search solution that is still very widely used today (see for example [1, 11, 15]). Indeed, a recent survey of data mining techniques states that it “is by far the most popular clust... |

408 | Survey of Clustering Data Mining Techniques
- Berkhin
- 2002
(Show Context)
Citation Context ... today (see for example [1, 11, 15]). Indeed, a recent survey of data mining techniques states that it “is by far the most popular clustering algorithm used in scientific and industrial applications” =-=[5]-=-. Usually referred to simply as k-means, Lloyd’s algorithm begins with k arbitrary centers, typically chosen uniformly at random from the data points. Each point is then assigned to the nearest center... |

157 | Clustering data streams: Theory and practice.
- Guha, Meyerson, et al.
- 2003
(Show Context)
Citation Context ...t, they prove k-means++ is O(1)- competitive in the case where φOPT,k φOPT,k−1 ≤ 2. The intuition here is that if this condition does not hold, then the data is not well suited for clustering with the given value for k. Combining this result with ours gives a strong characterization of the algorithm’s performance. In particular, k-means++ is never worse than O(log k)- competitive, and on very well formed data sets, it improves to being O(1)-competitive. Overall, the seeding technique we propose is similar in spirit to that used by Meyerson [23] for online facility location, and Mishra et al. [12] and Charikar et al. [6] in the context of k-median clustering. However, our analysis is quite different from those works. 2 Preliminaries In this section, we formally define the k-means problem, as well as the k-means and k-means++ algorithms. For the k-means problem, we are given an integer k and a set of n data points X ⊂ Rd. We wish to choose k centers C so as to minimize the potential function, φ = ∑ x∈X min c∈C ‖x − c‖2. Choosing these centers implicitly defines a clustering – for each center, we set one cluster to be the set of data points that are closer to that center than to any othe... |

113 | A local search approximation algorithm for k-means clustering.
- Kanungo, Mount, et al.
- 2002
(Show Context)
Citation Context ... insights into the structure of the clustering problem, their algorithms are highly exponential (or worse) in k, and are unfortunately impractical even for relatively small n, k and d. Kanungo et al. =-=[17]-=- proposed an O(n 3 ɛ −d ) algorithm that is (9 + ɛ)-competitive. However, n 3 compares unfavorably with the almost linear running time of Lloyd’s method, and the exponential dependence on d can also b... |

112 | Clustering large graphs via the singular value decomposition
- Drineas, Frieze, et al.
- 2004
(Show Context)
Citation Context .... The goal is to choose k centers so as to minimize φ, the sum of the squared distances between each point and its closest center. Solving this problem exactly is NP-hard, even with just two clusters =-=[10]-=-, but twenty-five years ago, Lloyd [20] proposed a local search solution that is still very widely used today (see for example [1, 11, 15]). Indeed, a recent survey of data mining techniques states th... |

106 | Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering,”
- Inaba, Katoh, et al.
- 1994
(Show Context)
Citation Context ...ccuracy. The convergence time of Lloyd’s method has been the subject of a recent series of papers [2, 4, 8, 14]; in this work we focus on improving its accuracy. In the theory community, Inaba et al. =-=[16]-=- were the first to give an exact algorithm for the k-means problem, with the running time of O(n kd ). Since then, a number of polynomial time approximation schemes have been developed (see [9, 13, 19... |

94 | Better streaming algorithms for clustering problems.
- Charikar, O’Callaghan, et al.
- 2003
(Show Context)
Citation Context ...proves to being O(1)-competitive. Overall, the seeding technique we propose is similar in spirit to that used by Meyerson [23] for online facility location, and Mishra et al. [12] and Charikar et al. =-=[6]-=- in the context of k-median clustering. However, our analysis is quite different from those works. 2 Preliminaries In this section, we formally define the k-means problem, as well as the k-means and k... |

83 | The Effectiveness of Lloyd-Type Methods for the k-Means Problem.” FOCS
- Ostrovsky, Rabani, et al.
- 2006
(Show Context)
Citation Context ...arantees that k-means cannot. We propose using it to seed the initial centers for k-means, leading to a combined algorithm we call k-means++. This complements a very recent result of Ostrovsky et al. =-=[24]-=-, who independently proposed much the same algorithm. Whereas they showed this randomized seeding is O(1)-competitive on data sets following a certain separation condition, we show it is O(log k)-comp... |

71 | Online facility location.
- Meyerson
- 2001
(Show Context)
Citation Context ...never worse than O(log k)competitive, and on very well formed data sets, it improves to being O(1)-competitive. Overall, the seeding technique we propose is similar in spirit to that used by Meyerson =-=[23]-=- for online facility location, and Mishra et al. [12] and Charikar et al. [6] in the context of k-median clustering. However, our analysis is quite different from those works. 2 Preliminaries In this ... |

70 |
On coresets for k-means and k-median clustering.
- Har-Peled, Mazumdar
- 2004
(Show Context)
Citation Context ...t al. [16] were the first to give an exact algorithm for the k-means problem, with the running time of O(n kd ). Since then, a number of polynomial time approximation schemes have been developed (see =-=[9, 13, 19, 21]-=- and the references therein). While the authors develop interesting insights into the structure of the clustering problem, their algorithms are highly exponential (or worse) in k, and are unfortunatel... |

65 |
Approximation schemes for clustering problems.
- Vega, Karpinski, et al.
- 2003
(Show Context)
Citation Context ...a substantial margin. 1.1 Related Work There have been a number of recent papers that describe O(1 + ɛ)-competitive algorithms for the k-means problem that are essentially unrelated to Lloyd’s method =-=[4, 6, 10, 12]-=-. These algorithms are all highly exponential in k, however, and are not at all viable in practice. Kanungo et al. [9] recently proposed an O(n 3 ɛ −d ) algorithm for the k-means problem that is (9 + ... |

52 | How slow is the k-means method?
- Arthur, Vassilvitskii
- 2006
(Show Context)
Citation Context ...served speed, Lloyd’s method [20] remains the most popular approach in practice,despite its limited accuracy. The convergence time of Lloyd’s method has been the subject of a recent series of papers =-=[2, 4, 8, 14]-=-; in this work we focus on improving its accuracy. In the theory community, Inaba et al. [16] were the first to give an exact algorithm for the k-means problem, with the running time of O(n kd ). Sinc... |

50 |
A fast hybrid k-means level set algorithm
- Gibou, Fedkiw
- 2005
(Show Context)
Citation Context ...Solving this problem exactly is NP-hard, even with just two clusters [10], but twenty-five years ago, Lloyd [20] proposed a local search solution that is still very widely used today (see for example =-=[1, 11, 15]-=-). Indeed, a recent survey of data mining techniques states that it “is by far the most popular clustering algorithm used in scientific and industrial applications” [5]. Usually referred to simply as ... |

44 |
Rajeev Motwani, and Liadan O’Callaghan. Clustering data streams: Theory and practice.
- Guha, Meyerson, et al.
- 2003
(Show Context)
Citation Context ...l formed data sets, it improves to being O(1)-competitive. Overall, the seeding technique we propose is similar in spirit to that used by Meyerson [23] for online facility location, and Mishra et al. =-=[12]-=- and Charikar et al. [6] in the context of k-median clustering. However, our analysis is quite different from those works. 2 Preliminaries In this section, we formally define the k-means problem, as w... |

44 |
Large-scale clustering of cdna-fingerprinting data,”
- Herwig, Poustka, et al.
- 1999
(Show Context)
Citation Context ...Solving this problem exactly is NP-hard, even with just two clusters [10], but twenty-five years ago, Lloyd [20] proposed a local search solution that is still very widely used today (see for example =-=[1, 11, 15]-=-). Indeed, a recent survey of data mining techniques states that it “is by far the most popular clustering algorithm used in scientific and industrial applications” [5]. Usually referred to simply as ... |

39 |
k-means projective clustering
- Agarwal, Mustafa
- 2004
(Show Context)
Citation Context ...Solving this problem exactly is NP-hard, even with just two clusters [10], but twenty-five years ago, Lloyd [20] proposed a local search solution that is still very widely used today (see for example =-=[1, 11, 15]-=-). Indeed, a recent survey of data mining techniques states that it “is by far the most popular clustering algorithm used in scientific and industrial applications” [5]. Usually referred to simply as ... |

36 | How fast is the k-means method?
- Har-Peled, Sadri
- 2005
(Show Context)
Citation Context ...served speed, Lloyd’s method [20] remains the most popular approach in practice,despite its limited accuracy. The convergence time of Lloyd’s method has been the subject of a recent series of papers =-=[2, 4, 8, 14]-=-; in this work we focus on improving its accuracy. In the theory community, Inaba et al. [16] were the first to give an exact algorithm for the k-means problem, with the running time of O(n kd ). Sinc... |

36 | Optimal time bounds for approximate clustering.
- Mettu, Plaxton
- 2002
(Show Context)
Citation Context ... suggested a way of combining their techniques with Lloyd’s algorithm, but in order to avoid the exponential dependence on d, their approach sacrifices all approximation guarantees. Mettu and Plaxton =-=[22]-=- also achieved a constantprobability O(1) approximation using a technique called successive sampling. They match our running time of O(nkd), but only if k is sufficiently large and the spread is suffi... |

29 | On approximate geometric k-clustering.
- Matousek
- 2000
(Show Context)
Citation Context ...t al. [16] were the first to give an exact algorithm for the k-means problem, with the running time of O(n kd ). Since then, a number of polynomial time approximation schemes have been developed (see =-=[9, 13, 19, 21]-=- and the references therein). While the authors develop interesting insights into the structure of the clustering problem, their algorithms are highly exponential (or worse) in k, and are unfortunatel... |

28 |
A simple linear time (1+ɛ)-approximation algorithm for k-means clustering in any dimensions
- Kumar, Sen, et al.
- 2004
(Show Context)
Citation Context ...t al. [16] were the first to give an exact algorithm for the k-means problem, with the running time of O(n kd ). Since then, a number of polynomial time approximation schemes have been developed (see =-=[9, 13, 19, 21]-=- and the references therein). While the authors develop interesting insights into the structure of the clustering problem, their algorithms are highly exponential (or worse) in k, and are unfortunatel... |

23 |
Survey of Clustering Data Mining Techniques, Technical Report, Accrue Software,
- Berkhin
- 2002
(Show Context)
Citation Context ...n the popular k-means formulation, one is given an integer k and a set of n data points in Rd. The goal is to choose k centers so as to minimize φ, the sum of the squared distances between each point and its closest center. Solving this problem exactly is NP-hard, even with just two clusters [10], but twenty-five years ago, Lloyd [20] proposed a local search solution that is still very widely used today (see for example [1, 11, 15]). Indeed, a recent survey of data mining techniques states that it “is by far the most popular clustering algorithm used in scientific and industrial applications” [5]. Usually referred to simply as k-means, Lloyd’s algorithm begins with k arbitrary centers, typically chosen uniformly at random from the data points. Each point is then assigned to the nearest center, and each center is recomputed as the center of mass of all points assigned to it. These two steps (assignment and center calculation) are repeated until the process stabilizes. One can check that the total error φ is monotonically decreasing, which ensures that no clustering is repeated during the course of the algorithm. Since there are at most kn possible clusterings, the process will always t... |

20 | Worst-case and smoothed analysis of the icp algorithm, with an application to the kmeans method,” in
- Arthur, Vassilvitskii
- 2006
(Show Context)
Citation Context ...served speed, Lloyd’s method [20] remains the most popular approach in practice,despite its limited accuracy. The convergence time of Lloyd’s method has been the subject of a recent series of papers =-=[2, 4, 8, 14]-=-; in this work we focus on improving its accuracy. In the theory community, Inaba et al. [16] were the first to give an exact algorithm for the k-means problem, with the running time of O(n kd ). Sinc... |

16 |
A simple linear time (1+)-approximation algorithm for k-means clustering in any dimensions.
- Kumar, Sen, et al.
- 2004
(Show Context)
Citation Context ... et al. [16] were the first to give an exact algorithm for the k-means problem, with the running time of O(nkd). Since then, a number of polynomial time approximation schemes have been developed (see =-=[9, 13, 19, 21]-=- and the references therein). While the authors develop interesting insights into the structure of the clustering problem, their algorithms are highly exponential (or worse) in k, and are unfortunatel... |

7 |
How fast is k-means
- Dasgupta
- 2003
(Show Context)
Citation Context |

3 |
How fast is -means?
- Dasgupta
- 2003
(Show Context)
Citation Context ... can also get a simple O(log k) approximation algorithm for the k-median objective. Furthermore, we provide preliminary experimental data showing that in practice, k-means++ really does outperform k-means in terms of both accuracy and speed, often by a substantial margin. 1.1 Related work As a fundamental problem in machine learning, k-means has a rich history. Because of its simplicity and its observed speed, Lloyd’s method [20] remains the most popular approach in practice, despite its limited accuracy. The convergence time of Lloyd’s method has been the subject of a recent series of papers [2, 4, 8, 14]; in this work we focus on improving its accuracy. In the theory community, Inaba et al. [16] were the first to give an exact algorithm for the k-means problem, with the running time of O(nkd). Since then, a number of polynomial time approximation schemes have been developed (see [9, 13, 19, 21] and the references therein). While the authors develop interesting insights into the structure of the clustering problem, their algorithms are highly exponential (or worse) in k, and are unfortunately impractical even for relatively small n, k and d. Kanungo et al. [17] proposed an O(n3−d) algorithm t... |

1 |
k-means++ test code. http://www.stanford.edu/ ∼darthur/ kMeansppTest.zip
- Arthur, Vassilvitskii
(Show Context)
Citation Context ...n the corresponding potential function φ [ℓ] satisfies, E[φ [ℓ] ] ≤ 2 2ℓ (ln k + 2)φ [ℓ] OPT . 6 Empirical results In order to evaluate k-means++ in practice, we have implemented and tested it in C++ =-=[3]-=-. In this section, we discuss the results of these preliminary experiments. We found that D 2 seeding substantially improves both the running time and the accuracy of k-means. 6.1 Datasets We evaluate... |

1 |
Collard’s cloud cover database. ftp://ftp. ics.uci.edu/pub/machine-learning-databases/ undocumented/taylor/cloud.data
- Philippe
(Show Context)
Citation Context ...he the true centers providing a good approximation to the optimal clustering. We chose the remaining datasets from real-world examples off the UC-Irvine Machine Learning Repository. The Cloud dataset =-=[7]-=- consists of 1024 points in 10 dimensions, and it is Philippe Collard’s first cloud cover database. The Intrusion dataset [18] consists of 494019 points in 35 dimensions, and it represents features av... |