Results 1  10
of
14
Clustering data streams: Theory and practice
 IEEE TKDE
, 2003
"... Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little ..."
Abstract

Cited by 154 (4 self)
 Add to MetaCart
(Show Context)
Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm’s performance on synthetic and real data streams. Index Terms—Clustering, data streams, approximation algorithms. 1
Greedy Facility Location Algorithms analyzed using Dual Fitting with FactorRevealing LP
 Journal of the ACM
, 2001
"... We present a natural greedy algorithm for the metric uncapacitated facility location problem and use the method of dual fitting to analyze its approximation ratio, which turns out to be 1.861. The running time of our algorithm is O(m log m), where m is the total number of edges in the underlying c ..."
Abstract

Cited by 148 (12 self)
 Add to MetaCart
(Show Context)
We present a natural greedy algorithm for the metric uncapacitated facility location problem and use the method of dual fitting to analyze its approximation ratio, which turns out to be 1.861. The running time of our algorithm is O(m log m), where m is the total number of edges in the underlying complete bipartite graph between cities and facilities. We use our algorithm to improve recent results for some variants of the problem, such as the fault tolerant and outlier versions. In addition, we introduce a new variant which can be seen as a special case of the concave cost version of this problem.
A local search approximation algorithm for kmeans clustering
, 2004
"... In kmeans clustering we are given a set of n data points in ddimensional space ℜd and an integer k, and the problem is to determine a set of k points in ℜd, called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomialtime algorithms are kno ..."
Abstract

Cited by 105 (1 self)
 Add to MetaCart
(Show Context)
In kmeans clustering we are given a set of n data points in ddimensional space ℜd and an integer k, and the problem is to determine a set of k points in ℜd, called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomialtime algorithms are known for this problem. Although asymptotically efficient approximation algorithms exist, these algorithms are not practical due to the very high constant factors involved. There are many heuristics that are used in practice, but we know of no bounds on their performance. We consider the question of whether there exists a simple and practical approximation algorithm for kmeans clustering. We present a local improvement heuristic based on swapping centers in and out. We prove that this yields a (9 + ε)approximation algorithm. We present an example showing that any approach based on performing a fixed number of swaps achieves an approximation factor of at least (9 − ε) in all sufficiently high dimensions. Thus, our approximation factor is almost tight for algorithms based on performing a fixed number of swaps. To establish the practical value of the heuristic, we present an empirical study that shows that, when combined with
Selfimproving algorithms
 in SODA ’06: Proceedings of the seventeenth annual ACMSIAM symposium on Discrete algorithm
"... We investigate ways in which an algorithm can improve its expected performance by finetuning itself automatically with respect to an arbitrary, unknown input distribution. We give such selfimproving algorithms for sorting and computing Delaunay triangulations. The highlights of this work: (i) an al ..."
Abstract

Cited by 34 (6 self)
 Add to MetaCart
We investigate ways in which an algorithm can improve its expected performance by finetuning itself automatically with respect to an arbitrary, unknown input distribution. We give such selfimproving algorithms for sorting and computing Delaunay triangulations. The highlights of this work: (i) an algorithm to sort a list of numbers with optimal expected limiting complexity; and (ii) an algorithm to compute the Delaunay triangulation of a set of points with optimal expected limiting complexity. In both cases, the algorithm begins with a training phase during which it adjusts itself to the input distribution, followed by a stationary regime in which the algorithm settles to its optimized incarnation. 1
Optimal Time Bounds for Approximate Clustering
, 2002
"... Clusteringisafundamentalprobleminunsupervised learning, andhasbeenstudiedwidelyboth asaproblemoflearningmixture modelsandasanoptimizationproblem. Inthispaper, we studyclusteringwithrespectthe kmedian objectivefunction, anaturalformulationofclusteringin whichweattempttominimize the average distance ..."
Abstract

Cited by 34 (2 self)
 Add to MetaCart
Clusteringisafundamentalprobleminunsupervised learning, andhasbeenstudiedwidelyboth asaproblemoflearningmixture modelsandasanoptimizationproblem. Inthispaper, we studyclusteringwithrespectthe kmedian objectivefunction, anaturalformulationofclusteringin whichweattempttominimize the average distancetoclustercenters. Oneofthe maincontributionsofthispaperisasimplebutpowerful samplingtechniquethatwecall successivesampling thatcouldbeofindependentinterest. Weshowthatoursamplingprocedurecan rapidlyidentify asmallsetofpoints(ofsizejust O(k log n/k))thatsummarizetheinputpoints forthepurposeofclustering. Usingsuccessive sampling, we develop analgorithmforthe kmedianproblemthatrunsin O(nk) timeforawiderangeof valuesof k andisguaranteed, with high probability, to return a solution with cost at most a constant factor times optimal. We also establish a lower bound of \Omega ( nk) onanyrandomizedconstantfactorapproximation algorithm for the kmedian problem that succeeds with even a negligible (say
Fast clustering using MapReduce
 In KDD
, 2011
"... Clustering problems have numerous applications and are becoming more challenging with the growing size of data available. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the ..."
Abstract

Cited by 23 (4 self)
 Add to MetaCart
(Show Context)
Clustering problems have numerous applications and are becoming more challenging with the growing size of data available. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems kcenter and kmedian. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis showing several clustering algorithms are in MRC 0, a theoretical MapReduce class introduced by Karloff et al. [26]. Our algorithms use sampling to decrease the data size and run a time consuming clustering algorithm such as local search or Lloyd’s algorithm on the reduced data set. Our algorithms have sufficient flexibility to be used in practice since they run in a constant number of MapReduce rounds. We complement these results by performing experiments using our algorithms. We compare the empirical performance of our algorithms to several sequential and parallel algorithms for the kmedian problem. The experiments show that our algorithms ’ solutions are similar or better than the other algorithms, while running faster than any other parallel algorithm that was tested for sufficiently large data sets. 1.
On the Implementation of a SwapBased Local Search Procedure for the pMedian Problem
, 2002
"... We present a new implementation of a widely used swapbased local search procedure for the pmedian problem. It produces the same output as the best implementation described in the literature and has the same worstcase complexity, but, through the use of extra memory, it can be significantly faster ..."
Abstract

Cited by 19 (6 self)
 Add to MetaCart
(Show Context)
We present a new implementation of a widely used swapbased local search procedure for the pmedian problem. It produces the same output as the best implementation described in the literature and has the same worstcase complexity, but, through the use of extra memory, it can be significantly faster in practice: speedups of up to three orders of magnitude were observed.
A 2Approximation Algorithm for the SoftCapacitated Facility Location Problem
 Proceedings of the 6th International Workshop on Approximation Algorithms for Combinatorial Optimization (APPROX), LNCS 2764
, 2003
"... This paper is divided into two parts. In the first part of this paper, we present a 2approximation algorithm for the softcapacitated facility location problem. This achieves the integrality gap of the natural LP relaxation of the problem. The algorithm is based on an improved analysis of an algo ..."
Abstract

Cited by 18 (4 self)
 Add to MetaCart
(Show Context)
This paper is divided into two parts. In the first part of this paper, we present a 2approximation algorithm for the softcapacitated facility location problem. This achieves the integrality gap of the natural LP relaxation of the problem. The algorithm is based on an improved analysis of an algorithm for the linear facility location problem, and a bifactor approximatereduction from this problem to the softcapacitated facility location problem. We will define and use the concept of bifactor approximate reductions to improve the approximation factor of several other variants of the facility location problem. In the second part of the paper, we present an alternative analysis of the authors' 1.52approximation algorithm for the uncapacitated facility location problem, using a single factorrevealing LP. This answers an open question of [18]. Furthermore, this analysis, combined with a recent result of Thorup [25] shows that our algorithm can be implemented in quasilinear time, achieving the best known approximation factor in the best possible running time.
Clustering for Metric and NonMetric Distance Measures
, 2009
"... We study a generalization of the kmedian problem with respect to an arbitrary dissimilarity measure D. Given a finite set P of size n, our goal is to find a set C of size k such that the sum of errors D(P, C) = ∑ D(p, c) is minimized. The main result in this paper can be p∈P minc∈C stated as follo ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
We study a generalization of the kmedian problem with respect to an arbitrary dissimilarity measure D. Given a finite set P of size n, our goal is to find a set C of size k such that the sum of errors D(P, C) = ∑ D(p, c) is minimized. The main result in this paper can be p∈P minc∈C stated as follows: There exists a (1+ɛ)approximation algorithm for the kmedian problem with respect to D, if the 1median problem can be approximated within a factor of (1+ɛ) by taking a random sample of constant size and solving the 1median problem on the sample exactly. This algorithms requires time n2O(mk log(mk/ɛ)) , where m is a constant that depends only on ɛ and D. Using this characterization, we obtain the first linear time (1 + ɛ)approximation algorithms for the kmedian problem in an arbitrary metric space with bounded doubling dimension, for the KullbackLeibler divergence (relative entropy), for the ItakuraSaito divergence, for Mahalanobis distances, and for some special cases of Bregman divergences. Moreover, we obtain previously known results for the Euclidean kmedian problem and the Euclidean kmeans problem in a simplified manner. Our results are based on a new analysis of an algorithm of Kumar, Sabharwal, and Sen from FOCS 2004.