Results 1  10
of
12
Clustering data streams: Theory and practice
 IEEE TKDE
, 2003
"... Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little ..."
Abstract

Cited by 106 (2 self)
 Add to MetaCart
Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm’s performance on synthetic and real data streams. Index Terms—Clustering, data streams, approximation algorithms. 1
Greedy Facility Location Algorithms analyzed using Dual Fitting with FactorRevealing LP
 Journal of the ACM
, 2001
"... We present a natural greedy algorithm for the metric uncapacitated facility location problem and use the method of dual fitting to analyze its approximation ratio, which turns out to be 1.861. The running time of our algorithm is O(m log m), where m is the total number of edges in the underlying c ..."
Abstract

Cited by 100 (13 self)
 Add to MetaCart
We present a natural greedy algorithm for the metric uncapacitated facility location problem and use the method of dual fitting to analyze its approximation ratio, which turns out to be 1.861. The running time of our algorithm is O(m log m), where m is the total number of edges in the underlying complete bipartite graph between cities and facilities. We use our algorithm to improve recent results for some variants of the problem, such as the fault tolerant and outlier versions. In addition, we introduce a new variant which can be seen as a special case of the concave cost version of this problem.
A local search approximation algorithm for kmeans clustering
, 2004
"... In kmeans clustering we are given a set of n data points in ddimensional space ℜd and an integer k, and the problem is to determine a set of k points in ℜd, called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomialtime algorithms are kno ..."
Abstract

Cited by 71 (1 self)
 Add to MetaCart
In kmeans clustering we are given a set of n data points in ddimensional space ℜd and an integer k, and the problem is to determine a set of k points in ℜd, called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomialtime algorithms are known for this problem. Although asymptotically efficient approximation algorithms exist, these algorithms are not practical due to the very high constant factors involved. There are many heuristics that are used in practice, but we know of no bounds on their performance. We consider the question of whether there exists a simple and practical approximation algorithm for kmeans clustering. We present a local improvement heuristic based on swapping centers in and out. We prove that this yields a (9 + ε)approximation algorithm. We present an example showing that any approach based on performing a fixed number of swaps achieves an approximation factor of at least (9 − ε) in all sufficiently high dimensions. Thus, our approximation factor is almost tight for algorithms based on performing a fixed number of swaps. To establish the practical value of the heuristic, we present an empirical study that shows that, when combined with
Optimal Time Bounds for Approximate Clustering
, 2002
"... Clusteringisafundamentalprobleminunsupervised learning, andhasbeenstudiedwidelyboth asaproblemoflearningmixture modelsandasanoptimizationproblem. Inthispaper, we studyclusteringwithrespectthe kmedian objectivefunction, anaturalformulationofclusteringin whichweattempttominimize the average distance ..."
Abstract

Cited by 32 (2 self)
 Add to MetaCart
Clusteringisafundamentalprobleminunsupervised learning, andhasbeenstudiedwidelyboth asaproblemoflearningmixture modelsandasanoptimizationproblem. Inthispaper, we studyclusteringwithrespectthe kmedian objectivefunction, anaturalformulationofclusteringin whichweattempttominimize the average distancetoclustercenters. Oneofthe maincontributionsofthispaperisasimplebutpowerful samplingtechniquethatwecall successivesampling thatcouldbeofindependentinterest. Weshowthatoursamplingprocedurecan rapidlyidentify asmallsetofpoints(ofsizejust O(k log n/k))thatsummarizetheinputpoints forthepurposeofclustering. Usingsuccessive sampling, we develop analgorithmforthe kmedianproblemthatrunsin O(nk) timeforawiderangeof valuesof k andisguaranteed, with high probability, to return a solution with cost at most a constant factor times optimal. We also establish a lower bound of \Omega ( nk) onanyrandomizedconstantfactorapproximation algorithm for the kmedian problem that succeeds with even a negligible (say
Selfimproving algorithms
 in SODA ’06: Proceedings of the seventeenth annual ACMSIAM symposium on Discrete algorithm
"... We investigate ways in which an algorithm can improve its expected performance by finetuning itself automatically with respect to an arbitrary, unknown input distribution. We give such selfimproving algorithms for sorting and computing Delaunay triangulations. The highlights of this work: (i) an al ..."
Abstract

Cited by 26 (4 self)
 Add to MetaCart
We investigate ways in which an algorithm can improve its expected performance by finetuning itself automatically with respect to an arbitrary, unknown input distribution. We give such selfimproving algorithms for sorting and computing Delaunay triangulations. The highlights of this work: (i) an algorithm to sort a list of numbers with optimal expected limiting complexity; and (ii) an algorithm to compute the Delaunay triangulation of a set of points with optimal expected limiting complexity. In both cases, the algorithm begins with a training phase during which it adjusts itself to the input distribution, followed by a stationary regime in which the algorithm settles to its optimized incarnation. 1
A 2Approximation Algorithm for the SoftCapacitated Facility Location Problem
 Proceedings of the 6th International Workshop on Approximation Algorithms for Combinatorial Optimization (APPROX), LNCS 2764
, 2003
"... This paper is divided into two parts. In the first part of this paper, we present a 2approximation algorithm for the softcapacitated facility location problem. This achieves the integrality gap of the natural LP relaxation of the problem. The algorithm is based on an improved analysis of an algo ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
This paper is divided into two parts. In the first part of this paper, we present a 2approximation algorithm for the softcapacitated facility location problem. This achieves the integrality gap of the natural LP relaxation of the problem. The algorithm is based on an improved analysis of an algorithm for the linear facility location problem, and a bifactor approximatereduction from this problem to the softcapacitated facility location problem. We will define and use the concept of bifactor approximate reductions to improve the approximation factor of several other variants of the facility location problem. In the second part of the paper, we present an alternative analysis of the authors' 1.52approximation algorithm for the uncapacitated facility location problem, using a single factorrevealing LP. This answers an open question of [18]. Furthermore, this analysis, combined with a recent result of Thorup [25] shows that our algorithm can be implemented in quasilinear time, achieving the best known approximation factor in the best possible running time.
On the Implementation of a SwapBased Local Search Procedure for the pMedian Problem
, 2002
"... We present a new implementation of a widely used swapbased local search procedure for the pmedian problem. It produces the same output as the best implementation described in the literature and has the same worstcase complexity, but, through the use of extra memory, it can be significantly faster ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
We present a new implementation of a widely used swapbased local search procedure for the pmedian problem. It produces the same output as the best implementation described in the literature and has the same worstcase complexity, but, through the use of extra memory, it can be significantly faster in practice: speedups of up to three orders of magnitude were observed.
Hierarchical Traffic Grooming in LargeScale WDM Networks
, 2005
"... The advances in fiber optics and wavelength division multiplexing (WDM) technology are viewed as the key to satisfying the datadriven bandwidth demand of today’s Internet. The mismatch of bandwidths between user needs and wavelength capacity makes it clear that some multiplexing should be done to u ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
The advances in fiber optics and wavelength division multiplexing (WDM) technology are viewed as the key to satisfying the datadriven bandwidth demand of today’s Internet. The mismatch of bandwidths between user needs and wavelength capacity makes it clear that some multiplexing should be done to use the wavelength capacity efficiently, which will result in reduction on the cost of line terminating equipment (LTE). The technique is referred to as traffic grooming. Previous studies have concentrated on different objectives, or on some special network topologies such as rings. In our study, we aim at minimizing the LTE cost to directly target on minimizing the network cost. We look into the grooming problem in elemental topologies as a starting point. First, we conduct proofs to show that traffic grooming in path, ring and star topology networks with the cost function we consider is NPComplete. We also show the same complexity results for a MinMax objective that has not been considered before, on the two elementary topologies. We then design polynomialtime heuristic algorithms for the grooming problem in rings (thus implicitly paths) and stars for networks of larger size. Experiments on various network sizes and traffic patterns
Fast clustering using MapReduce
 In KDD
, 2011
"... Clustering problems have numerous applications and are becoming more challenging with the growing size of data available. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Clustering problems have numerous applications and are becoming more challenging with the growing size of data available. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems kcenter and kmedian. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis showing several clustering algorithms are in MRC 0, a theoretical MapReduce class introduced by Karloff et al. [26]. Our algorithms use sampling to decrease the data size and run a time consuming clustering algorithm such as local search or Lloyd’s algorithm on the reduced data set. Our algorithms have sufficient flexibility to be used in practice since they run in a constant number of MapReduce rounds. We complement these results by performing experiments using our algorithms. We compare the empirical performance of our algorithms to several sequential and parallel algorithms for the kmedian problem. The experiments show that our algorithms ’ solutions are similar or better than the other algorithms, while running faster than any other parallel algorithm that was tested for sufficiently large data sets. 1.