## Clustering Data Streams (2000)

### Cached

### Download Links

- [robotics.stanford.edu]
- [theory.stanford.edu]
- CiteULike
- DBLP

### Other Repositories/Bibliography

Citations: | 252 - 13 self |

### BibTeX

@INPROCEEDINGS{Guha00clusteringdata,

author = {Sudipto Guha and Nina Mishra and Rajeev Motwani and Liadan O'Callaghan},

title = {Clustering Data Streams},

booktitle = {},

year = {2000},

pages = {359--366}

}

### Years of Citing Articles

### OpenURL

### Abstract

We study clustering under the data stream model of computation where: given a sequence of points, the objective is to maintain a consistently good clustering of the sequence observed so far, using a small amount of memory and time. The data stream model is relevant to new classes of applications involving massive data sets, such as web click stream analysis and multimedia data analysis. We give constant-factor approximation algorithms for the k-Median problem in the data stream model of computation in a single pass. We also show negative results implying that our algorithms cannot be improved in a certain sense.

### Citations

319 | Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation
- Jain, Vazirani
(Show Context)
Citation Context ...Theorem 2.1). Throughout the paper we also assume that the input points are drawn from a metric space. In the recent past, several approximation algorithms have been proposed for the k{Median problem =-=[3, 10, 2]-=-. These algorithms require O(n 2 ) space to compute the dual variables or primal constraints. We will be interested in algorithms which use more than k medians but run in linear space [12, 2, 9]. Char... |

212 | A constant-factor approximation algorithm for the k-median problem (extended abstract - Charikar, Guha, et al. - 1999 |

205 | Improved combinatorial algorithms for facility location problems
- Charikar, Guha
(Show Context)
Citation Context ...Theorem 2.1). Throughout the paper we also assume that the input points are drawn from a metric space. In the recent past, several approximation algorithms have been proposed for the k{Median problem =-=[3, 10, 2]-=-. These algorithms require O(n 2 ) space to compute the dual variables or primal constraints. We will be interested in algorithms which use more than k medians but run in linear space [12, 2, 9]. Char... |

155 | Computing on data streams
- Henzinger, Raghavan, et al.
- 1998
(Show Context)
Citation Context ... Paterson [16], where they studied the space requirement of selection and sorting as a function of the number of passes over the data. The model was formalized by Henzinger, Raghavan, and Rajagopalan =-=[7]-=-, who gave several algorithms and complexity results re1 lated to graph-theoretic problems and their applications. Other recent results on data streams can be found in [4, 13, 14, 6]. Related Work on ... |

153 | Incremental clustering and dynamic information retrieval - Charikar, Chekuri, et al. |

146 | Analysis of a local search heuristic for facility location problems
- Korupolu, Plaxton, et al.
(Show Context)
Citation Context ...oblem [3, 10, 2]. These algorithms require O(n 2 ) space to compute the dual variables or primal constraints. We will be interested in algorithms which use more than k medians but run in linear space =-=[12, 2, 9]-=-. Charikar, Chekuri, Feder, and Motwani [1] gave a constant-factor algorithm for the incremental k{Center problem, which is also a single-pass algorithm requiringsO(nk log k) time and O(k) space. Ther... |

118 |
Selection and sorting with limited storage
- Munro, Paterson
- 1980
(Show Context)
Citation Context ...ic ~ O(nk)- time, polylog(n)-approximation single-pass algorithm that uses n space, for s1. Related Work on Data Streams One of thesrst results in data streams was the result of Munro and Paterson [1=-=6]-=-, where they studied the space requirement of selection and sorting as a function of the number of passes over the data. The model was formalized by Henzinger, Raghavan, and Rajagopalan [7], who gave ... |

114 | Approximate medians and other quantiles in one pass and with limited memory
- Manku, Rajagopalan, et al.
- 1998
(Show Context)
Citation Context ...r, Raghavan, and Rajagopalan [7], who gave several algorithms and complexity results re1 lated to graph-theoretic problems and their applications. Other recent results on data streams can be found in =-=[4, 13, 14, 6]. Rel-=-ated Work on Clustering In this paper we shall consider models in which clusters have a distinguished point, or \center." In the k{Median problem, the objective is to minimize the average distanc... |

98 | Random sampling techniques for space efficient online computation of order statistics of large datasets - Manku, Rajagopalan, et al. - 1999 |

87 | An approximate L1-difference algorithm for massive data streams - Feigenbaum, Kannan, et al. - 2000 |

81 | A constant-factor approximation algorithm for the multicommodity rent-or-buy problem - Kumar, Gupta, et al. - 2002 |

79 | Sublinear time algorithms for metric space problems
- Indyk
- 1999
(Show Context)
Citation Context ...oblem [3, 10, 2]. These algorithms require O(n 2 ) space to compute the dual variables or primal constraints. We will be interested in algorithms which use more than k medians but run in linear space =-=[12, 2, 9]-=-. Charikar, Chekuri, Feder, and Motwani [1] gave a constant-factor algorithm for the incremental k{Center problem, which is also a single-pass algorithm requiringsO(nk log k) time and O(k) space. Ther... |

67 |
Probabilistic counting
- Flajolet, Martin
- 1983
(Show Context)
Citation Context ...r, Raghavan, and Rajagopalan [7], who gave several algorithms and complexity results re1 lated to graph-theoretic problems and their applications. Other recent results on data streams can be found in =-=[4, 13, 14, 6]. Rel-=-ated Work on Clustering In this paper we shall consider models in which clusters have a distinguished point, or \center." In the k{Median problem, the objective is to minimize the average distanc... |

54 | Randomized query processing in robot path planning
- Kavraki, Latombe, et al.
- 1995
(Show Context)
Citation Context ...factor k-clustering can be computed with t queries to the distance function i a graph k-partition can be computed with t queries to the adjacency matrix of G. Kavraki, Latombe, Motwani, and Raghavan [=-=8]-=- show that any deterministic algorithm thatsnds a Graph kPartition requires nk) queries to the adjacency matrix of G. This result establishes a deterministic lower bound for k{Median. Theorem 5.1 A de... |

28 | Hardness of approximating MAX k-CUT and its dual - Kann, Khanna, et al. - 1997 |

11 |
ber den Standort der Industrien, Erster Teil: Reine Theorie des Standortes. Tbingen
- Weber
- 1909
(Show Context)
Citation Context ...nguished point, or \center." In the k{Median problem, the objective is to minimize the average distance from data points to their closest cluster centers. The 1{ median problem wassrst posed by W=-=eber [17]-=-. In the k{Center problem, the objective is to minimize the maximum radius of a cluster. The above problems are all NP-hard, so we will be concerned with approximation algorithms. We will assume that ... |

10 |
An approximate L - dierence algorithm for massive data streams
- Feigenbaum, Kannan, et al.
- 1999
(Show Context)
Citation Context ...r, Raghavan, and Rajagopalan [7], who gave several algorithms and complexity results re1 lated to graph-theoretic problems and their applications. Other recent results on data streams can be found in =-=[4, 13, 14, 6]. Rel-=-ated Work on Clustering In this paper we shall consider models in which clusters have a distinguished point, or \center." In the k{Median problem, the objective is to minimize the average distanc... |

10 | Fast Monte Carlo algorithms for finding low-rank approximations - Frieze, Kannan, et al. |

9 | On the hardness of approximating max-k-cut and its dual - Kann, Khanna, et al. - 1997 |

8 |
Random sampling techniques for space e#cientonline computation of order statistics of large datasets
- Manku, Rajagopalan, et al.
- 1999
(Show Context)
Citation Context |

4 |
WaySublinear Time Approximate (PAC) Clustering
- Mishra, Oblinger, et al.
- 2000
(Show Context)
Citation Context ... algorithm with the local search tradeo results in [2] reduces the space requirement to O( p nk). Alternate sampling-based results exist for the k{ Median measure that do extend to the weighted case [=-=15]-=-, however these results assume Euclidean space. 4.2 Extension to the Weighted Case We need this sampling-based algorithm to work on weighted input. It is necessary to draw a random sample based on the... |

2 |
Incremental clustering and dynamic infor7 mation retrieval
- Charikar, Chekuri, et al.
- 1997
(Show Context)
Citation Context ... space to compute the dual variables or primal constraints. We will be interested in algorithms which use more than k medians but run in linear space [12, 2, 9]. Charikar, Chekuri, Feder, and Motwani =-=[1-=-] gave a constant-factor algorithm for the incremental k{Center problem, which is also a single-pass algorithm requiringsO(nk log k) time and O(k) space. There is a large dierence, however, between th... |

2 |
Fast Monte Carlo algorithms for low-rank approximations. submitted
- Frieze, Kannan, et al.
(Show Context)
Citation Context ...hese outliers. If the points have weights, however, in thesrst step we may only eliminate k points. Therefore sampling according to weights does not carry through. Contrast this with the algorithm in =-=[5-=-] where the points were in Euclidean space and the measure was sum of squares of distances. Both these facts were crucial for their algorithm. We suggest the following modication. The basic idea is sc... |