## Clustering data streams: Theory and practice (2003)

### Cached

### Download Links

Venue: | IEEE TKDE |

Citations: | 106 - 2 self |

### BibTeX

@ARTICLE{Guha03clusteringdata,

author = {Sudipto Guha and Adam Meyerson and Nina Mishra and Rajeev Motwani},

title = {Clustering data streams: Theory and practice},

journal = {IEEE TKDE},

year = {2003},

volume = {15},

pages = {515--528}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm’s performance on synthetic and real data streams. Index Terms—Clustering, data streams, approximation algorithms. 1

### Citations

2145 |
Algorithms for Clustering Data
- Jain, Dubes
- 1988
(Show Context)
Citation Context ...relaxation of the k-Median problem, where the number of centers is unrestricted, but there is an additional cost for each center included in the solution. There is abundant literature on these, books =-=[45]-=-, [62], [55], provable algorithms [41], [49], [54], [53], [71], [31], [16], [15], [6], [52], [47], [14], [59], [7], [46], the running time of provable clustering heuristics [21], [10], [42], [34], [73... |

1326 |
Finding Groups in Data An Introduction to Cluster Analysis
- Kaufman, Rousseeuw
- 1990
(Show Context)
Citation Context ...ustive study of related work. k-Means is a widely used heuristic, but, since the mean of a cluster is not always defined, alternate medoid-based algorithms have been developed. For example, k-Medoids =-=[50]-=- selects k initial centers and repeatedly replaces an existing center with a random chosen point, if doing so improves the sum of squared assignment distances. 2 CLARA [50] proposed sampling to reduce... |

1094 | A density-based algorithm for discovering clusters in large spatial databases with noise
- Ester, Kriegel, et al.
- 1996
(Show Context)
Citation Context ...p, but, whereas CURE is geared toward robustness and clustering arbitrary shapes, the algorithm presented here is designed to produce a provably good clustering. Other known approaches such as DBSCAN =-=[19]-=-, OPTICS [5], DENCLUE [39], STING [75], CLIQUE [3], Wave-Cluster [70], and OPTIGRID [40], are not designed to optimize the k-Median objective. 3 APROVABLE STREAM CLUSTERING FRAMEWORK 3.1 Clustering in... |

896 |
Approximation Algorithms
- Vazirani
- 2004
(Show Context)
Citation Context ... be labeled with the clusters they belong to. These algorithms output the cluster centers only. More involved definitions of clustering based on other graph theoretic notions exist; cliques [9], cuts =-=[74]-=-, conductance [48]. References [18], [69], [2] consider clustering defined by projections onto subspaces. 2.3 Existing Large Scale Clustering Algorithms Clustering is very well-studied in the applied ... |

701 | The space complexity of approximating the frequency moments
- Alon, Matias, et al.
- 1996
(Show Context)
Citation Context ...orm by Henzinger et al. [38], although the works of Munro and Patterson [64] and of Flajolet and Martin [23] predate this definition. The interest in the model started from the results of Alon et al. =-=[4]-=-, who proved upper and lower bounds for the memory requirements of one-pass algorithms computing statistics over data streams. Clustering, a useful and ubiquitous tool in data analysis, is, in broad s... |

593 | Efficient and effective clustering methods for spatial data mining
- Ng, Han
- 1994
(Show Context)
Citation Context ... sum of squared assignment distances. 2 CLARA [50] proposed sampling to reduce the number of exchanges considered since choosing a new medoid among all the remaining points is time-consuming; CLARANS =-=[66]-=- draws a fresh sample of feasible centers before each calculation of improvement. We will later see a slightly differing nonrepeated 2. In Section 4.3, we will see a scheme which at every step will no... |

565 | CURE: an efficient clustering algorithm for large databases
- Guha, Rastogi, et al.
- 1998
(Show Context)
Citation Context ...lustering (HAC) heuristics exist. Under the celebrated SLINK heuristic, the distance between clusters A and B is defined by the closest pair of points a2A, b2B. Another hierarchical technique is CURE =-=[35]-=- which represents a cluster by multiple points that are initially well-scattered in the cluster and then shrunk toward the cluster center by a certain fraction. Depending on the values of the CURE par... |

561 | Automatic subspace clustering of high dimensional data for data mining applications
- Agrawal, Gehrke, et al.
- 1998
(Show Context)
Citation Context ...d clustering arbitrary shapes, the algorithm presented here is designed to produce a provably good clustering. Other known approaches such as DBSCAN [19], OPTICS [5], DENCLUE [39], STING [75], CLIQUE =-=[3]-=-, Wave-Cluster [70], and OPTIGRID [40], are not designed to optimize the k-Median objective. 3 APROVABLE STREAM CLUSTERING FRAMEWORK 3.1 Clustering in Small Space Data stream algorithms must not have ... |

436 | BIRCH: an efficient data clustering method for very large databases - Zhang, Ramakrishnan, et al. - 1996 |

345 | Optics: Ordering points to identify the clustering structure
- Ankerst, Breunig, et al.
- 1999
(Show Context)
Citation Context ...as CURE is geared toward robustness and clustering arbitrary shapes, the algorithm presented here is designed to produce a provably good clustering. Other known approaches such as DBSCAN [19], OPTICS =-=[5]-=-, DENCLUE [39], STING [75], CLIQUE [3], Wave-Cluster [70], and OPTIGRID [40], are not designed to optimize the k-Median objective. 3 APROVABLE STREAM CLUSTERING FRAMEWORK 3.1 Clustering in Small Space... |

340 | Probabilistic counting algorithms for data base applications
- Flajolet, Martin
- 1985
(Show Context)
Citation Context ...finition of the streaming model, inclusive of multiple passes, was first characterized in this form by Henzinger et al. [38], although the works of Munro and Patterson [64] and of Flajolet and Martin =-=[23]-=- predate this definition. The interest in the model started from the results of Alon et al. [4], who proved upper and lower bounds for the memory requirements of one-pass algorithms computing statisti... |

319 | Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation
- Jain, Vazirani
(Show Context)
Citation Context ...additional cost for each center included in the solution. There is abundant literature on these, books [45], [62], [55], provable algorithms [41], [49], [54], [53], [71], [31], [16], [15], [6], [52], =-=[47]-=-, [14], [59], [7], [46], the running time of provable clustering heuristics [21], [10], [42], [34], [73], [60], and special metric spaces [6], [52], [43], [68]. The k-Median problem is also relevant i... |

262 | Stable distributions, pseudorandom generators, embeddings and data stream computation - Indyk - 2000 |

261 | Approximation algorithms for facility location problems - Shmoys, Tardos, et al. - 1997 |

255 | On clusterings: Good, bad and spectral - Kannan, Vempala, et al. - 2004 |

252 | Clustering data streams - Guha, Mishra, et al. - 2000 |

246 | Scaling Clustering Algorithms to Large Databases”, Microsoft Research
- Bradley, Fayyad, et al.
(Show Context)
Citation Context ...roach. k-medoid approaches, including PAM, CLARA, and CLARANS, are known not to be scalable and, thus, are inappropriate for stream analysis. Other partitioning methods include that of Bradley et al. =-=[11]-=-, and its subsequent improvement by Farnstorm et al. [20], which repeatedly takes k weighted centers (initially chosen randomly with weight 1) and as much data as can fit in main memory and computes a... |

245 | Fast algorithms for projected clustering
- Aggarwal, Wolf, et al.
- 1999
(Show Context)
Citation Context ...hese algorithms output the cluster centers only. More involved definitions of clustering based on other graph theoretic notions exist; cliques [9], cuts [74], conductance [48]. References [18], [69], =-=[2]-=- consider clustering defined by projections onto subspaces. 2.3 Existing Large Scale Clustering Algorithms Clustering is very well-studied in the applied literature, and the following is by no means a... |

233 | STING: a statistical information grid approach to spatial data mining
- Wang, Yang, et al.
- 1997
(Show Context)
Citation Context ...robustness and clustering arbitrary shapes, the algorithm presented here is designed to produce a provably good clustering. Other known approaches such as DBSCAN [19], OPTICS [5], DENCLUE [39], STING =-=[75]-=-, CLIQUE [3], Wave-Cluster [70], and OPTIGRID [40], are not designed to optimize the k-Median objective. 3 APROVABLE STREAM CLUSTERING FRAMEWORK 3.1 Clustering in Small Space Data stream algorithms mu... |

229 | Local search heuristic for k-median and facility location problems
- Arya, Garg, et al.
- 2001
(Show Context)
Citation Context ...r each center included in the solution. There is abundant literature on these, books [45], [62], [55], provable algorithms [41], [49], [54], [53], [71], [31], [16], [15], [6], [52], [47], [14], [59], =-=[7]-=-, [46], the running time of provable clustering heuristics [21], [10], [42], [34], [73], [60], and special metric spaces [6], [52], [43], [68]. The k-Median problem is also relevant in the context of ... |

227 | Maintaining stream statistics over sliding windows
- Datar, Gionis, et al.
(Show Context)
Citation Context ... the data is a stream include: frequency estimation [36], [12], norm estimation [4], [22], [44], order statistics [64], [57], [56], [30], [28], synopsis structures [26], time indexed data [25], [32], =-=[17]-=-, [8], signal reconstructions [24], [18], [1], [33], [27], [29], [72]. 2.2 Theoretical Analysis of Clustering For most natural clustering objective functions, the optimization problems turn out to be ... |

214 |
An efficient data clustering method for very large databases
- Zhang, Ramakrishnan, et al.
- 1996
(Show Context)
Citation Context ...SLINK (HAC) to k-Medoid. Both HAC and CURE are designed to discover clusters of arbitrary shape and, thus, do not necessarily optimize the k-Median objective. Hierarchical algorithms, including BIRCH =-=[76]-=- are known to suffer from the problem that hierarchical merge or split operations are irrevocable [37]. The stream clustering algorithm we present is somewhat similar to CURE. The algorithm given here... |

212 | A constant-factor approximation algorithm for the k-median problem (extended abstract - Charikar, Guha, et al. - 1999 |

207 | KEIM,." An efficient approach to clustering large multimedia databases with noise
- HINNEBURG
(Show Context)
Citation Context ...ared toward robustness and clustering arbitrary shapes, the algorithm presented here is designed to produce a provably good clustering. Other known approaches such as DBSCAN [19], OPTICS [5], DENCLUE =-=[39]-=-, STING [75], CLIQUE [3], Wave-Cluster [70], and OPTIGRID [40], are not designed to optimize the k-Median objective. 3 APROVABLE STREAM CLUSTERING FRAMEWORK 3.1 Clustering in Small Space Data stream a... |

206 |
An algorithmic approach to network location problems. I: The p-centers
- Kariv, Hakimi
(Show Context)
Citation Context ...e number of centers is unrestricted, but there is an additional cost for each center included in the solution. There is abundant literature on these, books [45], [62], [55], provable algorithms [41], =-=[49]-=-, [54], [53], [71], [31], [16], [15], [6], [52], [47], [14], [59], [7], [46], the running time of provable clustering heuristics [21], [10], [42], [34], [73], [60], and special metric spaces [6], [52]... |

205 | Improved combinatorial algorithms for facility location problems
- Charikar, Guha
(Show Context)
Citation Context ...onal cost for each center included in the solution. There is abundant literature on these, books [45], [62], [55], provable algorithms [41], [49], [54], [53], [71], [31], [16], [15], [6], [52], [47], =-=[14]-=-, [59], [7], [46], the running time of provable clustering heuristics [21], [10], [42], [34], [73], [60], and special metric spaces [6], [52], [43], [68]. The k-Median problem is also relevant in the ... |

189 |
A best possible heuristic for the k-center problem
- Hochbaum, Shmoys
- 1985
(Show Context)
Citation Context ...ere the number of centers is unrestricted, but there is an additional cost for each center included in the solution. There is abundant literature on these, books [45], [62], [55], provable algorithms =-=[41]-=-, [49], [54], [53], [71], [31], [16], [15], [6], [52], [47], [14], [59], [7], [46], the running time of provable clustering heuristics [21], [10], [42], [34], [73], [60], and special metric spaces [6]... |

183 | Greedy strikes back: improved facility location algorithms
- Guha, Khuller
- 1998
(Show Context)
Citation Context ...nrestricted, but there is an additional cost for each center included in the solution. There is abundant literature on these, books [45], [62], [55], provable algorithms [41], [49], [54], [53], [71], =-=[31]-=-, [16], [15], [6], [52], [47], [14], [59], [7], [46], the running time of provable clustering heuristics [21], [10], [42], [34], [73], [60], and special metric spaces [6], [52], [43], [68]. The k-Medi... |

183 | Space-efficient online computation of quantile summaries - Greenwald, Khanna - 2001 |

176 | Fast Monte-Carlo algorithms for finding low-rank approximations
- Frieze, Kannan, et al.
- 1998
(Show Context)
Citation Context ...quency estimation [36], [12], norm estimation [4], [22], [44], order statistics [64], [57], [56], [30], [28], synopsis structures [26], time indexed data [25], [32], [17], [8], signal reconstructions =-=[24]-=-, [18], [1], [33], [27], [29], [72]. 2.2 Theoretical Analysis of Clustering For most natural clustering objective functions, the optimization problems turn out to be NP hard. Therefore, most theoretic... |

171 | WaveCluster: A multi-resolution clustering approach for very large spatial databases
- Sheikholeslami, Chatterjee, et al.
- 1998
(Show Context)
Citation Context ...rary shapes, the algorithm presented here is designed to produce a provably good clustering. Other known approaches such as DBSCAN [19], OPTICS [5], DENCLUE [39], STING [75], CLIQUE [3], Wave-Cluster =-=[70]-=-, and OPTIGRID [40], are not designed to optimize the k-Median objective. 3 APROVABLE STREAM CLUSTERING FRAMEWORK 3.1 Clustering in Small Space Data stream algorithms must not have large space require... |

153 | Incremental clustering and dynamic information retrieval
- Charikar, Chekuri, et al.
(Show Context)
Citation Context ...eometric mean of the cluster corresponding to it. This is the k-Median objective function defined over real spaces in which assignment costs (distances) are replaced by their squares. Charikar et al. =-=[13]-=- gave a constant-factor, single-pass k-Center algorithm using Oðnk log kÞ time and OðkÞ space. For k-Median, we give a constant-factor, single-pass approximation in time ~ OðnkÞ and sublinear n space ... |

138 | Discrete Location Theory - Mirchandani, Francis - 1990 |

129 | Data-streams and histograms - Guha, Koudas, et al. - 2001 |

122 | Improved approximation algorithms for uncapaciateted facility location
- Chudak
- 1998
(Show Context)
Citation Context ...icted, but there is an additional cost for each center included in the solution. There is abundant literature on these, books [45], [62], [55], provable algorithms [41], [49], [54], [53], [71], [31], =-=[16]-=-, [15], [6], [52], [47], [14], [59], [7], [46], the running time of provable clustering heuristics [21], [10], [42], [34], [73], [60], and special metric spaces [6], [52], [43], [68]. The k-Median pro... |

118 | Selection and sorting with limited storage - Munro, Paterson - 1980 |

115 | A new greedy approach for facility location problems - Jain, Mahdian, et al. - 2002 |

114 | Approximation schemes for euclidean k-medians and related problems
- Arora, Raghavan, et al.
- 1998
(Show Context)
Citation Context ...here is an additional cost for each center included in the solution. There is abundant literature on these, books [45], [62], [55], provable algorithms [41], [49], [54], [53], [71], [31], [16], [15], =-=[6]-=-, [52], [47], [14], [59], [7], [46], the running time of provable clustering heuristics [21], [10], [42], [34], [73], [60], and special metric spaces [6], [52], [43], [68]. The k-Median problem is als... |

114 | Sampling-based estimation of the number of distinct values of an attribute
- Haas, Naughton, et al.
- 1995
(Show Context)
Citation Context ... Algorithms A rich body of fundamental research has emerged in the data stream model of computation. Problems that can be solved in small space when the data is a stream include: frequency estimation =-=[36]-=-, [12], norm estimation [4], [22], [44], order statistics [64], [57], [56], [30], [28], synopsis structures [26], time indexed data [25], [32], [17], [8], signal reconstructions [24], [18], [1], [33],... |

113 | Approximate medians and other quantiles in one pass and with limited memory - Manku, Rajagopalan, et al. - 1998 |

108 | Synopsis data structures for massive data sets
- Gibbons, Matias
- 1999
(Show Context)
Citation Context ...at can be solved in small space when the data is a stream include: frequency estimation [36], [12], norm estimation [4], [22], [44], order statistics [64], [57], [56], [30], [28], synopsis structures =-=[26]-=-, time indexed data [25], [32], [17], [8], signal reconstructions [24], [18], [1], [33], [27], [29], [72]. 2.2 Theoretical Analysis of Clustering For most natural clustering objective functions, the o... |

105 | small-space algorithms for approximate histogram maintenance - Fast - 2002 |

104 | How to summarize the universe: Dynamic maintenance of quantiles - Gilbert, Kotidis, et al. - 2002 |

98 | Random sampling techniques for space efficient online computation of order statistics of large datasets
- Manku, Rajagopalan, et al.
- 1999
(Show Context)
Citation Context ...ata stream model of computation. Problems that can be solved in small space when the data is a stream include: frequency estimation [36], [12], norm estimation [4], [22], [44], order statistics [64], =-=[57]-=-, [56], [30], [28], synopsis structures [26], time indexed data [25], [32], [17], [8], signal reconstructions [24], [18], [1], [33], [27], [29], [72]. 2.2 Theoretical Analysis of Clustering For most n... |

94 | Sampling from a moving window over streaming data
- Babcock, Datar, et al.
- 2002
(Show Context)
Citation Context ...ata is a stream include: frequency estimation [36], [12], norm estimation [4], [22], [44], order statistics [64], [57], [56], [30], [28], synopsis structures [26], time indexed data [25], [32], [17], =-=[8]-=-, signal reconstructions [24], [18], [1], [33], [27], [29], [72]. 2.2 Theoretical Analysis of Clustering For most natural clustering objective functions, the optimization problems turn out to be NP ha... |

87 |
An approximate L1-difference algorithm for massive data streams
- Feigenbaum, Kannan, et al.
- 2000
(Show Context)
Citation Context ...ental research has emerged in the data stream model of computation. Problems that can be solved in small space when the data is a stream include: frequency estimation [36], [12], norm estimation [4], =-=[22]-=-, [44], order statistics [64], [57], [56], [30], [28], synopsis structures [26], time indexed data [25], [32], [17], [8], signal reconstructions [24], [18], [1], [33], [27], [29], [72]. 2.2 Theoretica... |

87 | Dynamic multidimensional histograms - Thaper, Guha, et al. - 2002 |

85 | Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering
- Hinneburg, Keim
- 1999
(Show Context)
Citation Context ...gorithm presented here is designed to produce a provably good clustering. Other known approaches such as DBSCAN [19], OPTICS [5], DENCLUE [39], STING [75], CLIQUE [3], Wave-Cluster [70], and OPTIGRID =-=[40]-=-, are not designed to optimize the k-Median objective. 3 APROVABLE STREAM CLUSTERING FRAMEWORK 3.1 Clustering in Small Space Data stream algorithms must not have large space requirements and, so, our ... |

84 | A monte carlo algorithm for fast projective clustering
- Procopiuc, Jones, et al.
- 2002
(Show Context)
Citation Context ... to. These algorithms output the cluster centers only. More involved definitions of clustering based on other graph theoretic notions exist; cliques [9], cuts [74], conductance [48]. References [18], =-=[69]-=-, [2] consider clustering defined by projections onto subspaces. 2.3 Existing Large Scale Clustering Algorithms Clustering is very well-studied in the applied literature, and the following is by no me... |

84 | Clustering in large graphs and matrices - Drineas, Frieze, et al. - 1999 |