## Conquering the divide: Continuous clustering of distributed data streams (2007)

### Cached

### Download Links

- [dimacs.rutgers.edu]
- [www.research.att.com]
- [www.cs.rutgers.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In Intl. Conf. on Data Engineering |

Citations: | 23 - 3 self |

### BibTeX

@INPROCEEDINGS{Cormode07conqueringthe,

author = {Graham Cormode},

title = {Conquering the divide: Continuous clustering of distributed data streams},

booktitle = {In Intl. Conf. on Data Engineering},

year = {2007}

}

### OpenURL

### Abstract

Data is often collected over a distributed network, but in many cases, is so voluminous that it is impractical and undesirable to collect it in a central location. Instead, we must perform distributed computations over the data, guaranteeing high quality answers even as new data arrives. In this paper, we formalize and study the problem of maintaining a clustering of such distributed data that is continuously evolving. In particular, our goal is to minimize the communication and computational cost, still providing guaranteed accuracy of the clustering. We focus on the k-center clustering, and provide a suite of algorithms that vary based on which centralized algorithm they derive from, and whether they maintain a single global clustering or many local clusterings that can be merged together. We show that these algorithms can be designed to give accuracy guarantees that are close to the best possible even in the centralized case. In our experiments, we see clear trends among these algorithms, showing that the choice of algorithm is crucial, and that we can achieve a clustering that is as good as the best centralized clustering, with only a small fraction of the communication required to collect all the data in a single location. 1

### Citations

2161 |
Algorithms for Clustering Data
- Jain, Dubes
- 1988
(Show Context)
Citation Context ... problem that is NP-hard to optimize [21, 13]. Hence, most clustering work focuses either on efficient algorithms that give good results in practice, such as BIRCH [30], CURE [19] DBSCAN [12], kmeans =-=[23]-=- and so on; or on giving guaranteed approximations to particular clustering optimization criteria, such as k-center (minimizing the maximum radius/diameter of any cluster) [17] and k-median (minimizin... |

1106 | A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise
- Ester
- 1996
(Show Context)
Citation Context ...ition yield a problem that is NP-hard to optimize [21, 13]. Hence, most clustering work focuses either on efficient algorithms that give good results in practice, such as BIRCH [30], CURE [19] DBSCAN =-=[12]-=-, kmeans [23] and so on; or on giving guaranteed approximations to particular clustering optimization criteria, such as k-center (minimizing the maximum radius/diameter of any cluster) [17] and k-medi... |

568 | Cure: An Efficient Clustering Algorithm for Large Databases
- Guha, Rastogi, et al.
- 1998
(Show Context)
Citation Context ...atical definition yield a problem that is NP-hard to optimize [21, 13]. Hence, most clustering work focuses either on efficient algorithms that give good results in practice, such as BIRCH [30], CURE =-=[19]-=- DBSCAN [12], kmeans [23] and so on; or on giving guaranteed approximations to particular clustering optimization criteria, such as k-center (minimizing the maximum radius/diameter of any cluster) [17... |

439 | M.: Birch: An efficient data clustering method for very large databases
- Zhang, Ramakrishnan, et al.
- 1996
(Show Context)
Citation Context ...cise mathematical definition yield a problem that is NP-hard to optimize [21, 13]. Hence, most clustering work focuses either on efficient algorithms that give good results in practice, such as BIRCH =-=[30]-=-, CURE [19] DBSCAN [12], kmeans [23] and so on; or on giving guaranteed approximations to particular clustering optimization criteria, such as k-center (minimizing the maximum radius/diameter of any c... |

282 |
Clustering to minimize the maximum intercluster distance
- GONZALEZ
- 1985
(Show Context)
Citation Context ...19] DBSCAN [12], kmeans [23] and so on; or on giving guaranteed approximations to particular clustering optimization criteria, such as k-center (minimizing the maximum radius/diameter of any cluster) =-=[17]-=- and k-median (minimizing the sum of distances from each point to its cluster center) [4]. All of these methods assume full access to (static) data, and hence do not naturally adapt to the continuous ... |

253 | Clustering data streams
- Guha, Mishra, et al.
- 2001
(Show Context)
Citation Context ...factor of 8, using merging techniques and storing only k points. We use similar ideas, but are able to get much better approximations with slightly more storage. The k-median objective was studied in =-=[18]-=-, but much more space and much worse approximation factors make this algorithm impractical in our setting. Similarly, more recent work using εnets [20] and gridding techniques [22, 15] have addressed ... |

243 | A framework for clustering evolving data streams
- Aggarwal, Han, et al.
- 2003
(Show Context)
Citation Context ...g in lower-dimensional spaces, but the complexity of the algorithms makes them unusable for our purposes. Prior work on clustering evolving data considered capturing historic trends at a central site =-=[2]-=-, whereas our goal is to cluster the current set of point streams. 3 Centralized Clustering Algorithms In this work we concentrate on the k-center objective for clustering, since this is a simple and ... |

231 | Local search heuristics for k-median and facility location problems
- Arya, Garg, et al.
- 2004
(Show Context)
Citation Context ...ar clustering optimization criteria, such as k-center (minimizing the maximum radius/diameter of any cluster) [17] and k-median (minimizing the sum of distances from each point to its cluster center) =-=[4]-=-. All of these methods assume full access to (static) data, and hence do not naturally adapt to the continuous distributed model. Some prior work touches on aspects of our setting – for example, the D... |

194 | Adaptive Filters for Continuous Queries over Distributed Data Streams
- Olston, Jiang, et al.
- 2003
(Show Context)
Citation Context ... Section 6, and conclude in Section 7. 2 Preliminaries 2.1 Continuous Distributed Model The Continuous Distributed model of computation has been refined over recent years through a sequence of papers =-=[5, 26, 10, 8, 7]-=-. It abstracts the key features of a broad variety of scenarios and identifies the principle dimensions. In the model there are a set of m different remote sites, each of which observes an update stre... |

174 | Underwater acoustic sensor networks: Research challenges
- Akyildiz, Pompili, et al.
- 2005
(Show Context)
Citation Context ... quality of the clustering. Motivating Example. Underwater sensor networks are a particularly resource constrained setting because of physical conditions (reduced channel capacity, harsh environment) =-=[3]-=-. A typical problem is when m remote tracking stations are deployed in an underwater acoustic monitoring system. Each station keeps track of certain schools of fishes based on a given wave length, and... |

170 | Distributed top-k monitoring
- Babcock, Olston
- 2003
(Show Context)
Citation Context ... Section 6, and conclude in Section 7. 2 Preliminaries 2.1 Continuous Distributed Model The Continuous Distributed model of computation has been refined over recent years through a sequence of papers =-=[5, 26, 10, 8, 7]-=-. It abstracts the key features of a broad variety of scenarios and identifies the principle dimensions. In the model there are a set of m different remote sites, each of which observes an update stre... |

169 |
Optimal algorithms for approximate clustering
- Feder, Greene
- 1988
(Show Context)
Citation Context ... different clusters are dissimilar. The definition can be formalized in various ways; however, many attempts to formulate a precise mathematical definition yield a problem that is NP-hard to optimize =-=[21, 13]-=-. Hence, most clustering work focuses either on efficient algorithms that give good results in practice, such as BIRCH [30], CURE [19] DBSCAN [12], kmeans [23] and so on; or on giving guaranteed appro... |

154 | Incremental Clustering and Dynamic Information Retrieval
- Charikar, Chekuri, et al.
- 1997
(Show Context)
Citation Context ...ause such algorithms keep small memory state, this state can sometimes be used as a 3 summary of the much larger data, and shared or merged with others to allow clustering of the union of streams. In =-=[6]-=-, the “doubling algorithm” allows the k-center objective to be approximated over a single stream, up to a factor of 8, using merging techniques and storing only k points. We use similar ideas, but are... |

95 | A data-clustering algorithm on distributed memory multiprocessors, in
- Dhillon, Modha
- 2000
(Show Context)
Citation Context ...e continuous distributed model. Some prior work touches on aspects of our setting – for example, the DEMON project [16] considers the case when (centralized) data can be somewhat dynamic. In [29] and =-=[11]-=- the authors consider how to perform clustering on parallel processors, which shares some concerns with our setting. However, they have the freedom to preprocess and distribute the data to processors,... |

79 | Holistic aggregates in a networked world: distributed tracking of approximate quantiles
- Cormode, Garofalakis, et al.
- 2005
(Show Context)
Citation Context ... Section 6, and conclude in Section 7. 2 Preliminaries 2.1 Continuous Distributed Model The Continuous Distributed model of computation has been refined over recent years through a sequence of papers =-=[5, 26, 10, 8, 7]-=-. It abstracts the key features of a broad variety of scenarios and identifies the principle dimensions. In the model there are a set of m different remote sites, each of which observes an update stre... |

65 | Sketching streams through the net: Distributed approximate query tracking
- Cormode, Garofalakis
- 2005
(Show Context)
Citation Context |

55 | DEMON: Mining and Monitoring Evolving Data
- Ganti, Gehrke, et al.
- 2000
(Show Context)
Citation Context ...e methods assume full access to (static) data, and hence do not naturally adapt to the continuous distributed model. Some prior work touches on aspects of our setting – for example, the DEMON project =-=[16]-=- considers the case when (centralized) data can be somewhat dynamic. In [29] and [11] the authors consider how to perform clustering on parallel processors, which shares some concerns with our setting... |

47 |
Distributed data clustering can be efficient and exact
- Forman, Zhang
(Show Context)
Citation Context ...ess and distribute the data to processors, whereas in our setting, the allocation of data to sites is fixed. From the Distributed Data Mining community (DDM), distributed clustering has been explored =-=[14]-=-. This motivated the problem of clustering distributed static data, and gave results based on collecting sufficient statistics for density based clustering algorithms from remote sites. [24] extends D... |

47 | Coresets for k-means and k-median clustering and their applications
- Har-Peled, Mazumdar
- 2004
(Show Context)
Citation Context ...e storage. The k-median objective was studied in [18], but much more space and much worse approximation factors make this algorithm impractical in our setting. Similarly, more recent work using εnets =-=[20]-=- and gridding techniques [22, 15] have addressed clustering in lower-dimensional spaces, but the complexity of the algorithms makes them unusable for our purposes. Prior work on clustering evolving da... |

41 | Algorithms for dynamic geometric problems over data streams
- INDYK
(Show Context)
Citation Context ...tive was studied in [18], but much more space and much worse approximation factors make this algorithm impractical in our setting. Similarly, more recent work using εnets [20] and gridding techniques =-=[22, 15]-=- have addressed clustering in lower-dimensional spaces, but the complexity of the algorithms makes them unusable for our purposes. Prior work on clustering evolving data considered capturing historic ... |

29 |
Easy and hard bottleneck location problems
- Hsu, Nemhauser
- 1979
(Show Context)
Citation Context ... different clusters are dissimilar. The definition can be formalized in various ways; however, many attempts to formulate a precise mathematical definition yield a problem that is NP-hard to optimize =-=[21, 13]-=-. Hence, most clustering work focuses either on efficient algorithms that give good results in practice, such as BIRCH [30], CURE [19] DBSCAN [12], kmeans [23] and so on; or on giving guaranteed appro... |

28 | Distributed set-expression cardinality estimation
- Das, Ganguly, et al.
- 2004
(Show Context)
Citation Context |

24 |
Coresets in dynamic geometric data streams
- Frahling, Sohler
- 2005
(Show Context)
Citation Context ...tive was studied in [18], but much more space and much worse approximation factors make this algorithm impractical in our setting. Similarly, more recent work using εnets [20] and gridding techniques =-=[22, 15]-=- have addressed clustering in lower-dimensional spaces, but the complexity of the algorithms makes them unusable for our purposes. Prior work on clustering evolving data considered capturing historic ... |

22 | Whats Different: Distributed, Continuous Monitoring of Duplicate-Resilient Aggregates on Data Streams
- Cormode, Muthukrishnan, et al.
- 2006
(Show Context)
Citation Context ...deling radio networks). A variety of problems have been studied within this model, including: monitoring the top-k most frequent items [5]; tracking set expressions and duplicate resilient quantities =-=[10, 9]-=-; monitoring quantiles of a distribution [8]; and finding accurate sketch summaries of the data [7, 9]. In most cases, it is possible to write the communication cost of the protocol in terms of the ac... |

22 | A fast parallel clustering algorithm for large spatial databases
- Xu, J&228, et al.
- 1999
(Show Context)
Citation Context ...apt to the continuous distributed model. Some prior work touches on aspects of our setting – for example, the DEMON project [16] considers the case when (centralized) data can be somewhat dynamic. In =-=[29]-=- and [11] the authors consider how to perform clustering on parallel processors, which shares some concerns with our setting. However, they have the freedom to preprocess and distribute the data to pr... |

12 | Towards effective and efficient distributed clustering
- Januza, Kriegel, et al.
- 2003
(Show Context)
Citation Context ... explored [14]. This motivated the problem of clustering distributed static data, and gave results based on collecting sufficient statistics for density based clustering algorithms from remote sites. =-=[24]-=- extends DBSCAN to a distributed setting, again assuming static data. Each one of these touches on some aspects of continuous, distributed clustering, but none fully solve this problem. Also relevant ... |

1 |
Pseudo periodic synthetic time series. http://kdd.ics.uci.edu/databases/ synthetic/synthetic.html
- Keogh, Pazzani
(Show Context)
Citation Context ...t to show the effect of high dimensional data (where the size of the points is significantly larger than other data that may be transmitted). Each point is randomly assigned to a site. Synthetic Data =-=[25]-=-. We also ran experiments on a synthetic aperiodic time series consisting of 1 million entries. We again made this into high-dimensional data points by using a sliding window of default size d = 100. ... |

1 |
Sound Sightings
- Holden
- 2006
(Show Context)
Citation Context ...tation. This station maintains a kclustering of the schools so k attracting or dispelling acoustic devices can be deployed near the k center points to use the minimum energy to cover the whole region =-=[20]-=-. Our Contributions. Our main contributions are as follows: 1. We introduce and motivate the problem of clustering evolving distributed data, minimize the communication between sites and obtain cluste... |