## A Data-Clustering Algorithm On Distributed Memory Multiprocessors (2000)

### Cached

### Download Links

- [miles.cnuce.cnr.it]
- [www.cs.utexas.edu]
- [www.cs.utexas.edu]
- [www.cs.utexas.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence |

Citations: | 97 - 1 self |

### BibTeX

@INPROCEEDINGS{Dhillon00adata-clustering,

author = {Inderjit Dhillon and Dharmendra Modha},

title = {A Data-Clustering Algorithm On Distributed Memory Multiprocessors},

booktitle = {In Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence},

year = {2000},

pages = {245--260}

}

### Years of Citing Articles

### OpenURL

### Abstract

To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the k-means clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent data-parallelism in the k-means algorithm. We analytically show that the speedup and the scaleup of our algorithm approach the optimal as the number of data points increases. We implemented our algorithm on an IBM POWERparallel SP2 with a maximum of 16 nodes. On typical test data sets, we observe nearly linear relative speedups, for example, 15.62 on 16 nodes, and essentially linear scaleup in the size of the data set and in the number of clusters desired. For a 2 gigabyte test data set, our implementation drives the 16 node SP2 at more than 1.8 gigaflops. Keywords: k-means, data mining, massive data sets, message-passing, text mining. 1 Introduction Data sets measuring in gigabytes and even terabytes are now quite common in data and text minin...

### Citations

4074 |
Pattern classification and scene analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ..., on a shared-nothing parallel machine, and analytically and empirically validate our parallelization strategy. Specifically, we propose a parallel version of the popular k-means clustering algorithm =-=[31, 13]-=- based on the message-passing model of parallel computing [32, 33]. To the best of our knowledge, a parallel implementation of the k-means clustering algorithm has not been reported in the literature.... |

1719 | Vector quantization and signal compression - Gersho, Gray - 1992 |

1609 |
A k-means clustering algorithm
- Hartigan, Wong
- 1979
(Show Context)
Citation Context ...uping of similar objects is one of the most widely used procedures in data mining [Fayyad et al.1996]. Practical applications of clustering include unsupervised classification and taxonomy generation =-=[Hartigan1975]-=-, nearest neighbor searching [Fukunaga and Narendra1975], scientific discovery [Smyth et al.1997], vector quantization [Gersho and Gray1992], time series analysis [Shaw and King1992], multidimensional... |

1080 | Using MPI: Portable Parallel Programming with the Message Passing Interface, 2nd edition - Gropp, Lusk, et al. - 1999 |

658 | MPI: the Complete Reference - Snir, Otto, et al. - 1996 |

655 | A Cluster-Based Approach to Browsing Large Document Collections - Tukey - 1992 |

499 |
Bayesian classification (AutoClass): Theory and results. chapter 6
- Cheeseman, Stutz
- 1996
(Show Context)
Citation Context ...idely used procedures in data mining [14]. Practical applications of clustering include unsupervised classification and taxonomy generation [13], nearest neighbor searching [15], scientific discovery =-=[16, 17]-=-, vector quantization [18], time series analysis [19], and multidimensional visualization [20, 21]. Our interest in clustering stems from the need to mine and analyze heaps of unstructured text docume... |

464 | Birch: An efficient data clustering method for very large databases
- Zhang, Ramakrishnan, et al.
- 1996
(Show Context)
Citation Context ...(1). Reasons for popularity of k-means are ease of interpretation, simplicity of implementation, scalability, speed of convergence, adaptability to sparse data, and ease of out-of-core implementation =-=[30, 35, 36]-=-. We present this algorithm in Figure 1, and intuitively explain it below: 1. (Initialization) Select a set of k starting points {mj} k j=1 in Rd (line 5 in Figure 1). The selection may be done in a r... |

432 |
Syntactic Clustering of the Web
- Broder, Glassman, et al.
- 1997
(Show Context)
Citation Context ...p for routing new information such as that arriving from newsfeeds and new scientic publications. For experiments describing a certain syntactic clustering of the whole web and its applications, see [=-=22]-=-. For detailed review of various classical text clustering algorithms such as the k-means algorithm and its variants, hierarchical agglomerative clustering, and graph-theoretic methods, see [23, 24]. ... |

347 | Etzioni O.: ”Web Document clustering: A feasibility demonstration - Zamir - 1998 |

315 | Concept decompositions for large sparse text data using clustering
- Dhillon, Modha
- 2001
(Show Context)
Citation Context ...s, see [23, 24]. Recently, there has been asurry of activity in this area, see [25-29]. For our recent work on matrix approximations using a variant of the k-means algorithm applied to text data, see =-=[30]-=-. Our results have been extremely promising; their applicability to extremely large collections of text documents requires a highly scalable implementation, and, hence, the motivation for this work. I... |

263 | SPRINT: A scalable parallel classifier for data mining - Shafer, Agrawal, et al. - 1996 |

171 |
Pattern Classi and scene analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...s, on a shared-nothing parallel machine, and analytically and empirically validate our parallelization strategy. Specically, we propose a parallel version of the popular k-means clustering algorithm [=-=31, 13]-=- based on the message-passing model of parallel computing [32, 33]. To the best of our knowledge, a parallel implementation of the k-means clustering algorithm has not been reported in the literature.... |

156 | Scalable Parallel Data Mining for Association Rules - Han, Karypis, et al. - 1997 |

135 |
Clustering algorithms
- Rasmussen
- 1992
(Show Context)
Citation Context ..., see [22]. For detailed review of various classical text clustering algorithms such as the k-means algorithm and its variants, hierarchical agglomerative clustering, and graph-theoretic methods, see =-=[23, 24]-=-. Recently, there has been asurry of activity in this area, see [25-29]. For our recent work on matrix approximations using a variant of the k-means algorithm applied to text data, see [30]. Our resul... |

124 | A branch and bound algorithm for computing k-nearest neighbors - Fukunaga, Narendra - 1975 |

90 | Document categorization and query generation on the world wide web using webace - Boley, Gini, et al. - 1999 |

86 | Convergence Properties of the K-Means Algorithms - Bottou, Bengio - 1995 |

83 | LogP: A Practical Model of Parallel Computation - Culler, Karp, et al. - 1996 |

81 |
Recent trends in hierarchic document clustering: a critical review
- Willet
- 1988
(Show Context)
Citation Context ..., see [22]. For detailed review of various classical text clustering algorithms such as the k-means algorithm and its variants, hierarchical agglomerative clustering, and graph-theoretic methods, see =-=[23, 24]-=-. Recently, there has been asurry of activity in this area, see [25-29]. For our recent work on matrix approximations using a variant of the k-means algorithm applied to text data, see [30]. Our resul... |

71 | Mining Very Large Databases with Parallel Processing - Freitas, Lavington - 1998 |

63 | Parallel Mining of Association Rules: Design, Implementation and Experience - Agrawal, Shafer - 1996 |

62 | ScalParC: A new scalable and efficient parallel classification algorithm for mining large datasets
- Joshi, Karypis, et al.
- 1998
(Show Context)
Citation Context ...association rules and classification, see, for example, Agrawal and Shafer [1], Chattratichat et al. [2], Cheung and Xiao [3], Han, Karypis, and Kumar [4],s2 Dhillon & Modha Joshi, Karypis, and Kumar =-=[5]-=-, Kargupta, Hamzaoglu, and Stafford [6], Shafer, Agrawal, and Mehta [7], Srivastava, et al. [8], Zaki, Ho, and Agrawal [9], and Zaki et al. [10]. Also, see Stolorz and Musick [11] and Freitas and Lavi... |

58 | An efficient kmeans clustering algorithm - Alsabti, S, et al. - 1998 |

49 | W.: New parallel algorithms for fast discovery of association rule. Data Mining and Knowledge Discovery 1 - Zaki, Parthasarathy, et al. - 1997 |

44 | SONIA: A Service for Organizing Networked Information Autonomously - Sahami, Yusufali, et al. - 1998 |

41 | H.S.: The complexity of the generalized Lloyd-Max problem - Garey, Johnson, et al. - 1982 |

35 | Parallel Formulations of Decision-Tree Classifi cation Algorithms - Srivastava, Han, et al. - 1999 |

30 | Visualizing Class Structure of Multidimensional Data - Dhillon, Modha, et al. - 1998 |

28 | Parallel Classification for Data Mining on Shared-Memory Multiprocessors - Zaki, Ho, et al. - 1999 |

27 |
Baysian classi (autoclass): Theory and results
- Cheeseman, Stutz
- 1996
(Show Context)
Citation Context ... widely used procedures in data mining [14]. Practical applications of clustering include unsupervised classication and taxonomy generation [13], nearest neighbor searching [15], scientic discovery [1=-=6, 17]-=-, vector quantization [18], time series analysis [19], and multidimensional visualization [20, 21]. Our interest in clustering stems from the need to mine and analyze heaps of unstructured text docume... |

27 |
Eicken, \LogP - a practical model of parallel computing
- Culler, Karp, et al.
- 1996
(Show Context)
Citation Context ... T reduce P ; (5) where T reduce P denotes the time (in seconds) required to \MPI Allreduce" asoating point number on P processors. On most architectures, one may assume that T reduce P = O(log =-=P ) [38]. Lin-=-e 27 in Figure 2 ensures that each of the P processes has a local copy of the total mean-squared-error \MSE", hence each process can independently decide on the convergence condition, that is, wh... |

24 | BIRCH:An e_cient data clustering method for very large databases - Zhang, Ramakrishnan, et al. - 1996 |

22 | The communication software and parallel environment of the IBM SP2 - Snir, Hochschild, et al. - 1995 |

19 |
SPRINT: A scalable parallel classi for data mining
- Agrawal, Mehta, et al.
- 1996
(Show Context)
Citation Context ...afer [1], Chattratichat et al. [2], Cheung and Xiao [3], Han, Karypis, and Kumar [4], 2 Dhillon & Modha Joshi, Karypis, and Kumar [5], Kargupta, Hamzaoglu, and Staord [6], Shafer, Agrawal, and Mehta [=-=7]-=-, Srivastava, et al. [8], Zaki, Ho, and Agrawal [9], and Zaki et al. [10]. Also, see Stolorz and Musick [11] and Freitas and Lavington [12] for recent books on scalable and parallel data mining. In th... |

19 | Almost-constant-time clustering of arbitrary corpus subsets - Silverstein, Pedersen - 1997 |

17 | scale data mining: Challenges and responses - Chattratichat, Darlington, et al. - 1997 |

13 | K.: PADMA: Parallel data mining agents for scalable text classification - Kargupta, Hamzaoglu, et al. - 1997 |

13 | T.: The EM Algorithm and Extentions - McLachlan, Krishnan - 1996 |

13 | Effect of Data Distribution in Parallel Mining of Associations
- Cheung, Xiao
- 1999
(Show Context)
Citation Context ...rallel data mining algorithms have been recently considered for tasks such as association rules and classification, see, for example, Agrawal and Shafer [1], Chattratichat et al. [2], Cheung and Xiao =-=[3]-=-, Han, Karypis, and Kumar [4],s2 Dhillon & Modha Joshi, Karypis, and Kumar [5], Kargupta, Hamzaoglu, and Stafford [6], Shafer, Agrawal, and Mehta [7], Srivastava, et al. [8], Zaki, Ho, and Agrawal [9]... |

11 |
Programming with UNIX Threads
- Northrup
- 1996
(Show Context)
Citation Context ... on a di#erent processor while ensuring that each processor has access to a separate copy of the centroids {m j } k j=1 . Such an algorithm can be implemented on a shared memory machine using threads =-=[Northrup1996]-=-. It is well known that the k-means algorithm is a hard thresholded version of the expectation-maximization (EM) algorithm [McLachlan and Krishnan1996]. We believe that the EM algorithm can be e#ectiv... |

10 |
The communication software and parallel environment
- Snir, Hochschild, et al.
- 1995
(Show Context)
Citation Context ... The processors all run AIX level 4.2.1 and communicate with each other through the High-Performance Switch with HPS-2 adapters. The entire system runs PSSP 2.3 (Parallel System Support Program). See =-=[39] for -=-further information about the SP2 architecture. Our implementation is in C and MPI. All the timing measurements are done using the routine \MPI Wtime()" described in Table 1. Our timing measureme... |

8 | Using cluster analysis to classify time series - Shaw, King - 1992 |

8 | A.: Detecting atmospheric regimes using cross-validated clustering - Smyth, Ghil, et al. - 1997 |

6 | ScalParC: A new scalable and ecient parallel classi algorithm for mining large datasets - Joshi, Karypis, et al. - 1998 |

4 |
An algorithm for creating artificial test clusters. Psychometrika 50
- Milligan
- 1985
(Show Context)
Citation Context ...ta point is to be interpreted as an average over five measurements. For a given number of data points n and number of dimensionssd, we generated a test data set with 8 clusters using the algorithm in =-=[Milligan1985]. A public-=- domain implementation of this algorithm "clusgen.c" is available from Dave Dubin (http://alexia.lis.uiuc.edu/dubin/). The advantage of such data generation is that we can generate as many d... |

4 |
Parallel classi for data mining on shared-memory multiprocessors
- Zaki, Ho, et al.
- 1999
(Show Context)
Citation Context ... [3], Han, Karypis, and Kumar [4], 2 Dhillon & Modha Joshi, Karypis, and Kumar [5], Kargupta, Hamzaoglu, and Staord [6], Shafer, Agrawal, and Mehta [7], Srivastava, et al. [8], Zaki, Ho, and Agrawal [=-=9]-=-, and Zaki et al. [10]. Also, see Stolorz and Musick [11] and Freitas and Lavington [12] for recent books on scalable and parallel data mining. In this paper, we consider parallel clustering. Clusteri... |

3 | Scalable High Performance Computing for Knowledge Discovery and Data Mining - Stolorz, Musick - 1997 |

2 |
W.S.: Visualizing class structure of highdimensional data with applications
- Dhillon, Modha, et al.
- 1999
(Show Context)
Citation Context ...ised classication and taxonomy generation [13], nearest neighbor searching [15], scientic discovery [16, 17], vector quantization [18], time series analysis [19], and multidimensional visualization [2=-=0, 21]. Our-=- interest in clustering stems from the need to mine and analyze heaps of unstructured text documents. Clustering has been used to discover \latent concepts" in sets of unstructured text documents... |

1 | E#ect of data distribution in parallel mining of associations. Data Mining and Knowledge Discovery - Cheung, Xiao - 1999 |