Results 1 
5 of
5
Parallel Algorithms for Hierarchical Clustering
 Parallel Computing
, 1995
"... Hierarchical clustering is a common method used to determine clusters of similar data points in multidimensional spaces. O(n 2 ) algorithms are known for this problem [3, 4, 10, 18]. This paper reviews important results for sequential algorithms and describes previous work on parallel algorithms f ..."
Abstract

Cited by 80 (1 self)
 Add to MetaCart
Hierarchical clustering is a common method used to determine clusters of similar data points in multidimensional spaces. O(n 2 ) algorithms are known for this problem [3, 4, 10, 18]. This paper reviews important results for sequential algorithms and describes previous work on parallel algorithms for hierarchical clustering. Parallel algorithms to perform hierarchical clustering using several distance metrics are then described. Optimal PRAM algorithms using n log n processors are given for the average link, complete link, centroid, median, and minimum variance metrics. Optimal butterfly and tree algorithms using n log n processors are given for the centroid, median, and minimum variance metrics. Optimal asymptotic speedups are achieved for the best practical algorithm to perform clustering using the single link metric on a n log n processor PRAM, butterfly, or tree. Keywords. Hierarchical clustering, pattern analysis, parallel algorithm, butterfly network, PRAM algorithm. 1 In...
Clustering and Classification of Large Document Bases in a Parallel Environment
, 1997
"... : Development of clusterbased search systems has been hampered by prohibitive times involved in clustering large document sets. Once completed, maintaining cluster organizations is difficult in dynamic file environments. We propose the use of parallel computing systems to overcome the computational ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
: Development of clusterbased search systems has been hampered by prohibitive times involved in clustering large document sets. Once completed, maintaining cluster organizations is difficult in dynamic file environments. We propose the use of parallel computing systems to overcome the computationally intense clustering process. Two operations are examined. The first is clustering a document set and the second is classifying the document set. A subset of the TIPSTER corpus, specifically, articles from the Wall Street Journal, is used. Document set classification was performed without the large storage requirement (potentially as high as 522M) for ancillary data matrices. In all cases, the time performance of the parallel system was an improvement over sequential system times, and produced the same clustering and classification scheme. Some results show near linear speed up in higher threshold clustering applications. Keywords: Parallel Information Retrieval, Document Clustering, Docume...
Clustering in Massive Data Sets
 Handbook of massive data sets
, 1999
"... We review the time and storage costs of search and clustering algorithms. We exemplify these, based on casestudies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
We review the time and storage costs of search and clustering algorithms. We exemplify these, based on casestudies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, and a basis for clustering algorithms to follow. Sections 7 to 11 review a number of families of clustering algorithm. Sections 12 to 14 relate to visual or image representations of data sets, from which a number of interesting algorithmic developments arise.
k/hmeans Clustering for Large Data Sets
"... . This paper describes the realization of a parallel version of the k/hmeans clustering algorithm. This is one of the basic algorithms used in a wide range of data mining tasks. We show how a database can be distributed and how the algorithm can be applied to this distributed database. The test ..."
Abstract
 Add to MetaCart
. This paper describes the realization of a parallel version of the k/hmeans clustering algorithm. This is one of the basic algorithms used in a wide range of data mining tasks. We show how a database can be distributed and how the algorithm can be applied to this distributed database. The tests conducted on a network of 32 PCs showed for large data sets a nearly ideal speedup. 1 Introduction Clustering, the process of grouping similar objects, is a well known and a well studied problem. Some of early work has been done in statistics (e.g. [2][7]). In more recent years clustering was identified as a key technique in data mining tasks [3]. This fundamental operation can be applied to many common tasks such as unsupervised classification, segmentation and dissection. We are focusing here on one specific algorithm for clustering namely k/hmeans clustering [1]. The original version of the k/hmeans algorithm was designed for numerical data [4][6][5]. Our contribution in this pape...
Data Mining and Knowledge Discovery, 3, 263–290 (1999) c ○ 1999 Kluwer Academic Publishers. Manufactured in The Netherlands. A Fast Parallel Clustering Algorithm for Large Spatial Databases
"... Abstract. The clustering algorithm DBSCAN relies on a densitybased notion of clusters and is designed to discover clusters of arbitrary shape as well as to distinguish noise. In this paper, we present PDBSCAN, a parallel version of this algorithm. We use the ‘sharednothing ’ architecture with mult ..."
Abstract
 Add to MetaCart
Abstract. The clustering algorithm DBSCAN relies on a densitybased notion of clusters and is designed to discover clusters of arbitrary shape as well as to distinguish noise. In this paper, we present PDBSCAN, a parallel version of this algorithm. We use the ‘sharednothing ’ architecture with multiple computers interconnected through a network. A fundamental component of a sharednothing system is its distributed data structure. We introduce the dR∗tree, a distributed spatial index structure in which the data is spread among multiple computers and the indexes of the data are replicated on every computer. We implemented our method using a number of workstations connected via Ethernet (10 Mbit). A performance evaluation shows that PDBSCAN offers nearly linear speedup and has excellent scaleup and sizeup behavior.