## Distributed Clustering Using Collective Principal Component Analysis (1999)

Venue: | Knowledge and Information Systems |

Citations: | 49 - 9 self |

### BibTeX

@ARTICLE{Kargupta99distributedclustering,

author = {Hillol Kargupta and Weiyun Huang and Krishnamoorthy Sivakumar and Erik Johnson},

title = {Distributed Clustering Using Collective Principal Component Analysis},

journal = {Knowledge and Information Systems},

year = {1999},

volume = {3},

pages = {2001}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper considers distributed clustering of high dimensional heterogeneous data using a distributed Principal Component Analysis (PCA) technique called the Collective PCA. It presents the Collective PCA technique that can be used independent of the clustering application. It shows a way to integrate the Collective PCA with a given o-the-shelf clustering algorithm in order to develop a distributed clustering technique. It also presents experimental results using dierent test data sets including an application for web mining.

### Citations

4673 |
Matrix Analysis
- Horn, Johnson
- 1985
(Show Context)
Citation Context ... singular value decomposition (SVD) of a real, symmetric, positive semi-denite matrix (the matrix x in our case), is equivalent to the orthogonal decomposition in terms of eigenvalues/eigenvectors [2=-=6]-=-. Therefore, algorithms for computing the SVD can also be used for PCA. The power method and its variants are some of the simplest techniques forsnding a few of the dominant eigenvalue/eigenvector of ... |

3016 |
Probability and Measure
- Billingsley
- 1995
(Show Context)
Citation Context ...me that all the observations are statistically independent and are drawn from a common underlying probability distribution. It then follows from the law of large numbers and the central limit theorem =-=-=-[3, 4] that the covariance matrix ^ x estimated from data is related to the true covariance matrix x by Var[k ^ x k] = O 1 m ; where m is the number of observations transmitted to the central sit... |

1771 |
Perturbation Theory for linear operators
- Kato
- 1980
(Show Context)
Citation Context ...l basis for these subspaces. A detailed analysis of this error measure based on distance between subspaces is important. Indeed, let U ; V R n be two subspaces. The gap between U and V is dened as [4=-=9, 33]-=-s(U ; V) = maxf sup kuk=1;u2U inf v2V ku vk; sup kvk=1;v2V inf u2U ku vkg; where k:k is a norm on R n . The gap function is a metric for the important special case where k:k is the Euclidean norm. The... |

522 |
Analysis of Complex Statistical Variables into Principal Components,” The
- Hotelling
(Show Context)
Citation Context ...er feature selection and feature construction may not produce desirable data clusters. Moreover, this is important for the scalability of the clustering algorithms. Principal component analysis (PCA) =-=[27, 28]-=- is a popular technique to construct a representation of the data that capture maximally variant dimensions of the data. It computes a representation with a set of basis vectors that are the dominant ... |

451 | The Geometry of Graphs and Some of its Algorithmic Applications
- Linial, London, et al.
- 1995
(Show Context)
Citation Context ...pace to a Hilbert space. Theorem 1 ([7]) Every n-point metric space of dimension n can be mapped to a O(log n) Hilbert space with an O(log n) distortion. 17 This result was further explored elsewhere =-=[38-=-] which produced the following theorem. Theorem 2 ([39]) In random polynomial time, every n-point metric space of n dimensions can be embedded in ` O(log n) p (for any p 1), with distortion O(log n),... |

299 |
A User’s Guide to Principal Components
- Jackson
- 1991
(Show Context)
Citation Context ...er feature selection and feature construction may not produce desirable data clusters. Moreover, this is important for the scalability of the clustering algorithms. Principal component analysis (PCA) =-=[27, 28]-=- is a popular technique to construct a representation of the data that capture maximally variant dimensions of the data. It computes a representation with a set of basis vectors that are the dominant ... |

247 | Fast algorithms for projected clustering
- Aggarwal, Wolf, et al.
- 1999
(Show Context)
Citation Context ...bsets of the data and computing the minimum distance between the point being projected and the subsets. Related work for constructing projections of n points in Euclidean space can be found elsewhere =-=[9, 12]-=-. We are currently exploring this possibility. The assignment of the representative points to every member of the data set can be made more ecient by storing the data set using similarity preserving i... |

246 | Scaling Clustering Algorithms to Large Databases”, Microsoft Research
- Bradley, Fayyad, et al.
(Show Context)
Citation Context ...on describes a distributed clustering algorithm that makes use of the CPCA. 5 Distributed Clustering Using the CPCA Clustering is an important technique that is often used in data mining applications =-=[8, 21, 44, 57]-=-. Clustering high dimensional data often requires application of PCA-like techniques for constructing features that capture the maximum variance in the data in a small number of components. When the d... |

214 |
an e cient data clustering method for very large databases
- BIRCH
- 1996
(Show Context)
Citation Context ...on describes a distributed clustering algorithm that makes use of the CPCA. 5 Distributed Clustering Using the CPCA Clustering is an important technique that is often used in data mining applications =-=[8, 21, 44, 57]-=-. Clustering high dimensional data often requires application of PCA-like techniques for constructing features that capture the maximum variance in the data in a small number of components. When the d... |

157 |
Fundamentals of Matrix Computations
- Watkins
- 2002
(Show Context)
Citation Context ... of the data to obtain the transformation matrix A for PCA [3]. 4 In recent years, the QR algorithm has been the most widely used algorithm for calculating the complete set of eigenvalues of a matrix =-=[52, 18]-=-. Cyclic Jacobi methods are particularly suited for implementation in a parallel computer [52, 18]. The divide-and conquer method of Cuppen is a relatively new method for calculating the complete eige... |

156 |
Multivariate Statistical Methods
- Morrison
- 1976
(Show Context)
Citation Context ...f its ease of implementation, we have adopted this method in our experiments. 2.2 PCA and Data Analysis Principal component analysis has found wide applications in various disciplines like psychology =-=[43, 5]-=-, genetics [23], pattern recognition [55], remote sensing [35], and seismic data analysis [24, 29], among others. PCA is also a popular choice for data mining applications. PCA has been used for detec... |

114 |
Error and perturbation bounds for subspaces associated with certain eigenvalue problems
- Stewart
(Show Context)
Citation Context ...: ; u k g and fv 1 ; : : : ; v k g. In fact, for some cases with repeated eigenvalues, the former error maybe large even when the subspaces are close in some appropriate metric. To quote from Stewart =-=[49]: \..-=-. one cannot expect the eigenvectors of nearby matrices to lie near one another when their corresponding eigenvalues belong to clusters of poorly separated eigenvalues." \Although the eigenvector... |

112 | A Transformation for Ordering Multispectral Data in Terms of Image Quality with Implications for Noise Removal - Green, Berman, et al. - 1988 |

112 | Similarity indexing: Algorithms and performance
- White, Jain
- 1996
(Show Context)
Citation Context ...e currently exploring this possibility. The assignment of the representative points to every member of the data set can be made more ecient by storing the data set using similarity preserving indices =-=[54]-=-. We are currently integrating such techniques with the distributed clustering technique. 8 Conclusions Distributed data analysis is playing an increasingly important role in KDD applications from dat... |

83 | Collective data mining: A new perspective toward distributed data mining
- Kargupta, Park, et al.
- 1999
(Show Context)
Citation Context ... retrieve huge amounts of data has necessitated the development of algorithms that can extract useful information from these databases. KDD addresses this issue. Distributed knowledge discovery (DKD) =-=[10, 13, 20, 22, 25, 32, 46, 37, 51, 56]-=- takes KDD to a new platform. It embraces the growing trend of merging computation with communication and explores all facets of the KDD process in the context of the emerging distributed computing en... |

80 | Parallel algorithms for hierarchical clustering
- OLSON
- 1995
(Show Context)
Citation Context ...rated with standard o-the-shelve clustering algorithms in order to generate a distributed clustering technique. There are numerous recent eorts directed towards scaling up clustering algorithms. In [4=-=5]-=-, the author shows an adaptation of the SLINK [48] and other agglomerative hierarchical clustering algorithms to a multiprocessor environment to parallelize the clustering process. The PADMA system [3... |

51 |
Clustering methodology in exploratory data analysis
- DUBES, JAIN
- 1980
(Show Context)
Citation Context ...sed Distributed Clustering This section presents the experimental results with CPCA-based clustering technique for two experimental test suites. In our experiments we use K-means clustering algorithm =-=[16]-=- as the clustering module. We compare the results of centralized clustering with those of the CPCA-based distributed clustering. In the centralized case, we also use PCA to extract the features and pe... |

49 | Scalable, Distributed Data Mining Using an Agent Based Architecture
- Kargupta, Hamzaoglu, et al.
- 1997
(Show Context)
Citation Context ...5], the author shows an adaptation of the SLINK [48] and other agglomerative hierarchical clustering algorithms to a multiprocessor environment to parallelize the clustering process. The PADMA system =-=[31-=-] oers a distributed clustering system for homogeneous text data. In [15], the authors adapt the K-Means algorithm to run in a parallel/distributed environment. The Collective Hierarchical Clustering ... |

45 | Collective, hierarchical clustering from distributed, heterogeneous data
- Johnson, Kargupta
- 2000
(Show Context)
Citation Context ...CA and distributed PCA-based clustering of heterogeneous and distributed data. Earlier eorts on classier and hierarchical cluster learning from heterogeneous distributed data can be found elsewhere [2=-=5, 30, 32, 51]. We-=- will not assume any restriction on the number of data sites. By denition, we will assume that there exists at least one key feature (e.g. the feature \City" in Tables 3 & 4) associated with the ... |

41 |
Principal direction divisive partitioning. Data Mining and Knowledge Discovery
- Boley
- 1998
(Show Context)
Citation Context ...s of the covariance matrix generated by the data. Clustering algorithms equipped with PCA-based representation are quite useful in many applications including knowledge discovery from databases (KDD) =-=[6-=-]. Both PCA and PCA-based clustering algorithms are reasonably well understood when the data sets are centrally located. However, the emergence of network-based computation has oered a new challenge t... |

35 |
Enhancement of High Spectral Resolution Remote-Sensing Data by a Noise-Adjusted Principal Components Transform
- Lee, Woodyatt, et al.
- 1990
(Show Context)
Citation Context ...experiments. 2.2 PCA and Data Analysis Principal component analysis has found wide applications in various disciplines like psychology [43, 5], genetics [23], pattern recognition [55], remote sensing =-=[35]-=-, and seismic data analysis [24, 29], among others. PCA is also a popular choice for data mining applications. PCA has been used for detecting linear associative rules [17]. Application of PCA-based t... |

34 | LAPACK Users’ Guide. Society for Industrial and Applied Mathematics, third edition - Anderson, Bai, et al. |

34 |
E.cient and e.ective clustering methods for spatial data mining
- Ng, Han
- 1994
(Show Context)
Citation Context ...on describes a distributed clustering algorithm that makes use of the CPCA. 5 Distributed Clustering Using the CPCA Clustering is an important technique that is often used in data mining applications =-=[8, 21, 44, 57]-=-. Clustering high dimensional data often requires application of PCA-like techniques for constructing features that capture the maximum variance in the data in a small number of components. When the d... |

30 | Sharing learned models among remote database partitions by local metalearning
- Chan, Stolfo
- 1996
(Show Context)
Citation Context ... retrieve huge amounts of data has necessitated the development of algorithms that can extract useful information from these databases. KDD addresses this issue. Distributed knowledge discovery (DKD) =-=[10, 13, 20, 22, 25, 32, 46, 37, 51, 56]-=- takes KDD to a new platform. It embraces the growing trend of merging computation with communication and explores all facets of the KDD process in the context of the emerging distributed computing en... |

30 | Non-expansive hashing
- Linial, Sasson
- 1996
(Show Context)
Citation Context ... metric space of dimension n can be mapped to a O(log n) Hilbert space with an O(log n) distortion. 17 This result was further explored elsewhere [38] which produced the following theorem. Theorem 2 (=-=[3-=-9]) In random polynomial time, every n-point metric space of n dimensions can be embedded in ` O(log n) p (for any p 1), with distortion O(log n), where ` m p is a norm in the Euclidean spacesm dened... |

28 |
The Rotation of Eigenvectors by a Perturbation: III
- Davis, Kahan
- 1970
(Show Context)
Citation Context ...set of suitably constructed orthonormal basis vectors for the respective subspaces. The angle i between the i-th basis vectors of U and V is dened as the i-th canonical angle between the subspaces [1=-=4]-=-. In this case, the sine of the largest canonical angle is the gaps(U ; V) between the subspaces. We are actively investigating the idea of characterizing the CPCA error in terms of the relation betwe... |

28 | A Principal Components Approach to Combining Regression Estimates
- Merz, Pazzani
- 1999
(Show Context)
Citation Context ...ive rules [17]. Application of PCA-based techniques for large scale text can be found in [6]. PCA has also found applications in ensemble learning and aggregation of multiple models. Merz and Pazzani =-=[42]-=- have reported a PCA-based technique for combining regression estimates. A maximum-likelihood-based framework for constructing mixture models of PCA is proposed by Tipping and Bishop [50]. The techniq... |

28 |
Pattern recognition by means of disjoint principalcomponents models
- Wold
- 1976
(Show Context)
Citation Context ...d this method in our experiments. 2.2 PCA and Data Analysis Principal component analysis has found wide applications in various disciplines like psychology [43, 5], genetics [23], pattern recognition =-=[55]-=-, remote sensing [35], and seismic data analysis [24, 29], among others. PCA is also a popular choice for data mining applications. PCA has been used for detecting linear associative rules [17]. Appli... |

25 |
A Data Clustering Algorithm on Distributed Memory Multiprocessors
- Dhillon, Modha
- 1996
(Show Context)
Citation Context ...ive hierarchical clustering algorithms to a multiprocessor environment to parallelize the clustering process. The PADMA system [31] oers a distributed clustering system for homogeneous text data. In [=-=15]-=-, the authors adapt the K-Means algorithm to run in a parallel/distributed environment. The Collective Hierarchical Clustering algorithm was proposed elsewhere [30] for generating hierarchical cluster... |

25 |
CURE: an ecient clustering algorithm for large databases
- Guha, Rastogi, et al.
- 1998
(Show Context)
Citation Context |

25 | Distributed Multivariate Regression Using Wavelet-based Collective Data Mining
- Hershberger, Kargupta
- 1999
(Show Context)
Citation Context ... retrieve huge amounts of data has necessitated the development of algorithms that can extract useful information from these databases. KDD addresses this issue. Distributed knowledge discovery (DKD) =-=[10, 13, 20, 22, 25, 32, 46, 37, 51, 56]-=- takes KDD to a new platform. It embraces the growing trend of merging computation with communication and explores all facets of the KDD process in the context of the emerging distributed computing en... |

24 | BIRCH:An e_cient data clustering method for very large databases - Zhang, Ramakrishnan, et al. - 1996 |

23 | The preliminary design of papyrus: A system for high performance
- Grossman, Baily, et al.
- 1998
(Show Context)
Citation Context |

20 | Probabilistic visualisation of high-dimensional binary data
- Tipping
- 1999
(Show Context)
Citation Context ... and Pazzani [42] have reported a PCA-based technique for combining regression estimates. A maximum-likelihood-based framework for constructing mixture models of PCA is proposed by Tipping and Bishop =-=[50-=-]. The technique developed by Tippin and Bishop develops a collection of PCA models by analyzing dierent horizontal partitions of the data. The objective is to a generate better quality model that is ... |

18 |
Numerical Analysis: A Practical Approach
- Maron
- 1982
(Show Context)
Citation Context ...erefore, algorithms for computing the SVD can also be used for PCA. The power method and its variants are some of the simplest techniques forsnding a few of the dominant eigenvalue/eigenvector of x [=-=40, 41]-=-. Because of its ease of implementation, we have adopted this method in our experiments. 2.2 PCA and Data Analysis Principal component analysis has found wide applications in various disciplines like ... |

18 |
Distributed cooperative Bayesian learning strategies
- Yamanishi
- 1997
(Show Context)
Citation Context |

16 |
Distributed data mining of probabilistic knowledge
- Lam, Segre
- 1997
(Show Context)
Citation Context |

13 |
On lipschitz embedding of metric spaces in hilbert space
- Bourgain
- 1985
(Show Context)
Citation Context ... spaces. We say that the mappingsis -nearly isometric, if x (x1 ;x2 ) y ( (x1 ); (x2 )) . In this case we may say that the mapping has an distortion. The following theorem developed elsewhere [7] provide an interesting result about near isometric mappings of a metric space to a Hilbert space. Theorem 1 ([7]) Every n-point metric space of dimension n can be mapped to a O(log n) Hilbert space w... |

13 | 2000, ‘Robust Order Statistics based Ensemble for Distributed Data Mining
- Tumer, Ghosh
(Show Context)
Citation Context |

11 |
A Fourier Analysis-Based Approach to Learn Classifier from
- Park, R, et al.
- 2001
(Show Context)
Citation Context |

10 | A survey of methods for multivariate data projection, visualization and interactive analysis - Konig - 1998 |

10 |
Analysis of multiblock and hierarchical PCA and PLS models
- Westerhuis, Kourti, et al.
- 1998
(Show Context)
Citation Context ...ctor. 4 partitions may not share the same feature set. Another approach for aggregating multiple data partitions into a single hierarchical PCA model is developed by Westerhuis, Kourti, and Macgregor =-=[53]-=-. This technique was primarily developed for process control applications in chemical engineering where data sets are collected periodically. This approach iteratively extracts one dominant eigenvecto... |

7 |
Distributed learning with knowledge probing: A new framework for distributed data mining
- Guo, Sutiwaraphun
- 2000
(Show Context)
Citation Context |

7 |
On lines and planes of closest ¯t to systems of points in
- Pearson
- 1901
(Show Context)
Citation Context ...th any classication label. The following section presents a brief review of PCA. 2.1 PCA: A Brief Review Principal Component Analysis (PCA) is a statistical technique for analyzing multivariate data [=-=47, 27, 28]-=-. It involves linear transformation of a collection of related (statistically correlated) variables into a set of transformed variables | usually referred to as principal components. All the principal... |

5 | Quantifiable Data Mining Using Principal Component Analysis
- Faloutsos, Korn, et al.
- 1997
(Show Context)
Citation Context ...nition [55], remote sensing [35], and seismic data analysis [24, 29], among others. PCA is also a popular choice for data mining applications. PCA has been used for detecting linear associative rules =-=[17]-=-. Application of PCA-based techniques for large scale text can be found in [6]. PCA has also found applications in ensemble learning and aggregation of multiple models. Merz and Pazzani [42] have repo... |

4 |
Mining Decentralized Data Repositories
- Crestana, Soparkar
- 1999
(Show Context)
Citation Context |

4 |
Slink: An optimally ecient algorithm for the single-link cluster method. The Computer Journal
- Sibson
- 1973
(Show Context)
Citation Context ...ithms in order to generate a distributed clustering technique. There are numerous recent eorts directed towards scaling up clustering algorithms. In [45], the author shows an adaptation of the SLINK [=-=48-=-] and other agglomerative hierarchical clustering algorithms to a multiprocessor environment to parallelize the clustering process. The PADMA system [31] oers a distributed clustering system for homog... |

3 | Randomized non-linear projections uncover high-dimensional structure
- Cowen, Priebe
- 1997
(Show Context)
Citation Context ...bsets of the data and computing the minimum distance between the point being projected and the subsets. Related work for constructing projections of n points in Euclidean space can be found elsewhere =-=[9, 12]-=-. We are currently exploring this possibility. The assignment of the representative points to every member of the data set can be made more ecient by storing the data set using similarity preserving i... |

3 |
The use of Karhunen-Loeve transform in seismic data prospecting
- Hemon, Mace
- 1978
(Show Context)
Citation Context ...alysis Principal component analysis has found wide applications in various disciplines like psychology [43, 5], genetics [23], pattern recognition [55], remote sensing [35], and seismic data analysis =-=[24, 29]-=-, among others. PCA is also a popular choice for data mining applications. PCA has been used for detecting linear associative rules [17]. Application of PCA-based techniques for large scale text can b... |

2 |
Analysis of WAIS subtests in relation to age and education
- Birren, Morrison
- 1961
(Show Context)
Citation Context ...f its ease of implementation, we have adopted this method in our experiments. 2.2 PCA and Data Analysis Principal component analysis has found wide applications in various disciplines like psychology =-=[43, 5]-=-, genetics [23], pattern recognition [55], remote sensing [35], and seismic data analysis [24, 29], among others. PCA is also a popular choice for data mining applications. PCA has been used for detec... |