## K.S.: Scalable model-based clustering for large databases based on data summarization (2005)

Venue: | IEEE Transactions on Pattern Analysis and Machine Intelligence |

Citations: | 11 - 3 self |

### BibTeX

@ARTICLE{Jin05k.s.:scalable,

author = {Huidong Jin and Man-leung Wong and Kwong-sak Leung},

title = {K.S.: Scalable model-based clustering for large databases based on data summarization},

journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},

year = {2005},

volume = {27},

pages = {1710--1719}

}

### OpenURL

### Abstract

The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources such as memory and computation time. In this paper, two scalable clustering algorithms, bEMADS and gEMADS, are presented based on the Gaussian mixture model. Both summarize data into subclusters and then generate Gaussian mixtures from their data summaries. Their core algorithm EMADS is defined on data summaries and approximates the aggregate behavior of each subcluster of data under the Gaussian mixture model. EMADS is provably convergent. Experimental results substantiate that both algorithms can run several orders of magnitude faster than expectation-maximization with little loss of accuracy. Index Terms Scalable clustering, Gaussian mixture model, expectation-maximization, data summary, maximum

### Citations

9033 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...chniques have attracted much research interest [2], [6], [7], [9], [10], [12]. They have solid probabilistic foundations [12]–[16], and can handle clusters of various shapes and complicated databases =-=[14]-=-, [16], [17]. They, especially the Gaussian mixture model, have been successfully applied to various real applications [17]–[20]. Due to its theoretical and practical significance, we focus on scalabl... |

2132 | Data mining: Concepts and techniques
- Han, Kamber
- 2000
(Show Context)
Citation Context ...ited memory and computation time. A data mining algorithm is said to be scalable when its running time grows linearly or sub-linearly with data size, given computational resources such as main memory =-=[1]-=-–[5]. It bridges the gap between the limited computational resources and large databases. Due to its wide applications, scalable clustering has drawn much attention recently [2], [4], [6]–[10]. Model-... |

1076 |
The EM Algorithm and Extensions
- McLachlan, Krishnan
- 1997
(Show Context)
Citation Context ...Among many clustering techniques [4], [8], [11], model-based clustering techniques have attracted much research interest [2], [6], [7], [9], [10], [12]. They have solid probabilistic foundations [12]–=-=[16]-=-, and can handle clusters of various shapes and complicated databases [14], [16], [17]. They, especially the Gaussian mixture model, have been successfully applied to various real applications [17]–[2... |

514 |
Baysian classification (autoclass): Theory and results
- Cheeseman, Stutz
- 1996
(Show Context)
Citation Context ...e attracted much research interest [2], [6], [7], [9], [10], [12]. They have solid probabilistic foundations [12]–[16], and can handle clusters of various shapes and complicated databases [14], [16], =-=[17]-=-. They, especially the Gaussian mixture model, have been successfully applied to various real applications [17]–[20]. Due to its theoretical and practical significance, we focus on scalable clustering... |

304 | Unsupervised learning on finite mixture models
- Figueiredo, Jain
- 2002
(Show Context)
Citation Context ...bility is maximal, i.e., k = arg max l {plφ(xi|θl)}. Among many clustering techniques [4], [8], [11], model-based clustering techniques have attracted much research interest [2], [6], [7], [9], [10], =-=[12]-=-. They have solid probabilistic foundations [12]–[16], and can handle clusters of various shapes and complicated databases [14], [16], [17]. They, especially the Gaussian mixture model, have been succ... |

96 | Very fast EM-based mixture model clustering using multiresolution kd-trees
- Moore
- 1999
(Show Context)
Citation Context ... posterior probability is maximal, i.e., k = arg max l {plφ(xi|θl)}. Among many clustering techniques [4], [8], [11], model-based clustering techniques have attracted much research interest [2], [6], =-=[7]-=-, [9], [10], [12]. They have solid probabilistic foundations [12]–[16], and can handle clusters of various shapes and complicated databases [14], [16], [17]. They, especially the Gaussian mixture mode... |

69 | BIRCH: A new data clustering algorithm and its applications
- Zhang, Ramakrishnan, et al.
- 1997
(Show Context)
Citation Context ...ch as main memory [1]–[5]. It bridges the gap between the limited computational resources and large databases. Due to its wide applications, scalable clustering has drawn much attention recently [2], =-=[4]-=-, [6]–[10]. Model-based clustering techniques assume a record xi ∈ ℜ D (i = 1, · · · , N) is drawn from a K-component mixture model Φ with probability p(xi|Φ) = K� [pkφ(xi|θk)]. The component density ... |

61 | Algorithms for model-based Gaussian hierarchical clustering - Fraley - 1999 |

60 | Transformation-invariant clustering using the EM algorithm - Frey, Jojic |

59 | Density Biased Sampling: An Improved Method for Data Mining and Clustering
- Palmer, Faloutsos
(Show Context)
Citation Context ...1) Given Φ, a crisp clustering is obtained by assigning each record xi to cluster k where its posterior probability is maximal, i.e., k = arg max l {plφ(xi|θl)}. Among many clustering techniques [4], =-=[8]-=-, [11], model-based clustering techniques have attracted much research interest [2], [6], [7], [9], [10], [12]. They have solid probabilistic foundations [12]–[16], and can handle clusters of various ... |

59 | Compressed data cubes for OLAP aggregate query approximation on continuous dimensions
- Bradley, Fayyad, et al.
- 1999
(Show Context)
Citation Context ...6], and can handle clusters of various shapes and complicated databases [14], [16], [17]. They, especially the Gaussian mixture model, have been successfully applied to various real applications [17]–=-=[20]-=-. Due to its theoretical and practical significance, we focus on scalable clustering based on the Gaussian mixture model hereafter. Expectation-Maximization (EM) effectively estimates maximum likeliho... |

52 | Mining very large databases - Ganti, Gehrke, et al. - 1999 |

46 | An experimental comparison of model-based clustering methods
- Meila, Heckerman
- 2001
(Show Context)
Citation Context ...erior probability is maximal, i.e., k = arg max l {plφ(xi|θl)}. Among many clustering techniques [4], [8], [11], model-based clustering techniques have attracted much research interest [2], [6], [7], =-=[9]-=-, [10], [12]. They have solid probabilistic foundations [12]–[16], and can handle clusters of various shapes and complicated databases [14], [16], [17]. They, especially the Gaussian mixture model, ha... |

34 | Accelerating EM for Large Databases - Thiesson, Meek, et al. - 2001 |

27 | Clustering by Committee - Pantel - 2003 |

18 |
Clustering very large databases using EM mixture models
- Bradley, Fayyad, et al.
- 2000
(Show Context)
Citation Context ...es such as main memory [1]–[5]. It bridges the gap between the limited computational resources and large databases. Due to its wide applications, scalable clustering has drawn much attention recently =-=[2]-=-, [4], [6]–[10]. Model-based clustering techniques assume a record xi ∈ ℜ D (i = 1, · · · , N) is drawn from a K-component mixture model Φ with probability p(xi|Φ) = K� [pkφ(xi|θk)]. The component den... |

5 | Scalable model-based cluster analysis using clustering features
- Jin, Leung, et al.
- 2005
(Show Context)
Citation Context ...r, i.e., the covariance matrix Σk of each Gaussian distribution is a diagonal matrix, the clustering accuracy generated by ExEM is significantly worse than EM and EMACF as shown in our previous study =-=[15]-=-. EMACF is a simplified version of EMADS where data summaries are replaced with clustering features. EMACF can only run when attributes are independent within each cluster [15]. A. Performance on Thre... |

4 |
Scalable model-based clustering algorithms for large databases and their applications
- Jin
- 2002
(Show Context)
Citation Context ...n memory [1]–[5]. It bridges the gap between the limited computational resources and large databases. Due to its wide applications, scalable clustering has drawn much attention recently [2], [4], [6]–=-=[10]-=-. Model-based clustering techniques assume a record xi ∈ ℜ D (i = 1, · · · , N) is drawn from a K-component mixture model Φ with probability p(xi|Φ) = K� [pkφ(xi|θk)]. The component density φ(xi|θk) i... |

3 | Scalable model-based clustering by working on data summaries
- Jin, Wong, et al.
- 2003
(Show Context)
Citation Context ... memory and computation time. A data mining algorithm is said to be scalable when its running time grows linearly or sub-linearly with data size, given computational resources such as main memory [1]–=-=[5]-=-. It bridges the gap between the limited computational resources and large databases. Due to its wide applications, scalable clustering has drawn much attention recently [2], [4], [6]–[10]. Model-base... |

2 | Learning Mixture Models with the Latent Maximum Entropy Principal
- Wang, Schuurmans, et al.
- 2003
(Show Context)
Citation Context ...el Φ. The pseudo model approximates the aggregate behavior of each subcluster under Φ. Thus, we can get good Gaussian mixtures Φ through finding good estimates of Ψ. To filter out degenerate mixtures =-=[13]-=-, i.e., |Σk| ≈ 0 for some k, we choose a conjugate prior of each covariance matrix Σk. The conjugate prior for Σ −1 k is a Wishart distribution Wk |Ωk| α k 2 |Σ −1 k | α k −D−1 2 αkD 2 π (D−1)D 4 D� d... |