## Scaling EM (Expectation-Maximization) Clustering to Large Databases (1999)

Citations: | 40 - 0 self |

### BibTeX

@MISC{Bradley99scalingem,

author = {Paul S. Bradley and Usama M. Fayyad and Cory A. Reina and P. S. Bradley and Usama Fayyad and Cory Reina},

title = {Scaling EM (Expectation-Maximization) Clustering to Large Databases},

year = {1999}

}

### Years of Citing Articles

### OpenURL

### Abstract

Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table. For large databases, the scans become prohibitively expensive. We present a scalable implementation of the Expectation-Maximization (EM) algorithm. The database community has focused on distance-based clustering schemes and methods have been developed to cluster either numerical or categorical data. Unlike distancebased algorithms (such as K-Means), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discrete-valued and continuous-valued data. The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that...