## Scaling EM (Expectation-Maximization) Clustering to Large Databases (1999)

Citations: | 40 - 0 self |

### BibTeX

@MISC{Bradley99scalingem,

author = {Paul S. Bradley and Usama M. Fayyad and Cory A. Reina and P. S. Bradley and Usama Fayyad and Cory Reina},

title = {Scaling EM (Expectation-Maximization) Clustering to Large Databases},

year = {1999}

}

### Years of Citing Articles

### OpenURL

### Abstract

Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table. For large databases, the scans become prohibitively expensive. We present a scalable implementation of the Expectation-Maximization (EM) algorithm. The database community has focused on distance-based clustering schemes and methods have been developed to cluster either numerical or categorical data. Unlike distancebased algorithms (such as K-Means), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discrete-valued and continuous-valued data. The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that...

### Citations

8134 | Maximum likelihood from incomplete data via the em algorithm - Dempster, Laird, et al. - 1977 |

4848 |
Neural Networks for Pattern Recognition
- Bishop
- 1995
(Show Context)
Citation Context ...antified by the log likelihood of the data under the computed mixture model (Equation 4). This measure quantifies the likelihood that the database records were generated by the computed mixture model =-=[B95]-=-. It was empirically determined that SEM running time was 5 times 0 2000 4000 6000 8000 10000 12000 14000 16000 0 500 1000 Number of Records (k) Running Time (sec) Scalable EM Vanilla EM Figure 5.1: R... |

3927 |
Pattern classification and scene analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...portion p of the data that is best represented by the current model. This is achieved by purging records from RS that have greatest log likelihoods. This is equivalent to finding a Mahalanobis radius =-=[DH73]-=- r # from the center of each cluster l and purging the data items within that radius as illustrated in Figure 3.1 (shaded regions go to DS # ). The subset of data records that is summarized near the m... |

2662 | Introduction to Statistical Pattern Recognition, 2nd ed - Fukunaga - 1990 |

2185 | Density Estimation for Statistics and Data Analysis - Silverman - 1986 |

1866 | Some methods for classification and analysis of multivariate observations - MacQueen - 1967 |

716 | Cluster Analysis for Applications - Anderberg - 1973 |

643 | Knowledge Acquisition Via Incremental Conceptual Clustering
- Fisher
- 1987
(Show Context)
Citation Context ...NH99]. The clustering problem has been formulated in various ways in the statistics [KR89,BR93,B95,S92,S86], pattern recognition [DH73,F90], optimization [BMS97,SI84], and machine learning literature =-=[F87]-=-. The fundamental problem is that of grouping together (clustering) data items that are similar to each other. The most general view places clustering in the framework of density estimation [S86, S92,... |

594 | Efficient and effective clustering method for spatial data mining
- Ng, Han
- 1994
(Show Context)
Citation Context ...t in the statistical literature [PE96,GMPS97,CS96] that EM modeling results in better quality models than other simpler alternatives like k-Means (upon which algorithms like BIRCH [ZRL97] and CLARANS =-=[NH94]-=-). There is no prior work on scaling EM so we compare against de facto standard practices for dealing with large databases: sampling-based and on-line algorithms. Other scalable clustering algorithms ... |

563 | CURE: An efficient clustering algorithm for large databases
- Guha, Rastogi, et al.
- 1998
(Show Context)
Citation Context ...-line algorithms. Other scalable clustering algorithms exist, but do not produce mixture model representations of the database probability density function (e.g. BIRCH, DBSCAN [SEKX98], CLARANS, CURE =-=[GRS98], etc.) Th-=-ree techniques estimating mixture model parameters over large databases are evaluated: scalable EM (SEM), standard or "vanilla" EM run over random samples of the database (VEM), and an onlin... |

562 | Automatic subspace clustering of high dimensional data for data mining applications - Agrawal, Gehrke, et al. - 1998 |

479 | Bayesian classification (AutoClass: theory and results - Cheeseman, Stutz - 1996 |

370 | Multivariate Density Estimation - Scott - 1992 |

313 | Model-based Gaussian and non-Gaussian clustering - Banfield, Raftery - 1993 |

246 | Scaling clustering algorithms to large databases
- Bradley, Fayyad, et al.
- 1998
(Show Context)
Citation Context ... and that it outperforms traditional ways for coping with large databases involving data sampling. The framework we present can accommodate many iterative refinement type algorithms including k-Means =-=[BFR98]-=-, and also supports clustering over data with mixed continuous-discrete attributes. The localized data access properties of SEM have additional benefits relating to improved utilization of fast caches... |

235 | Refining initial points for K-Means clustering - Bradley, Fayyad - 1998 |

182 | Finding Groups in Data - Kaufman, PJ - 1990 |

144 | Clustering categorical data: An approach based on dynamical systems
- Gibson, Kleinberg, et al.
- 1998
(Show Context)
Citation Context ...tance measures and readily admits categorical and continuous attributes (which is untrue of other clustering algorithms that either operate on continuous, e.g. k-Means-type algorithms, or categorical =-=[GKR98]-=- data exclusively). EM has been shown to be superior to other alternatives for statistical modeling purposes [GMPS97,PE96,B95,CS96,NH99]. The clustering problem has been formulated in various ways in ... |

133 |
Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data Mining and Knowledge Discovery
- Sander, Ester, et al.
- 1998
(Show Context)
Citation Context ...s: sampling-based and on-line algorithms. Other scalable clustering algorithms exist, but do not produce mixture model representations of the database probability density function (e.g. BIRCH, DBSCAN =-=[SEKX98], CLARANS,-=- CURE [GRS98], etc.) Three techniques estimating mixture model parameters over large databases are evaluated: scalable EM (SEM), standard or "vanilla" EM run over random samples of the datab... |

121 | K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality - Selim, Ismail - 1984 |

83 | Convergence properties of the K-means algorithm - Bottou, Bengio - 1995 |

79 | An experimental comparison of several clustering and initialization methods - MeilÄƒ, Heckerman - 1998 |

65 | Clustering using Monte Carlo cross validation - Smyth - 1996 |

65 | BIRCH: A New Data Clustering Algorithm and Its Applications
- Zhang, Ramakrishnan, et al.
- 1998
(Show Context)
Citation Context ...ustering to Large Databases Bradley, Fayyad, and Reina 1 1 Introduction Data clustering is important in many fields, including data mining [FPSU96], statistical data analysis [KR89,BR93], compression =-=[ZRL97]-=-, and vector quantization. Applications include data analysis and modeling [FDW97,FHS96], image segmentation, marketing, fraud detection, predictive modeling, data summarization, and general data repo... |

54 | Compressed data cubes for OLAP aggregate query approximation on continuous dimensions - Shanmugasundaram, Fayyad, et al. - 1999 |

49 | Clustering via Concave Minimization - Bradley, Mangasarian, et al. - 1997 |

41 |
Towards faster stochastic gradient search. Advances in neural information processing systems
- Darken, Moody
(Show Context)
Citation Context ...e (VEM), and an online EM implementation (OEM) [NH99, DM92]. The online EM algorithm is a stochastic gradient descent approach that operates by updating the initial mixture model one record at a time =-=[DM92]-=-. A single record is read and its membership probabilities in each of the k clusters is computed. The cluster parameters are then updated and the record is purged from memory. SEM has three major para... |

41 | A Statistical Perspective on Knowledge Discovery in Databases - John, Pregibon - 1996 |

36 | Clustering algorithms. In Information Retrieval: Data Structures and Algorithms - RASMUSSEN - 1992 |

33 | D.: Density-based indexing for approximate nearest-neighbor queries - Bennett, Fayyad, et al. - 1999 |

32 | Statistical Themes and Lessons for Data Mining - Glymour, Madigan, et al. - 1997 |

24 | Density-Based Clustering - Sander, Ester, et al. - 1998 |

23 |
A View of the EM Algorithm that
- Neal, Hinton
- 1998
(Show Context)
Citation Context ...ndidate sub-clusters takes place over RS, post primary compression purging. Candidate sub-clusters are determined by applying either the standard k-Means clustering algorithm [F90,DH73] or "harsh=-=" EM [NH99] to RS. Th-=-e choice of these "hard" assignment clustering algorithms is justified by the fact to fully compress data records into sufficient statistics, records must have full membership in their repre... |

13 | Mining Science Data - Fayyad, Haussler, et al. - 1996 |

13 | Fast Density Estimation Using CF-Kernel for Very Large Databases - Zhang, Ramakrishnan, et al. - 1999 |

8 | Industrial Applications of Data Mining and Knowledge Discovery - Brachman, Khabaza, et al. - 1996 |

5 | Fast density and probability estimation using CF-Kernel method for very large databases
- Livny, Ramakrishnan, et al.
- 1996
(Show Context)
Citation Context ...o support the k-Means algorithm and not the probabilistic EM algorithm. However, the CF-tree structure can be used to do nonparametric (kernel-based) density estimation with a large number of kernels =-=[LRZ96]-=-. A set of related clustering algorithms, which for the same reasons cannot be compared with our density estimation approach, are DBSCAN, GDBSCAN [SEKX98], and CLARANS [NH94] which are designed primar... |

4 | Clustering Algorithms", in Information Retrieval Data Structures and Algorithms - Rasmussen - 1992 |

3 | Application of Classification and Clustering to Sky Survey Cataloging and Analysis - Fayyad, Djorgovski, et al. - 1997 |

2 |
D Gunopulos and P Raghavan. "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications
- Agrawal, Gehrke
- 1998
(Show Context)
Citation Context ...ch are designed primarily for spatial data clustering. DBSCAN and GDBSCAN are region growing algorithms and do not aim to probabilistically model the density. The same applies to the CLIQUE algorithm =-=[AGGR98]-=- which grows dense data regions by attempting to find all clustered subspaces of the original data space and present the result to the user in a minimal DNF expression. CLIQUE requires many data scans... |

2 | Refining Initialization of Clustering Algorithms - Fayyad, Reina, et al. - 1998 |