Results 1 
2 of
2
Scaling EM (ExpectationMaximization) Clustering to Large Databases
, 1999
"... Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the ..."
Abstract

Cited by 40 (0 self)
 Add to MetaCart
Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table. For large databases, the scans become prohibitively expensive. We present a scalable implementation of the ExpectationMaximization (EM) algorithm. The database community has focused on distancebased clustering schemes and methods have been developed to cluster either numerical or categorical data. Unlike distancebased algorithms (such as KMeans), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discretevalued and continuousvalued data. The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that...
dNumber: A Fast Clustering Algorithm for Very Large Categorical Datasets
, 2002
"... In this paper, we present dNumber, a new fast algorithm for clustering categorical data, which is very fast and at the same time produce acceptable clustering results. It read each tuple ! in sequence, a number indicating this tuple's group is produced by a mapping function which incorporates e ..."
Abstract
 Add to MetaCart
In this paper, we present dNumber, a new fast algorithm for clustering categorical data, which is very fast and at the same time produce acceptable clustering results. It read each tuple ! in sequence, a number indicating this tuple's group is produced by a mapping function which incorporates existed distribution of attribute values. Due to its characteristics, the proposed algorithm is extremely suitable for clustering data streams, where given a sequence of points, the objective is to maintain consistently good clustering of the sequence so far, using a small amount of memory and time. Experimental results on reallife and synthetic datasets verify the superiority of dNumber.