## Scalability for Clustering Algorithms Revisited (2000)

Venue: | SIGKDD Explorations |

Citations: | 52 - 4 self |

### BibTeX

@ARTICLE{Farnstrom00scalabilityfor,

author = {Fredrik Farnstrom and James Lewis and Charles Elkan},

title = {Scalability for Clustering Algorithms Revisited},

journal = {SIGKDD Explorations},

year = {2000},

volume = {2},

pages = {51--57}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper presents a simple new algorithm that performs k-means clustering in one scan of a dataset, while using a buffer for points from the dataset of fixed size. Experiments show that the new method is several times faster than standard k-means, and that it produces clusterings of equal or almost equal quality. The new method is a simplification of an algorithm due to Bradley, Fayyad and Reina that uses several data compression techniques in an attempt to improve speed and clustering quality. Unfortunately, the overhead of these techniques makes the original algorithm several times slower than standard k-means on materialized datasets, even though standard k-means scans a dataset multiple times. Also, lesion studies show that the compression techniques do not improve clustering quality. All results hold for 400 megabyte synthetic datasets and for a dataset created from the real-world data used in the 1998 KDD data mining contest. All algorithm implementations and experiments are designed so that results generalise to datasets of many gigabytes and larger.

### Citations

575 | Cure: An Efficient Clustering Algorithm for Large Databases
- Guha, Rastogi, et al.
- 1998
(Show Context)
Citation Context ...to a solution, with each element needing to be accessed on each iteration. Therefore, considerable recent research has focused on designing clustering algorithms that use only one pass over a dataset =-=[9; 6]-=-. These methods all assume that only a portion of the dataset can reside in memory, and require only a single pass through the dataset. The starting point of this paper is a single pass k-means algori... |

562 | Approximate statistical tests for comparing supervised classification learning algorithms
- Dietterich
- 1998
(Show Context)
Citation Context ...ts are not reported because their precision could be misleading, since the assumptions on which standard tests are based are often not valid when comparing performance metrics for data mining methods =-=[3]-=-. Figure 2 shows surprisingly that the standard k-means algorithm is not significantly more reliable than random samplingsk-means. This fact indicates that the standard algorithm has difficulty escapi... |

266 | Clustering data streams
- Guha, Mishra, et al.
- 2000
(Show Context)
Citation Context ...sets of low dimensionality (d ! 8). Our simple single pass method is effective regardless of dimensionality. The results here are also complementary to those of Guha, Mishra, Motwani, and O'Callaghan =-=[5]-=-, who present single pass clustering algorithms that are guaranteed to achieve clusterings with quality within a constant factor of optimal. We have not tested other single pass clustering algorithms,... |

254 | Scaling clustering algorithms to large databases
- Bradley, Fayyad, et al.
- 1998
(Show Context)
Citation Context ...ion of the dataset can reside in memory, and require only a single pass through the dataset. The starting point of this paper is a single pass k-means algorithm proposed by Bradley, Fayyad, and Reina =-=[1]-=-. This method uses several types of compression to limit memory usage. However, the compression techniques make the algorithm complicated. We investigate the tradeoffs involved by comparing several va... |

223 |
an efficient data clustering method for very large databases. SIGMOD'96. Authors' Information Shatovska Tetyana – Kharkiv National University of Radio Electronics, Computer Science department, P.O.Box: Kharkiv-116, Lenin av. 14, Ukraine; e-mail: tanita_uk
- BIRCH
- 1991
(Show Context)
Citation Context ...to a solution, with each element needing to be accessed on each iteration. Therefore, considerable recent research has focused on designing clustering algorithms that use only one pass over a dataset =-=[9; 6]-=-. These methods all assume that only a portion of the dataset can reside in memory, and require only a single pass through the dataset. The starting point of this paper is a single pass k-means algori... |

108 | Accelerating Exact k-means Algorithms with Geometric Reasoning
- Pelleg, Moore
- 1999
(Show Context)
Citation Context ...speed is likely to be still SIGKDD Explorations. Copyright c fl2000 ACM SIGKDD, Jul 2000. Volume 2, Issue 1 - page 6 the same. The results of this paper are complementary to those of Pelleg and Moore =-=[8]-=-, who show how to use a sophisticated data structure to increase the speed of k-means clustering for datasets of low dimensionality (d ! 8). Our simple single pass method is effective regardless of di... |

83 | An experimental comparison of several clustering and initialization methods
- Meila, Heckerman
- 1998
(Show Context)
Citation Context ... satisfy a tightness criterion, meaning that its standard deviation in each dimension must be below a certain thresholdsfi. Secondary clusters are combined using hierarchical agglomerative clustering =-=[7]-=-, as long as the combined clusters satisfy the tightness criterion. After primary and secondary compression, the space in the buffer that has become available is filled with new points, and the whole ... |

50 | Mining Very Large Databases
- Ganti, Gehrke, et al.
- 1999
(Show Context)
Citation Context ...hanges in the re-estimated cluster parameters. Under the assumption that datasets tend to be small, research on clustering algorithms has traditionally focused on improving the quality of clusterings =-=[4]-=-. However, many datasets now are large and cannot fit into main memory. Scanning a dataset stored on disk or tape repeatedly is timeconsuming, but the standard k-means algorithm typically requires man... |

21 |
Scaling EM (expectation maximization) clustering to large databases
- Bradley, Fayyad, et al.
- 1998
(Show Context)
Citation Context ...H would be interesting. Also, all the single pass methods discussed in this paper can be extended to apply to other iterative clustering approaches, and in particular to expectation maximization (EM) =-=[2]-=-. It would be interesting to repeat the experiments of this paper in the EM context. Acknowledgments: The authors are grateful to Nina Mishra and Bin Zhang of Hewlett Packard Laboratories and to the a... |