Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters neccessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historial perspective rooted in mathematics, statistics and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters in unsupervised learning and the resulting system represents a data concept. From a practicual perspective clustering plays an outstanding role in data mining applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition and machine learning. This survery focuses on clustering in data ming. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique computational requirements on relevant clustering algorithms. A variety of algorithms have recently emerged that meet these requirements and were successfully applied to real-life data mining problems. They are subject of the survey.
|
5180
|
Genetic Algorithms
– Goldberg
- 1989
|
|
4923
|
Elements of Information Theory
– Cover, Thomas
- 1991
|
|
4735
|
Maximum Likelihood from incomplete data via the EM algorithm
– Dempster, Laird, et al.
- 1977
|
|
3011
|
Pattern Classification and Scene Analysis
– Duda, Hart
- 1973
|
|
2274
|
Self-Organizing Maps
– Kohonen
- 1995
|
|
1771
|
Introduction to Statistical Pattern Recognition
– Fukunaga
- 1990
|
|
1709
|
R-trees: a dynamic index structure for spatialsearching
– Guttman
- 1984
|
|
1478
|
Algorithms for Clustering Data
– Jain, Dubes
- 1988
|
|
1309
|
Randomized algorithms
– Motwani, Raghavan
- 1995
|
|
1125
|
Vector Quantization and Signal Compression
– Gersho, Gray
- 1992
|
|
971
|
Estimating the dimension of a model
– Schwarz
- 1978
|
|
970
|
Principal Component Analysis
– Jolliffe
- 1986
|
|
874
|
Data Mining: Concepts And Techniques
– Han, Kamber
- 2001
|
|
793
|
Clustering Algorithms
– Hartigan
- 1975
|
|
767
|
An efficient heuristic procedure for partitioning graphs
– Kernighan, Lin
- 1970
|
|
740
|
Modeling by shortest data description
– Rissanen
- 1978
|
|
728
|
Finding Groups in Data: An Introduction to Cluster Analysis
– Kaufman, Rousseeuw
- 1990
|
|
706
|
Pattern Recognition with Fuzzy Objective Function Algorithms
– Bezdek
- 1981
|
|
613
|
Data clustering: a review
– Jain, Murty, et al.
- 1999
|
|
572
|
The EM Algorithm and Extensions
– McLachlan, Krishnan
- 1997
|
|
570
|
A Density-Based Algorithm for Discovering Clusters
– Ester, Kriegel, et al.
- 1996
|
|
523
|
Knowledge Acquisition via Incremental Concept Formation
– Fisher
- 1987
|
|
506
|
Bayes factors
– Kaas, Raftery
- 1995
|
|
502
|
On information and sufficiency
– Kullback, Leibler
- 1951
|
|
498
|
A fast and high quality multilevel scheme for partitioning irregular graphs
– Karypis, Kumar
- 1998
|
|
464
|
On spectral clustering: Analysis and an algorithm
– Ng, Jordan, et al.
- 2001
|
|
448
|
Multivariate analysis
– Mardia, Kent, et al.
- 1979
|
|
442
|
Efficient and effective clustering methods for spatial data mining
– Ng, Han
- 1994
|
|
442
|
Exploratory Data Analysis
– Tukey
- 1977
|
|
430
|
Scatter/gather: a cluster-based approach to browsing large document collections
– Cutting, Karger, et al.
- 1992
|
|
415
|
An algorithm for finding best matches in logarithmic expected time
– Friedman, Bentley, et al.
- 1977
|
|
410
|
Cluster Analysis for Applications
– Anderberg
- 1973
|
|
400
|
Using linear algebra for intelligent information retrieval
– Berry, Dumais, et al.
- 1995
|
|
397
|
Automatic subspace clustering of high dimensional data for data mining applications
– AGRAWAL, GEHRKE, et al.
- 1998
|
|
385
|
Stochastic Complexity
– Rissanen
- 1987
|
|
374
|
Using information content to evaluate semantic similarity in a taxonomy
– Resnik
- 1995
|
|
362
|
Cure: an efficient clustering algorithm for large databases
– Guha, Rastogi, et al.
- 2001
|
|
361
|
Bayesian classification (AutoClass): Theory and results
– Cheeseman, Stutz
- 1995
|
|
325
|
An information-theoretic definition of similarity
– Lin
- 1998
|
|
322
|
Fast subsequence matching in time-series databases
– Faloutsos, Ranganathan, et al.
|
|
320
|
Mixture models: inference and applications to clustering
– McLachlan, Basford
- 1998
|
|
313
|
FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia databases
– FALOUTSOS, LIN
- 1995
|
|
310
|
Efficient similarity search in sequence databases
– Agrawal, Faloutsos, et al.
- 1993
|
|
297
|
Data preparation for mining world wide web browsing
– COOLEY, MOBASHER, et al.
- 1999
|
|
269
|
BIRCH: an efficient data clustering method for very large databases
– Zhang, Ramakrishnan, et al.
- 1996
|
|
259
|
Toward optimal feature selection
– Koller, Sahami
- 1996
|
|
239
|
Multivariate Density Estimation
– Scott
- 1992
|
|
231
|
W: The information bottleneck method
– Tishby, Pereira, et al.
- 1999
|
|
224
|
Stochastic Complexity in Statistical Inquiry
– Rissanen
- 1989
|
|
221
|
Data mining approaches for intrusion detection
– Lee, Stolfo
- 1998
|