## Clustering in Massive Data Sets (1999)

Venue: | Handbook of massive data sets |

Citations: | 11 - 0 self |

### BibTeX

@INPROCEEDINGS{Murtagh99clusteringin,

author = {Fionn Murtagh},

title = {Clustering in Massive Data Sets},

booktitle = {Handbook of massive data sets},

year = {1999},

pages = {501--543},

publisher = {Kluwer Academic Publishers}

}

### OpenURL

### Abstract

We review the time and storage costs of search and clustering algorithms. We exemplify these, based on case-studies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, and a basis for clustering algorithms to follow. Sections 7 to 11 review a number of families of clustering algorithm. Sections 12 to 14 relate to visual or image representations of data sets, from which a number of interesting algorithmic developments arise.

### Citations

8090 | Maximum likelihood from incomplete data via the EM algorithm - Dempster, Laird, et al. - 1977 |

3124 | Introduction to Modern Information Retrieval - Salton, McGill - 1983 |

2307 | Estimating the dimension of a model - SCHWARZ - 1978 |

2152 | Algorithms for Clustering Data - Jain, Dubes - 1988 |

1872 | Randomized Algorithms - Motwani, Raghavan - 1995 |

983 | Bayes factors - Kass, Raftery - 1995 |

842 | Least squares quantization in pcm - Lloyd - 1982 |

414 | Syntactic clustering of the Web - Broder, Glassman, et al. - 1997 |

341 | On the resemblance and containment of documents - Broder - 1997 |

295 | When is “nearest neighbor” meaningful - Beyer, Goldstein, et al. - 1999 |

277 | How many clusters? Which clustering method? Answers via model-based Cluster Analysis - Fraley, Raftery - 1998 |

245 | Graph-theoretical methods for detecting and describing gestalt clusters - Zahn - 1971 |

118 | A branch and bound algorithm for computing k-nearest neighbors - Fukunaga, Narendra - 1975 |

110 | Matrix, vector space, and information retrieval - Berry, Drmac, et al. - 1999 |

106 | Accelerating exact k-means algorithms with geometric reasoning - Pelleg, Moore - 1999 |

97 | Gaussian parsimonious clustering models - Celeux, Govaert - 1995 |

89 | Very fast EM-based mixture model clustering using multiresolution kd-trees - Moore - 1999 |

82 | Detecting features in spatial point processes with clutter via model-based clustering - Dasgupta, Raftery - 1998 |

75 | Efficient algorithms for agglomerative hierarchical clustering methods - Day, Edelsbrunner - 1984 |

72 | Dotplot: a program for exploring self-similarity in millions of lines of text and code - Church, Helfman - 1992 |

68 | Nearest Neighbor (NN) Norms - Dasarathy - 1991 |

67 |
Finding minimum spanning trees
- Cheriton, Tarjan
- 1976
(Show Context)
Citation Context ...ity can be assumed or if the data can be broken down into regions of approximately uniformly distributed points. 2 x min ; y min = 0; 0, x max ; y max = 50; 40, r = 10 * 0 1 2 3 4 5 0 1 2 3 4 * Point =-=(22,32)-=- is mapped onto cell (2,3); point (8,13) is mapped onto cell (0,1). Figure 1: Example of simple binning in the plane. Simple Fortran code for this approach is listed, and discussed, in Schreiber (1993... |

66 | Multidimensional Clustering Algorithms, Compstat lectures - Murtagh - 1985 |

56 | Algorithms for model-based Gaussian hierarchical clustering - Fraley - 1997 |

52 | A view of the em algorithm that justi�es incremental, sparse, and other variants - Neal, Hinton - 1998 |

50 | Nearest-neighbor clutter removal for estimating features in spatial point processes - Byers, Raftery - 1998 |

46 | Applied Combinatorics - Tucker - 1980 |

42 | Note on learning rate schedules for stochastic optimization - Darken, Moody - 1990 |

42 | Learning rate schedules for faster stochastic gradients search - Darken, Chang, et al. - 1992 |

42 | Cluster Dissection and Analysis: Theory, FORTRAN Programs, Examples - Spath - 1985 |

41 | Towards faster stochastic gradient search. Advances in neural information processing systems - Darken, Moody |

41 | Multivariate data analysis, D - Murtagh, Heck - 1987 |

37 | An efficient algorithm for a complete link method - Defays - 1977 |

34 | Semantic Road Maps for Literature Searchers, The - Doyle - 1961 |

34 | Some Competitive Learning Methods - Fritzke - 1997 |

33 | Density-based indexing for approximate nearest-neighbor queries - Bennett, Fayyad, et al. - 1999 |

30 | Fast algorithms for constructing minimal spanning trees in coordinate spaces - BENTLEY, FRIEDMAN - 1978 |

30 | Fitting straight lines to point patterns - Murtagh, Raftery - 1984 |

29 |
An overview of combinatorial data analysis, in: Arabie
- Arabie
- 1996
(Show Context)
Citation Context ... data can be broken down into regions of approximately uniformly distributed points. 2 x min ; y min = 0; 0, x max ; y max = 50; 40, r = 10 * 0 1 2 3 4 5 0 1 2 3 4 * Point (22,32) is mapped onto cell =-=(2,3)-=-; point (8,13) is mapped onto cell (0,1). Figure 1: Example of simple binning in the plane. Simple Fortran code for this approach is listed, and discussed, in Schreiber (1993). The search through adja... |

28 | Complexities of hierarchic clustering algorithms: the state of the art - Murtagh - 1984 |

27 | Nonparametric maximum likelihood estimation of features in spatial point processes using Voronoï tessellation
- Allard, Fraley
- 1997
(Show Context)
Citation Context ...e further implementation details can be found in Murtagh (1993a). A powerful theoretical result regarding this approach is as follows. For uniformly distributed points, the NN of a point is found in O=-=(1)-=-, or constant, expected time (see Delannoy, 1980, or Bentley et al., 1980, for proof). Therefore this approach will work well if approximate uniformity can be assumed or if the data can be broken down... |

26 | Some methods for classi and analysis of multivariate observations - MacQueen - 1967 |

24 | Sparse matrix reordering schemes for browsing hypertext - BERRY, HENDRICKSON, et al. - 1996 |

23 |
Optimal expected time algorithms for closest point problems
- Bentley, Weide, et al.
(Show Context)
Citation Context ...broken down into regions of approximately uniformly distributed points. 2 x min ; y min = 0; 0, x max ; y max = 50; 40, r = 10 * 0 1 2 3 4 5 0 1 2 3 4 * Point (22,32) is mapped onto cell (2,3); point =-=(8,13)-=- is mapped onto cell (0,1). Figure 1: Example of simple binning in the plane. Simple Fortran code for this approach is listed, and discussed, in Schreiber (1993). The search through adjacent cells req... |

22 | Hierarchic agglomerative clustering methods for automatic document classification - Griffiths, Robinson, et al. - 1984 |

21 | The Kohonen SelfOrganizing Map Method: An Assessment - Murtagh, Hernández-Pajares - 1995 |

19 | Nonparametric estimation of gamma-ray burst intensities using haar wavelets - Kolaczyk - 1997 |

18 | Subquadratic approximation algorithms for clustering problems in high dimensional spaces - Borodin, Ostrovsky, et al. - 1999 |

18 | An algorithm for best matches in logarithmic expected time - Friedman, Bentley, et al. - 1977 |

18 | Image and Data Analysis: The Multiscale Approach - Starck, Murtagh, et al. - 1998 |