Results 1  10
of
18
ModelBased Clustering, Discriminant Analysis, and Density Estimation
 JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2000
"... Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little ..."
Abstract

Cited by 319 (26 self)
 Add to MetaCart
Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as \How many clusters are there?", "Which clustering method should be used?" and \How should outliers be handled?". We outline a general methodology for modelbased clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, mineeld detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology, a...
Comparing Clusterings
, 2002
"... This paper proposes an information theoretic criterion for comparing two clusterings of the same data set. The criterion, called variation of information measures the amount of information that is lost or gained in changing from clustering C to dustering C '. The criterion makes no assumptions ..."
Abstract

Cited by 52 (4 self)
 Add to MetaCart
This paper proposes an information theoretic criterion for comparing two clusterings of the same data set. The criterion, called variation of information measures the amount of information that is lost or gained in changing from clustering C to dustering C '. The criterion makes no assumptions about how the dusterings were generated and applies to both soft and hard dusterings.The basic properties of VI are presented and discussed from the point of view of comparing c!usterings. In particular, the VI is positive, symmetric and transitive and thus, surprisingly enough, is a true metric on the space of c1usterings.
Learning recursive Bayesian multinets for data clustering by means of constructive induction
, 2001
"... This paper introduces and evaluates a new class of knowledge model, the recursive Bayesian multinet (RBMN), which encodes the joint probability distribution of a given database. RBMNs extend Bayesian networks (BNs) as well as partitional clustering systems. Briefly, a RBMN is a decision tree with co ..."
Abstract

Cited by 19 (7 self)
 Add to MetaCart
This paper introduces and evaluates a new class of knowledge model, the recursive Bayesian multinet (RBMN), which encodes the joint probability distribution of a given database. RBMNs extend Bayesian networks (BNs) as well as partitional clustering systems. Briefly, a RBMN is a decision tree with component BNs at the leaves. A RBMN is learnt using a greedy, heuristic approach akin to that used by many supervised decision tree learners, but where BNs are learnt at leaves using constructive induction. A key idea is to treat expected data as real data. This allows us to complete the database and to take advantage of a closed form for the marginal likelihood of the expected complete data that factorizes into separate marginal likelihoods for each family (a node and its parents). Our approach is evaluated on synthetic and realworld databases.
Clustering in Massive Data Sets
 Handbook of massive data sets
, 1999
"... We review the time and storage costs of search and clustering algorithms. We exemplify these, based on casestudies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
We review the time and storage costs of search and clustering algorithms. We exemplify these, based on casestudies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, and a basis for clustering algorithms to follow. Sections 7 to 11 review a number of families of clustering algorithm. Sections 12 to 14 relate to visual or image representations of data sets, from which a number of interesting algorithmic developments arise.
Overcoming the Curse of Dimensionality in Clustering by means of the Wavelet Transform
 The Computer Journal
, 2000
"... We use a redundant wavelet transform analysis to detect clusters in highdimensional data spaces. We overcome Bellman's \curse of dimensionality" in such problems by (i) using some canonical ordering of observation and variable (document and term) dimensions in our data, (ii) applying a ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
(Show Context)
We use a redundant wavelet transform analysis to detect clusters in highdimensional data spaces. We overcome Bellman's \curse of dimensionality" in such problems by (i) using some canonical ordering of observation and variable (document and term) dimensions in our data, (ii) applying a wavelet transform to such canonically ordered data, (iii) modeling the noise in wavelet space, (iv) dening signicant component parts of the data as opposed to insignicant or noisy component parts, and (v) reading o the resultant clusters. The overall complexity of this innovative approach is linear in the data dimensionality. We describe a number of examples and test cases, including the clustering of highdimensional hypertext data. 1 Introduction Bellman's (1961) [1] \curse of dimensionality" refers to the exponential growth of hypervolume as a function of dimensionality. All problems become tougher as the dimensionality increases. Nowhere is this more evident than in problems related to ...
Latent Variable Discovery in Classification Models
, 2004
"... The naive Bayes model makes the often unrealistic assumption that feature variables are mutually independent given the class variable. We interpret the violation of this assumption as an indication of the presence of latent variables and show how latent variables can be detected. Latent variable dis ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
The naive Bayes model makes the often unrealistic assumption that feature variables are mutually independent given the class variable. We interpret the violation of this assumption as an indication of the presence of latent variables and show how latent variables can be detected. Latent variable discovery is interesting, especially for medical applications, because it can lead to better understanding of application domains. It can also improve classification accuracy and boost user confidence in classification models.
Clustering Massive Datasets With Applications in Software Metrics and Tomography
 Technometrics
, 1998
"... Clustering datasets is not an easy problem in general, and the difficulty is compounded for a massive dataset. Restricting attention to a sample from the data ignores minority groups and hence compromises on the available riches. This paper develops, under Gaussian assumptions, a multistage cluster ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
Clustering datasets is not an easy problem in general, and the difficulty is compounded for a massive dataset. Restricting attention to a sample from the data ignores minority groups and hence compromises on the available riches. This paper develops, under Gaussian assumptions, a multistage clustering algorithm. After clustering an initial sample, observations that can be reasonably classi ed in the identified groups are filtered out using a series of likelihood ratio test. The remainder are again sampled, clustered and the procedure iterated until all cases have either been clustered or classified. Final estimates of the class probabilities and the dispersions are obtained after an initial classification of the complete dataset into the identified clusters. Class membership of the observations in the dataset are finally assigned using these estimated probabilities and dispersions. Results on several test experiments indicated good performance. The procedure was also implemented on t...
Accurate and efficient curve detection in images: the importance sampling Hough transform
 PATTERN RECOGNITION
, 2002
"... ..."
An algorithm for nondistance based clustering in high dimensionalspaces
, 2002
"... Abstract. The clustering problem, which aims at identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity clusters, has been widely studied. Traditional clustering algorithms use distance functions to measure similarity and ..."
Abstract

Cited by 7 (5 self)
 Add to MetaCart
Abstract. The clustering problem, which aims at identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity clusters, has been widely studied. Traditional clustering algorithms use distance functions to measure similarity and are not suitable for high dimensional spaces. In this paper, we propose CoFD algorithm, which is a nondistance based clustering algorithm for high dimensional spaces. Based on the maximum likelihood principle, CoFD is to optimize parameters to maximize the likelihood between data points and the modelgenerated by the parameters. Experimentalresults on both synthetic data sets and a realdata set show the efficiency and effectiveness of CoFD. 1