Results 1 -
7 of
7
Finding the number of clusters in a data set: An information theoretic approach
- Journal of the American Statistical Association
, 2003
"... One of the most difficult problems in cluster analysis is the identification of the number of groups in a data set. Most previously suggested approaches to this problem are either somewhat ad hoc or require parametric assumptions and complicated calculations. In this paper we develop a simple yet po ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
One of the most difficult problems in cluster analysis is the identification of the number of groups in a data set. Most previously suggested approaches to this problem are either somewhat ad hoc or require parametric assumptions and complicated calculations. In this paper we develop a simple yet powerful non-parametric method for choosing the number of clusters based on distortion, a quantity that measures the average distance, per dimension, between each observation and its closest cluster center. Our technique is computationally efficient and straightforward to implement. We demonstrate empirically its effectiveness, not only for choosing the number of clusters but also for identifying underlying structure, on a wide range of simulated and real world data sets. In addition, we give a rigorous theoretical justification for the method based on information theoretic ideas. Specifically, results from the subfield of electrical engineering known as rate distortion theory allow us to describe the behavior of the distortion in both the presence and absence of clustering. Finally, we note that these ideas potentially can be extended to a wide range of other statistical model selection problems. 1
Interpreting and Extending Classical Agglomerative Clustering Algorithms Using a Model-Based Approach
- In Proceedings of 19th International Conference on Machine Learning (ICML-2002
, 2002
"... We present two results which arise from a model-based approach to hierarchical agglomerative clustering. First, we show formally that the common heuristic agglomerative clustering algorithms -- Ward's method, single-link, complete-link, and a variant of group-average -- are each equivalent to ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
We present two results which arise from a model-based approach to hierarchical agglomerative clustering. First, we show formally that the common heuristic agglomerative clustering algorithms -- Ward's method, single-link, complete-link, and a variant of group-average -- are each equivalent to a hierarchical model-based method. This interpretation gives a theoretical explanation of the empirical behavior of these algorithms, as well as a principled approach to resolving practical issues, such as number of clusters or the choice of method. Second, we show how a model-based viewpoint can suggest variations on these basic agglomerative algorithms.
An Observation-Constrained Generative Approach for Probabilistic Classification of Image Regions
, 2003
"... In this paper, we propose a probabilistic region classification scheme for natural scene images. In conventional generative methods, a generative model is learnt for each class using all the available training data belonging to that class. However, if an input image has been generated from only a su ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
In this paper, we propose a probabilistic region classification scheme for natural scene images. In conventional generative methods, a generative model is learnt for each class using all the available training data belonging to that class. However, if an input image has been generated from only a subset of the model support, use of the full model to assign generative probabilities can produce serious artifacts in the probability assignments. This problem arises mainly when the different classes have multimodal distributions with considerable overlap in the feature space. We propose an approach to constrain the class generative probability of a set of newly observed data by exploiting the distribution of the new data itself and using linear weighted mixing. A Kullback -- Leibler Divergence-based fast model selection procedure is also proposed for learning mixture models in a low dimensional feature space. The preliminary results on the natural scene images support the effectiveness of the proposed approach.
Probabilistic Classification of Image Regionsusing an Observation-Constrained Generative Approach
- Proc. Int. Workshop on GenerativeModel -Based Vision
, 2002
"... In generic image understanding applications, one of the goals is to interpret the semantic context of the scene (e.g., beach, office etc.). In this paper, we propose a probabilistic region classification scheme for natural scene images as a priming step for the problem of context interpretation. In ..."
Abstract
- Add to MetaCart
In generic image understanding applications, one of the goals is to interpret the semantic context of the scene (e.g., beach, office etc.). In this paper, we propose a probabilistic region classification scheme for natural scene images as a priming step for the problem of context interpretation. In conventional generative methods, a generative model is learnt for each class using all the available training data belonging to that class. However, if a set of newly observed data has been generated because of the subset of the model support, using the full model to assign generative probabilities can produce serious artifacts in the probability assignments. This problem arises mainly when the different classes have multimodal distributions with considerable overlap in the feature space. We propose an approach to constrain the class generative probability of a set of newly observed data by exploiting the distribution of the new data itself and using linear weighted mixing. A KL-Divergence-based fast model selection procedure is also proposed for learning mixture models in a sparse feature space. The preliminary results on the natural scene images support the effectiveness of the proposed approach.
Random Projection for High Dimensional Data Clustering:
, 2003
"... We investigate how random projection can best be used for clustering high dimensional data. Random projection has been shown to have promising theoretical properties. In practice, however, we find that it results in highly unstable clustering performance. Our solution is to use random projection in ..."
Abstract
- Add to MetaCart
We investigate how random projection can best be used for clustering high dimensional data. Random projection has been shown to have promising theoretical properties. In practice, however, we find that it results in highly unstable clustering performance. Our solution is to use random projection in a cluster ensemble approach. Empirical results show that the proposed approach achieves better and more robust clustering performance compared to not only single runs of random projection/clustering but also clustering with PCA, a traditional data reduction method for high dimensional data. To gain insights into the performance improvement obtained by our ensemble method, we analyze and identify the influence of the quality and the diversity of the individual clustering solutions on the final ensemble performance.
Variable Selection and Updating in Model-Based . . .
, 2008
"... A model-based discriminant analysis method that includes variable selection is presented. The discriminant analysis model is fitted in a semi-supervised manner using both labeled and unlabeled data. The method is shown to give excellent classification performance on several high-dimensional multicla ..."
Abstract
- Add to MetaCart
A model-based discriminant analysis method that includes variable selection is presented. The discriminant analysis model is fitted in a semi-supervised manner using both labeled and unlabeled data. The method is shown to give excellent classification performance on several high-dimensional multiclass datasets with more variables than observations. The variables selected by the proposed method provide information about which variables are meaningful for classification purposes. A headlong search strategy for variable selection is shown to be efficient in terms of computation and achieves excellent classification performance. In applications to several food classification datasets, our proposed method outperformed default implementations of Random Forests, AdaBoost and Bayesian Multinomial Regression by
A Survey of Evolutionary Algorithms for Clustering
"... Abstract — This paper presents a survey of evolutionary algorithms designed for clustering tasks. It tries to reflect the profile of this area by focusing more on those subjects that have been given more importance in the literature. In this context, most of the paper is devoted to partitional algor ..."
Abstract
- Add to MetaCart
Abstract — This paper presents a survey of evolutionary algorithms designed for clustering tasks. It tries to reflect the profile of this area by focusing more on those subjects that have been given more importance in the literature. In this context, most of the paper is devoted to partitional algorithms that look for hard clusterings of data, though overlapping (i.e., soft and fuzzy) approaches are also covered in the manuscript. The paper is original in what concerns two main aspects. First, it provides an up-to-date overview that is fully devoted to evolutionary algorithms for clustering, is not limited to any particular kind of evolutionary approach, and comprises advanced topics, like multi-objective and ensemble-based evolutionary clustering. Second, it provides a taxonomy that highlights some very important aspects in the context of evolutionary data clustering, namely, fixed or variable number of clusters, cluster-oriented or non-oriented operators, context-sensitive or context-insensitive operators, guided or unguided operators, binary, integer or real encodings, centroid-based, medoid-based, label-based, tree-based or graph-based representations, among others. A number of references is provided that describe applications of evolutionary algorithms for clustering in different domains, such as image processing, computer security, and bioinformatics. The paper ends by addressing some important issues and open questions that can be subject of future research. Index Terms — evolutionary algorithms, clustering, applications. I.

