Results 1  10
of
13
Comparing subspace clusterings
 IEEE Transactions on Knowledge and Data Engineering
, 2004
"... Abstract—We present the first framework for comparing subspace clusterings. We propose several distance measures for subspace clusterings, including generalizations of wellknown distance measures for ordinary clusterings. We describe a set of important properties for any measure for comparing subsp ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
(Show Context)
Abstract—We present the first framework for comparing subspace clusterings. We propose several distance measures for subspace clusterings, including generalizations of wellknown distance measures for ordinary clusterings. We describe a set of important properties for any measure for comparing subspace clusterings and give a systematic comparison of our proposed measures in terms of these properties. We validate the usefulness of our subspace clustering distance measures by comparing clusterings produced by the algorithms FastDOC, HARP, PROCLUS, ORCLUS, and SSPC. We show that our distance measures can be also used to compare partial clusterings, overlapping clusterings, and patterns in binary data matrices. Index Terms—Subspace clustering, projected clustering, distance, feature selection, cluster validation.
Subset clustering of binary sequences, with an application to genomic abnormality data. Biometrics 2005
"... abnormality data ..."
Local Semantic Kernels for Text Document Clustering
 In Workshop on Text Mining, SIAM International Conference on Data Mining
, 2007
"... Document clustering is a fundamental task of text mining, by which efficient organization, navigation, summarization and retrieval of documents can be achieved. The clustering of documents presents difficult challenges due to the sparsity and the high dimensionality of text data, and to the complex ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Document clustering is a fundamental task of text mining, by which efficient organization, navigation, summarization and retrieval of documents can be achieved. The clustering of documents presents difficult challenges due to the sparsity and the high dimensionality of text data, and to the complex semantics of the natural language. Subspace clustering is an extension of traditional clustering that is designed to capture local feature relevance, and to group documents with respect to the features (or words) that matter the most. This paper presents a subspace clustering technique based on a Locally Adaptive Clustering (LAC) algorithm. To improve the subspace clustering of documents and the identification of keywords achieved by LAC, kernel methods and semantic distances are deployed. The basic idea is to define a local kernel for each cluster by which semantic distances between pairs of words are computed to derive the clustering and the local term weightings. The proposed approach, called Semantic LAC, is evaluated using benchmark datasets. Our experiments show that Semantic LAC is capable of improving the clustering quality. 1
Clustering based on Dirichlet mixtures of attribute ensembles
, 2004
"... We discuss a modelbased approach to identifying clusters of objects based on subsets of attributes, so that the attributes that distinguish a cluster from the rest of the population may depend on the cluster being considered. The method is based on a Pólya urn cluster model for multivariate means a ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
We discuss a modelbased approach to identifying clusters of objects based on subsets of attributes, so that the attributes that distinguish a cluster from the rest of the population may depend on the cluster being considered. The method is based on a Pólya urn cluster model for multivariate means and variances, resulting in a multivariate Dirichlet process mixture model. This particular modelbased approach accommodates outliers and allows for the incorporation of applicationspecific data features into the clustering scheme. For example, in an analysis of genetic CGH array data we are able to design a clustering method that accounts for spatial dependence of chromosomal abnormalities. Some key words: nonparametric Bayes, unsupervised learning, subspace clustering, variable
Pairwise Variable Selection for Highdimensional Modelbased Clustering
"... Variable selection for clustering is an important and challenging problem in highdimensional data analysis. Existing variable selection methods for modelbased clustering select informative variables in a “oneinallout ” manner; that is, a variable is selected if at least one pair of clusters is s ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Variable selection for clustering is an important and challenging problem in highdimensional data analysis. Existing variable selection methods for modelbased clustering select informative variables in a “oneinallout ” manner; that is, a variable is selected if at least one pair of clusters is separable by this variable and removed if it cannot separate any of the clusters. In many applications, however, it is of interest to further establish exactly which clusters are separable by each informative variable. To address this question, we propose a pairwise variable selection method for highdimensional modelbased clustering. The method is based on a new pairwise penalty. Results on simulated and real data show that the new method performs better than alternative approaches which use ℓ1 and ℓ ∞ penalties and offers better interpretation. Key words:
HighDimensional Clustering with Sparse Gaussian Mixture Models
"... We consider the problem of clustering highdimensional data using Gaussian Mixture Models (GMMs) with unknown covariances. In this context, the ExpectationMaximization algorithm (EM), which is typically used to learn GMMs, fails to cluster the data accurately due to the large number of free paramet ..."
Abstract
 Add to MetaCart
(Show Context)
We consider the problem of clustering highdimensional data using Gaussian Mixture Models (GMMs) with unknown covariances. In this context, the ExpectationMaximization algorithm (EM), which is typically used to learn GMMs, fails to cluster the data accurately due to the large number of free parameters in the covariance matrices. We address this weakness by assuming that the mixture model consists of sparse gaussian distributions and leveraging this assumption in a novel algorithm for learning GMMs. Our approach incorporates the graphical lasso procedure for sparse covariance estimation into the EM algorithm for learning GMMs, and by encouraging sparsity, it avoids the problems faced by traditional GMMs. We guarantee convergence of our algorithm and show through experimentation that this procedure outperforms the traditional Expectation Maximization algorithm and other clustering algorithms in the highdimensional clustering setting. 1
SSC: Statistical Subspace Clustering
, 2010
"... Abstract. Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. This is a particularly important challenge with high dimensional data where the curse of dimensionality occurs. It has also the benefit of providing smaller de ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. This is a particularly important challenge with high dimensional data where the curse of dimensionality occurs. It has also the benefit of providing smaller descriptions of the clusters found. Existing methods only consider numerical databases and do not propose any method for clusters visualization. Besides, they require some input parameters difficult to set for the user. The aim of this paper is to propose a new subspace clustering algorithm, able to tackle databases that may contain continuous as well as discrete attributes, requiring as few user parameters as possible, and producing an interpretable output. We present a method based on the use of the wellknown EM algorithm on a probabilistic model designed under some specific hypotheses, allowing us to present the result as a set of rules, each one defined with as few relevant dimensions as possible. Experiments, conducted on artificial as well as real databases, show that our algorithm gives robust results, in terms of classification and interpretability of the output. 1
Cascade evaluation of clustering algorithms
"... Abstract. This paper is about the evaluation of the results of clustering algorithms, and the comparison of such algorithms. We propose a new method based on the enrichment of a set of independent labeled datasets by the results of clustering, and the use of a supervised method to evaluate the inter ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. This paper is about the evaluation of the results of clustering algorithms, and the comparison of such algorithms. We propose a new method based on the enrichment of a set of independent labeled datasets by the results of clustering, and the use of a supervised method to evaluate the interest of adding such new information to the datasets. We thus adapt the cascade generalization [1] paradigm in the case where we combine an unsupervised and a supervised learner. We also consider the case where independent supervised learnings are performed on the different groups of data objects created by the clustering [2]. We then conduct experiments using different supervised algorithms to compare various clustering algorithms. And we thus show that our proposed method exhibits a coherent behavior, pointing out, for example, that the algorithms based on the use of complex probabilistic models outperform algorithms based on the use of simpler models. 1