Results 1 - 10
of
12
Simultaneous feature selection and clustering using mixture models
- IEEE TRANS. PATTERN ANAL. MACH. INTELL
, 2004
"... Clustering is a common unsupervised learning technique used to discover group structure in a set of data. While there exist many algorithms for clustering, the important issue of feature selection, that is, what attributes of the data should be used by the clustering algorithms, is rarely touched u ..."
Abstract
-
Cited by 51 (0 self)
- Add to MetaCart
Clustering is a common unsupervised learning technique used to discover group structure in a set of data. While there exist many algorithms for clustering, the important issue of feature selection, that is, what attributes of the data should be used by the clustering algorithms, is rarely touched upon. Feature selection for clustering is difficult because, unlike in supervised learning, there are no class labels for the data and, thus, no obvious criteria to guide the search. Another important problem in clustering is the determination of the number of clusters, which clearly impacts and is influenced by the feature selection issue. In this paper, we propose the concept of feature saliency and introduce an expectation-maximization (EM) algorithm to estimate it, in the context of mixture-based clustering. Due to the introduction of a minimum message length model selection criterion, the saliency of irrelevant features is driven toward zero, which corresponds to performing feature selection. The criterion and algorithm are then extended to simultaneously estimate the feature saliencies and the number of clusters.
An Information-Theoretic External Cluster-Validity Measure
- Research Report RJ 10219, IBM
, 2001
"... In this paper we propose a measure of similarity/association between two partitions of a set of objects. Our motivation is the desire to use the measure to characterize the quality or accuracy of clustering algorithms by somehow comparing the clusters they produce with "ground truth" consisting of c ..."
Abstract
-
Cited by 48 (2 self)
- Add to MetaCart
In this paper we propose a measure of similarity/association between two partitions of a set of objects. Our motivation is the desire to use the measure to characterize the quality or accuracy of clustering algorithms by somehow comparing the clusters they produce with "ground truth" consisting of classes assigned to the patterns by manual means or some other means in whose veracity there is confidence. Such measures are referred to as "external". Our measure also allows clusterings with different numbers of clusters to be compared in a quantitative and principled way. Our evaluation scheme quantitatively measures how useful the cluster labels of the patterns are as predictors of their class labels. When all clusterings to be compared have the same number of clusters, the measure is equivalent to the mutual information between the cluster labels and the class labels. In cases where the numbers of clusters are different, however, it computes the reduction in the number of bits that w...
Variable Selection for Model-Based Clustering
- Journal of the American Statistical Association
, 2006
"... We consider the problem of variable or feature selection for model-based clustering. We recast the problem of comparing two nested subsets of variables as a model comparison problem, and address it using approximate Bayes factors. We develop a greedy search algorithm for finding a local optimum in m ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
We consider the problem of variable or feature selection for model-based clustering. We recast the problem of comparing two nested subsets of variables as a model comparison problem, and address it using approximate Bayes factors. We develop a greedy search algorithm for finding a local optimum in model space. The resulting method selects variables (or features), the number of clusters, and the clustering model simultaneously. We applied the method to several simulated and real examples, and found that removing irrelevant variables often improved performance. Compared to methods based on all the variables, our variable selection method consistently yielded more accurate estimates of the number of clusters, and lower classification error rates, as well as more parsimonious clustering models and easier visualization of results.
A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections
, 2002
"... This paper presents a probabilistic mixture modeling framework for the hierarchic organisation of document collections. It is demonstrated that the probabilistic corpus model which emerges from the automatic or unsupervised hierarchical organisation of a document collection can be further exploited ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
This paper presents a probabilistic mixture modeling framework for the hierarchic organisation of document collections. It is demonstrated that the probabilistic corpus model which emerges from the automatic or unsupervised hierarchical organisation of a document collection can be further exploited to create a kernel which boosts the performance of state-of-the-art Support Vector Machine document classifiers. It is shown that the performance of such a classifier is further enhanced when employing the kernel derived from an appropriate hierarchic mixture model used for partitioning a document corpus rather than the kernel associated with a at non-hierarchic mixture model. This has important implications for document classification when a hierarchic ordering of topics exists. This can be considered as the eective combination of documents with no topic or class labels (unlabeled data), labeled documents, and prior domain knowledge (in the form of the known hierarchic structure), in providing enhanced document classification performance.
Model-Based Hierarchical Clustering
- In Proc. 16th Conf. Uncertainty in Artificial Intelligence
, 2000
"... We present an approach to model-based hierarchical clustering by formulating an objective function based on a Bayesian analysis. This model organizes the data into a cluster hierarchy while specifying a complex feature-set partitioning that is a key component of our model. Features can have ei ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
We present an approach to model-based hierarchical clustering by formulating an objective function based on a Bayesian analysis. This model organizes the data into a cluster hierarchy while specifying a complex feature-set partitioning that is a key component of our model. Features can have either a unique distribution in every cluster or a common distribution over some (or even all) of the clusters. The cluster subsets over which these features have such a common distribution correspond to the nodes (clusters) of the tree representing the hierarchy. We apply this general model to the problem of document clustering for which we use a multinomial likelihood function and Dirichlet priors. Our algorithm consists of a two-stage process wherein we first perform a flat clustering followed by a modified hierarchical agglomerative merging process that includes determining the features that will have common distributions over the merged clusters. The regularization induced...
Feature Selection in Mixture-Based Clustering
, 2002
"... While there exist many approaches to clustering, the important issue of feature selection, that is, what attributes of the data are relevant, is rarely addressed. Feature selection for clustering is made difficult by the absence of class labels to guide the search. In this paper, we propose two appr ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
While there exist many approaches to clustering, the important issue of feature selection, that is, what attributes of the data are relevant, is rarely addressed. Feature selection for clustering is made difficult by the absence of class labels to guide the search. In this paper, we propose two approaches to deal with this problem. In the first one, instead of making hard selections, we estimate how salient each features is. An expectation-maximization (EM) algorithm is derived for this task. The second approach extends Koller and Sahami's mutual-information-based feature relevance criterion to the unsupervised case. Implementation is carried out by a backward search scheme. The resulting algorithm can be classified as a "wrapper", since it wraps mixture estimation in an outer layer that performs feature selection. Experimental results on synthetic and real data show that both methods have promising performance. 1
A Probabilistic Hierarchical Clustering Method for Organising Collections of Text Documents
- Proceedings of the 15th International Conference on Pattern Recognition (ICPR’2000
, 2000
"... In this paper a generic probabilistic framework for the unsupervised hierarchical clustering of large-scale sparse high-dimensional data collections is proposed. The framework is based on a hierarchical probabilistic mixture methodology. Two classes of models emerge from the analysis and these have ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
In this paper a generic probabilistic framework for the unsupervised hierarchical clustering of large-scale sparse high-dimensional data collections is proposed. The framework is based on a hierarchical probabilistic mixture methodology. Two classes of models emerge from the analysis and these have been termed as symmetric and asymmetric models. For text data specifically both asymmetric and symmetric models based on the multinomial and binomial distributions are most appropriate. An Expectation Maximisation parameter estimation method is provided for all of these models. An experimental comparison of the models is obtained for two extensive online document collections. 1.
Non-redundant clustering
, 2005
"... Data mining and knowledge discovery attempt to reveal concepts, patterns, relationships, and struc-tures of interest in data. Typically, data may have many such structures. Most existing data mining techniques allow the user little say in which structure will be returned from the search. Those techn ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Data mining and knowledge discovery attempt to reveal concepts, patterns, relationships, and struc-tures of interest in data. Typically, data may have many such structures. Most existing data mining techniques allow the user little say in which structure will be returned from the search. Those techniques which do allow the user control over the search typically require supervised information in the form of knowledge about a target solution. In the spirit of exploratory data mining, we consider the setting where the user does not have information about a target solution. Instead we suppose the user can provide information about solutions which are not desired. These undesired solutions may be previously obtained from data mining algorithms, or they may be known to the user a priori. The goal is then to discover novel structure in the dataset which is not redundant with respect to the known structure. Techniques should guide the search away from this known structure and towards novel, interesting structures. We describe and formally define the task of non-redundant clustering. Three different algorithmic approaches are derived for non-redundant clustering. Their performance is experimentally evaluated on data sets containing multiple cluster-ings. We explore how these techniques may be extended to systematically enumerate clusterings in a data set. Finally, we also investigate whether non-redundant approaches may be incorporated to enhance state-of-the-art supervised techniques.
Feature Saliency in Unsupervised Learning
, 2002
"... Clustering is a common unsupervised learning technique to discover the structure of a set of multidimensional data. While there exist many algorithms for clustering, the important issue of feature selection, that is, what attributes of the data should be used by the clustering algorithms, is rarel ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Clustering is a common unsupervised learning technique to discover the structure of a set of multidimensional data. While there exist many algorithms for clustering, the important issue of feature selection, that is, what attributes of the data should be used by the clustering algorithms, is rarely touched upon.
The Organisation and Retrieval of Document Collections: A Machine Learning Approach
, 2003
"... THE ORGANISATION AND RETRIEVAL OF DOCUMENT COLLECTIONS: A MACHINE LEARNING APPROACH BY ALEXEI VINOKOUROV Doctor of Philosophy School of Information and Communication Technologies University of Paisley Paisley, Scotland, 2003 The enormous growth of (online) text information available in digit ..."
Abstract
- Add to MetaCart
THE ORGANISATION AND RETRIEVAL OF DOCUMENT COLLECTIONS: A MACHINE LEARNING APPROACH BY ALEXEI VINOKOUROV Doctor of Philosophy School of Information and Communication Technologies University of Paisley Paisley, Scotland, 2003 The enormous growth of (online) text information available in digital form has raised the problem of automatic structuring and processing of large document collections. Consequently, the need for automatic organization of large text collections has become an important issue in modern text information access systems. This problem is identified as the Information Organisation problem. In this thesis we present a method termed Multinomial ASymmetric Hierarchical Analysis (MASHA) that allows one to automate the structuring of a large document collection into a hierarchy of topics. We also explore the use of the obtained structure to improve performance in document retrieval and classification applications. In addition to other similar works, we also present a method for the deduction of hierarchies from text corpora or, in other words, for finding a vi most appropriate (for a given document collection) topic hierarchy that would reflect the structure of the textual data in terms of interrelationships between hypothetical topics that presumably underlie the collection or are most appropriate to categorise documents in the collection. Unfortunately the cost of learning probabilistic models is, at best, proportional to the size of the collection, size of vocabularly and size of derived hierarchy. It appears, however, that for some tasks, particularly, for crosslingual information retrieval, the computational cost can be reduced by employing other methods. One such method, the kernel Canonical Correlation Analysis (KCCA), learns a semantic represen...

