Results 1  10
of
32
Simultaneous feature selection and clustering using mixture models
 IEEE TRANS. PATTERN ANAL. MACH. INTELL
, 2004
"... Clustering is a common unsupervised learning technique used to discover group structure in a set of data. While there exist many algorithms for clustering, the important issue of feature selection, that is, what attributes of the data should be used by the clustering algorithms, is rarely touched u ..."
Abstract

Cited by 75 (1 self)
 Add to MetaCart
Clustering is a common unsupervised learning technique used to discover group structure in a set of data. While there exist many algorithms for clustering, the important issue of feature selection, that is, what attributes of the data should be used by the clustering algorithms, is rarely touched upon. Feature selection for clustering is difficult because, unlike in supervised learning, there are no class labels for the data and, thus, no obvious criteria to guide the search. Another important problem in clustering is the determination of the number of clusters, which clearly impacts and is influenced by the feature selection issue. In this paper, we propose the concept of feature saliency and introduce an expectationmaximization (EM) algorithm to estimate it, in the context of mixturebased clustering. Due to the introduction of a minimum message length model selection criterion, the saliency of irrelevant features is driven toward zero, which corresponds to performing feature selection. The criterion and algorithm are then extended to simultaneously estimate the feature saliencies and the number of clusters.
Feature Selection in MixtureBased Clustering
, 2002
"... While there exist many approaches to clustering, the important issue of feature selection, that is, what attributes of the data are relevant, is rarely addressed. Feature selection for clustering is made difficult by the absence of class labels to guide the search. In this paper, we propose two appr ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
While there exist many approaches to clustering, the important issue of feature selection, that is, what attributes of the data are relevant, is rarely addressed. Feature selection for clustering is made difficult by the absence of class labels to guide the search. In this paper, we propose two approaches to deal with this problem. In the first one, instead of making hard selections, we estimate how salient each features is. An expectationmaximization (EM) algorithm is derived for this task. The second approach extends Koller and Sahami's mutualinformationbased feature relevance criterion to the unsupervised case. Implementation is carried out by a backward search scheme. The resulting algorithm can be classified as a "wrapper", since it wraps mixture estimation in an outer layer that performs feature selection. Experimental results on synthetic and real data show that both methods have promising performance. 1
Repairing Faulty Mixture Models using Density Estimation
 In Proceedings of the 18th International Conf. on Machine Learning
, 2001
"... Previous work in mixture model clustering has focused primarily on the issue of model selection. Model scoring functions (including penalized likelihood and Bayesian approxi mations) can guide a search of the model pa rameter and structure space. Relatively lit tle research has addressed th ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
Previous work in mixture model clustering has focused primarily on the issue of model selection. Model scoring functions (including penalized likelihood and Bayesian approxi mations) can guide a search of the model pa rameter and structure space. Relatively lit tle research has addressed the issue of how to move through this space. Local optimization techniques, such as expectation maximization, solve only part of the problem; we still need to move between different local optima.
A Case Study in Knowledge Discovery and Elicitation in an Intelligent Tutoring Application
, 2001
"... Most successful Bayesian network (BN) applications to date have been built through knowledge elicitation from experts. This is difficult and time consuming, which has lead to recent interest in automated methods for learning BNs from data. We present a case study in the construction of a BN in ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
Most successful Bayesian network (BN) applications to date have been built through knowledge elicitation from experts. This is difficult and time consuming, which has lead to recent interest in automated methods for learning BNs from data. We present a case study in the construction of a BN in an intelligent tutoring application, specifically decimal misconceptions. We describe the BN construction using expert elicitation and then investigate how certain existing automated knowledge discovery methods might support the BN knowledge engineering process.
Minimum Message Length Grouping of Ordered Data
 Proceedings of the Eleventh International Conference on Algorithmic Learning Theory (ALT2000), LNAI
, 2000
"... Explicit segmentation is the partitioning of data into homogeneous regions by specifying cutpoints. W. D. Fisher (1958) gave an early example of explicit segmentation based on the minimisation of squared error. Fisher called this the grouping problem and came up with a polynomial time Dynamic Progr ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
Explicit segmentation is the partitioning of data into homogeneous regions by specifying cutpoints. W. D. Fisher (1958) gave an early example of explicit segmentation based on the minimisation of squared error. Fisher called this the grouping problem and came up with a polynomial time Dynamic Programming Algorithm (DPA). Oliver, Baxter and colleagues (1996,1997,1998) have applied the informationtheoretic Minimum Message Length (MML) principle to explicit segmentation. Given a series of multivariate data, approximate it by a piecewise constant function. How many cutpoints are there? What are the means and variances of each segment? Where should the cut points be placed? The simplest model is a single segment. The most complex model has one segment per data point. The best model is generally somewhere between these extremes. Only by considering model complexity can a reasonable inference be made.
Semisupervised learning of hierarchical latent trait models for data visualisation
 IEEE Transactions on Knowledge and Data Engineering
, 2005
"... Recently, we have developed the hierarchical Generative Topographic Mapping (HGTM), an interactive method for visualisation of large highdimensional realvalued data sets. In this paper, we propose a more general visualisation system by extending HGTM in 3 ways, which allow the user to visualise a ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Recently, we have developed the hierarchical Generative Topographic Mapping (HGTM), an interactive method for visualisation of large highdimensional realvalued data sets. In this paper, we propose a more general visualisation system by extending HGTM in 3 ways, which allow the user to visualise a wider range of datasets and better support the model development process. (i) We integrate HGTM with noise models from the exponential family of distributions. The basic building block is the Latent Trait Model (LTM). This enables us to visualise data of inherently discrete nature, e.g. collections of documents in a hierarchical manner. (ii) We give the user a choice of initialising the child plots of the current plot in either interactive, or automatic mode. In the interactive mode the user selects “regions of interest”, whereas in the automatic mode an unsupervised minimum message length (MML)inspired construction of a mixture of LTMs is employed. The unsupervised construction is particularly useful when highlevel plots are covered with dense clusters of highly overlapping data projections, making it difficult to use the interactive mode. Such a situation often arises when visualising large data sets. (iii) We derive general formulas for magnification factors in latent trait models. Magnification factors are a useful tool to improve our understanding of the visualisation plots, since they can highlight the boundaries between data clusters. We illustrate our approach on a toy example and evaluate it on three more complex real data sets.
Unsupervised Learning of Gamma Mixture Models Using Minimum Message Length
 Proc. Third IASTED Conf. Artificial Intelligence and Applications
, 2003
"... Mixture modelling or unsupervised classification is a problem of identifying and modelling components in a body of data. Earlier work in mixture modelling using Minimum Message Length (MML) includes the multinomial and Gaussian distributions (Wallace and Boulton, 1968), the von Mises circular and Po ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
Mixture modelling or unsupervised classification is a problem of identifying and modelling components in a body of data. Earlier work in mixture modelling using Minimum Message Length (MML) includes the multinomial and Gaussian distributions (Wallace and Boulton, 1968), the von Mises circular and Poisson distributions (Wallace and Dowe, 1994, 2000) and the distribution (Agusta and Dowe, 2002a, 2002b). In this paper, we extend this research by considering MML mixture modelling using the Gamma distribution. The point estimation of the distribution was performed using the MML approximation proposed by Wallace and Freeman (1987) and gives impressive results compared to Maximum Likelihood (ML). We then considered mixture modelling on artificially generated datasets and compared the results with two other criteria, AIC and BIC. In terms of the resulting number of components, the results were again impressive. Application to the Heming Pike dataset was then examined and the results were compared in terms of the probability bitcostings, showing that the proposed MML method performs better than AIC and BIC. A further application also shows that our method works well with datasets containing leftskewed components such as the Palm Valley (Australia) image dataset.
Minimum Message Length Clustering of SpatiallyCorrelated Data with Varying InterClass Penalties
 6TH IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS 2007
, 2007
"... We present here some applications of the Minimum Message Length (MML) principle to spatially correlated data. Discrete valued Markov Random Fields are used to model spatial correlation. The models for spatial correlation used here are a generalisation of the model used in (Wallace 1998) [14] for uns ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
We present here some applications of the Minimum Message Length (MML) principle to spatially correlated data. Discrete valued Markov Random Fields are used to model spatial correlation. The models for spatial correlation used here are a generalisation of the model used in (Wallace 1998) [14] for unsupervised classification of spatially correlated data (such as image segmentation). We discuss how our work can be applied to that type of unsupervised classification. We now make the following three new contributions. First, the rectangular grid used in (Wallace 1998) [14] is generalised to an arbitrary graph of arbitrary edge distances. Secondly, we refine (Wallace 1998) [14] slightly by including a discarded message length term important to small data sets and to a simpler problem presented here. Finally, we show how the Minimum Message Length (MML) principle can be used to test for the presence of spatial correlation and how it can be used to choose between models of varying complexity to infer details of the nature of the spatial correlation.
Compression and intelligence: social environments and communication
"... Abstract. Compression has been advocated as one of the principles which pervades inductive inference and prediction and, from there, it has also been recurrent in definitions and tests of intelligence. However, this connection is less explicit in new approaches to intelligence. In this paper, we ad ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
Abstract. Compression has been advocated as one of the principles which pervades inductive inference and prediction and, from there, it has also been recurrent in definitions and tests of intelligence. However, this connection is less explicit in new approaches to intelligence. In this paper, we advocate that the notion of compression can appear again in definitions and tests of intelligence through the concepts of ‘mindreading’ and ‘communication ’ in the context of multiagent systems and social environments. Our main position is that twopart Minimum Message Length (MML) compression is not only more natural and effective for agents with limited resources, but it is also much more appropriate for agents in (cooperative) social environments than onepart compression schemes particularly those using a posteriorweighted mixture of all available models following Solomonoff’s theory of prediction. We think that the realisation of these differences is important to avoid a naive view of ‘intelligence as compression ’ in favour of a better understanding of how, why and where (onepart or twopart, lossless or lossy) compression is needed.
Summarising contextual activity and detecting unusual inactivity in a supportive home environment
 PATTERN ANALYSIS APPLICATION
, 2004
"... Interpretation of human activity and the detection of associated events are eased if appropriate models of context are available. A method is presented for automatically learning a contextspecific spatial model in terms of semantic regions, specifically inactivity zones and entry zones. Maximium a ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Interpretation of human activity and the detection of associated events are eased if appropriate models of context are available. A method is presented for automatically learning a contextspecific spatial model in terms of semantic regions, specifically inactivity zones and entry zones. Maximium a posteriori estimation of Gaussian mixtures is used in conjunction with minumum description length for selection of the number of mixture components. Learning is performed using EM algorithms to maximise penalised likelihood functions that incorporate prior knowledge of the size and shape of the semantic regions. This encourages a onetoone correspondence between the Gaussian mixture components and the regions. The resulting contextual model enables humanreadable summaries of activity to be produced and unusual inactivity to be detected. Results are presented using overhead camera sequences tracked using a particle filter. The method is developed and described within the context of supportive home environments which have as their aim the extension of independent, quality living for older people.