Results 1  10
of
10
Survey of clustering algorithms
 IEEE TRANSACTIONS ON NEURAL NETWORKS
, 2005
"... Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the ..."
Abstract

Cited by 483 (4 self)
 Add to MetaCart
(Show Context)
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
Simultaneous Gene Clustering and Subset Selection for Classification via MDL
 Bioinformatics
, 2003
"... Motivation: The microarray technology allows for the simultaneous monitoring of thousands of genes for each sample. The highdimensional gene expression data can be used to study similarities of gene expression profiles across different samples to form a gene clustering. The clusters may be indicati ..."
Abstract

Cited by 29 (0 self)
 Add to MetaCart
(Show Context)
Motivation: The microarray technology allows for the simultaneous monitoring of thousands of genes for each sample. The highdimensional gene expression data can be used to study similarities of gene expression profiles across different samples to form a gene clustering. The clusters may be indicative of genetic pathways. Parallel to gene clustering is the important application of sample classification based on all or selected gene expressions. The gene clustering and sample classification are often undertaken separately, or in a directional manner (one as an aid for the other). However, such separation of these two tasks may occlude informative structure in the data. Here we present an algorithm for the simultaneous clustering of genes and subset selection of gene clusters for
Semisupervised learning of hierarchical latent trait models for data visualisation
 IEEE Transactions on Knowledge and Data Engineering
, 2005
"... Recently, we have developed the hierarchical Generative Topographic Mapping (HGTM), an interactive method for visualisation of large highdimensional realvalued data sets. In this paper, we propose a more general visualisation system by extending HGTM in 3 ways, which allow the user to visualise a ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
(Show Context)
Recently, we have developed the hierarchical Generative Topographic Mapping (HGTM), an interactive method for visualisation of large highdimensional realvalued data sets. In this paper, we propose a more general visualisation system by extending HGTM in 3 ways, which allow the user to visualise a wider range of datasets and better support the model development process. (i) We integrate HGTM with noise models from the exponential family of distributions. The basic building block is the Latent Trait Model (LTM). This enables us to visualise data of inherently discrete nature, e.g. collections of documents in a hierarchical manner. (ii) We give the user a choice of initialising the child plots of the current plot in either interactive, or automatic mode. In the interactive mode the user selects “regions of interest”, whereas in the automatic mode an unsupervised minimum message length (MML)inspired construction of a mixture of LTMs is employed. The unsupervised construction is particularly useful when highlevel plots are covered with dense clusters of highly overlapping data projections, making it difficult to use the interactive mode. Such a situation often arises when visualising large data sets. (iii) We derive general formulas for magnification factors in latent trait models. Magnification factors are a useful tool to improve our understanding of the visualisation plots, since they can highlight the boundaries between data clusters. We illustrate our approach on a toy example and evaluate it on three more complex real data sets.
PROTEINS: Structure, Function, and Genetics 51:504–514 (2003) Hidden Markov Models That Use Predicted Local Structure for Fold Recognition: Alphabets of Backbone Geometry
"... ABSTRACT An important problem in computational biology is predicting the structure of the large number of putative proteins discovered by genome sequencing projects. Foldrecognition methods attempt to solve the problem by relating the target proteins to known structures, searching for template prot ..."
Abstract
 Add to MetaCart
ABSTRACT An important problem in computational biology is predicting the structure of the large number of putative proteins discovered by genome sequencing projects. Foldrecognition methods attempt to solve the problem by relating the target proteins to known structures, searching for template proteins homologous to the target. Remote homologs that may have significant structural similarity are often not detectable by sequence similarities alone. To address this, we incorporated predicted local structure, a generalization of secondary structure, into twotrack profile hidden Markov models (HMMs). We did not rely on a simple helixstrandcoil definition of secondary structure, but experimented with a variety of local structure descriptions, following a principled protocol to establish which descriptions are most useful for improving fold recognition and alignment quality. On a test set of 1298 nonhomologous proteins, HMMs incorporating a 3letter STRIDE alphabet improved fold recognition accuracy by 15 % over aminoacidonly HMMs and 23% over PSIBLAST, measured by ROC65 numbers. We compared twotrack HMMs to aminoacidonly HMMs on a difficult alignment test set of 200 protein pairs (structurally similar with 3–24 % sequence identity). HMMs with a 6letter STRIDE secondary track improved alignment quality by 62%, relative to DALI structural alignments, while HMMs with an STR track (an expanded DSSP alphabet that subdivides strands into six states) improved by 40 % relative to CE. Proteins 2003;51:504–514. © 2003 WileyLiss, Inc. Key words: protein structure prediction; twotrack HMM; multitrack HMM; information theory; neural network; alignment; secondary structure
A New Nonparametric Pairwise Clustering Algorithm Based on Iterative Estimation of Distance Profiles
"... Abstract. We present a novel pairwise clustering method. Given a proximity matrix of pairwise relations (i.e. pairwise similarity or dissimilarity estimates) between data points, our algorithm extracts the two most prominent clusters in the data set. The algorithm, which is completely nonparametric, ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. We present a novel pairwise clustering method. Given a proximity matrix of pairwise relations (i.e. pairwise similarity or dissimilarity estimates) between data points, our algorithm extracts the two most prominent clusters in the data set. The algorithm, which is completely nonparametric, iteratively employs a twostep transformation on the proximity matrix. The first step of the transformation represents each point by its relation to all other data points, and the second step reestimates the pairwise distances using a statistically motivated proximity measure on these representations. Using this transformation, the algorithm iteratively partitions the data points, until it finally converges to two clusters. Although the algorithm is simple and intuitive, it generates a complex dynamics of the proximity matrices. Based on this bipartition procedure we devise a hierarchical clustering algorithm, which employs the basic bipartition algorithm in a straightforward divisive manner. The hierarchical clustering algorithm copes with the model validation problem using a general crossvalidation approach, which may be combined with various hierarchical clustering methods. We further present an experimental study of this algorithm. We examine some of the algorithm’s properties
Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based on Minimum Message Length
"... Abstract—We consider the problem of determining the structure of highdimensional data without prior knowledge of the number of clusters. Data are represented by a finite mixture model based on the generalized Dirichlet distribution. The generalized Dirichlet distribution has a more general covarian ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—We consider the problem of determining the structure of highdimensional data without prior knowledge of the number of clusters. Data are represented by a finite mixture model based on the generalized Dirichlet distribution. The generalized Dirichlet distribution has a more general covariance structure than the Dirichlet distribution and offers high flexibility and ease of use for the approximation of both symmetric and asymmetric distributions. This makes the generalized Dirichlet distribution more practical and useful. An important problem in mixture modeling is the determination of the number of clusters. Indeed, a mixture with too many or too few components may not be appropriate to approximate the true model. Here, we consider the application of the minimum message length (MML) principle to determine the number of clusters. The MML is derived so as to choose the number of clusters in the mixture model that best describes the data. A comparison with other selection criteria is performed. The validation involves synthetic data, real data clustering, and two interesting real applications: classification of Web pages, and texture database summarization for efficient retrieval.
MML Inference of Finite State Automata for Probabilistic Spam Detection
"... Abstract—MML (Minimum Message Length) has emerged as a powerful tool in inductive inference of discrete, continuous and hybrid structures. The Probabilistic Finite State Automaton (PFSA) is one such discrete structure that needs to be inferred for classes of problems in the field of Computer Science ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—MML (Minimum Message Length) has emerged as a powerful tool in inductive inference of discrete, continuous and hybrid structures. The Probabilistic Finite State Automaton (PFSA) is one such discrete structure that needs to be inferred for classes of problems in the field of Computer Science including artificial intelligence, pattern recognition and data mining. MML has also served as a viable tool in many classes of problems in the field of Machine Learning including both supervised and unsupervised learning. The classification problem is the most common among them. This research is a twofold solution to a problem where one part focusses on the best inferred PFSA using MML and the second part focusses on the classification problem of Spam Detection. Using the best PFSA inferred in part 1, the Spam Detection theory has been tested using MML on a publicly available Enron Spam dataset. The filter was evaluated on various performance parameters like precision and recall. The evaluation was also done taking into consideration the cost of misclassification in terms of weighted accuracy rate and weighted error rate. The results of our empirical evaluation indicate the classification accuracy to be around 93%, which outperforms wellknown established spam filters.
Image Segmentation With Modified KMeans Clustering Method
"... Abstract: Image segmentation is used to recognizing some objects or something that is more meaningful and easier to analyze In this paper we are focus on the the K means clustering for segmentation of the image. Kmeans clustering is the most widely used clustering algorithm to position the radial ..."
Abstract
 Add to MetaCart
Abstract: Image segmentation is used to recognizing some objects or something that is more meaningful and easier to analyze In this paper we are focus on the the K means clustering for segmentation of the image. Kmeans clustering is the most widely used clustering algorithm to position the radial basis function (RBF) centres. Its simplicity and ability to perform online clustering may inspire this choice. However, kmeans clustering algorithm can be sensitive to the initial centres and the search for the optimum centre locations may result in poor local minima. Many attempts have been made to minimise these problems In this paper two updating rules were suggested as alternatives or improvements to the standard adaptive kmeans clustering algorithm. The updating methods are proposed to give better overall RBF network performance rather than good clustering performance. However, there is a strong correlation between good clustering and the performance of the RBF network. The sensitivity of the RBF network to the centre locations will also be studied.Thus we will test the modified K means different set of images.
Are There Subgroups within the Autistic Spectrum? A Cluster Analysis of a Group of Children with Autistic Spectrum Disorders
"... Comprehensive data on the developmental history and current behaviours of a large sample of highfunctioning individuals with diagnoses of autism, Asperger’s syndrome, or other related disorder were collected via parent interviews. This provided the basis for a taxonomic analysis to search for subgr ..."
Abstract
 Add to MetaCart
(Show Context)
Comprehensive data on the developmental history and current behaviours of a large sample of highfunctioning individuals with diagnoses of autism, Asperger’s syndrome, or other related disorder were collected via parent interviews. This provided the basis for a taxonomic analysis to search for subgroups. Most participants also completed theory of mind tasks. Three clusters or subgroups were obtained; these differed on theory of mind performance and on verbal abilities. Although subgroups were identified which bore some relationship to clinical differentiation of autistic, Asperger syndrome, and Pervasive Developmental Disorder Not Otherwise Specified (PDDNOS) cases, the nature of the differences between them appeared strongly related to ability variables. Examination of the kinds of behaviours that differentiated the groups suggested that a spectrum of autistic disorders on which children differ primarily in term of degrees of social and cognitive impairments could explain