Results 1 
6 of
6
Survey of clustering algorithms
 IEEE TRANSACTIONS ON NEURAL NETWORKS
, 2005
"... Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the ..."
Abstract

Cited by 248 (3 self)
 Add to MetaCart
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
Semisupervised learning of hierarchical latent trait models for data visualisation
 IEEE Transactions on Knowledge and Data Engineering
, 2005
"... Recently, we have developed the hierarchical Generative Topographic Mapping (HGTM), an interactive method for visualisation of large highdimensional realvalued data sets. In this paper, we propose a more general visualisation system by extending HGTM in 3 ways, which allow the user to visualise a ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Recently, we have developed the hierarchical Generative Topographic Mapping (HGTM), an interactive method for visualisation of large highdimensional realvalued data sets. In this paper, we propose a more general visualisation system by extending HGTM in 3 ways, which allow the user to visualise a wider range of datasets and better support the model development process. (i) We integrate HGTM with noise models from the exponential family of distributions. The basic building block is the Latent Trait Model (LTM). This enables us to visualise data of inherently discrete nature, e.g. collections of documents in a hierarchical manner. (ii) We give the user a choice of initialising the child plots of the current plot in either interactive, or automatic mode. In the interactive mode the user selects “regions of interest”, whereas in the automatic mode an unsupervised minimum message length (MML)inspired construction of a mixture of LTMs is employed. The unsupervised construction is particularly useful when highlevel plots are covered with dense clusters of highly overlapping data projections, making it difficult to use the interactive mode. Such a situation often arises when visualising large data sets. (iii) We derive general formulas for magnification factors in latent trait models. Magnification factors are a useful tool to improve our understanding of the visualisation plots, since they can highlight the boundaries between data clusters. We illustrate our approach on a toy example and evaluate it on three more complex real data sets.
A New Nonparametric Pairwise Clustering Algorithm Based on Iterative Estimation of Distance Profiles
"... Abstract. We present a novel pairwise clustering method. Given a proximity matrix of pairwise relations (i.e. pairwise similarity or dissimilarity estimates) between data points, our algorithm extracts the two most prominent clusters in the data set. The algorithm, which is completely nonparametric, ..."
Abstract
 Add to MetaCart
Abstract. We present a novel pairwise clustering method. Given a proximity matrix of pairwise relations (i.e. pairwise similarity or dissimilarity estimates) between data points, our algorithm extracts the two most prominent clusters in the data set. The algorithm, which is completely nonparametric, iteratively employs a twostep transformation on the proximity matrix. The first step of the transformation represents each point by its relation to all other data points, and the second step reestimates the pairwise distances using a statistically motivated proximity measure on these representations. Using this transformation, the algorithm iteratively partitions the data points, until it finally converges to two clusters. Although the algorithm is simple and intuitive, it generates a complex dynamics of the proximity matrices. Based on this bipartition procedure we devise a hierarchical clustering algorithm, which employs the basic bipartition algorithm in a straightforward divisive manner. The hierarchical clustering algorithm copes with the model validation problem using a general crossvalidation approach, which may be combined with various hierarchical clustering methods. We further present an experimental study of this algorithm. We examine some of the algorithm’s properties
Signature:................................. Date:......................
, 2007
"... I, the undersigned, hereby declare that the work contained in this dissertation is my own original work and that I have not previously in its entirety or in part submitted it at any university for a degree. ..."
Abstract
 Add to MetaCart
I, the undersigned, hereby declare that the work contained in this dissertation is my own original work and that I have not previously in its entirety or in part submitted it at any university for a degree.
Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based on Minimum Message Length
"... Abstract—We consider the problem of determining the structure of highdimensional data without prior knowledge of the number of clusters. Data are represented by a finite mixture model based on the generalized Dirichlet distribution. The generalized Dirichlet distribution has a more general covarian ..."
Abstract
 Add to MetaCart
Abstract—We consider the problem of determining the structure of highdimensional data without prior knowledge of the number of clusters. Data are represented by a finite mixture model based on the generalized Dirichlet distribution. The generalized Dirichlet distribution has a more general covariance structure than the Dirichlet distribution and offers high flexibility and ease of use for the approximation of both symmetric and asymmetric distributions. This makes the generalized Dirichlet distribution more practical and useful. An important problem in mixture modeling is the determination of the number of clusters. Indeed, a mixture with too many or too few components may not be appropriate to approximate the true model. Here, we consider the application of the minimum message length (MML) principle to determine the number of clusters. The MML is derived so as to choose the number of clusters in the mixture model that best describes the data. A comparison with other selection criteria is performed. The validation involves synthetic data, real data clustering, and two interesting real applications: classification of Web pages, and texture database summarization for efficient retrieval.
PROTEINS: Structure, Function, and Genetics 51:504–514 (2003) Hidden Markov Models That Use Predicted Local Structure for Fold Recognition: Alphabets of Backbone Geometry
"... ABSTRACT An important problem in computational biology is predicting the structure of the large number of putative proteins discovered by genome sequencing projects. Foldrecognition methods attempt to solve the problem by relating the target proteins to known structures, searching for template prot ..."
Abstract
 Add to MetaCart
ABSTRACT An important problem in computational biology is predicting the structure of the large number of putative proteins discovered by genome sequencing projects. Foldrecognition methods attempt to solve the problem by relating the target proteins to known structures, searching for template proteins homologous to the target. Remote homologs that may have significant structural similarity are often not detectable by sequence similarities alone. To address this, we incorporated predicted local structure, a generalization of secondary structure, into twotrack profile hidden Markov models (HMMs). We did not rely on a simple helixstrandcoil definition of secondary structure, but experimented with a variety of local structure descriptions, following a principled protocol to establish which descriptions are most useful for improving fold recognition and alignment quality. On a test set of 1298 nonhomologous proteins, HMMs incorporating a 3letter STRIDE alphabet improved fold recognition accuracy by 15 % over aminoacidonly HMMs and 23% over PSIBLAST, measured by ROC65 numbers. We compared twotrack HMMs to aminoacidonly HMMs on a difficult alignment test set of 200 protein pairs (structurally similar with 3–24 % sequence identity). HMMs with a 6letter STRIDE secondary track improved alignment quality by 62%, relative to DALI structural alignments, while HMMs with an STR track (an expanded DSSP alphabet that subdivides strands into six states) improved by 40 % relative to CE. Proteins 2003;51:504–514. © 2003 WileyLiss, Inc. Key words: protein structure prediction; twotrack HMM; multitrack HMM; information theory; neural network; alignment; secondary structure