Results 1  10
of
16
EntropyBased Criterion in Categorical Clustering
 Proc. of Intl. Conf. on Machine Learning (ICML
, 2004
"... Entropytype measures for the heterogeneity of clusters have been used for a long time. This paper studies the entropybased criterion in clustering categorical data. It first shows that the entropybased criterion can be derived in the formal framework of probabilistic clustering models and e ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
Entropytype measures for the heterogeneity of clusters have been used for a long time. This paper studies the entropybased criterion in clustering categorical data. It first shows that the entropybased criterion can be derived in the formal framework of probabilistic clustering models and establishes the connection between the criterion and the approach based on dissimilarity coefficients.
Genetic Algorithms for Large Scale Clustering Problems
 Comput. J
, 1997
"... We consider the clustering problem in a case where the distances of elements are metric and both the number of attributes and the number of the clusters are large. ..."
Abstract

Cited by 11 (7 self)
 Add to MetaCart
We consider the clustering problem in a case where the distances of elements are metric and both the number of attributes and the number of the clusters are large.
Comparing Bayesian Model Class Selection Criteria by Discrete Finite Mixtures
 Information, Statistics and Induction in Science, pages 364374, Proceedings of the ISIS'96 Conference
, 1996
"... : We investigate the problem of computing the posterior probability of a model class, given a data sample and a prior distribution for possible parameter settings. By a model class we mean a group of models which all share the same parametric form. In general this posterior may be very hard to compu ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
: We investigate the problem of computing the posterior probability of a model class, given a data sample and a prior distribution for possible parameter settings. By a model class we mean a group of models which all share the same parametric form. In general this posterior may be very hard to compute for highdimensional parameter spaces, which is usually the case with realworld applications. In the literature several methods for computing the posterior approximately have been proposed, but the quality of the approximations may depend heavily on the size of the available data sample. In this work we are interested in testing how well the approximative methods perform in realworld problem domains. In order to conduct such a study, we have chosen the model family of finite mixture distributions. With certain assumptions, we are able to derive the model class posterior analytically for this model family. We report a series of model class selection experiments on realworld data sets, w...
Probabilistic Models for Bacterial Taxonomy
 INTERNATIONAL STATISTICAL REVIEW
, 2000
"... We give a survey of different probabilistic partitioning methods that have been applied to bacterial taxonomy. We introduce a theoretical framework, which makes it possible to treat the various models in a unified way. The key concepts of our approach are prediction and storing of microbiological in ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
We give a survey of different probabilistic partitioning methods that have been applied to bacterial taxonomy. We introduce a theoretical framework, which makes it possible to treat the various models in a unified way. The key concepts of our approach are prediction and storing of microbiological information in a Bayesian forecasting setting. We show that there is a close connection between classification and probabilistic identification and that, in fact, our approach ties these two concepts together in a coherent way.
A unified view on clustering binary data
 Machine Learning
"... Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. Binary data have been occupying a special place in the domain of dat ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. Binary data have been occupying a special place in the domain of data analysis. A unified view of binary data clustering is presented by examining the connections among various clustering criteria. Experimental studies are conducted to empirically verify the relationships. 1
Randomised Local Search Algorithm for the Clustering Problem
, 2000
"... : We consider clustering as a combinatorial optimisation problem. Local search provides a simple and effective approach to many ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
: We consider clustering as a combinatorial optimisation problem. Local search provides a simple and effective approach to many
Pairwise Nearest Neighbor Method Revisited
, 2004
"... The pairwise nearest neighbor (PNN) method, also known as Ward's method belongs to the class of agglomerative clustering methods. The PNN method generates hierarchical clustering using a sequence of merge operations until the desired number of clusters is obtained. This method selects the cluster pa ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
The pairwise nearest neighbor (PNN) method, also known as Ward's method belongs to the class of agglomerative clustering methods. The PNN method generates hierarchical clustering using a sequence of merge operations until the desired number of clusters is obtained. This method selects the cluster pair to be merged so that it increases the given objective function value least. The main drawback of the PNN method is its slowness because the time complexity of the fastest known exact implementation of the PNN method is lower bounded by O(N²), where N is the number of data objects. We consider several speedup methods for the PNN method in the first publication. These methods maintain the precision of the method. Another method for speedingup the PNN method is investigated in the second publication, where we utilize a kneighborhood graph for reducing distance calculations and operations. A remarkable speedup is achieved at the cost of slight increase in distortion. The PNN method can also be adapted for multilevel thresholding, which can be seen as
Minimizing stochastic complexity using local search and GLA with applications to classification of bacteria
, 2000
"... In this paper, we compare the performance of two iterative clustering methods when applied to an extensive data set describing strains of the bacterial family Enterobacteriaceae. In both methods, the classification (i.e. the number of classes and the partitioning) is determined by minimizing stochas ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
In this paper, we compare the performance of two iterative clustering methods when applied to an extensive data set describing strains of the bacterial family Enterobacteriaceae. In both methods, the classification (i.e. the number of classes and the partitioning) is determined by minimizing stochastic complexity. The first method performs the minimization by repeated application of the generalized Lloyd algorithm (GLA). The second method uses an optimization technique known as local search (LS). The method modifies the current solution by making global changes to the class structure and it, then, performs local finetuning to find a local optimum. It is observed that if we fix the number of classes, the LS finds a classification with a lower stochastic complexity value than GLA. In addition, the variance of the solutions is much smaller for the LS due to its more systematic method of searching. Overall, the two algorithms produce similar classifications but they merge certain natural classes with microbiological relevance in different ways. 2000 Elsevier Science Ireland Ltd. All rights reserved.
Applying the EMalgorithm to Classification of Bacteria
 Proceedings of the International ICSC Congress on Intelligent Systems and Applications
, 2000
"... In present paper we study the use of the expectation maximization (EM) algorithm in classification. The EMalgorithm is used to calculate the probability of each vector belonging to each class. If we assign each vector to the class of maximal probability we get a classification minimizing a certain ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
In present paper we study the use of the expectation maximization (EM) algorithm in classification. The EMalgorithm is used to calculate the probability of each vector belonging to each class. If we assign each vector to the class of maximal probability we get a classification minimizing a certain loglikelihood function. By analyzing these probabilities we get a clearer picture of how well data fits to the classification than by traditional classification methods. We define a vector to be well classified in the classification if its probability of belonging to some class is above a prescribed value 1 \Gamma ffl. Then we set up the experimental procedure to filter out elements that are not well classified in a large data set describing strains of bacteria belonging to the family Enterobacteriaceae. We compare classifications with subset of the data (containing only well classified elements) to classifications done with randomly chosen subsets. We note that classifications done with w...
BinClass: A Software Package for Classifying Binary Vectors User's Guide
"... In this document we introduce a software package BinClass for the classification of binary vectors and analysis of the classification results. First we will give brief introduction to the mathematical foundations and theory of clustering, cumulative classification and mixture classification. We also ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
In this document we introduce a software package BinClass for the classification of binary vectors and analysis of the classification results. First we will give brief introduction to the mathematical foundations and theory of clustering, cumulative classification and mixture classification. We also introduce methods for analysis of the classifications including trees (dendrograms) , comparison of the classifications and bootstrapping. A few pseudoalgorithms are presented. These methods are included in the software package. The third and fourth chapters are the user's guide to the actual software package. Finally a short sample session is presented to give insight into how the software actually works and to illustrate the function of some of the many parameters. Apart from being a user's guide to the software package, this document can be seen as a review and tutorial to classification methodology of binary data. This is due to extensive research done on the subject at our department.