Results 1  10
of
60
Empirical and theoretical comparisons of selected criterion functions for document clustering
 Machine Learning
"... Abstract. This paper evaluates the performance of different criterion functions in the context of partitional clustering algorithms for document datasets. Our study involves a total of seven different criterion functions, three of which are introduced in this paper and four that have been proposed i ..."
Abstract

Cited by 110 (7 self)
 Add to MetaCart
(Show Context)
Abstract. This paper evaluates the performance of different criterion functions in the context of partitional clustering algorithms for document datasets. Our study involves a total of seven different criterion functions, three of which are introduced in this paper and four that have been proposed in the past. We present a comprehensive experimental evaluation involving 15 different datasets, as well as an analysis of the characteristics of the various criterion functions and their effect on the clusters they produce. Our experimental results show that there are a set of criterion functions that consistently outperform the rest, and that some of the newly proposed criterion functions lead to the best overall results. Our theoretical analysis shows that the relative performance of the criterion functions depends on (i) the degree to which they can correctly operate when the clusters are of different tightness, and (ii) the degree to which they can lead to reasonably balanced clusters. Keywords:
Comparing Clusterings
, 2002
"... This paper proposes an information theoretic criterion for comparing two clusterings of the same data set. The criterion, called variation of information measures the amount of information that is lost or gained in changing from clustering C to dustering C '. The criterion makes no assumptions ..."
Abstract

Cited by 67 (4 self)
 Add to MetaCart
This paper proposes an information theoretic criterion for comparing two clusterings of the same data set. The criterion, called variation of information measures the amount of information that is lost or gained in changing from clustering C to dustering C '. The criterion makes no assumptions about how the dusterings were generated and applies to both soft and hard dusterings.The basic properties of VI are presented and discussed from the point of view of comparing c!usterings. In particular, the VI is positive, symmetric and transitive and thus, surprisingly enough, is a true metric on the space of c1usterings.
Alternatives to the kMeans Algorithm That Find Better Clusterings
"... We investigate here the behavior of the standard kmeans clustering algorithm and several alternatives to it: the k harmonic means algorithm due to Zhang and colleagues, fuzzy kmeans, Gaussian expectationmaximization, and two new variants of kharmonic means. Our aim is to nd which aspect ..."
Abstract

Cited by 62 (5 self)
 Add to MetaCart
We investigate here the behavior of the standard kmeans clustering algorithm and several alternatives to it: the k harmonic means algorithm due to Zhang and colleagues, fuzzy kmeans, Gaussian expectationmaximization, and two new variants of kharmonic means. Our aim is to nd which aspects of these algorithms contribute to nding good clusterings, as opposed to converging to a lowquality local optimum. We describe each algorithm in a uni ed framework that introduces separate cluster membership and data weight functions.
Generative modelbased document clustering: a comparative study
 Knowledge and Information Systems
, 2005
"... Semisupervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semisupervised clustering. Viewing semisupervis ..."
Abstract

Cited by 48 (0 self)
 Add to MetaCart
(Show Context)
Semisupervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semisupervised clustering. Viewing semisupervised learning from a clustering angle is useful in practical situations when the set of labels available in labeled data are not complete, i.e., unlabeled data contain new classes that are not present in labeled data. This paper analyzes several multinomial modelbased semisupervised document clustering methods under a principled modelbased clustering framework. The framework naturally leads to a deterministic annealing extension of existing semisupervised clustering approaches. We compare three (slightly) different semisupervised approaches for clustering documents: Seeded damnl, Constrained damnl, and Feedbackbased damnl, where damnl stands for multinomial modelbased deterministic annealing algorithm. The first two are extensions of the seeded kmeans and constrained kmeans algorithms studied by Basu et al. (2002); the last one is motivated by Cohn et al. (2003). Through empirical experiments on text datasets, we show that: (a) deterministic annealing can often significantly improve the performance of semisupervised clustering; (b) the constrained approach is the best when available labels are complete whereas the feedbackbased approach excels when available labels are incomplete.
A Comparative Study of Generative Models for Document Clustering
 SIAM Knowledge and Information Systems
, 2002
"... Generative models based on the multivariate Bernoulli and multinomial distributions have been widely used for text classification. Recently, the spherical kmeans algorithm, which has desirable properties for text clustering, has been shown to be a special case of a generative model based on a mixtu ..."
Abstract

Cited by 45 (5 self)
 Add to MetaCart
(Show Context)
Generative models based on the multivariate Bernoulli and multinomial distributions have been widely used for text classification. Recently, the spherical kmeans algorithm, which has desirable properties for text clustering, has been shown to be a special case of a generative model based on a mixture of von MisesFisher (vMF) distributions. This paper compares these three probabilistic models for text clustering, both theoretically and empirically, using a general modelbased clustering framework. For each model, we investigate three strategies for assigning documents to models: maximum likelihood (kmeans) assignment, stochastic assignment, and soft assignment. Our experimental results over a large number of datasets show that, in terms of clustering quality, (a) The Bernoulli model is the worst for text clustering; (b) The vMF model produces better clustering results than both Bernoulli and multinomial models; (c) Soft assignment leads to comparable or slightly better results than hard assignment. We also use deterministic annealing (DA) to improve the vMFbased soft clustering and compare all the modelbased algorithms with the stateoftheart discriminative approach to document clustering based on graph partitioning (CLUTO) and a spectral coclustering method. Overall, CLUTO and DA perform the best but are also the most computationally expensive; the spectral coclustering algorithm fares worse than the vMFbased methods. 1
Convex clustering with exemplarbased models
 In Advances in Neural Information Processing Systems (NIPS
, 2007
"... Clustering is often formulated as the maximum likelihood estimation of a mixture model that explains the data. The EM algorithm widely used to solve the resulting optimization problem is inherently a gradientdescent method and is sensitive to initialization. The resulting solution is a local optimu ..."
Abstract

Cited by 35 (0 self)
 Add to MetaCart
(Show Context)
Clustering is often formulated as the maximum likelihood estimation of a mixture model that explains the data. The EM algorithm widely used to solve the resulting optimization problem is inherently a gradientdescent method and is sensitive to initialization. The resulting solution is a local optimum in the neighborhood of the initial guess. This sensitivity to initialization presents a significant challenge in clustering large data sets into many clusters. In this paper, we present a different approach to approximate mixture fitting for clustering. We introduce an exemplarbased likelihood function that approximates the exact likelihood. This formulation leads to a convex minimization problem and an efficient algorithm with guaranteed convergence to the globally optimal solution. The resulting clustering can be thought of as a probabilistic mapping of the data points to the set of exemplars that minimizes the average distance and the informationtheoretic cost of mapping. We present experimental results illustrating the performance of our algorithm and its comparison with the conventional approach to mixture model clustering. 1
Robust hierarchical clustering
, 2010
"... Oneofthemostwidelyusedtechniquesfordataclusteringisagglomerativeclustering. Such algorithms have been long used across many different fields ranging from computational biologytosocialsciencestocomputervisioninpartbecausetheiroutputiseasytointerpret. Unfortunately, it is well known, however, that man ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
Oneofthemostwidelyusedtechniquesfordataclusteringisagglomerativeclustering. Such algorithms have been long used across many different fields ranging from computational biologytosocialsciencestocomputervisioninpartbecausetheiroutputiseasytointerpret. Unfortunately, it is well known, however, that many of the classic agglomerative clustering algorithms are not robust to noise [14]. In this paper we propose and analyze a new robust algorithm for bottomup agglomerative clustering. We show that our algorithm can be used to cluster accurately in cases where the data satisfies a number of natural properties and where the traditional agglomerative algorithms fail. We also show how to adapt our algorithm to the inductive setting where our given data is only a small random sample of the entire data set. 1
K.S.: Scalable modelbased clustering for large databases based on data summarization
 IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2005
"... The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources such as memory and computation time. In this paper, two scalable clustering algorithms, bEMADS and gEMADS, are presented based on the Gaussian mixture model. B ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
(Show Context)
The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources such as memory and computation time. In this paper, two scalable clustering algorithms, bEMADS and gEMADS, are presented based on the Gaussian mixture model. Both summarize data into subclusters and then generate Gaussian mixtures from their data summaries. Their core algorithm EMADS is defined on data summaries and approximates the aggregate behavior of each subcluster of data under the Gaussian mixture model. EMADS is provably convergent. Experimental results substantiate that both algorithms can run several orders of magnitude faster than expectationmaximization with little loss of accuracy. Index Terms Scalable clustering, Gaussian mixture model, expectationmaximization, data summary, maximum
Protein solubility: sequence based prediction and experimental verification
"... Motivation: Obtaining soluble proteins in sufficient concentrations is a recurring limiting factor in various experimental studies. Solubility is an individual trait of proteins which, under a given set of experimental conditions, is determined by their amino acid sequence. Accurate theoretical pred ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
(Show Context)
Motivation: Obtaining soluble proteins in sufficient concentrations is a recurring limiting factor in various experimental studies. Solubility is an individual trait of proteins which, under a given set of experimental conditions, is determined by their amino acid sequence. Accurate theoretical prediction of solubility from sequence is instrumental for setting priorities on targets in largescale proteomics projects. Results: We present a machinelearning approach called PROSO to assess the chance of a protein to be soluble upon heterologous expression in E. coli based on its amino acid composition. The classification algorithm is organized as a twolayered structure in which the output of primary support vector machine classifiers serves as input for a secondary Naive Bayes classifier. Experimental progress information from the TargetDB database as well as previously published datasets were used as the source of training data. In comparison with previously published methods our classification algorithm possesses improved discriminatory capacity characterized by the Matthews Correlation Coefficient of 0.434 between predicted and known solubility states and the overall prediction accuracy of 72 % (75 % and 68 % for positive and negative class respectively). We also provide experimental verification of our predictions using solubility measurements for 31 mutational variants of two different proteins. Availability: A Web server for protein solubility prediction is available