Results 1  10
of
251
Solving Cluster Ensemble Problems by Bipartite Graph Partitioning
 IN PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON MACHINE LEARNING
, 2004
"... A critical problem in cluster ensemble research is how to combine multiple clusterings to yield a final superior clustering result. Leveraging advanced graph partitioning techniques, we solve this problem by reducing it to a graph partitioning problem. We introduce a new reduction method that constr ..."
Abstract

Cited by 109 (3 self)
 Add to MetaCart
(Show Context)
A critical problem in cluster ensemble research is how to combine multiple clusterings to yield a final superior clustering result. Leveraging advanced graph partitioning techniques, we solve this problem by reducing it to a graph partitioning problem. We introduce a new reduction method that constructs a bipartite graph from a given cluster ensemble. The resulting graph models both instances and clusters of the ensemble simultaneously as vertices in the graph. Our approach retains all of the information provided by a given ensemble, allowing the similarity among instances and the similarity among clusters to be considered collectively in forming the final clustering. Further, the resulting graph partitioning problem can be solved efficiently. We empirically evaluate the proposed approach against two commonly used graph formulations and show that it is more robust and achieves comparable or better performance in comparison to its competitors.
Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary?
"... Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of paircounting based and setmatching based measures. In this paper, we discuss the necessity of correction for chance for information theoretic based measures for clust ..."
Abstract

Cited by 101 (5 self)
 Add to MetaCart
(Show Context)
Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of paircounting based and setmatching based measures. In this paper, we discuss the necessity of correction for chance for information theoretic based measures for clusterings comparison. We observe that the baseline for such measures, i.e. average value between random partitions of a data set, does not take on a constant value, and tends to have larger variation when the ratio between the number of data points and the number of clusters is small. This effect is similar in some other noninformation theoretic based measures such as the wellknown Rand Index. Assuming a hypergeometric model of randomness, we derive the analytical formula for the expected mutual information value between a pair of clusterings, and then propose the adjusted version for several popular information theoretic based measures. Some examples are given to demonstrate the need and usefulness of the adjusted measures. 1.
Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms
, 2003
"... Many clustering and segmentation algorithms both suffer from the limitation that the number of clusters/segments are specified by a human user. It is often impractical to expect a human with sufficient domain knowledge to be available to select the number of clusters/segments to return. In this pape ..."
Abstract

Cited by 101 (2 self)
 Add to MetaCart
(Show Context)
Many clustering and segmentation algorithms both suffer from the limitation that the number of clusters/segments are specified by a human user. It is often impractical to expect a human with sufficient domain knowledge to be available to select the number of clusters/segments to return. In this paper, we investigate techniques to determine the number of clusters or segments to return from hierarchical clustering and segmentation algorithms. We propose an efficient algorithm, the L method, that finds the “knee ” in a ‘ # of clusters vs. clustering evaluation metric ’ graph. Using the knee is wellknown, but is not a particularly wellunderstood method to determine the number of clusters. We explore the feasibility of this method, and attempt to determine in which situations it will and will not work. We also compare the L method to existing methods based on the accuracy of the number of clusters that are determined and efficiency. Our results show favorable performance for these criteria compared to the existing methods that were evaluated.
Meta clustering
 In Proceedings IEEE International Conference on Data Mining
, 2006
"... Clustering is illdefined. Unlike supervised learning where labels lead to crisp performance criteria such as accuracy and squared error, clustering quality depends on how the clusters will be used. Devising clustering criteria that capture what users need is difficult. Most clustering algorithms se ..."
Abstract

Cited by 41 (1 self)
 Add to MetaCart
(Show Context)
Clustering is illdefined. Unlike supervised learning where labels lead to crisp performance criteria such as accuracy and squared error, clustering quality depends on how the clusters will be used. Devising clustering criteria that capture what users need is difficult. Most clustering algorithms search for optimal clusterings based on a prespecified clustering criterion. Our approach differs. We search for many alternate clusterings of the data, and then allow users to select the clustering(s) that best fit their needs. Meta clustering first finds a variety of clusterings and then clusters this diverse set of clusterings so that users must only examine a small number of qualitatively different clusterings. We present methods for automatically generating a diverse set of alternate clusterings, as well as methods for grouping clusterings into meta clusters. We evaluate meta clustering on four test problems and two case studies. Surprisingly, clusterings that would be of most interest to users often are not very compact clusterings. 1.
Model order selection for biomolecular data clustering
 BMC BIOINFORMATICS
, 2007
"... Background: Cluster analysis has been widely applied for investigating structure in biomolecular data. A drawback of most clustering algorithms is that they cannot automatically detect the ”natural ” number of clusters underlying the data, and in many cases we have no enough ”a priori ” biological ..."
Abstract

Cited by 32 (5 self)
 Add to MetaCart
(Show Context)
Background: Cluster analysis has been widely applied for investigating structure in biomolecular data. A drawback of most clustering algorithms is that they cannot automatically detect the ”natural ” number of clusters underlying the data, and in many cases we have no enough ”a priori ” biological knowledge to evaluate both the number of clusters as well as their validity. Recently several methods based on the concept of stability have been proposed to estimate the ”optimal ” number of clusters, but despite their successful application to the analysis of complex biomolecular data, the assessment of the statistical significance of the discovered clustering solutions and the detection of multiple structures simultaneously present in highdimensional biomolecular data are still major problems. Results: We propose a stability method based on randomized maps that exploits the highdimensionality and relatively low cardinality that characterize biomolecular data, by selecting subsets of randomized linear combinations of the input variables, and by using stability indices based on the overall distribution of similarity measures between multiple pairs of clusterings performed on the randomly projected data. A χ 2based statistical test is proposed to assess the significance of the clustering solutions and to detect significant and if possible multilevel structures simultaneously present in the data (e.g. hierarchical structures).
Cluster Ensemble Selection
, 2008
"... This paper studies the ensemble selection problem for unsupervised learning. Given a large library of different clustering solutions, our goal is to select a subset of solutions to form a smaller but better performing cluster ensemble than using all available solutions. We design our ensemble select ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
(Show Context)
This paper studies the ensemble selection problem for unsupervised learning. Given a large library of different clustering solutions, our goal is to select a subset of solutions to form a smaller but better performing cluster ensemble than using all available solutions. We design our ensemble selection methods based on quality and diversity, the two factors that have been shown to influence cluster ensemble performance. Our investigation revealed that using quality or diversity alone may not consistently achieve improved performance. Based on our observations, we designed three different selection approaches that jointly consider these two factors. We empirically evaluated their performances in comparison with both full ensembles and a random selection strategy. Our results indicated that by explicitly considering both quality and diversity in ensemble selection, we can achieve statistically significant performance improvement over full ensembles.
Learning States and Rules for Time Series Anomaly Detection
, 2003
"... The normal operation of a device can be characterized in different temporal states. To identify these states, we introduce a clustering algorithm called Gecko that can determine a reasonable number of clusters using our proposed L method. We then use the RIPPER classification algorithm to describe t ..."
Abstract

Cited by 28 (6 self)
 Add to MetaCart
The normal operation of a device can be characterized in different temporal states. To identify these states, we introduce a clustering algorithm called Gecko that can determine a reasonable number of clusters using our proposed L method. We then use the RIPPER classification algorithm to describe these states in logical rules. Finally, transitional logic between the states is added to create a finite state automaton. Our empirical results, on data obtained from the NASA shuttle program, indicate that the Gecko clustering algorithm is comparable to a human expert in identifying states and our overall system can track normal behavior and detect anomalies.
Bayesian cluster ensembles
 In Proceedings of the 9th SIAM International Conference on Data Mining
, 2009
"... Cluster ensembles provide a framework for combining multiple base clusterings of a dataset to generate a stable and robust consensus clustering. There are important variants of the basic cluster ensemble problem, notably including cluster ensembles with missing values, as well as rowdistributed or ..."
Abstract

Cited by 26 (2 self)
 Add to MetaCart
(Show Context)
Cluster ensembles provide a framework for combining multiple base clusterings of a dataset to generate a stable and robust consensus clustering. There are important variants of the basic cluster ensemble problem, notably including cluster ensembles with missing values, as well as rowdistributed or columndistributed cluster ensembles. Existing cluster ensemble algorithms are applicable only to a small subset of these variants. In this paper, we propose Bayesian Cluster Ensembles (BCE), which is a mixedmembership model for learning cluster ensembles, and is applicable to all the primary variants of the problem. We propose two methods, respectively based on variational approximation and Gibbs sampling, for learning a Bayesian cluster ensemble. We compare BCE extensively with several other cluster ensemble algorithms, and demonstrate that BCE is not only versatile in terms of its applicability, but also outperforms the other algorithms in terms of stability and accuracy. 1