• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

T: Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data (0)

by S Monti, P Tamayo, J Mesirov, Golub
Venue:Mach Learn
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 61
Next 10 →

Solving Cluster Ensemble Problems by Bipartite Graph Partitioning

by Xiaoli Zhang Fern, Carla E. Brodley - IN PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON MACHINE LEARNING , 2004
"... A critical problem in cluster ensemble research is how to combine multiple clusterings to yield a final superior clustering result. Leveraging advanced graph partitioning techniques, we solve this problem by reducing it to a graph partitioning problem. We introduce a new reduction method that constr ..."
Abstract - Cited by 42 (3 self) - Add to MetaCart
A critical problem in cluster ensemble research is how to combine multiple clusterings to yield a final superior clustering result. Leveraging advanced graph partitioning techniques, we solve this problem by reducing it to a graph partitioning problem. We introduce a new reduction method that constructs a bipartite graph from a given cluster ensemble. The resulting graph models both instances and clusters of the ensemble simultaneously as vertices in the graph. Our approach retains all of the information provided by a given ensemble, allowing the similarity among instances and the similarity among clusters to be considered collectively in forming the final clustering. Further, the resulting graph partitioning problem can be solved efficiently. We empirically evaluate the proposed approach against two commonly used graph formulations and show that it is more robust and achieves comparable or better performance in comparison to its competitors.

Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms

by Stan Salvador, Philip Chan , 2003
"... Many clustering and segmentation algorithms both suffer from the limitation that the number of clusters/segments are specified by a human user. It is often impractical to expect a human with sufficient domain knowledge to be available to select the number of clusters/segments to return. In this pape ..."
Abstract - Cited by 36 (1 self) - Add to MetaCart
Many clustering and segmentation algorithms both suffer from the limitation that the number of clusters/segments are specified by a human user. It is often impractical to expect a human with sufficient domain knowledge to be available to select the number of clusters/segments to return. In this paper, we investigate techniques to determine the number of clusters or segments to return from hierarchical clustering and segmentation algorithms. We propose an efficient algorithm, the L method, that finds the “knee ” in a ‘ # of clusters vs. clustering evaluation metric ’ graph. Using the knee is well-known, but is not a particularly well-understood method to determine the number of clusters. We explore the feasibility of this method, and attempt to determine in which situations it will and will not work. We also compare the L method to existing methods based on the accuracy of the number of clusters that are determined and efficiency. Our results show favorable performance for these criteria compared to the existing methods that were evaluated.

Stability selection

by Nicolai Meinshausen, Peter Bühlmann
"... Proofs subject to correction. Not to be reproduced without permission. Contributions to the discussion must not exceed 400 words. Contributions longer than 400 words will be cut by the editor. 1 2 ..."
Abstract - Cited by 18 (2 self) - Add to MetaCart
Proofs subject to correction. Not to be reproduced without permission. Contributions to the discussion must not exceed 400 words. Contributions longer than 400 words will be cut by the editor. 1 2

Combining Multiple Clustering Systems

by Constantinos Boulis, Mari Ostendorf - In 8th European conference on Principles and Practice of Knowledge Discovery in Databases(PKDD), LNAI 3202 , 2004
"... Three methods for combining multiple clustering systems are presented and evaluated, focusing on the problem of finding the correspondence between clusters of di#erent systems. In this work, the clusters of individual systems are represented in a common space and their correspondence estimated by ei ..."
Abstract - Cited by 16 (0 self) - Add to MetaCart
Three methods for combining multiple clustering systems are presented and evaluated, focusing on the problem of finding the correspondence between clusters of di#erent systems. In this work, the clusters of individual systems are represented in a common space and their correspondence estimated by either "clustering clusters" or with Singular Value Decomposition. The approaches are evaluated for the task of topic discovery on three major corpora and eight di#erent clustering algorithms and it is shown experimentally that combination schemes almost always o#er gains compared to single systems, but gains from using a combination scheme depend on the underlying clustering systems.

Meta clustering

by Rich Caruana, Mohamed Elhawary, Nam Nguyen - In Proceedings IEEE International Conference on Data Mining , 2006
"... Clustering is ill-defined. Unlike supervised learning where labels lead to crisp performance criteria such as accuracy and squared error, clustering quality depends on how the clusters will be used. Devising clustering criteria that capture what users need is difficult. Most clustering algorithms se ..."
Abstract - Cited by 14 (1 self) - Add to MetaCart
Clustering is ill-defined. Unlike supervised learning where labels lead to crisp performance criteria such as accuracy and squared error, clustering quality depends on how the clusters will be used. Devising clustering criteria that capture what users need is difficult. Most clustering algorithms search for optimal clusterings based on a pre-specified clustering criterion. Our approach differs. We search for many alternate clusterings of the data, and then allow users to select the clustering(s) that best fit their needs. Meta clustering first finds a variety of clusterings and then clusters this diverse set of clusterings so that users must only examine a small number of qualitatively different clusterings. We present methods for automatically generating a diverse set of alternate clusterings, as well as methods for grouping clusterings into meta clusters. We evaluate meta clustering on four test problems and two case studies. Surprisingly, clusterings that would be of most interest to users often are not very compact clusterings. 1.

Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary?

by Nguyen Xuan Vinh, James Bailey
"... Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of pair-counting based and set-matching based measures. In this paper, we discuss the necessity of correction for chance for information theoretic based measures for clust ..."
Abstract - Cited by 14 (0 self) - Add to MetaCart
Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of pair-counting based and set-matching based measures. In this paper, we discuss the necessity of correction for chance for information theoretic based measures for clusterings comparison. We observe that the baseline for such measures, i.e. average value between random partitions of a data set, does not take on a constant value, and tends to have larger variation when the ratio between the number of data points and the number of clusters is small. This effect is similar in some other non-information theoretic based measures such as the well-known Rand Index. Assuming a hypergeometric model of randomness, we derive the analytical formula for the expected mutual information value between a pair of clusterings, and then propose the adjusted version for several popular information theoretic based measures. Some examples are given to demonstrate the need and usefulness of the adjusted measures. 1.

Model order selection for bio-molecular data clustering

by Alberto Bertoni, Giorgio Valentini - BMC BIOINFORMATICS , 2007
"... Background: Cluster analysis has been widely applied for investigating structure in bio-molecular data. A drawback of most clustering algorithms is that they cannot automatically detect the ”natural ” number of clusters underlying the data, and in many cases we have no enough ”a priori ” biological ..."
Abstract - Cited by 10 (2 self) - Add to MetaCart
Background: Cluster analysis has been widely applied for investigating structure in bio-molecular data. A drawback of most clustering algorithms is that they cannot automatically detect the ”natural ” number of clusters underlying the data, and in many cases we have no enough ”a priori ” biological knowledge to evaluate both the number of clusters as well as their validity. Recently several methods based on the concept of stability have been proposed to estimate the ”optimal ” number of clusters, but despite their successful application to the analysis of complex bio-molecular data, the assessment of the statistical significance of the discovered clustering solutions and the detection of multiple structures simultaneously present in high-dimensional bio-molecular data are still major problems. Results: We propose a stability method based on randomized maps that exploits the high-dimensionality and relatively low cardinality that characterize bio-molecular data, by selecting subsets of randomized linear combinations of the input variables, and by using stability indices based on the overall distribution of similarity measures between multiple pairs of clusterings performed on the randomly projected data. A χ 2-based statistical test is proposed to assess the significance of the clustering solutions and to detect significant and if possible multi-level structures simultaneously present in the data (e.g. hierarchical structures).

Cluster Ensemble Selection

by Xiaoli Z. Fern, Wei Lin , 2008
"... This paper studies the ensemble selection problem for unsupervised learning. Given a large library of different clustering solutions, our goal is to select a subset of solutions to form a smaller but better performing cluster ensemble than using all available solutions. We design our ensemble select ..."
Abstract - Cited by 8 (1 self) - Add to MetaCart
This paper studies the ensemble selection problem for unsupervised learning. Given a large library of different clustering solutions, our goal is to select a subset of solutions to form a smaller but better performing cluster ensemble than using all available solutions. We design our ensemble selection methods based on quality and diversity, the two factors that have been shown to influence cluster ensemble performance. Our investigation revealed that using quality or diversity alone may not consistently achieve improved performance. Based on our observations, we designed three different selection approaches that jointly consider these two factors. We empirically evaluated their performances in comparison with both full ensembles and a random selection strategy. Our results indicated that by explicitly considering both quality and diversity in ensemble selection, we can achieve statistically significant performance improvement over full ensembles.

Average Parameterization and Partial Kernelization for Computing Medians

by Nadja Betzler, Jiong Guo, Christian Komusiewicz, Rolf Niedermeier - PROC. 9TH LATIN , 2010
"... We propose an effective polynomial-time preprocessing strategy for intractable median problems. Developing a new methodological framework, we show that if the input instances of generally intractable problems exhibit a sufficiently high degree of similarity between each other on average, then there ..."
Abstract - Cited by 6 (5 self) - Add to MetaCart
We propose an effective polynomial-time preprocessing strategy for intractable median problems. Developing a new methodological framework, we show that if the input instances of generally intractable problems exhibit a sufficiently high degree of similarity between each other on average, then there are efficient exact solving algorithms. In other words, we show that the median problems Swap Median Permutation, Consensus Clustering, Kemeny Score, and Kemeny Tie Score all are fixed-parameter tractable with respect to the parameter “average distance between input objects”. To this end, we develop the new concept of “partial kernelization” and identify interesting polynomial-time solvable special cases for the considered problems.

Are approximation algorithms for consensus clustering worthwhile?

by Michael Bertolacci, Anthony Wirth
"... ..."
Abstract - Cited by 5 (0 self) - Add to MetaCart
Abstract not found
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University