## Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary?

Citations: | 34 - 1 self |

### BibTeX

@MISC{Vinh_informationtheoretic,

author = {Nguyen Xuan Vinh and James Bailey},

title = {Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary?},

year = {}

}

### OpenURL

### Abstract

Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of pair-counting based and set-matching based measures. In this paper, we discuss the necessity of correction for chance for information theoretic based measures for clusterings comparison. We observe that the baseline for such measures, i.e. average value between random partitions of a data set, does not take on a constant value, and tends to have larger variation when the ratio between the number of data points and the number of clusters is small. This effect is similar in some other non-information theoretic based measures such as the well-known Rand Index. Assuming a hypergeometric model of randomness, we derive the analytical formula for the expected mutual information value between a pair of clusterings, and then propose the adjusted version for several popular information theoretic based measures. Some examples are given to demonstrate the need and usefulness of the adjusted measures. 1.

### Citations

8563 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...sure, information theoretic based measures have received increasing attention for their strong theoretical background. Let us first review some of the very fundamental concepts of information theory (=-=Cover & Thomas, 1991-=-) and then see how those concepts might be used toward assessing clusterings agreement. Definition 2.1 The information entropy of a discrete random variable X, that can take on possible values in its ... |

521 |
Comparing partitions
- Hubert, Arabie
- 1985
(Show Context)
Citation Context ...area which has also received much attention. Various clustering comparison measures have been proposed: besides the class of pair-counting based measures including the well-known Adjusted Rand Index (=-=Hubert & Arabie, 1985-=-), and set-matching based measures, such as the H criterion (Meilǎ, 2005), information theoretic based measures, such as the Mutual Information (Strehl & Ghosh, 2002) and the Variation of Information ... |

455 |
Objective criteria for the evaluation of clustering methods
- Rand
- 1971
(Show Context)
Citation Context ... 1). Intuitively, N11 and N00 can be used as indicators of agreement between U . . and V, while N01 and N10 can be used as disagreement indicators. A well known index of this class is the Rand Index (=-=Rand, 1971-=-), defined straightforwardly as: ( ) N RI(U, V) = (N00 + N11)/ (1) 2 The Rand Index lies between 0 and 1. It takes the value of 1 when the two clusterings are identical, and 0 when no pair of points a... |

396 | Cluster ensembles — a knowledge reuse framework for combining multiple partitions
- Strehl, Ghosh
(Show Context)
Citation Context ...e well-known Adjusted Rand Index (Hubert & Arabie, 1985), and set-matching based measures, such as the H criterion (Meilǎ, 2005), information theoretic based measures, such as the Mutual Information (=-=Strehl & Ghosh, 2002-=-) and the Variation of Information (Meilǎ, 2005), form another fundamental class of clustering comparison measures. In this paper, we aim to improve the usability of the class of information theoretic... |

157 | Consensus Clustering: A Resampling-Based Method for Class Discovery and
- Monti, Tamayo, et al.
- 2003
(Show Context)
Citation Context ...of clusters via Consensus Clustering: We start by first providing some background on Consensus Clustering. In an era where a huge number of clustering algorithms exist, the Consensus Clustering idea (=-=Monti et al., 2003-=-; Strehl & Ghosh, 2002; Yu et al., 2007) has recently received increasing interest. Consensus Clustering is not just another clustering algorithm: it rather provides a framework for unifying the knowl... |

84 | Clustering on the unit hypersphere using von mises-fisher distributions
- Banerjee, Dhillon, et al.
(Show Context)
Citation Context ...h are information theoretic based, have also beenInformation Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary? employed more recently in the clustering literature (=-=Banerjee et al., 2005-=-; Strehl & Ghosh, 2002; Meilǎ, 2005). Although there is currently no consensus on which is the best measure, information theoretic based measures have received increasing attention for their strong th... |

45 | The chi-Squared Distribution - Lancaster - 1969 |

8 |
On similarity indices and correction for chance agreement
- Albatineh, Niewiadomska-Bugaj, et al.
- 2006
(Show Context)
Citation Context ...ts expected value (under the generalized hypergeometric distribution assumption for randomness). Besides the Adjusted Rand Index, there are many other, possibly less popular, measures in this class. (=-=Albatineh et al., 2006-=-) discussed correction for chance for a comprehensive list of 28 different indices in this class, a number which is large enough to make the task of choosing an appropriate measure difficult and confu... |

4 |
A novel approach for automatic number of clusters detection in microarray data based on consensus clustering
- Vinh, Epps
- 2009
(Show Context)
Citation Context ...f the obtained clusterInformation Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary? structure. To quantify this diversity we have recently developed a novel index (=-=Vinh & Epps, 2009-=-), namely the Consensus Index (CI), which is built upon a suitable clustering similarity measure. Given a value of K, suppose we have generated a set of B clustering solutions UK = {U1, U2, . . . , UB... |

4 |
On similarity coefficients for 2x2 tables and correction for chance
- Warrens
- 2008
(Show Context)
Citation Context ...ensive list of 28 different indices in this class, a number which is large enough to make the task of choosing an appropriate measure difficult and confusing. Their work, and subsequent extension of (=-=Warrens, 2008-=-), however, showed that after correction for chance, many of these measures become equivalent, facilitating the task of choosing a measure. 2.2. Information Theoretic based Indices Another class of cl... |

1 |
Comparing clusterings: an axiomatic view. ICML ’05
- Meilǎ
- 2005
(Show Context)
Citation Context ...s have been proposed: besides the class of pair-counting based measures including the well-known Adjusted Rand Index (Hubert & Arabie, 1985), and set-matching based measures, such as the H criterion (=-=Meilǎ, 2005-=-), information theoretic based measures, such as the Mutual Information (Strehl & Ghosh, 2002) and the Variation of Information (Meilǎ, 2005), form another fundamental class of clustering comparison m... |