Results 1 - 10
of
20
Comparing clusterings: an axiomatic view
- In ICML ’05: Proceedings of the 22nd international conference on Machine learning
, 2005
"... This paper views clusterings as elements of a lattice. Distances between clusterings are analyzed in their relationship to the lattice. From this vantage point, we first give an axiomatic characterization of some criteria for comparing clusterings, including the variation of information and the unad ..."
Abstract
-
Cited by 51 (3 self)
- Add to MetaCart
This paper views clusterings as elements of a lattice. Distances between clusterings are analyzed in their relationship to the lattice. From this vantage point, we first give an axiomatic characterization of some criteria for comparing clusterings, including the variation of information and the unadjusted Rand index. Then we study other distances between partitions w.r.t these axioms and prove an impossibility result: there is no “sensible” criterion for comparing clusterings that is simultaneously (1) aligned with the lattice of partitions, (2) convexely additive, and (3) bounded. 1.
Comparing Clusterings
, 2002
"... This paper proposes an information theoretic criterion for comparing two clusterings of the same data set. The criterion, called variation of information measures the amount of information that is lost or gained in changing from clustering C to dustering C '. The criterion makes no assumptions about ..."
Abstract
-
Cited by 43 (2 self)
- Add to MetaCart
This paper proposes an information theoretic criterion for comparing two clusterings of the same data set. The criterion, called variation of information measures the amount of information that is lost or gained in changing from clustering C to dustering C '. The criterion makes no assumptions about how the dusterings were generated and applies to both soft and hard dusterings.The basic properties of VI are presented and discussed from the point of view of comparing c!usterings. In particular, the VI is positive, symmetric and transitive and thus, surprisingly enough, is a true metric on the space of c1usterings.
A distributed approach to node clustering in decentralized peer-to-peer networks
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 2005
"... Connectivity-based node clustering has wide-ranging applications in decentralized peer-to-peer (P2P) networks such as P2P file sharing systems, mobile ad-hoc networks, P2P sensor networks, and so forth. This paper describes a Connectivity-based Distributed Node Clustering scheme (CDC). This scheme ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
Connectivity-based node clustering has wide-ranging applications in decentralized peer-to-peer (P2P) networks such as P2P file sharing systems, mobile ad-hoc networks, P2P sensor networks, and so forth. This paper describes a Connectivity-based Distributed Node Clustering scheme (CDC). This scheme presents a scalable and efficient solution for discovering connectivity-based clusters in peer networks. In contrast to centralized graph clustering algorithms, the CDC scheme is completely decentralized and it only assumes the knowledge of neighbor nodes instead of requiring a global knowledge of the network (graph) to be available. An important feature of the CDC scheme is its ability to cluster the entire network automatically or to discover clusters around a given set of nodes. To cope with the typical dynamics of P2P networks, we provide mechanisms to allow new nodes to be incorporated into appropriate existing clusters and to gracefully handle the departure of nodes in the clusters. These mechanisms enable the CDC scheme to be extensible and adaptable in the sense that the clustering structure of the network adjusts automatically as nodes join or leave the system. We provide detailed experimental evaluations of the CDC scheme, addressing its effectiveness in discovering good quality clusters and handling the node dynamics. We further study the types of topologies that can benefit best from the connectivitybased distributed clustering algorithms like CDC. Our experiments show that utilizing message-based connectivity structure can considerably reduce the messaging cost and provide better utilization of resources, which in turn improves the quality of service of the applications executing over decentralized peer-to-peer networks.
Consensus Clusterings
"... In this paper we address the problem of combining multiple clusterings without access to the underlying features of the data. This process is known in the literature as clustering ensembles, clustering aggregation, or consensus clustering. Consensus clustering yields a stable and robust final cluste ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
In this paper we address the problem of combining multiple clusterings without access to the underlying features of the data. This process is known in the literature as clustering ensembles, clustering aggregation, or consensus clustering. Consensus clustering yields a stable and robust final clustering that is in agreement with multiple clusterings. We find that an iterative EM-like method is remarkably effective for this problem. We present three iterative algorithms for finding clustering consensus. An extensive empirical study compares our proposed algorithms with eleven other consensus clustering methods on four data sets using six different clustering performance metrics. The experimental results show that the new ensemble clustering methods produce clusterings that are as good as, and often better than, these other methods. 1.
The NVI Clustering Evaluation Measure
"... Clustering is crucial for many NLP tasks and applications. However, evaluating the results of a clustering algorithm is hard. In this paper we focus on the evaluation setting in which a gold standard solution is available. We discuss two existing information theory based measures, V and VI, and show ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Clustering is crucial for many NLP tasks and applications. However, evaluating the results of a clustering algorithm is hard. In this paper we focus on the evaluation setting in which a gold standard solution is available. We discuss two existing information theory based measures, V and VI, and show that they are both hard to use when comparing the performance of different algorithms and different datasets. The V measure favors solutions having a large number of clusters, while the range of scores given by VI depends on the size of the dataset. We present a new measure, NVI, which normalizes VI to address the latter problem. We demonstrate the superiority of NVI in a large experiment involving an important NLP application, grammar induction, using real corpus data in English, German and Chinese. 1
Comparing Clusterings in Space
"... This paper proposes a new method for comparing clusterings both partitionally and geometrically. Our approach is motivated by the following observation: the vast majority of previous techniques for comparing clusterings are entirely partitional, i.e., they examine assignments of points in set theore ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
This paper proposes a new method for comparing clusterings both partitionally and geometrically. Our approach is motivated by the following observation: the vast majority of previous techniques for comparing clusterings are entirely partitional, i.e., they examine assignments of points in set theoretic terms after they have been partitioned. In doing so, these methods ignore the spatial layout of the data, disregarding the fact that this information is responsible for generating the clusterings to begin with. We demonstrate that this leads to a variety of failure modes. Previous comparison techniques often fail to differentiate between significant changes made in data being clustered. We formulate a new measure for comparing clusterings that combines spatial and partitional information into a single measure using optimization theory. Doing so eliminates pathological conditions in previous approaches. It also simultaneously removes common limitations, such as that each clustering must have the same number of clusters or they are over identical datasets. This approach is stable, easily implemented, and has strong intuitive appeal. spatial properties as well as their cluster membership assignments. We view a clustering as a partition of a set of points located in a space with an associated distance function. This view is natural, since popular clustering algorithms, e.g., k-means, spectral clustering, affinity propagation, etc., take as input not only a collection of points to be clustered but also a distance function on the space in which the points lie. This distance function may be specified implicitly and it may be transformed by a kernel, but it must be defined one way or another and its properties are crucial to a clustering algorithm’s output. In contrast, almost all existing clustering comparison techniques ignore the distances between points, treating clusterings as partitions of disembodied atoms. While this approach has merit under some circumstances, it seems surprising to ignore the distance func-
EXTRACTING CLUSTERS FROM LARGE DATASETS WITH MULTIPLE SIMILARITY MEASURES USING IMSCAND
"... Abstract. We consider the problem of how to group information when multiple similarities are known. For a group of people, we may know their education, geographic location and family connections and want to cluster the people by treating all three of these similarities simultaneously. Our approach i ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract. We consider the problem of how to group information when multiple similarities are known. For a group of people, we may know their education, geographic location and family connections and want to cluster the people by treating all three of these similarities simultaneously. Our approach is to store each similarity as a slice in a tensor. The similarity measures are generated by comparing features. Generally, the object similarity matrix is dense. However it can be stored implicitly as the product of a sparse matrix, representing the object-feature matrix, and its transpose. For this new type of tensor where dense slices are stored implicitly, we have created a new decomposition called Implicit Slice Canonical Decomposition (IMSCAND). Our decomposition is equivalent to the tensor CANDECOMP/PARAFAC decomposition, which is a higher-order analogue of the matrix Singular Value decomposition (SVD) and Principal Component Analysis (PCA). From IMSCAND we obtain compilation feature vectors which are clustered using k-means. We demonstrate the applicability of IMSCAND on a set of journal articles with multiple similarities.
Type Level Clustering Evaluation: New Measures and a POS Induction Case Study
"... Clustering is a central technique in NLP. Consequently, clustering evaluation is of great importance. Many clustering algorithms are evaluated by their success in tagging corpus tokens. In this paper we discuss type level evaluation, which reflects class membership only and is independent of the tok ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Clustering is a central technique in NLP. Consequently, clustering evaluation is of great importance. Many clustering algorithms are evaluated by their success in tagging corpus tokens. In this paper we discuss type level evaluation, which reflects class membership only and is independent of the token statistics of a particular reference corpus. Type level evaluation casts light on the merits of algorithms, and for some applications is a more natural measure of the algorithm’s quality. We propose new type level evaluation measures that, contrary to existing measures, are applicable when items are polysemous, the common case in NLP. We demonstrate the benefits of our measures using a detailed case study, POS induction. We experiment with seven leading algorithms, obtaining useful insights and showing that token and type level measures can weakly or even negatively correlate, which underscores the fact that these two approaches reveal different aspects of clustering quality. 1
A Stochastic Uncoupling Process for Graphs
- NATIONAL RESEARCH INSTITUTE FOR MATHEMATICS AND COMPUTER SCIENCE IN THE
, 2000
"... A discrete stochastic uncoupling process for finite spaces is introduced, called the Markov Cluster Process. The process takes a stochastic matrix as input, and then alternates flow expansion and flow inflation, each step defining a stochastic matrix in terms of the previous one. Flow expansion corr ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
A discrete stochastic uncoupling process for finite spaces is introduced, called the Markov Cluster Process. The process takes a stochastic matrix as input, and then alternates flow expansion and flow inflation, each step defining a stochastic matrix in terms of the previous one. Flow expansion corresponds with taking the k th power of a stochastic matrix, where k 2 IN . Flow inflation corresponds with a parametrized operator \Gamma r , r 0, which maps the set of (column) stochastic matrices onto itself. The image \Gamma r M is obtained by raising each entry in M to the r th power and rescaling each column to have sum 1 again. In practice the process converges very fast towards a limit which is idempotent under both matrix multiplication and inflation, with quadratic convergence around the limit points. The limit is in general extremely sparse and the number of components of its associated graph may be larger than the number associated with the input matrix. This uncoupli...

