Results 1 
6 of
6
Abstract Consensus Clustering Algorithms: Comparison and Refinement
"... Consensus clustering is the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. Cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be NPcomplete. A nu ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
(Show Context)
Consensus clustering is the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. Cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be NPcomplete. A number of heuristics have been proposed as approximate solutions, some with performance guarantees. In practice, the problem is apparently easy to approximate, but guidance is necessary as to which heuristic to use depending on the number of elements and clusterings given. We have implemented a number of heuristics for the consensus clustering problem, and here we compare their performance, independent of data size, in terms of efficacy and efficiency, on both simulated and real data sets. We find that based on the underlying algorithms and their behavior in practice the heuristics can be categorized into two distinct groups, with ramification as to which one to use in a given situation, and that a hybrid solution is the best bet in general. We have also developed a refined consensus clustering heuristic for the occasions when the given clusterings may be too disparate, and their consensus may not be representative of any one of them, and we show that in practice the refined consensus clusterings can be much superior to the general consensus clustering. 1
Average Parameterization and Partial Kernelization for Computing Medians
 PROC. 9TH LATIN
, 2010
"... We propose an effective polynomialtime preprocessing strategy for intractable median problems. Developing a new methodological framework, we show that if the input instances of generally intractable problems exhibit a sufficiently high degree of similarity between each other on average, then there ..."
Abstract

Cited by 8 (6 self)
 Add to MetaCart
We propose an effective polynomialtime preprocessing strategy for intractable median problems. Developing a new methodological framework, we show that if the input instances of generally intractable problems exhibit a sufficiently high degree of similarity between each other on average, then there are efficient exact solving algorithms. In other words, we show that the median problems Swap Median Permutation, Consensus Clustering, Kemeny Score, and Kemeny Tie Score all are fixedparameter tractable with respect to the parameter “average distance between input objects”. To this end, we develop the new concept of “partial kernelization” and identify interesting polynomialtime solvable special cases for the considered problems.
Bounding and comparing methods for correlation clustering beyond ILP
 In NAACLHLT Workshop on Integer Linear Programming for Natural Language Processing (ILPNLP 2009
, 2009
"... We evaluate several heuristic solvers for correlation clustering, the NPhard problem of partitioning a dataset given pairwise affinities between all points. We experiment on two practical tasks, document clustering and chat disentanglement, to which ILP does not scale. On these datasets, we show th ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
We evaluate several heuristic solvers for correlation clustering, the NPhard problem of partitioning a dataset given pairwise affinities between all points. We experiment on two practical tasks, document clustering and chat disentanglement, to which ILP does not scale. On these datasets, we show that the clustering objective often, but not always, correlates with external metrics, and that local search always improves over greedy solutions. We use semidefinite programming (SDP) to provide a tighter bound, showing that simple algorithms are already close to optimality. 1
On the Parameterized Complexity of Consensus Clustering
, 2011
"... Given a collection C of partitions of a base set S, the NPhard Consensus Clustering problem asks for a partition of S which has a total Mirkin distance of at most t to the partitions in C, where t is a nonnegative integer. We present a parameterized algorithm for Consensus Clustering with running ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Given a collection C of partitions of a base set S, the NPhard Consensus Clustering problem asks for a partition of S which has a total Mirkin distance of at most t to the partitions in C, where t is a nonnegative integer. We present a parameterized algorithm for Consensus Clustering with running time O(4.24 k ·k 3 +C·S  2), where k: = t/C is the average Mirkin distance of the solution partition to the partitions of C. Furthermore, we strengthen previous hardness results for Consensus Clustering, showing that Consensus Clustering remains NPhard even when all input partitions contain at most two subsets. Finally, we study a local search variant of Consensus Clustering, showing W[1]hardness for the parameter “radius of the Mirkindistance neighborhood”. In the process, we also consider a local search variant of the related Cluster Editing problem, showing W[1]hardness for the parameter “radius of the edge modification neighborhood”.
Generalizing Local Coherence Modeling
, 2011
"... A wellwritten text follows an overall structure, with each sentence following naturally from the ones before and leading into the ones which come afterwards. We call this structure “coherence”; without it, a document becomes a confusing series of non sequiturs. Understanding the principles that mak ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
A wellwritten text follows an overall structure, with each sentence following naturally from the ones before and leading into the ones which come afterwards. We call this structure “coherence”; without it, a document becomes a confusing series of non sequiturs. Understanding the principles that make a text coherent is an important goal of natural language processing. These principles can be applied to the design of systems that create new documents, like summaries, or make changes to existing documents. Coherence is a universal principle of language, but typical approaches to evaluation focus on the application of multidocument summarization. We test the generality of our models by applying them to a new task, chat disentanglement, in which we distinguish independent conversational threads in a crowded chat room. To study this task, we create our own corpus and evaluation metrics, propose a baseline model with basic coherence features, and then test the performance of our own and others' more sophisticated models of local coherence. We present evidence that despite the significant differences between this task setting and conventional summarizationinspired evaluations, many of these models generalize fairly well, improving over the baseline. Problems with lexicalized models are mostly the fault of insufficient indomain training data, rather than representing weaknesses in the models themselves. Thus we conclude that many of the same basic principles
Improved Consensus Clustering via Linear Programming
"... We consider the problem of Consensus Clustering. Given a finite set of input clusterings over some data items, a consensus clustering is a partitioning of the items which matches as closely as possible the given input clusterings. The best exact approach to tackling this problem is by modelling it ..."
Abstract
 Add to MetaCart
(Show Context)
We consider the problem of Consensus Clustering. Given a finite set of input clusterings over some data items, a consensus clustering is a partitioning of the items which matches as closely as possible the given input clusterings. The best exact approach to tackling this problem is by modelling it as a Boolean Integer Program (BIP). Unfortunately, the size of the BIP grows cubically in the number of data items, hence this method is applicable to only small sets of items. In this paper we show how to tackle the problem progressively, leading to much improved solution times and far less memory usage than previously. For the case where approximate clusterings are acceptable, we show a number of heuristic techniques for extracting good clusterings from the solutions of the linear relaxation of the BIP, and on several very large data sets we demonstrate much higher quality approximations than previously possible. When optimal solutions are desired, the problem is much harder, and we present some novel and existing techniques that can assist in finding candidate answers and proving the optimality thereof. For the first time we present optimal Consensus Clusterings for several complete, albeit small, data sets.