Results 1 - 10
of
14
Simultaneous Unsupervised Learning of Disparate Clusterings
"... Most clustering algorithms produce a single clustering for a given data set even when the data can be clustered naturally in multiple ways. In this paper, we address the difficult problem of uncovering disparate clusterings from the data in a totally unsupervised manner. We propose two new approache ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Most clustering algorithms produce a single clustering for a given data set even when the data can be clustered naturally in multiple ways. In this paper, we address the difficult problem of uncovering disparate clusterings from the data in a totally unsupervised manner. We propose two new approaches for this problem. In the first approach we aim to find good clusterings of the data that are also decorrelated with one another. To this end, we give a new and tractable characterization of decorrelation between clusterings, and present an objective function to capture it. We provide an iterative “decorrelated” k-means type algorithm to minimize this objective function. In the second approach, we model the data as a sum of mixtures and associate each mixture with a clustering. This approach leads us to the problem of learning a convolution of mixture distributions. Though the latter problem can be formulated as one of factorial learning [8, 13, 16], the existing formulations and methods do not perform well on many real high-dimensional data sets. We propose a new regularized factorial learning framework that is more suitable for capturing the notion of disparate clusterings in modern, high-dimensional data sets. The resulting algorithm does well in uncovering multiple clusterings, and is much improved over existing methods. We evaluate our methods on two real-world data sets- a music data set from the text mining domain, and a portrait data set from the computer vision domain. Our methods achieve a substantially higher accuracy than existing factorial learning as well as traditional clustering algorithms.
Multiple Non-Redundant Spectral Clustering Views
"... in several different ways for different purposes. For example, images of faces of people can be grouped based Many clustering algorithms only find one on their pose or identity. Web pages collected from clustering solution. However, data can of-universities can be clustered based on the type of webt ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
in several different ways for different purposes. For example, images of faces of people can be grouped based Many clustering algorithms only find one on their pose or identity. Web pages collected from clustering solution. However, data can of-universities can be clustered based on the type of webten be grouped and interpreted in many difpage’s owner, {faculty, student, staff}, field, {physics, ferent ways. This is particularly true in math, engineering, computer science}, or identity of the high-dimensional setting where differ-the university. In some cases, a data analyst wishes ent subspaces reveal different possible group-to find a single clustering, but this may require an alings of the data. Instead of committing gorithm to consider multiple clusterings and discard to one clustering solution, here we intro-those that are not of interest. In other cases, one may duce a novel method that can provide sev-wish to summarize and organize the data according to eral non-redundant clustering solutions to multiple possible clustering views. In either case, it is the user. Our approach simultaneously learns important to find multiple clustering solutions which non-redundant subspaces that provide multi-are non-redundant. ple views and finds a clustering solution in each view. We achieve this by augmenting a spectral clustering objective function to incorporate dimensionality reduction and multiple views and to penalize for redundancy between the views. 1.
Generation of Alternative Clusterings Using the CAMI Approach
"... Exploratory data analysis aims to discover and generate multiple views of the structure within a dataset. Conventional clustering techniques, however, are designed to only provide a single grouping or clustering of a dataset. In this paper, we introduce a novel algorithm called CAMI, that can uncove ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Exploratory data analysis aims to discover and generate multiple views of the structure within a dataset. Conventional clustering techniques, however, are designed to only provide a single grouping or clustering of a dataset. In this paper, we introduce a novel algorithm called CAMI, that can uncover alternative clusterings from a dataset. CAMI takes a mathematically appealing approach, combining the use of mutual information to distinguish between alternative clusterings, coupled with an expectation maximization framework to ensure clustering quality. We experimentally test CAMI on both synthetic and real-world datasets, comparing it against a variety of state-of-the-art algorithms. We demonstrate that CAMI’s performance is high and that its formulation provides a number of advantages compared to existing techniques. 1
Avoiding Bias in Text Clustering Using Constrained K-means and May-Not-Links
"... Abstract. In this paper we present a new clustering algorithm which extends the traditional batch k-means enabling the introduction of domain knowledge in the form of Must, Cannot, May and May-Not rules between the data points. Besides, we have applied the presented method to the task of avoiding bi ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract. In this paper we present a new clustering algorithm which extends the traditional batch k-means enabling the introduction of domain knowledge in the form of Must, Cannot, May and May-Not rules between the data points. Besides, we have applied the presented method to the task of avoiding bias in clustering. Evaluation carried out in standard collections showed considerable improvements in effectiveness against previous constrained and non-constrained algorithms for the given task.
Variational Inference for Nonparametric Multiple Clustering
"... Most clustering algorithms produce a single clustering solution. Similarly, feature selection for clustering tries to find one feature subset where one interesting clustering solution resides. However, a single data set may be multi-faceted and can be grouped and interpreted in many different ways, ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Most clustering algorithms produce a single clustering solution. Similarly, feature selection for clustering tries to find one feature subset where one interesting clustering solution resides. However, a single data set may be multi-faceted and can be grouped and interpreted in many different ways, especially for high dimensional data, where feature selection is typically needed. Moreover, different clustering solutions are interesting for different purposes. Instead of committing to one clustering solution, in this paper we introduce a probabilistic nonparametric Bayesian model that can discover several possible clustering solutions and the feature subset views that generated each cluster partitioning simultaneously. We provide a variational inference approach to learn the features and clustering partitions in each view. Our model allows us not only to learn the multiple clusterings and views but also allows us to automatically learn the number of views and the number of clusters in each view. Keywords multiple clustering, non-redundant/disparate clustering, feature selection, nonparametric Bayes, variational inference 1.
An Architecture and Algorithms for Multi-Run Clustering
"... Abstract—This paper addresses two main challenges for clustering which require extensive human effort: selecting appropriate parameters for an arbitrary clustering algorithm and identifying alternative clusters. We propose an architecture and a concrete system MR-CLEVER for multi-run clustering that ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—This paper addresses two main challenges for clustering which require extensive human effort: selecting appropriate parameters for an arbitrary clustering algorithm and identifying alternative clusters. We propose an architecture and a concrete system MR-CLEVER for multi-run clustering that integrates active learning with clustering algorithms. The key hypothesis of this work is that better clustering results can be obtained by combining clusters that originate from multiple runs of clustering algorithms. By defining states that represent parameter settings of a clustering algorithm, the proposed architecture actively learns a state utility function. The utility of a parameter setting is assessed based on clustering run-time, quality and novelty of the obtained clusters. Furthermore, the utility function plays an important role in guiding the clustering algorithm to seek novel solutions. Cluster novelty measures are introduced for this purpose. Finally, we also contribute a cluster summarization algorithm that assembles a final clustering as a combination of high-quality clusters originating from multiple runs. Merits of our proposed system are that it is generic and therefore can be used in conjunction with different clustering algorithms, and it reduces human effort for selecting the parameters, for comparing clustering results and for assembling clustering results. We evaluate the proposed system in conjunction with a representative based clustering algorithm namely CLEVER for a challenging data mining task involving an earthquake dataset. The obtained results demonstrate that, in comparison to the best single-run clustering, multi-run clustering discovers solutions of higher quality. C I.
Uncovering Many Views of Biological Networks Using Ensembles of Near-Optimal Partitions
- 1ST INTL WORKSHOP ON DISCOVERING, SUMMARIZING, AND USING MULTIPLE CLUSTERINGS, KDD
, 2010
"... Densely interacting regions of biological networks often correspond to functional modules such as protein complexes. Most algorithms proposed to uncover modules, however, produce one clustering that only reveals a single view of how the cell is organized. We describe two new methods to find ensemble ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Densely interacting regions of biological networks often correspond to functional modules such as protein complexes. Most algorithms proposed to uncover modules, however, produce one clustering that only reveals a single view of how the cell is organized. We describe two new methods to find ensembles of provably near-optimal modularity partitions that lie within a heuristically constrained search space. We also show how to count the number of solutions in this space that exist within a bounded modularity range. We apply our algorithms to a protein interaction network for S. cerevisiae and show how fine-grained differences between near-optimal partitions can be used to define robust communities. We also propose a technique to find structurally diverse nearoptimal solutions and show that these different partitions are enriched for different biological functions. Our results indicate that near-optimal solutions can represent alternative and complementary views of the network’s structure.
Unifying Dependent Clustering and Disparate Clustering for Non-homogeneous Data
"... Modern data mining settings involve a combination of attributevalued descriptors over entities as well as specified relationships between these entities. We present an approach to cluster such non-homogeneous datasets by using the relationships to impose either dependent clustering or disparate clus ..."
Abstract
- Add to MetaCart
Modern data mining settings involve a combination of attributevalued descriptors over entities as well as specified relationships between these entities. We present an approach to cluster such non-homogeneous datasets by using the relationships to impose either dependent clustering or disparate clustering constraints. Unlike prior work that views constraints as boolean criteria, we present a formulation that allows constraints to be satisfied or violated in a smooth manner. This enables us to achieve dependent clustering and disparate clustering using the same optimization framework by merely maximizing versus minimizing the objective function. We present results on both synthetic data as well as several real-world datasets.
Improving Alternative Text Clustering Quality in the Avoiding Bias Task with Spectral and Flat Partition Algorithms
"... Abstract. The problems of finding alternative clusterings and avoiding bias have gained popularity over the last years. In this paper we put the focus on the quality of these alternative clusterings, proposing two approaches based in the use of negative constraints in conjunction with spectral clust ..."
Abstract
- Add to MetaCart
Abstract. The problems of finding alternative clusterings and avoiding bias have gained popularity over the last years. In this paper we put the focus on the quality of these alternative clusterings, proposing two approaches based in the use of negative constraints in conjunction with spectral clustering techniques. The first approach tries to introduce these constraints in the core of the constrained normalised cut clustering, while the second one combines spectral clustering and soft constrained k-means. The experiments performed in textual collections showed that the first method does not yield good results, whereas the second one attains large increments on the quality of the results of the clustering while keeping low similarity with the avoided grouping. 1
Alternative Clusterings: Current Progress and Open Challenges
"... Cluster analysis: group “similar” objects into clusters ..."

