Results 1 - 10
of
28
Cluster Ensembles - A Knowledge Reuse Framework for Combining Multiple Partitions
- Journal of Machine Learning Research
, 2002
"... This paper introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings. We first identify several application scenarios for the resultant 'knowledge reuse' framew ..."
Abstract
-
Cited by 272 (16 self)
- Add to MetaCart
This paper introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings. We first identify several application scenarios for the resultant 'knowledge reuse' framework that we call cluster ensembles. The cluster ensemble problem is then formalized as a combinatorial optimization problem in terms of shared mutual information. In addition to a direct maximization approach, we propose three effective and efficient techniques for obtaining high-quality combiners (consensus functions). The first combiner induces a similarity measure from the partitionings and then reclusters the objects. The second combiner is based on hypergraph partitioning. The third one collapses groups of clusters into meta-clusters which then compete for each object to determine the combined clustering. Due to the low computational costs of our techniques, it is quite feasible to use a supra-consensus function that evaluates all three approaches against the objective function and picks the best solution for a given situation. We evaluate the effectiveness of cluster ensembles in three qualitatively different application scenarios: (i) where the original clusters were formed based on non-identical sets of features, (ii) where the original clustering algorithms worked on non-identical sets of objects, and (iii) where a common data-set is used and the main purpose of combining multiple clusterings is to improve the quality and robustness of the solution. Promising results are obtained in all three situations for synthetic as well as real data-sets.
Collective Data Mining: A New Perspective Toward Distributed Data Analysis
- Advances in Distributed and Parallel Knowledge Discovery
, 1999
"... This paper introduces the collective data mining (CDM) framework, a new approach toward distributed data mining (DDM) from heterogeneous sites. It points out that naive approaches to distributed data analysis in a heterogeneous environment may result in ambiguous or incorrect global data models. It ..."
Abstract
-
Cited by 75 (12 self)
- Add to MetaCart
This paper introduces the collective data mining (CDM) framework, a new approach toward distributed data mining (DDM) from heterogeneous sites. It points out that naive approaches to distributed data analysis in a heterogeneous environment may result in ambiguous or incorrect global data models. It also notes that any function can be expressed in a distributed fashion using a set of appropriate basis functions and orthogonal basis functions can be eectively used for developing a general DDM framework that guarantees correct local analysis and correct aggregation of local data models with minimal data communication. This paper develops the foundation of CDM, discusses decision tree learning and polynomial regression in CDM for discrete and continuous variables, and describes the BODHI, a CDM-based experimental system for distributed knowledge discovery. 1 Introduction Distributed data mining (DDM) is a fast growing area that deals with the problem of nding data patterns in a...
Combining Multiple Weak Clusterings
, 2003
"... A data set can be clustered in many ways depending on the clustering algorithm employed, parameter settings used and other factors. Can multiple clusterings be combined so that the final partitioning of data provides better clustering? The answer depends on the quality of clusterings to be combined ..."
Abstract
-
Cited by 44 (4 self)
- Add to MetaCart
A data set can be clustered in many ways depending on the clustering algorithm employed, parameter settings used and other factors. Can multiple clusterings be combined so that the final partitioning of data provides better clustering? The answer depends on the quality of clusterings to be combined as well as the properties of the fusion method. First, we introduce a unified representation for multiple clusterings and formulate the corresponding categorical clustering problem. As a result, we show that the consensus function is related to the classical intra-class variance criterion using the generalized mutual information definition. Second, we show the efficacy of combining partitions generated by weak clustering algorithms that use data projections and random data splits. A simple explanatory model is offered for the behavior of combinations of such weak clustering components. We analyze the combination accuracy as a function of parameters controlling the power and resolution of component partitions as well as the learning dynamics vs. the number of clusterings involved. Finally, some empirical studies compare the effectiveness of several consensus functions.
Distributed Data Mining: Algorithms, Systems, and Applications
, 2002
"... This paper presents a brief overview of the DDM algorithms, systems, applications, and the emerging research directions. The structure of the paper is organized as follows. We first present the related research of DDM and illustrate data distribution scenarios. Then DDM algorithms are reviewed. Subs ..."
Abstract
-
Cited by 43 (4 self)
- Add to MetaCart
This paper presents a brief overview of the DDM algorithms, systems, applications, and the emerging research directions. The structure of the paper is organized as follows. We first present the related research of DDM and illustrate data distribution scenarios. Then DDM algorithms are reviewed. Subsequently, the architectural issues in DDM systems and future directions are discussed
A mixture model of clustering ensembles
- Proc. SIAM Intl. Conf. on Data Mining
, 2004
"... Clustering ensembles have emerged as a powerful method for improving both the robustness and the stability of unsupervised classification solutions. However, finding a consensus clustering from multiple partitions is a difficult problem that can be approached from graph-based, combinatorial or stati ..."
Abstract
-
Cited by 39 (4 self)
- Add to MetaCart
Clustering ensembles have emerged as a powerful method for improving both the robustness and the stability of unsupervised classification solutions. However, finding a consensus clustering from multiple partitions is a difficult problem that can be approached from graph-based, combinatorial or statistical perspectives. We offer a probabilistic model of consensus using a finite mixture of multinomial distributions in a space of clusterings. A combined partition is found as a solution to the corresponding maximum likelihood problem using the EM algorithm. The excellent scalability of this algorithm and comprehensible underlying model are particularly important for clustering of large datasets. This study compares the performance of the EM consensus algorithm with other fusion approaches for clustering ensembles. We also analyze clustering ensembles with incomplete information and the effect of missing cluster labels on the quality of overall consensus. Experimental results demonstrate the effectiveness of the proposed method on large real-world datasets.
Distributed Clustering Using Collective Principal Component Analysis
- Knowledge and Information Systems
, 1999
"... This paper considers distributed clustering of high dimensional heterogeneous data using a distributed Principal Component Analysis (PCA) technique called the Collective PCA. It presents the Collective PCA technique that can be used independent of the clustering application. It shows a way to inte ..."
Abstract
-
Cited by 38 (8 self)
- Add to MetaCart
This paper considers distributed clustering of high dimensional heterogeneous data using a distributed Principal Component Analysis (PCA) technique called the Collective PCA. It presents the Collective PCA technique that can be used independent of the clustering application. It shows a way to integrate the Collective PCA with a given o-the-shelf clustering algorithm in order to develop a distributed clustering technique. It also presents experimental results using dierent test data sets including an application for web mining.
Privacy-preserving Distributed Clustering using Generative Models
, 2003
"... We present a framework for clustering distributed data in unsupervised and semi-supervised scenarios, taking into account privacy requirements and communication costs. Rather than sharing parts of the original or perturbed data, we instead transmit the parameters of suitable generative models built ..."
Abstract
-
Cited by 37 (1 self)
- Add to MetaCart
We present a framework for clustering distributed data in unsupervised and semi-supervised scenarios, taking into account privacy requirements and communication costs. Rather than sharing parts of the original or perturbed data, we instead transmit the parameters of suitable generative models built at each local data site to a central location. We mathematically show that the best representative of all the data is a certain " mean" model, and empirically show that this model can be approximated quite well by generating artificial samples from the underlying distributions using Markov Chain Monte Carlo techniques, and then fitting a combined global model with a chosen parametric form to these samples. We also propose a new measure that quantifies privacy based on information theoretic concepts, and show that decreasing privacy leads to a higher quality of the combined model and vice versa. We provide empirical results on different data types to highlight the generality of our framework. The results show that high quality distributed clustering can be achieved with little privacy loss and low communication cost.
Clustering ensembles: Models of consensus and weak partitions
- IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
, 2005
"... Clustering ensembles have emerged as a powerful method for improving both the robustness as well as the stability of unsupervised classification solutions. However, finding a consensus clustering from multiple partitions is a difficult problem that can be approached from graph-based, combinatorial ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
Clustering ensembles have emerged as a powerful method for improving both the robustness as well as the stability of unsupervised classification solutions. However, finding a consensus clustering from multiple partitions is a difficult problem that can be approached from graph-based, combinatorial or statistical perspectives. This study extends previous research on clustering ensembles in several respects. First, we introduce a unified representation for multiple clusterings and formulate the corresponding categorical clustering problem. Second, we propose a probabilistic model of consensus using a finite mixture of multinomial distributions in a space of clusterings. A combined partition is found as a solution to the corresponding maximum likelihood problem using the EM algorithm. Third, we define a new consensus function that is related to the classical intra-class variance criterion using the generalized mutual information definition. Finally, we demonstrate the efficacy of combining partitions generated by weak clustering algorithms that use data projections and random data splits. A simple explanatory model is offered for the behavior of combinations of such weak clustering components. Combination accuracy is analyzed as a function of several parameters that control the power and resolution of component partitions as well as the number of partitions. We also analyze clustering ensembles with incomplete information and the effect of missing cluster labels on the quality of overall consensus. Experimental results demonstrate the effectiveness of the proposed methods on several real-world datasets.
A Consensus Framework for Integrating Distributed Clusterings Under Limited Knowledge Sharing
- In Proc. NSF Workshop on Next Generation Data Mining
, 2002
"... This paper examines the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings. This problem is an abstraction of scenarios where different organizations have grouped some ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
This paper examines the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings. This problem is an abstraction of scenarios where different organizations have grouped some or all elements of a common underlying population, possibly using different features, algorithms or clustering criteria. Moreover, due to real life constraints such as proprietary techniques, legal restrictions, different data ownerships etc, it is not feasible to pool all the data into a central location and then apply clustering techniques: the only information that can be shared are the symbolic cluster labels. The cluster ensemble problem is formalized as a combinatorial optimization problem that obtains a consensus function in terms of shared mutual information among individual solutions. Three effective and efficient techniques for obtaining high-quality consensus functions are described and studied empirically for the following qualitatively different application scenarios: (i) where the original clusters were formed based on non-identical sets of features, (ii) where the original clustering algorithms were applied to non-identical sets of objects and (iii) when the individual solutions provide varying numbers of clusters. Promising results are obtained in all the three situations for synthetic as well as real data sets, even under severe restrictions on data and knowledge sharing.
Multiclassifier systems: Back to the future
- Multiple Classifier Systems, pages invited paper, 1–15. LNCS
, 2002
"... Abstract. While a variety of multiple classifier systems have been studied since at least the late 1950’s, this area came alive in the 90’s with significant theoretical advances as well as numerous successful practical applications. This article argues that our current understanding of ensemble-type ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Abstract. While a variety of multiple classifier systems have been studied since at least the late 1950’s, this area came alive in the 90’s with significant theoretical advances as well as numerous successful practical applications. This article argues that our current understanding of ensemble-type multiclassifier systems is now quite mature and exhorts the reader to consider a broader set of models and situations for further progress. Some of these scenarios have already been considered in classical pattern recognition literature, but revisiting them often leads to new insights and progress. As an example, we consider how to integrate multiple clusterings, a problem central to several emerging distributed data mining applications. We also revisit output space decomposition to show how this can lead to extraction of valuable domain knowledge in addition to improved classification accuracy. 1 A Brief History of Multilearner Systems Multiple classifier systems are special cases of approaches that integrate several

