## COALA: A novel approach for the extraction of an alternate clustering of high quality and high dissimilarity

### Cached

### Download Links

- [www.cs.mu.oz.au]
- [www.csse.unimelb.edu.au]
- DBLP

### Other Repositories/Bibliography

Venue: | in ICDM, 2006 |

Citations: | 31 - 2 self |

### BibTeX

@INPROCEEDINGS{Bae_coala:a,

author = {Eric Bae and James Bailey},

title = {COALA: A novel approach for the extraction of an alternate clustering of high quality and high dissimilarity},

booktitle = {in ICDM, 2006},

year = {},

pages = {53--62}

}

### OpenURL

### Abstract

Cluster analysis has long been a fundamental task in data mining and machine learning. However, traditional clustering methods concentrate on producing a single solution, even though multiple alternative clusterings may exist. It is thus difficult for the user to validate whether the given solution is in fact appropriate, particularly for large and complex datasets. In this paper we explore the critical requirements for systematically finding a new clustering, given that an already known clustering is available and we also propose a novel algorithm, COALA, to discover this new clustering. Our approach is driven by two important factors; dissimilarity and quality. These are especially important for finding a new clustering which is highly informative about the underlying structure of data, but is at the same time distinctively different from the provided clustering. We undertake an experimental analysis and show that our method is able to outperform existing techniques, for both synthetic and real datasets. 1.

### Citations

534 | Distance metric learning with application to clustering with side-information
- Xing, Ng, et al.
- 2003
(Show Context)
Citation Context ...tion remains as a non-trivial task. One might consider more sophisticated methods which combine multiple functions [16], or applying techniques to learn about distance functions through various means =-=[25]-=-. Measuring Dissimilarity and Quality : In the experimental section, we have used the Jaccard and Dunn indices to evaluate the dissimilarity and quality of clusterings, but these measures also have th... |

445 | The information bottleneck method
- Tishby, Pereira, et al.
- 1999
(Show Context)
Citation Context ...echnique uses the pre-defined class labels as additional information with which an alternate clustering is found. The underlying principal of this technique is based on information bottleneck (IB) in =-=[21]-=-. The general idea of IB is that given two variables (i.e. X representing objects, Y representing the features), the shared information between these two variables are maximized while one variable is ... |

350 | ROCK: A Robust Clustering Algorithm For Categorical Attributes
- Guha, Rastogi, et al.
- 1999
(Show Context)
Citation Context ...ibutes, whose values cannot be naturally ordered in a metric space. Therefore, cluster analysis of categorical values has been studied extensively and there are numerous methods to handle the problem =-=[4, 14]-=-. The COALA algorithm faces a similar problem with categorical attributes and in this section we show how to modify our algorithm to handle this type of data.sThe extended algorithm called, COALACat (... |

344 | Constrained k-means clustering with background knowledge
- Wagstaff, Cardie, et al.
- 2001
(Show Context)
Citation Context ...ilar from one another as possible. Our algorithm addresses this requirement via the use of instance-based ‘cannot-link’ constraints. This type of constraint has been proposed in constraint clustering =-=[24]-=-. In essence, given an existing clustering, our algorithm derives ‘cannot-link’ constraints and uses them to guide the generation of a new, dissimilar clustering. While the dissimilarity requirement a... |

328 |
A fuzzy relative of the isodata process and its use in detecting compact, well-separated clusters
- Dunn
- 1973
(Show Context)
Citation Context ...uality threshold’, denoted by ω, which defines a numerical minimum bound on the quality required. For our purposes, the quality of a clustering can be quantitatively measured by use of the Dunn Index =-=[7]-=-. It is important to note that the two requirements can exhibit an inverse relationship. Suppose C is the pre-defined clustering, then if the quality of the new clustering S is increased, the dissimil... |

163 |
Subspace clustering for high dimensional data: A review
- Parsons, Haque, et al.
- 2004
(Show Context)
Citation Context ...ul information to be omitted and it is difficult to select associated features in a very high dimensional space. Furthermore, our method suggested here is different to the idea of subspace clustering =-=[18]-=-. Although subspace clustering uncovers a number of clusters from varying projections of features, the key difference is that we are discovering a completely new clustering, rather than just individua... |

153 | Fast and Effective Text Mining Using Linear-time Document Clustering
- Larsen, Aone
- 1999
(Show Context)
Citation Context ...all score on the clustering generated. As mentioned earlier, the quality and dissimilarity requirements may share an inverse relationship which leads us to adapt a widely used metric called F-Measure =-=[15]-=-. This has been traditionally applied to information retrieval systems, to coalesce the precision and recall values and calculate the harmonic mean to act as an overall score. This measure has also be... |

107 |
Some new indexes of cluster validity
- Bezdek, Pal
- 1998
(Show Context)
Citation Context ... factors can lead to more accurate comparisons. Dunn index has been effective for measuring quality, but it is known to be overly sensitive to outliers and prefers compact and well-separated clusters =-=[1]-=-. In fact, we have seen some inconsistencies in Fig. 7, where the increase insthe quality as the ω value increases is, sometimes not continuous. For future work, we would like to investigate other mea... |

88 |
Implementing agglomerative hierarchic clustering algorithms for use in document retrieval
- Voorhees
- 1986
(Show Context)
Citation Context ...e one of many different types (i.e. Euclidean distance, density, entropy) and methods (i.e. average distance, mutual information). Although many of them are effective, we use the average-linkage (AL) =-=[23]-=- algorithm to calculate the distance, because of its accuracy and robustness. The AL technique determines the similarity between clusters by calculating the average distance of all pairwise objects be... |

60 | Clustering with constraints: Feasibility issues and the k-means algorithm
- Davidson, Ravi
- 2005
(Show Context)
Citation Context ... However, it is actually infeasible to always merge the dissimilar pair and at some point in the agglomerative process, we reach a point where no clusters actually satisfy the cannot-link constraints =-=[5]-=- in line 4. From this point on, we proceed with merges of the qualitative pairs. Merge determination : It is not ideal, however, to always select the dissimilar pairs for merging, as this ignores the ... |

59 | Fast hierarchical clustering and other applications of dynamic closest pairs
- Eppstein
- 1998
(Show Context)
Citation Context .... log n, we can simplify the overall process of COALA to O(n 2 log n). To overcome this high complexity, a number of extensions have been proposed, such as using a new data structure called quad tree =-=[9]-=- or applying a parallel clustering technique [20]. Furthermore, employing other more efficient clustering models (i.e. partitioning algorithms - kmeans) may also enhance the performance and accuracy o... |

49 | Extracting relevant structures with side information
- Chechik, Tishby
- 2003
(Show Context)
Citation Context ... knowledge to guide their clustering process. In constraint clustering [6, 24], knowledge is expressed as ‘must-link’ and ‘cannot-link’ constraints to produce more efficient and accurate clusters. In =-=[3, 13]-=-, negative information about undesired structures or features is provided to ensure that clustering process avoids these information and focusing on the clusterings in ‘positive’ data. However, unlike... |

41 | Finding consistent clusters in data partitions
- Fred
- 2001
(Show Context)
Citation Context ...on is presumed to be provided by a manual process. Ensemble Clustering : Generating multiple clusterings and merging them to offer a final consensus clustering is the objective of ensemble clustering =-=[10]-=- which we briefly described in the section 1.1. Ensemble clustering adopts several clustering generation methods which all can be considered as a naive method. These clusterings are typically generate... |

23 |
Analysis of consensus partition in cluster ensemble
- Topchy, Law, et al.
- 2004
(Show Context)
Citation Context ... similarity function implemented by the particular algorithm used [16]. As a result, if one is trying to find multiple clusterings by just naively applying a number of different clustering algorithms =-=[22]-=-, the following difficulties present themselves : • An inability to know which algorithms to apply and how many, hence a risk of clustering overload • A risk of collecting highly similar clusterings •... |

21 | A new mallows distance based metric for comparing clusterings
- Zhou, Li, et al.
- 2005
(Show Context)
Citation Context .... The Jaccard index is limited in only considering the point-to-cluster assignments, while there are other factors that could differentiate clusterings, such as cluster centroids and density profiles =-=[26]-=-. Therefore, utilizing these various factors can lead to more accurate comparisons. Dunn index has been effective for measuring quality, but it is known to be overly sensitive to outliers and prefers ... |

17 |
Non-redundant clustering with conditional ensembles
- Gondek, Hofmann
- 2005
(Show Context)
Citation Context ... Feature Selection and Subspace Clustering : Finally, we note that feature-based methods, such as selecting certain features or applying dimension reduction methods are not practical. As explained in =-=[12]-=-, such an attempt may cause useful information to be omitted and it is difficult to select associated features in a very high dimensional space. Furthermore, our method suggested here is different to ... |

14 |
Identifying and generating easy sets of constraints for clustering
- Davidson, Ravi
- 2006
(Show Context)
Citation Context ...lity through the quality threshold ω. Clustering with background knowledge : A number of techniques have also utilized background knowledge to guide their clustering process. In constraint clustering =-=[6, 24]-=-, knowledge is expressed as ‘must-link’ and ‘cannot-link’ constraints to produce more efficient and accurate clusters. In [3, 13], negative information about undesired structures or features is provid... |

14 | Conditional information bottleneck clustering
- Gondek, Hofmann
- 2003
(Show Context)
Citation Context ...ss cannot be parameterized in a meaningful way to control the outcome. Furthermore, we have found that it is not just the naive approach which has drawbacks. Even a current state-of-the-art technique =-=[11]-=- does not always produce convincing results for this problem. In this paper, we propose a systematic technique called COALA 2 , to retrieve a new clustering which is distinctively different with respe... |

12 | Multiobjective data clustering
- Law, Topchy, et al.
- 2004
(Show Context)
Citation Context ...ere exists no easy definition of what exactly a cluster is. This naturally leads to clustering solutions being highly dependent on the similarity function implemented by the particular algorithm used =-=[16]-=-. As a result, if one is trying to find multiple clusterings by just naively applying a number of different clustering algorithms [22], the following difficulties present themselves : • An inability t... |

9 |
A Geometric Approach to Cluster Validity for NormalMixtures
- Bezdek, Li, et al.
- 1997
(Show Context)
Citation Context ...es higher dissimilarity. Quality : To quantitatively measure the quality of a clustering, we employed a generalized Dunn index [7], which has proved to be an effective measure in Bezdek’s experiments =-=[2]-=- compared to others. It is defined as follows: Dunn index : Let C = {c1, .., ck} be a clustering, δ : C ×C → R + 0 be a cluster-to-cluster distance and ∆:C → be a cluster diameter measure, then Dunn i... |

8 |
B.: Analysis of clustering algorithms for web-based search
- Eissen, Stein
- 2002
(Show Context)
Citation Context ...been traditionally applied to information retrieval systems, to coalesce the precision and recall values and calculate the harmonic mean to act as an overall score. This measure has also been used in =-=[8]-=-, for other clustering contexts. Following the definition of FMeasure [15], we define the overall DQ-Measure as below. 2J(C, S)DI(C, S) DQ(C, S) = (5) J(C, S)+DI(C, S) where J corresponds to Jaccard i... |

8 | Clustering with model-level constraints
- Gondek, Vaithyanathan, et al.
- 2005
(Show Context)
Citation Context ... knowledge to guide their clustering process. In constraint clustering [6, 24], knowledge is expressed as ‘must-link’ and ‘cannot-link’ constraints to produce more efficient and accurate clusters. In =-=[3, 13]-=-, negative information about undesired structures or features is provided to ensure that clustering process avoids these information and focusing on the clusterings in ‘positive’ data. However, unlike... |

6 |
Efficient parallel hierarchical clustering algorithms
- Rajasekaran
- 2005
(Show Context)
Citation Context ...COALA to O(n 2 log n). To overcome this high complexity, a number of extensions have been proposed, such as using a new data structure called quad tree [9] or applying a parallel clustering technique =-=[20]-=-. Furthermore, employing other more efficient clustering models (i.e. partitioning algorithms - kmeans) may also enhance the performance and accuracy of COALA. Moreover, selecting an appropriate dista... |

5 | Speeding-up hierarchical agglomerative clustering in presence of expensive metrics
- NANNI
(Show Context)
Citation Context ... produced more accurate results over other methods tested. Performance of COALA : Despite the flexible and intuitive clustering process, hierarchical algorithms are characterized by a high complexity =-=[17]-=-. In COALA, generating cannot-link constraints (algorithm 1) takes O(n 2 ). The itDissimilarity (Jaccard index) quality (Dunn index) dissimilarity (Jaccard index) 1.4 1.2 1 0.8 0.6 0.4 0.2 0 7 6 5 4 ... |

4 | L.: The “best k” for entropy-based categorical data clustering
- Chen, Liu
- 2005
(Show Context)
Citation Context ...ibutes, whose values cannot be naturally ordered in a metric space. Therefore, cluster analysis of categorical values has been studied extensively and there are numerous methods to handle the problem =-=[4, 14]-=-. The COALA algorithm faces a similar problem with categorical attributes and in this section we show how to modify our algorithm to handle this type of data.sThe extended algorithm called, COALACat (... |

1 |
Methods for comparing subspace clusterings, master’s thesis, www.cis.hut.fi/ annep/lisuri.pdf
- Patrikainen
- 2005
(Show Context)
Citation Context ...nner, and we provide these measures in this section. Dissimilarity : A number of measures exist for comparing similarity/dissimilarity between two clusterings. We have chosen to use the Jaccard index =-=[19]-=-, which is a well known measure based on ‘pair-counting’ technique, that observes object-to-cluster assignments between two clusterings. It is defined by the function below : N11 J(C, S) = (3) N11 + N... |