## A divide-and-merge methodology for clustering (2005)

### Cached

### Download Links

- [www.cs.yale.edu]
- [cs-www.cs.yale.edu]
- [research.microsoft.com]
- [faculty.ksu.edu.sa]
- DBLP

### Other Repositories/Bibliography

Venue: | ACM Transactions on Database Systems |

Citations: | 57 - 7 self |

### BibTeX

@INPROCEEDINGS{Cheng05adivide-and-merge,

author = {David Cheng and Ravi Kannan and Santosh Vempala and Grant Wang},

title = {A divide-and-merge methodology for clustering},

booktitle = {ACM Transactions on Database Systems},

year = {2005},

pages = {196--205},

publisher = {ACM Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present a divide-and-merge methodology for clustering a set of objects that combines a top-down “divide ” phase with a bottom-up “merge ” phase. In contrast, previous algorithms use either top-down or bottom-up methods for constructing a hierarchical clustering or produce a flat clustering using local search (e.g. k-means). Our divide phase produces a tree whose leaves are the elements of the set. For this phase, we suggest an efficient spectral algorithm. The merge phase quickly finds the optimal partition that respects the tree for many natural objective functions, e.g., k-means, min-diameter, min-sum, correlation clustering, etc. We present a metasearch engine that clusters results from web searches. We also give empirical results on textbased data where the algorithm performs better than or competitively with existing clustering algorithms. 1

### Citations

2175 |
Dubes. Algorithms for Clustering Data
- Jain, Richard
- 1988
(Show Context)
Citation Context ...ps so that each group consists of similar objects. The classification could either be flat (a partition of the data set usually found by a local search algorithm such as k-means [17]) or hierarchical =-=[19]-=-. Clustering has been proposed as a method to aid information retrieval in many contexts (e.g. [12, 31, 28, 23, 15]). Document clustering can help generate a hierarchical taxonomy efficiently (e.g. [9... |

1520 |
Clustering algorithms
- Hartigan
- 1975
(Show Context)
Citation Context ...data objects into groups so that each group consists of similar objects. The classification could either be flat (a partition of the data set usually found by a local search algorithm such as k-means =-=[17]-=-) or hierarchical [19]. Clustering has been proposed as a method to aid information retrieval in many contexts (e.g. [12, 31, 28, 23, 15]). Document clustering can help generate a hierarchical taxonom... |

1319 | Data clustering: a review
- Jain, Murty, et al.
- 1999
(Show Context)
Citation Context ... models to data sets [18] and for image segmentation [30]. Most hierarchical clustering algorithms can be described as either divisive methods (i.e. topdown) or agglomerative methods (i.e. bottom-up) =-=[5, 19, 20]-=-. Both methods create trees, but do not provide a flat clustering. A divisive algorithm begins with the entire set and recursively partitions it into two pieces, forming a tree. An agglomerative algor... |

724 |
Cluster Analysis for Applications
- Anderberg
- 1973
(Show Context)
Citation Context ... models to data sets [18] and for image segmentation [30]. Most hierarchical clustering algorithms can be described as either divisive methods (i.e. topdown) or agglomerative methods (i.e. bottom-up) =-=[5, 19, 20]-=-. Both methods create trees, but do not provide a flat clustering. A divisive algorithm begins with the entire set and recursively partitions it into two pieces, forming a tree. An agglomerative algor... |

623 | Scatter/Gather: a Cluster-based Approach to Browsing Large Document Collections
- Cutting, Karger, et al.
- 1992
(Show Context)
Citation Context ...tion of the data set usually found by a local search algorithm such as k-means [17]) or hierarchical [19]. Clustering has been proposed as a method to aid information retrieval in many contexts (e.g. =-=[12, 31, 28, 23, 15]-=-). Document clustering can help generate a hierarchical taxonomy efficiently (e.g. [9, 35]) as well as organize the results of a web search (e.g. [33, 32]). It has also been used to learn (or fit) mix... |

430 | A comparison of document clustering techniques
- Steinbach, Karypis, et al.
- 2000
(Show Context)
Citation Context ...tion of the data set usually found by a local search algorithm such as k-means [17]) or hierarchical [19]. Clustering has been proposed as a method to aid information retrieval in many contexts (e.g. =-=[12, 31, 28, 23, 15]-=-). Document clustering can help generate a hierarchical taxonomy efficiently (e.g. [9, 35]) as well as organize the results of a web search (e.g. [33, 32]). It has also been used to learn (or fit) mix... |

318 | Co-clustering documents and words using bipartite spectral graph partitioning
- Dhillon
- 2001
(Show Context)
Citation Context ...tion of the data set usually found by a local search algorithm such as k-means [17]) or hierarchical [19]. Clustering has been proposed as a method to aid information retrieval in many contexts (e.g. =-=[12, 31, 28, 23, 15]-=-). Document clustering can help generate a hierarchical taxonomy efficiently (e.g. [9, 35]) as well as organize the results of a web search (e.g. [33, 32]). It has also been used to learn (or fit) mix... |

279 |
P-complete approximation problems
- Sahni, Gonzales
- 1976
(Show Context)
Citation Context ...resenting the objects. The class of functions for which the merge phase can find an optimal tree-respecting clustering include standard objectives such as k-means [17], min-diameter [11], and min-sum =-=[25]-=-. It also includes correlation clustering, a formulation of clustering that has seen recent interest [6, 10, 14, 16, 29]. Each of the corresponding optimization problems is NP-hard to solve for genera... |

259 | On clusterings – good, bad and spectral
- KANNAN, VEMPALA, et al.
(Show Context)
Citation Context ..., producing the best tree-respecting clustering. 1C3 divide merge C1 C2 Figure 1: The Divide-and-Merge methodology For the divide phase we suggest using the theoretical spectral algorithm studied in =-=[21]-=-. There, the authors use a quantity called conductance to define a measure of a good clustering based on the graph of pairwise similarities. They prove that the tree constructed by recursive spectral ... |

221 | Correlation clustering
- Bansal, Blum, et al.
- 2002
(Show Context)
Citation Context ...ng clustering include standard objectives such as k-means [17], min-diameter [11], and min-sum [25]. It also includes correlation clustering, a formulation of clustering that has seen recent interest =-=[6, 10, 14, 16, 29]-=-. Each of the corresponding optimization problems is NP-hard to solve for general graphs. Although approximation algorithms exist, many of them have impractical running times. Our methodology can be s... |

177 | Evaluation of hierarchical clustering algorithms for document datasets
- Zhao, Karypis
(Show Context)
Citation Context ...9]. Clustering has been proposed as a method to aid information retrieval in many contexts (e.g. [12, 31, 28, 23, 15]). Document clustering can help generate a hierarchical taxonomy efficiently (e.g. =-=[9, 35]-=-) as well as organize the results of a web search (e.g. [33, 32]). It has also been used to learn (or fit) mixture models to data sets [18] and for image segmentation [30]. Most hierarchical clusterin... |

157 | Spectral Relaxation for k-Means Clustering
- Zha, Ding, et al.
- 2002
(Show Context)
Citation Context ...a sets and results. 4.0.1 20 newsgroups The 20 newsgroups resource [1] is a corpus of roughly 20,000 articles that come from 20 specific Usenet newsgroups. We performed a subset of the experiments in =-=[34]-=-. Each experiment involved choosing 50 random newsgroup articles each from two newsgroups. 1 The results can be seen in Table 1. Note that we perform better than p-QR, the algorithm proposed in [34] o... |

151 | Document clustering using word clusters via the information bottleneck method - Slonim, Tishby - 2000 |

115 |
Rijsbergen, Information Retrieval (2nd edition
- van
- 1979
(Show Context)
Citation Context |

103 | Principal direction divisive partitioning
- Boley
- 1998
(Show Context)
Citation Context ...9]. Clustering has been proposed as a method to aid information retrieval in many contexts (e.g. [12, 31, 28, 23, 15]). Document clustering can help generate a hierarchical taxonomy efficiently (e.g. =-=[9, 35]-=-) as well as organize the results of a web search (e.g. [33, 32]). It has also been used to learn (or fit) mixture models to data sets [18] and for image segmentation [30]. Most hierarchical clusterin... |

101 | Fast and Intuitive Clustering of Web Documents
- Zamir
- 1997
(Show Context)
Citation Context ... retrieval in many contexts (e.g. [12, 31, 28, 23, 15]). Document clustering can help generate a hierarchical taxonomy efficiently (e.g. [9, 35]) as well as organize the results of a web search (e.g. =-=[33, 32]-=-). It has also been used to learn (or fit) mixture models to data sets [18] and for image segmentation [30]. Most hierarchical clustering algorithms can be described as either divisive methods (i.e. t... |

96 |
Clustering with qualitative information
- Charikar, Guruswami, et al.
- 2003
(Show Context)
Citation Context ...ng clustering include standard objectives such as k-means [17], min-diameter [11], and min-sum [25]. It also includes correlation clustering, a formulation of clustering that has seen recent interest =-=[6, 10, 14, 16, 29]-=-. Each of the corresponding optimization problems is NP-hard to solve for general graphs. Although approximation algorithms exist, many of them have impractical running times. Our methodology can be s... |

91 | X.: Frequent Term-based Text Clustering
- Beil, Ester, et al.
- 2002
(Show Context)
Citation Context ...2 Reuters The Reuters data set [3] is a corpus of 8, 654 news articles that have been classified into 135 distinct news topics. We performed same two experiments on this data set as were conducted in =-=[8, 23, 24]-=-. The first experiment, performed by [8, 23], constructed a complete hierarchical tree for a documentterm matrix that includes all 8, 654 news articles. In the second experiment, a complete hierarchic... |

69 | Coolcat: an entropy-based algorithm for categorical clustering - Barbará, Li, et al. - 2002 |

55 |
Yuval Rabani. Approximation schemes for clustering problems
- Vega, Karpinski, et al.
- 2003
(Show Context)
Citation Context ...e T by a similar dynamic program to the one above. Although approximation algorithms are known for this problem (as well as the one above), their running times seem too large to be useful in practice =-=[13]-=-. Correlation clustering: Suppose we are given a graph where each pair of vertices is either deemed similar (red) or not (blue). Let R and B be the set of red and blue edges, respectively. Correlation... |

52 | The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data
- Hofmann
- 1999
(Show Context)
Citation Context ...n help generate a hierarchical taxonomy efficiently (e.g. [9, 35]) as well as organize the results of a web search (e.g. [33, 32]). It has also been used to learn (or fit) mixture models to data sets =-=[18]-=- and for image segmentation [30]. Most hierarchical clustering algorithms can be described as either divisive methods (i.e. topdown) or agglomerative methods (i.e. bottom-up) [5, 19, 20]. Both methods... |

47 | Correlation clustering with partial information
- Demaine, Immorlica
- 2003
(Show Context)
Citation Context ...ng clustering include standard objectives such as k-means [17], min-diameter [11], and min-sum [25]. It also includes correlation clustering, a formulation of clustering that has seen recent interest =-=[6, 10, 14, 16, 29]-=-. Each of the corresponding optimization problems is NP-hard to solve for general graphs. Although approximation algorithms exist, many of them have impractical running times. Our methodology can be s... |

38 |
Chinatsu Aone. Fast and effective text mining using linear-time document clustering
- Larsen
- 1999
(Show Context)
Citation Context |

31 |
A simple linear time (1+ɛ)approximation algorithm for k-means clustering in any dimensions
- Kumar, Sabharwal, et al.
- 2004
(Show Context)
Citation Context ...pi) 2 . The centroid of a cluster is just the average of the points in the cluster. This problem is NPhard; several heuristics (such as the k-means algorithm) and approximation algorithms exist (e.g. =-=[17, 22]-=-). Let OPT(C, i) be the optimal clustering for C using i clusters. Let Cl and Cr be the left and right children of C in T . Then we have the following recurrence: { C when i = 1 OPT(C, i) = argmin1≤j<... |

31 | Correlation clustering: maximizing agreements via semidefinite programming
- Swamy
- 2004
(Show Context)
Citation Context |

29 | Incremental document clustering for web page classification
- Wong, Fu
- 2000
(Show Context)
Citation Context ... retrieval in many contexts (e.g. [12, 31, 28, 23, 15]). Document clustering can help generate a hierarchical taxonomy efficiently (e.g. [9, 35]) as well as organize the results of a web search (e.g. =-=[33, 32]-=-). It has also been used to learn (or fit) mixture models to data sets [18] and for image segmentation [30]. Most hierarchical clustering algorithms can be described as either divisive methods (i.e. t... |

27 | A contiguity-enhanced k-means clustering algorithm for unsupervised multispectral image segmentation
- Theiler
- 1997
(Show Context)
Citation Context ...axonomy efficiently (e.g. [9, 35]) as well as organize the results of a web search (e.g. [33, 32]). It has also been used to learn (or fit) mixture models to data sets [18] and for image segmentation =-=[30]-=-. Most hierarchical clustering algorithms can be described as either divisive methods (i.e. topdown) or agglomerative methods (i.e. bottom-up) [5, 19, 20]. Both methods create trees, but do not provid... |

10 |
Correlation clustering–minimizing disagreements on arbitrary weighted graphs
- Emanuel, Fiat
- 2003
(Show Context)
Citation Context |

2 |
Evangelos Milios. Using unsupervised learning to guide re-sampling in imbalanced data sets
- Nickerson, Japkowicz
- 2001
(Show Context)
Citation Context ...2 Reuters The Reuters data set [3] is a corpus of 8, 654 news articles that have been classified into 135 distinct news topics. We performed same two experiments on this data set as were conducted in =-=[8, 23, 24]-=-. The first experiment, performed by [8, 23], constructed a complete hierarchical tree for a documentterm matrix that includes all 8, 654 news articles. In the second experiment, a complete hierarchic... |

1 |
data set (Entropy
- SMART
(Show Context)
Citation Context ...imilarity function is the inner product. For a document-term matrix with M nonzeros, our implementation runs in O(Mn log n) in the worst case and seems to perform much better in practice (see Figure 2=-=(a)-=-). The data need not be text; all that is needed is for the similarity of two objects to be the inner product between the two vectors representing the objects. The class of functions for which the mer... |