## Criterion Functions for Document Clustering: Experiments and Analysis (2002)

Citations: | 148 - 11 self |

### BibTeX

@TECHREPORT{Zhao02criterionfunctions,

author = {Ying Zhao and George Karypis},

title = {Criterion Functions for Document Clustering: Experiments and Analysis},

institution = {},

year = {2002}

}

### Years of Citing Articles

### OpenURL

### Abstract

In recent years, we have witnessed a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intranets. This has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and organize this information with the ultimate goal of helping them to find what they are looking for. Fast and high-quality document clustering algorithms play an important role towards this goal as they have been shown to provide both an intuitive navigation/browsing mechanism by organizing large amounts of information into a small number of meaningful clusters as well as to greatly improve the retrieval performance either via cluster-driven dimensionality reduction, term-weighting, or query expansion. This ever-increasing importance of document clustering and the expanded range of its applications led to the development of a number of new and novel algorithms with different complexity-quality trade-offs. Among them, a class of clustering algorithms that have relatively low computational requirements are those that treat the clustering problem as an optimization process which seeks to maximize or minimize a particular clustering criterion function defined over the entire clustering solution.

### Citations

8167 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...ogram, with a single all inclusive cluster at the top and single-point clusters at the leaves. On the other hand, partitional algorithms, such as K -means [33, 22], K -medoids [22, 27, 35], Autoclass =-=[8, 6]-=-, graph-partitioning-based [45, 22, 17, 40], or spectral-partitioning-based [5, 11], find the clusters by partitioning the entire dataset into either a predetermined or an automatically derived number... |

2162 |
Algorithms for clustering data
- Jain, Dubes
- 1988
(Show Context)
Citation Context ...asonably balanced clusters. 1 Introduction The topic of clustering has been extensively studied in many scientific disciplines and over the years a variety of different algorithms have been developed =-=[31, 22, 6, 27, 20, 35, 2, 48, 13, 43, 14, 15, 24]. -=-Two recent surveys on ∗ This work was supported by NSF CCR-9972519, EIA-9986042, ACI-9982274, by Army Research Office contract DA/DAAG55-98-1-0441, by the DOE ASCI program, and by Army High Performa... |

1936 |
Pattern classification
- Duda, Hart, et al.
- 2001
(Show Context)
Citation Context ...ring solution by using the sum-of-squared-errors function. In particular, this criterion is defined as follows: k� � minimize Á3 = �di − Cr � 2 . (9) r=1 di ∈Sr By some simple algebraic m=-=anipulations [12], the above equ-=-ation can be rewritten as: Á3 = k� 1 � nr r=1 di ,d j ∈Sr r=1 �di − d j � 2 , (10) which shows that the Á3 criterion function is similar in nature to Á1 but instead of using similaritie... |

1874 | Some methods for classification and analysis of multivariate observations
- MacQueen
- 1967
(Show Context)
Citation Context ...ithms produce a clustering that forms a dendrogram, with a single all inclusive cluster at the top and single-point clusters at the leaves. On the other hand, partitional algorithms, such as K -means =-=[33, 22]-=-, K -medoids [22, 27, 35], Autoclass [8, 6], graph-partitioning-based [45, 22, 17, 40], or spectral-partitioning-based [5, 11], find the clusters by partitioning the entire dataset into either a prede... |

1818 |
An algorithm for suffix stripping
- Porter
- 1980
(Show Context)
Citation Context ...diversity in the datasets, we obtained them from different sources. For all data sets, we used a stop-list to remove common words, and the words were stemmed using Porter’s suffix-stripping algorith=-=m [36]-=-. Moreover, any term that occurs in fewer than two documents was eliminated. The classic dataset was obtained by combining the CACM, CISI, CRANFIELD, and MEDLINE abstracts that were used in the past t... |

1347 |
Finding Groups in Data: An Introduction to Cluster Analysis
- Kaufman, Rousseeuw
- 1990
(Show Context)
Citation Context ...asonably balanced clusters. 1 Introduction The topic of clustering has been extensively studied in many scientific disciplines and over the years a variety of different algorithms have been developed =-=[31, 22, 6, 27, 20, 35, 2, 48, 13, 43, 14, 15, 24]. -=-Two recent surveys on ∗ This work was supported by NSF CCR-9972519, EIA-9986042, ACI-9982274, by Army Research Office contract DA/DAAG55-98-1-0441, by the DOE ASCI program, and by Army High Performa... |

1309 | Data clustering: a review
- JAIN, MURTY, et al.
- 1999
(Show Context)
Citation Context ...the DOE ASCI program, and by Army High Performance Computing Research Center contract number DAAH04-95-C-0008. Related papers are available via WWW at URL: http://www.cs.umn.edu/˜karypis 1sthe topics=-= [21, 18]-=- offer a comprehensive summary of the different applications and algorithms. These algorithms can be categorized along different dimensions based either on the underlying methodology of the algorithm,... |

1213 |
Automatic text processing: the transformation, analysis, and retrieval of information by computer
- Salton
- 1989
(Show Context)
Citation Context ...hat is available online at http://www.cs.umn.edu/˜karypis/cluto. 2s2 Preliminaries Document Representation The various clustering algorithms that are described in this paper use the vectorspace model=-= [37]-=- to represent each document. In this model, each document d is considered to be a vector in the term-space. In its simplest form, each document is represented by the term-frequency (TF) vector dtf = (... |

1110 | X.: A density-based algorithm for discovering clusters in large spatial databases with noise
- Ester, Kriegel, et al.
- 1996
(Show Context)
Citation Context ...asonably balanced clusters. 1 Introduction The topic of clustering has been extensively studied in many scientific disciplines and over the years a variety of different algorithms have been developed =-=[31, 22, 6, 27, 20, 35, 2, 48, 13, 43, 14, 15, 24]. -=-Two recent surveys on ∗ This work was supported by NSF CCR-9972519, EIA-9986042, ACI-9982274, by Army Research Office contract DA/DAAG55-98-1-0441, by the DOE ASCI program, and by Army High Performa... |

622 | J.W.: Scatter/gather: A cluster-based approach to browsing large document collections
- Cutting, Karger, et al.
- 1992
(Show Context)
Citation Context ... recent years, various researchers have recognized that partitional clustering algorithms are well-suited for clustering large document datasets due to their relatively low computational requirements =-=[7, 30, 1, 39]-=-. A key characteristic of many partitional clustering algorithms is that they use a global criterion function whose optimization drives the entire clustering process 1 . For some of these algorithms t... |

595 | Efficient and Effective Clustering Methods for Spatial Data Mining
- Ng, Han
- 1994
(Show Context)
Citation Context |

563 | CURE: An efficient clustering algorithm for large databases
- Guha, Rastogi, et al.
- 1998
(Show Context)
Citation Context |

535 | Using Linear Algebra for Intelligent Information Retrieval
- Berry, Dumais, et al.
- 1995
(Show Context)
Citation Context |

482 |
Bayesian Classification (AutoClass): Theory and Results
- Cheeseman, Stutz
- 1996
(Show Context)
Citation Context |

459 |
Pattern Recognition
- Theodoridis, Koutroumbas
- 1998
(Show Context)
Citation Context ...avier in the overall clustering solution. This external criterion function was motivated by multiple discriminant analysis and is similar to minimizing the trace of the between-cluster scatter matrix =-=[12, 41]. Equation 11 can be re-written as k-=-� nr cos(Cr , C) = r=1 k� r=1 nr Cr t C �Cr ��C� = k� r=1 nr Dr t D �Dr ��D� � 1 k� Dr = nr �D� r=1 t � D , �Dr � where D is the composite vector of the entire docu... |

428 | A Comparison of Document Clustering Techniques
- Steinbach, Karypis, et al.
- 2000
(Show Context)
Citation Context ... recent years, various researchers have recognized that partitional clustering algorithms are well-suited for clustering large document datasets due to their relatively low computational requirements =-=[7, 30, 1, 39]-=-. A key characteristic of many partitional clustering algorithms is that they use a global criterion function whose optimization drives the entire clustering process 1 . For some of these algorithms t... |

336 | ROCK: A Robust Clustering Algorithm for Categorical Attributes
- Guha, Rastogi, et al.
- 1999
(Show Context)
Citation Context |

335 |
Numerical taxonomy
- Sneath, Sokal
- 1973
(Show Context)
Citation Context ...ers until a certain stopping criterion is met. A number of different methods have been proposed for determining the next pair of clusters to be merged, such as group average (UPGMA) [22], single-link =-=[38]-=-, complete link [28], CURE [14], ROCK [15], and CHAMELEON [24]. Hierarchical algorithms produce a clustering that forms a dendrogram, with a single all inclusive cluster at the top and single-point cl... |

317 | Co-clustering documents and words using bipartite spectral graph partitioning
- Dhillon
- 2001
(Show Context)
Citation Context ... In our study we will focus on a particular edge-cut based criterion function called the normalized cut, which was recently used in the 7scontext of this bipartite graph model for document clustering =-=[46, 9]. The -=-normalized cut criterion function is defined as k� cut(Vr, V − Vr) minimize �2 = , (17) W(Vr ) r=1 where Vr is the set of vertices assigned to the rth cluster, and W(Vr ) is the sum of the weigh... |

304 | Concept Decompositions for Large Sparse Text Data Using Clustering
- Dhillon, Modha
- 2001
(Show Context)
Citation Context ...s the group-average heuristic to determine which pair of clusters to merge next. The second criterion function that we will study is used by the popular vector-space variant of the K -means algorithm =-=[7, 30, 10, 39, 23]-=-. In this algorithm each cluster is represented by its centroid vector and the goal is to find the clustering solution that maximizes the similarity between each document and the centroid of the clust... |

301 |
A user’s guide to principal components
- Jackson
- 1991
(Show Context)
Citation Context |

255 | OHSUMED: An interactive retrieval evaluation and new large test collection for research
- Hersh, Buckley, et al.
- 1994
(Show Context)
Citation Context ...2], and TREC-7 [42] collections. The classes of these datasets correspond to the documents that were judged relevant to particular queries. The ohscal dataset was obtained from the OHSUMED collection =-=[19]-=-, which contains 233,445 documents indexed using 14,321 unique categories. Our dataset contained documents from the antibodies, carcinoma, DNA, in-vitro, molecular sequence data, pregnancy, prognosis,... |

247 | Graph-theoretical methods for detecting and describing gestalt clusters
- Zahn
- 1971
(Show Context)
Citation Context ...e cluster at the top and single-point clusters at the leaves. On the other hand, partitional algorithms, such as K -means [33, 22], K -medoids [22, 27, 35], Autoclass [8, 6], graph-partitioning-based =-=[45, 22, 17, 40]-=-, or spectral-partitioning-based [5, 11], find the clusters by partitioning the entire dataset into either a predetermined or an automatically derived number of clusters. Depending on the particular a... |

209 | Chameleon: A hierarchical clustering algorithm using dynamic modeling
- Karypis, Han, et al.
- 1999
(Show Context)
Citation Context |

156 | Spectral relaxation for k-means clustering
- Zha, He, et al.
- 2001
(Show Context)
Citation Context ...unctions it has been shown that they converge to a local minima. An alternate way is to use more powerful optimizers such as those based on the spectral properties of the document’s similarity matri=-=x [47]-=- or document-term matrix [46, 9], or various multilevel optimization methods [26, 25]. However, such optimization methods have only been developed for a subset of the various criterion functions that ... |

153 |
Reuters-21578 text categorization test collection distribution 1.0. http://www.research.att.com/∼lewis
- Lewis
- 1999
(Show Context)
Citation Context ... molecular sequence data, pregnancy, prognosis, receptors, risk factors, and tomography categories. The datasets re0 and re1 are from Reuters21578 text categorization test collection Distribution 1.0 =-=[32]-=-. We divided the labels into two sets and constructed data sets accordingly. For each data set, we selected documents that have a single label. Finally, the datasets k1a, k1b, and wap are from the Web... |

103 | Principal direction divisive partitioning
- Boley
- 1998
(Show Context)
Citation Context ... the leaves. On the other hand, partitional algorithms, such as K -means [33, 22], K -medoids [22, 27, 35], Autoclass [8, 6], graph-partitioning-based [45, 22, 17, 40], or spectral-partitioning-based =-=[5, 11]-=-, find the clusters by partitioning the entire dataset into either a predetermined or an automatically derived number of clusters. Depending on the particular algorithm, a k-way clustering solution ca... |

88 | Document categorization and query generation on the World Wide Web using WebACE
- Boley, Gini, et al.
- 1998
(Show Context)
Citation Context ... the labels into two sets and constructed data sets accordingly. For each data set, we selected documents that have a single label. Finally, the datasets k1a, k1b, and wap are from the WebACE project =-=[34, 16, 3, 4]-=-. Each document corresponds to a web page listed in the subject hierarchy of Yahoo! [44]. The datasets k1a and k1b contain exactly the same set of documents but they differ in how the documents were a... |

87 | Bipartite graph partitioning and data clustering
- Zha, He, et al.
- 2001
(Show Context)
Citation Context ... In our study we will focus on a particular edge-cut based criterion function called the normalized cut, which was recently used in the 7scontext of this bipartite graph model for document clustering =-=[46, 9]. The -=-normalized cut criterion function is defined as k� cut(Vr, V − Vr) minimize �2 = , (17) W(Vr ) r=1 where Vr is the set of vertices assigned to the rth cluster, and W(Vr ) is the sum of the weigh... |

75 |
Step-wise clustering procedures
- King
- 1967
(Show Context)
Citation Context ...stopping criterion is met. A number of different methods have been proposed for determining the next pair of clusters to be merged, such as group average (UPGMA) [22], single-link [38], complete link =-=[28]-=-, CURE [14], ROCK [15], and CHAMELEON [24]. Hierarchical algorithms produce a clustering that forms a dendrogram, with a single all inclusive cluster at the top and single-point clusters at the leaves... |

72 | Partitioning-based clustering for web document categorization. Decision Support Systems (accepted for publication
- Boley, Gini, et al.
- 1999
(Show Context)
Citation Context ... the labels into two sets and constructed data sets accordingly. For each data set, we selected documents that have a single label. Finally, the datasets k1a, k1b, and wap are from the WebACE project =-=[34, 16, 3, 4]-=-. Each document corresponds to a web page listed in the subject hierarchy of Yahoo! [44]. The datasets k1a and k1b contain exactly the same set of documents but they differ in how the documents were a... |

72 | WebACE: A web agent for document categorization and exploartion
- Han, Boley, et al.
- 1998
(Show Context)
Citation Context ... the labels into two sets and constructed data sets accordingly. For each data set, we selected documents that have a single label. Finally, the datasets k1a, k1b, and wap are from the WebACE project =-=[34, 16, 3, 4]-=-. Each document corresponds to a web page listed in the subject hierarchy of Yahoo! [44]. The datasets k1a and k1b contain exactly the same set of documents but they differ in how the documents were a... |

72 |
Spatial clustering methods in data mining: A survey
- Han, Kamber, et al.
- 2001
(Show Context)
Citation Context ...the DOE ASCI program, and by Army High Performance Computing Research Center contract number DAAH04-95-C-0008. Related papers are available via WWW at URL: http://www.cs.umn.edu/˜karypis 1sthe topics=-= [21, 18]-=- offer a comprehensive summary of the different applications and algorithms. These algorithms can be categorized along different dimensions based either on the underlying methodology of the algorithm,... |

68 | Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval & categorization
- Karypis, Han
- 2000
(Show Context)
Citation Context ...s the group-average heuristic to determine which pair of clusters to merge next. The second criterion function that we will study is used by the popular vector-space variant of the K -means algorithm =-=[7, 30, 10, 39, 23]-=-. In this algorithm each cluster is represented by its centroid vector and the goal is to find the clustering solution that maximizes the similarity between each document and the centroid of the clust... |

68 |
A Fast and Highly Quality Multilevel Scheme for Partitioning Irregular Graphs
- Karypis, Kumar
- 1999
(Show Context)
Citation Context ... is to use more powerful optimizers such as those based on the spectral properties of the document’s similarity matrix [47] or document-term matrix [46, 9], or various multilevel optimization method=-=s [26, 25]-=-. However, such optimization methods have only been developed for a subset of the various criterion functions that are used in our study. For this reason, in our study, the various criterion functions... |

48 | Hypergraph based clustering in high-dimensional data sets: A summary of results
- Han, Karypis, et al.
- 1998
(Show Context)
Citation Context ...e cluster at the top and single-point clusters at the leaves. On the other hand, partitional algorithms, such as K -means [33, 22], K -medoids [22, 27, 35], Autoclass [8, 6], graph-partitioning-based =-=[45, 22, 17, 40]-=-, or spectral-partitioning-based [5, 11], find the clusters by partitioning the entire dataset into either a predetermined or an automatically derived number of clusters. Depending on the particular a... |

45 | On the merits of building categorization systems by supervised clustering
- Aggarwal, Gates, et al.
- 1999
(Show Context)
Citation Context ... recent years, various researchers have recognized that partitional clustering algorithms are well-suited for clustering large document datasets due to their relatively low computational requirements =-=[7, 30, 1, 39]-=-. A key characteristic of many partitional clustering algorithms is that they use a global criterion function whose optimization drives the entire clustering process 1 . For some of these algorithms t... |

38 |
Chinatsu Aone. Fast and effective text mining using linear-time document clustering
- Larsen
- 1999
(Show Context)
Citation Context |

31 |
Text REtrieval conference. http://trec.nist.gov
- TREC
- 1999
(Show Context)
Citation Context ... information retrieval systems 4 . In this data set, each individual set of abstracts formed one of the four classes. The fbis dataset is from the Foreign Broadcast Information Service data of TREC-5 =-=[42]-=-, and the classes correspond to the categorization used in that collection. The hitech, reviews, and sports datasets were derived from the San Jose Mercury newspaper articles that are distributed as p... |

29 | Scalable approach to balanced, high-dimensional clustering of market-baskets
- Strehl, Ghosh
- 2000
(Show Context)
Citation Context ...e cluster at the top and single-point clusters at the leaves. On the other hand, partitional algorithms, such as K -means [33, 22], K -medoids [22, 27, 35], Autoclass [8, 6], graph-partitioning-based =-=[45, 22, 17, 40]-=-, or spectral-partitioning-based [5, 11], find the clusters by partitioning the entire dataset into either a predetermined or an automatically derived number of clusters. Depending on the particular a... |

24 | Web page categorization and feature selection using association rule and principal component clustering
- Moore, Han, et al.
- 1997
(Show Context)
Citation Context |

23 |
Spectral min-max cut for graph partitioning and data clustering
- Ding, He, et al.
- 2001
(Show Context)
Citation Context ... the leaves. On the other hand, partitional algorithms, such as K -means [33, 22], K -medoids [22, 27, 35], Autoclass [8, 6], graph-partitioning-based [45, 22, 17, 40], or spectral-partitioning-based =-=[5, 11]-=-, find the clusters by partitioning the entire dataset into either a predetermined or an automatically derived number of clusters. Depending on the particular algorithm, a k-way clustering solution ca... |

22 |
Clustering analysis and its applications
- Lee
- 1981
(Show Context)
Citation Context |

10 | Multilevel refinement for hierarchical clustering
- Karypis, Han, et al.
- 1999
(Show Context)
Citation Context ... is to use more powerful optimizers such as those based on the spectral properties of the document’s similarity matrix [47] or document-term matrix [46, 9], or various multilevel optimization method=-=s [26, 25]-=-. However, such optimization methods have only been developed for a subset of the various criterion functions that are used in our study. For this reason, in our study, the various criterion functions... |

4 |
Partitioning sparse rectangular and structurally nonsymmetric matrices for parallel computation
- Kolda, Hendrickson
(Show Context)
Citation Context ...tegy that we used for the �2 criterion function is based on alternating the cluster refinement between document-vertices and term-vertices, that was used in the past for partitioning bipartite graph=-=s [29]-=-. Similarly to the other two refinement strategies, it consists of a number of iterations but each iteration consists of two steps. In the first step, the documents are visited in a random order. For ... |

2 |
Sitaram Dikshitulu, Isidore Rigoutsos, and Kaizhong Zhang. Automated discovery of active motifs in three dimensional molecules
- Wang, Wang, et al.
- 1997
(Show Context)
Citation Context |

1 |
an efficient data clustering method for large databases
- Birch
- 1996
(Show Context)
Citation Context |