## A review of clustering algorithms as applied to IR (1999)

Citations: | 3 - 0 self |

### BibTeX

@TECHREPORT{He99areview,

author = {Qin He},

title = {A review of clustering algorithms as applied to IR},

institution = {},

year = {1999}

}

### OpenURL

### Abstract

### Citations

2334 | Algorithms for Clustering Data - Jain, Dubes - 1988 |

954 |
A vector space model for automatic indexing
- Salton, Wong, et al.
- 1975
(Show Context)
Citation Context ...re index terms Ti and represented by a n-dimensional vector Di = (di1, di2, … , din), where dij is the weight of the jth term in document i and n is the total different index terms in the collection (=-=Salton, Wong, & Yang, 1975-=-). In this model, there are three main classes of similarity coefficients have been used to determine the similarity degree between two documents: distance coefficients, association coefficients, prob... |

277 | Numerical Taxonomy: The Principles and Practice of Numerical Classification - Sneath, Sokal - 1973 |

219 |
Cluster analysis
- Aldenderfer, Blashfield
- 1984
(Show Context)
Citation Context ...icient. One of the major drawbacks of the use of the correlation coefficient as a similarity measure is its sensitivity to shape at the expense of the magnitude of differences between the variables. (=-=Aldenderfer & Blashfield, 1984-=-, p.23) 2) Distance measures Distance measures have enjoyed widespread popularity because intuitively they appear to be dissimilarity measures. Distance measures normally have no upper bounds and are ... |

143 |
Clustering Algorithms
- Rasmussen
- 1992
(Show Context)
Citation Context ... common. Conversely, association coefficients have been used more widely. Three commonly used normalized association coefficients are the Dice coefficient, Jaccard coefficient and Cosine coefficient (=-=Rasmussen, 1992-=-). Probabilistic coefficients have been used in ElHamdouchi's work, in which the main criterion for the formation of a cluster is that the 8sdocuments in it have a maximal probability of being jointly... |

133 | The use of hierarchical clustering in information retrieval. Information storage and retrieval - Jardine, Rijsbergen - 1971 |

132 | A theoretical basis for the use of co-occurrence data in information - Rijsbergen - 1977 |

111 | Slink: An optimally efficient algorithm for the single link cluster methods. The Computer Journal 16(1):30–34 - Sibson - 1973 |

94 | Implementing agglomerative hierarchic clustering algorithms for use in document retrieval - Voorhees - 1986 |

79 |
An examination of the effect of six types of error perturbation of fifteen clustering algorithms
- Milligan, W
- 1980
(Show Context)
Citation Context ...ties in another cluster by fairly empty areas of space. Internal cohesion requires that entities within the same cluster should be similar to each other, at least within the local metric (as cited in =-=Milligan, 1980-=-, p. 326). Sneath and Sokal (1973) have described a number of properties of a cluster, the most important of which are density, variance, dimension, shape and separation (as cited in Aldenderfer & Bla... |

68 |
A review of classification
- Cormack
- 1971
(Show Context)
Citation Context ...is that it does not restrict the shape of clusters as rigidly as other proposed definitions do (p. 60). 1.2 Properties of Clusters It is clear that clusters have certain properties. In early studies (=-=Cormack, 1971-=-), it is found that clusters should exhibit the properties of external isolation and internal cohesion. External isolation requires that entities in one cluster should be separated from entities in an... |

59 | Mapping of Science by combined co-citation and word analysis. II: Dynamic aspects - BRAAM, MOED, et al. - 1991 |

44 |
An efficient algorithm for a complete link method
- Defays
- 1977
(Show Context)
Citation Context ...t similarity to a document that is so linked (as cited in Willett, 1988, p. 583)." • Complete linkage. CLINK is a modification of Sibson's SLINK with the complexities for the complete linkage method (=-=Defays, 1977-=-). However, this algorithm has been shown to give very poor levels of retrieval effectiveness. The reason is that the CLINK algorithm does not seem to generate an exact complete linkage hierarchy. Voo... |

36 | The selection of good search terms - Rijsbergen, Harper, et al. - 1981 |

35 | Validity studies in clustering methodologies - Dubes, Jain - 1976 |

30 | Clustering large files of documents using the single-link method - Croft - 1977 |

28 | A cluster-based approach to thesaurus construction - Crouch - 1988 |

26 | Hierarchic Document Clustering Using Ward’s Method - El-Hamdouchi, Willett |

24 | Spatial versus tree representations of proximity data - Pruzansky, Tversky, et al. - 1982 |

13 |
Cluster Analysis 2nd ed
- Everitt
- 1980
(Show Context)
Citation Context ... measures. There are at least three concepts of similarity and distance, which need to be considered – between entities, between an entity and a group of entities, and between two groups of entities (=-=Everitt, 1980-=-, p. 12). Aldenderfer and Blashfield (1984) use “similarity coefficient” to describe any type of similarity measure and divide it into four groups (p. 17). Different similarity coefficients may have w... |

12 |
An algorithm for generating artificial test clusters
- Milligan
- 1985
(Show Context)
Citation Context ...data with known structure. The validation study is then reduced to the problem of determining how well the various clustering techniques or statistics detected the true (known) structure in the data (=-=Milligan, 1985-=-, p. 123). Milligan (1985) has proposed an algorithm for generating artificial test clusters. There are 9 steps in this algorithm with a three-way factorial design. The three factors are the number of... |

11 | Similarity coefficients and weighting functions for automatic document classification: an empirical comparison - Willett - 1983 |

9 |
Two partitioning type clustering algorithms
- Can, Ozkarahan
- 1984
(Show Context)
Citation Context ... tendency to produce large clusters early in the clustering pass, and because the clusters formed are not independent of the order in which the data set is processed. The cover coefficient algorithm (=-=Can & Ozkarahan, 1984-=-) is an example of a single pass algorithm developed for document clustering. In this algorithm, a set of documents is selected as cluster seeds, and then each document is assigned to the cluster seed... |

7 | Structure in Document Browsing Spaces - Dubin - 1996 |

6 | Probability tables for cluster analysis based on a theory of random graphs - Ling, Killough - 1976 |

5 |
On the theory and construction of k-clusters
- Ling
- 1972
(Show Context)
Citation Context ... between variables (Everitt, 1980). In addition, there are some other clustering methods, such as Ling’s generalization of single and complete link clustering to find what are termed (k, r) clusters (=-=Ling, 1972-=-). 17sIt is worth noting that the results obtained from different methods on the same data can be very different. Certain families of methods have been found to be particularly useful in specific scie... |

4 | Dynamic cluster maintenance - Can, Ozkarahan - 1989 |

1 | An investigation of document partitions - Shaw - 1986 |