## 2. The Choice of Variables and Similarity Measurements 3. A General Review of Clustering Methods and Their Applications in IR 4. Related Issues on Cluster Analysis

### BibTeX

@MISC{He_2.the,

author = {Qin He},

title = {2. The Choice of Variables and Similarity Measurements 3. A General Review of Clustering Methods and Their Applications in IR 4. Related Issues on Cluster Analysis},

year = {}

}

### OpenURL

### Abstract

### Citations

2148 | Dubes R.C. In Algorithms for clustering data - Jain - 1988 |

826 |
C.: A vector space model for automatic indexing
- Salton, Wong, et al.
- 1975
(Show Context)
Citation Context ...re index terms Ti and represented by a n-dimensional vector Di = (di1, di2, … , din), where dij is the weight of the jth term in document i and n is the total different index terms in the collection (=-=Salton, Wong, & Yang, 1975-=-). In this model, there are three main classes of similarity coefficients have been used to determine the similarity degree between two documents: distance coefficients, association coefficients, prob... |

184 | R.R.(1973) Numerical taxonomy: the principles and practice of numerical classification - Sneath, Sokal |

143 |
Cluster analysis
- Aldenderfer, Blashfield
- 1984
(Show Context)
Citation Context ...icient. One of the major drawbacks of the use of the correlation coefficient as a similarity measure is its sensitivity to shape at the expense of the magnitude of differences between the variables. (=-=Aldenderfer & Blashfield, 1984-=-, p.23) 2) Distance measures Distance measures have enjoyed widespread popularity because intuitively they appear to be dissimilarity measures. Distance measures normally have no upper bounds and are ... |

131 |
Clustering Algorithms
- Rasmussen
- 1992
(Show Context)
Citation Context ... common. Conversely, association coefficients have been used more widely. Three commonly used normalized association coefficients are the Dice coefficient, Jaccard coefficient and Cosine coefficient (=-=Rasmussen, 1992-=-). Probabilistic coefficients have been used in ElHamdouchi's work, in which the main criterion for the formation of a cluster is that the 8documents in it have a maximal probability of being jointly... |

124 | A theoretical basis for the use of cooccurrence data in information retrieval - RIJSBERGEN - 1977 |

121 | The use of hierarchic clustering in information retrieval - Jardine, Rijsbergen - 1971 |

104 | SLINK: an optimally efficient algorithm for the single-link cluster method - Sibson - 1973 |

79 | Implementing agglomerative hierarchic clustering algorithms for use in document retrieval - Voorhees - 1986 |

57 |
An examination of the effects of six types of error perturbations on fifteen clustering algorithms. Psychometrika
- Milligan
- 1981
(Show Context)
Citation Context ...ties in another cluster by fairly empty areas of space. Internal cohesion requires that entities within the same cluster should be similar to each other, at least within the local metric (as cited in =-=Milligan, 1980-=-, p. 326). Sneath and Sokal (1973) have described a number of properties of a cluster, the most important of which are density, variance, dimension, shape and separation (as cited in Aldenderfer & Bla... |

50 |
A review of classification
- Cormack
- 1971
(Show Context)
Citation Context ...is that it does not restrict the shape of clusters as rigidly as other proposed definitions do (p. 60). 1.2 Properties of Clusters It is clear that clusters have certain properties. In early studies (=-=Cormack, 1971-=-), it is found that clusters should exhibit the properties of external isolation and internal cohesion. External isolation requires that entities in one cluster should be separated from entities in an... |

41 | Mapping science by combined cocitation and word analysis. 2. dynamical aspects - Braam, Moed, et al. - 1991 |

33 | The selection of good search terms - Rijsbergen, Harper, et al. - 1981 |

31 | Validity studies in clustering methodologies - Dubes, Jain - 1979 |

27 | Clustering large files of documents using the single link method - CROFT - 1977 |

27 | A cluster-based approach to thesaurus construction - Crouch - 1988 |

21 | Hierarchic Document Clustering Using Ward's Method - El-Hamdouchi, Willet - 1986 |

20 | Spatial versus tree representations of proximity data - Pruzansky, Tversky, et al. |

12 |
An algorithm for generating artificial test clusters
- Milligan
- 1985
(Show Context)
Citation Context ...data with known structure. The validation study is then reduced to the problem of determining how well the various clustering techniques or statistics detected the true (known) structure in the data (=-=Milligan, 1985-=-, p. 123). Milligan (1985) has proposed an algorithm for generating artificial test clusters. There are 9 steps in this algorithm with a three-way factorial design. The three factors are the number of... |

10 |
Cluster analysis (2nd ed
- EVERITT
- 1980
(Show Context)
Citation Context ... measures. There are at least three concepts of similarity and distance, which need to be considered – between entities, between an entity and a group of entities, and between two groups of entities (=-=Everitt, 1980-=-, p. 12). Aldenderfer and Blashfield (1984) use “similarity coefficient” to describe any type of similarity measure and divide it into four groups (p. 17). Different similarity coefficients may have w... |

10 | Similarity coefficients and weighting functions for automatic document classification: an empirical comparison - Willett - 1983 |

8 |
Two Partitioning Type Clustering Algorithms
- Can, Ozkarahan
- 1984
(Show Context)
Citation Context ... tendency to produce large clusters early in the clustering pass, and because the clusters formed are not independent of the order in which the data set is processed. The cover coefficient algorithm (=-=Can & Ozkarahan, 1984-=-) is an example of a single pass algorithm developed for document clustering. In this algorithm, a set of documents is selected as cluster seeds, and then each document is assigned to the cluster seed... |

6 | Structure in Document Browsing Spaces - Dubin - 1996 |

5 | Probability tables for cluster analysis based on a theory of random graphs - Ling, Killough - 1976 |

4 | Dynamic cluster maintenance - Can, Ozkarahan - 1989 |

4 |
On the theory and construction of k-clusters
- Ling
- 1972
(Show Context)
Citation Context ... between variables (Everitt, 1980). In addition, there are some other clustering methods, such as Ling’s generalization of single and complete link clustering to find what are termed (k, r) clusters (=-=Ling, 1972-=-). 17It is worth noting that the results obtained from different methods on the same data can be very different. Certain families of methods have been found to be particularly useful in specific scie... |

1 | An efficient algorithms for a complete link method - unknown authors - 1977 |

1 | An investigation of document partitions - M - 1986 |