## Incremental Clustering and Dynamic Information Retrieval (1997)

### Cached

### Download Links

- [theory.stanford.edu]
- [www.ece.northwestern.edu]
- [theory.stanford.edu]
- [theory.stanford.edu]
- [www.cas.mcmaster.ca]
- DBLP

### Other Repositories/Bibliography

Citations: | 155 - 4 self |

### BibTeX

@INPROCEEDINGS{Charikar97incrementalclustering,

author = {Moses Charikar and Chandra Chekuri},

title = {Incremental Clustering and Dynamic Information Retrieval},

booktitle = {},

year = {1997},

pages = {626--635}

}

### Years of Citing Articles

### OpenURL

### Abstract

Motivated by applications such as document and image classification in information retrieval, we consider the problem of clustering dynamic point sets in a metric space. We propose a model called incremental clustering which is based on a careful analysis of the requirements of the information retrieval application, and which should also be useful in other applications. The goal is to efficiently maintain clusters of small diameter as new points are inserted. We analyze several natural greedy algorithms and demonstrate that they perform poorly. We propose new deterministic and randomized incremental clustering algorithms which have a provably good performance. We complement our positive results with lower bounds on the performance of incremental algorithms. Finally, we consider the dual clustering problem where the clusters are of fixed diameter, and the goal is to minimize the number of clusters. 1 Introduction We consider the following problem: as a sequence of points from a metric...

### Citations

11403 |
Computers and Intractability: A Guide to the Theory of NP-Completeness
- Garey, Johnson
- 1979
(Show Context)
Citation Context ...k in Static Clustering. The closely-related problems of clustering to minimize diameter and radius are also called pairwise clustering and the k-center problem, respectively [2, 21]. Both are NP-hard =-=[17, 28]-=-, and in fact hard to approximate to within factor 2 for arbitrary metric spaces [2, 21]. For Euclidean spaces, clustering on the line is easy [3], but in higher dimensions it is NP-hard to approximat... |

4139 |
Pattern classification and scene analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...lly, clustering is used to accelerate query processing by considering only a small number of representatives of the clusters, rather than the entire corpus. In addition, it is used for classification =-=[11]-=- and has been suggested as a method for facilitating browsing [9, 10]. The current information explosion, fueled by the availability of hypermedia and the World-wide Web, has led to the generation of ... |

3389 |
Introduction to Modern Information Retrieval
- Salton, McGill
- 1983
(Show Context)
Citation Context ...ons [1, 20, 29, 35, 44]. It has proved to be a particularly important tool in information retrieval for constructing a taxonomy of a corpus of documents by forming groups of closely related documents =-=[21, 24, 37, 44, 45, 47, 48]-=-. For this purpose, a distance metric is imposed over documents, enabling us to view them as points in a metric space. The central role of clustering in this application is captured by the so-called c... |

2302 |
Algorithms for Cluster& Data
- Jain, Dubes
- 1981
(Show Context)
Citation Context ...ber of clusters. Before describing our results in any greater detail, we motivate and formalize our new model. Clustering is used for data analysis and classification in a wide variety of application =-=[1, 12, 20, 27, 34]-=-. It has proved to be a particularly important tool in information retrieval for constructing a taxonomy of a corpus of documents by forming groups of closely-related documents [13, 16, 34, 35, 37, 38... |

1665 |
Clustering Algorithms
- Hartigan
- 1975
(Show Context)
Citation Context ...ber of clusters. Before describing our results in any greater detail, we motivate and formalize our new model. Clustering is used for data analysis and classification in a wide variety of application =-=[1, 12, 20, 27, 34]-=-. It has proved to be a particularly important tool in information retrieval for constructing a taxonomy of a corpus of documents by forming groups of closely-related documents [13, 16, 34, 35, 37, 38... |

724 |
Automatic Text Processing
- Salton
- 1989
(Show Context)
Citation Context ...ion [1, 12, 20, 27, 34]. It has proved to be a particularly important tool in information retrieval for constructing a taxonomy of a corpus of documents by forming groups of closely-related documents =-=[13, 16, 34, 35, 37, 38]-=-. For this purpose, a distance metric is imposed over documents, enabling us to view them as points in a metric space. The central role of clustering in this application is captured by the so-called c... |

674 | Scatter/gather: A cluster-based approach to browsing large document collections
- Cutting, Pedersen, et al.
- 1992
(Show Context)
Citation Context ...ing only a small number of representatives of the clusters, rather than the entire corpus. In addition, it is used for classification [11] and has been suggested as a method for facilitating browsing =-=[9, 10]-=-. The current information explosion, fueled by the availability of hypermedia and the World-wide Web, has led to the generation of an ever-increasing volume of data, posing a growing challenge for inf... |

534 |
Cluster Analysis
- Everitt, Landau, et al.
- 2001
(Show Context)
Citation Context ...ber of clusters. Before describing our results in any greater detail, we motivate and formalize our new model. Clustering is used for data analysis and classification in a wide variety of application =-=[1, 12, 20, 27, 34]-=-. It has proved to be a particularly important tool in information retrieval for constructing a taxonomy of a corpus of documents by forming groups of closely-related documents [13, 16, 34, 35, 37, 38... |

322 | Primal-dual approximation algorithms for metric facility location and k-median problems
- Jain, Vazirani
- 1999
(Show Context)
Citation Context ... to the minimum diameter and radius measures described above, several other objective functions have also been considered in the literature. A lot of recent work has focused on the k-median objective =-=[11, 36, 2]-=-. Here the goal is to assign points to k centers such that the sum of distances of points to their centers is minimized. Other objectives that have been studied include the objective of minimizing the... |

297 |
Clustering to minimize the maximum intercluster distance
- Gonzalez
- 1985
(Show Context)
Citation Context ...rary metric spaces [2, 21]. For Euclidean spaces, clustering on the line is easy [3], but in higher dimensions it is NP-hard to approximate to within factors close to 2, regardless of the metric used =-=[14, 15, 19, 29, 30]-=-. The furthest point heuristic due to Gonzalez [19] (see also Hochbaum and Shmoys [23, 24]) gives a 2-approximation in all metric spaces. This algorithm requires O(kn) distance computations, and when ... |

267 | Clustering data streams
- Guha, Mishra, et al.
- 2000
(Show Context)
Citation Context ...nimize the maximum diameter or radius of the clusters produced. Subsequent to this work, other clustering objectives have also been studied in the streaming model, most notably the k-median objective =-=[28, 13]-=- and the sum of cluster diameters objective [14]. 2. Greedy algorithms. We begin by examining some natural greedy algorithms. A greedy incremental clustering algorithm always merges clusters to minimi... |

241 | Local search heuristics for k-median and facility location problems
- Arya, Garg, et al.
- 2001
(Show Context)
Citation Context ... to the minimum diameter and radius measures described above, several other objective functions have also been considered in the literature. A lot of recent work has focused on the k-median objective =-=[11, 36, 2]-=-. Here the goal is to assign points to k centers such that the sum of distances of points to their centers is minimized. Other objectives that have been studied include the objective of minimizing the... |

230 |
Rijsbergen, Information Retrieval
- van
- 1979
(Show Context)
Citation Context ...ons [1, 20, 29, 35, 44]. It has proved to be a particularly important tool in information retrieval for constructing a taxonomy of a corpus of documents by forming groups of closely related documents =-=[21, 24, 37, 44, 45, 47, 48]-=-. For this purpose, a distance metric is imposed over documents, enabling us to view them as points in a metric space. The central role of clustering in this application is captured by the so-called c... |

219 | A constant factor approximation algorithm for the k- median problem
- Charikar, Guha, et al.
- 1999
(Show Context)
Citation Context ... to the minimum diameter and radius measures described above, several other objective functions have also been considered in the literature. A lot of recent work has focused on the k-median objective =-=[11, 36, 2]-=-. Here the goal is to assign points to k centers such that the sum of distances of points to their centers is minimized. Other objectives that have been studied include the objective of minimizing the... |

218 |
An algorithmic approach to network location problems II: The p-medians
- Kariv, Hakimi
- 1979
(Show Context)
Citation Context ...k in Static Clustering. The closely-related problems of clustering to minimize diameter and radius are also called pairwise clustering and the k-center problem, respectively [2, 21]. Both are NP-hard =-=[17, 28]-=-, and in fact hard to approximate to within factor 2 for arbitrary metric spaces [2, 21]. For Euclidean spaces, clustering on the line is easy [3], but in higher dimensions it is NP-hard to approximat... |

212 |
Recent Trends in Hierarchic Document Clustering: A Critical Review
- Willett
- 1988
(Show Context)
Citation Context ...e propose the model described below. Hierarchical Agglomerative Clustering. The clustering strategy employed almost universally in information retrieval is Hierarchical Agglomerative Clustering (HAC) =-=[12, 34, 35, 37, 38, 39]-=-. This is also popular in other applications such as biology, medicine, image processing, and geographical information systems. The basic idea is: initially assign the n input points to n distinct clu... |

204 |
Cluster Analysis
- Aldenderfer, Blashfiield
- 1984
(Show Context)
Citation Context ...ndation, Shell Foundation, and Xerox Corporation. 1417s1418 M. CHARIKAR, C. CHEKURI, T. FEDER, AND R. MOTWANI Clustering is used for data analysis and classification in a wide variety of applications =-=[1, 20, 29, 35, 44]-=-. It has proved to be a particularly important tool in information retrieval for constructing a taxonomy of a corpus of documents by forming groups of closely related documents [21, 24, 37, 44, 45, 47... |

202 |
McGill: Introduction to Modern Information Retrieval
- Salton, J
- 1983
(Show Context)
Citation Context ...ion [1, 12, 20, 27, 34]. It has proved to be a particularly important tool in information retrieval for constructing a taxonomy of a corpus of documents by forming groups of closely-related documents =-=[13, 16, 34, 35, 37, 38]-=-. For this purpose, a distance metric is imposed over documents, enabling us to view them as points in a metric space. The central role of clustering in this application is captured by the so-called c... |

200 | Approximation schemes for covering and packing problems in image processing and vlsi
- DS, Maass
(Show Context)
Citation Context ... , cover each point with a unit ball in ! d as it arrives, so as to minimize the total number of balls used. In the static case this problem is NP-Complete and there is a PTAS for any fixed dimension =-=[22]-=-. We note that in general metric spaces, it is not possible to achieve any bounded ratio. Our algorithm's analysis is based on a theorem from combinatorial geometry called Roger's theorem [36] (see al... |

196 |
D.B.Shmoys, A best possible heuristic for the k-center problem
- Hochbaum
- 1985
(Show Context)
Citation Context ...mensions it is NP-hard to approximate to within factors close to 2, regardless of the metric used [14, 15, 19, 29, 30]. The furthest point heuristic due to Gonzalez [19] (see also Hochbaum and Shmoys =-=[23, 24]-=-) gives a 2-approximation in all metric spaces. This algorithm requires O(kn) distance computations, and when the metric space is induced by shortest-path distances in weighted graphs, the running tim... |

173 |
Optimal algorithms for approximate clustering
- Feder, Greene
- 1988
(Show Context)
Citation Context ...rary metric spaces [2, 21]. For Euclidean spaces, clustering on the line is easy [3], but in higher dimensions it is NP-hard to approximate to within factors close to 2, regardless of the metric used =-=[14, 15, 19, 29, 30]-=-. The furthest point heuristic due to Gonzalez [19] (see also Hochbaum and Shmoys [23, 24]) gives a 2-approximation in all metric spaces. This algorithm requires O(kn) distance computations, and when ... |

166 | Combinatorial Geometry
- Pach, Agarwal
- 1995
(Show Context)
Citation Context ...n general metric spaces, it is not possible to achieve any bounded ratio. Our algorithm's analysis is based on a theorem from combinatorial geometry called Roger's theorem [36] (see also Theorem 7.17 =-=[33]-=-), which says that R d can be covered by any convex shape with covering density O(d log d). Since the volume of a radius 2 ball is 2 d times the volume of a unitradius ball, the number of balls needed... |

130 | Rijsbergen, The use of hierarchical clustering in information retrieval, Information storage and retrieval 7 - Jardine, van - 1971 |

127 | Constant interaction-time scatter/gather browsing of very large document collections
- Cutting, Karger, et al.
- 1993
(Show Context)
Citation Context ...ing only a small number of representatives of the clusters, rather than the entire corpus. In addition, it is used for classification [11] and has been suggested as a method for facilitating browsing =-=[9, 10]-=-. The current information explosion, fueled by the availability of hypermedia and the World-wide Web, has led to the generation of an ever-increasing volume of data, posing a growing challenge for inf... |

121 |
A note on coverings
- Rogers
- 1957
(Show Context)
Citation Context ...mension [22]. We note that in general metric spaces, it is not possible to achieve any bounded ratio. Our algorithm's analysis is based on a theorem from combinatorial geometry called Roger's theorem =-=[36]-=- (see also Theorem 7.17 [33]), which says that R d can be covered by any convex shape with covering density O(d log d). Since the volume of a radius 2 ball is 2 d times the volume of a unitradius ball... |

121 | On the complexity of some common geometric location problems
- Megiddo, Supowit
- 1984
(Show Context)
Citation Context ...420 M. CHARIKAR, C. CHEKURI, T. FEDER, AND R. MOTWANI tering on the line is easy [6], but in higher dimensions it is NP-hard to approximate to within factors close to 2, regardless of the metric used =-=[22, 23, 27, 39, 40]-=-. The furthest point heuristic due to Gonzalez [27] (see also Hochbaum and Shmoys [32, 33]) gives a 2-approximation in all metric spaces. This algorithm requires O(kn) distance computations, and when ... |

118 |
Optimal packing and covering in the plane are NP-complete
- Fowler, Paterson, et al.
- 1981
(Show Context)
Citation Context ...rary metric spaces [2, 21]. For Euclidean spaces, clustering on the line is easy [3], but in higher dimensions it is NP-hard to approximate to within factors close to 2, regardless of the metric used =-=[14, 15, 19, 29, 30]-=-. The furthest point heuristic due to Gonzalez [19] (see also Hochbaum and Shmoys [23, 24]) gives a 2-approximation in all metric spaces. This algorithm requires O(kn) distance computations, and when ... |

90 | A Survey of Information Retrieval and Filtering Methods
- Faloutsos, Oard
- 1996
(Show Context)
Citation Context ...ion [1, 12, 20, 27, 34]. It has proved to be a particularly important tool in information retrieval for constructing a taxonomy of a corpus of documents by forming groups of closely-related documents =-=[13, 16, 34, 35, 37, 38]-=-. For this purpose, a distance metric is imposed over documents, enabling us to view them as points in a metric space. The central role of clustering in this application is captured by the so-called c... |

90 | Theorie der konvexen Körper - Bonnesen, Fenchel - 1934 |

89 | Non-clairvoyant scheduling
- Motwani, Phillips, et al.
- 1994
(Show Context)
Citation Context ...r from [1=e; 1] according to the probability density function 1=r, set d 1 to rx, and redefine fi = e and ff = e=(e\Gamma1). Similar randomization of doublingalgorithms was used earlier in scheduling =-=[31]-=-, and later in other applications [7, 18]. Theorem 9 The Randomized Doubling Algorithm has expected performance ratio 2es5:437 in any metric space. The same bound is also achieved for the radius measu... |

87 | An improved approximation ratio for the minimum latency problem
- Goemans, Kleinberg
- 1996
(Show Context)
Citation Context ...bility density function 1=r, set d 1 to rx, and redefine fi = e and ff = e=(e\Gamma1). Similar randomization of doublingalgorithms was used earlier in scheduling [31], and later in other applications =-=[7, 18]-=-. Theorem 9 The Randomized Doubling Algorithm has expected performance ratio 2es5:437 in any metric space. The same bound is also achieved for the radius measure. Proof: Let oe be the sequence of upda... |

87 |
A unified approach to approximation algorithms for scheduling problems: Practical and theoretical results
- Hochbaum, Shmoys
- 1986
(Show Context)
Citation Context ...mensions it is NP-hard to approximate to within factors close to 2, regardless of the metric used [14, 15, 19, 29, 30]. The furthest point heuristic due to Gonzalez [19] (see also Hochbaum and Shmoys =-=[23, 24]-=-) gives a 2-approximation in all metric spaces. This algorithm requires O(kn) distance computations, and when the metric space is induced by shortest-path distances in weighted graphs, the running tim... |

80 | Approximation algorithms for geometric problems
- Bern, Eppstein
(Show Context)
Citation Context ...eration here. Previous Work in Static Clustering. The closely-related problems of clustering to minimize diameter and radius are also called pairwise clustering and the k-center problem, respectively =-=[2, 21]-=-. Both are NP-hard [17, 28], and in fact hard to approximate to within factor 2 for arbitrary metric spaces [2, 21]. For Euclidean spaces, clustering on the line is easy [3], but in higher dimensions ... |

75 | Better streaming algorithms for clustering problems
- Charikar, O’Callaghan, et al.
- 2003
(Show Context)
Citation Context ...nimize the maximum diameter or radius of the clusters produced. Subsequent to this work, other clustering objectives have also been studied in the streaming model, most notably the k-median objective =-=[28, 13]-=- and the sum of cluster diameters objective [14]. 2. Greedy algorithms. We begin by examining some natural greedy algorithms. A greedy incremental clustering algorithm always merges clusters to minimi... |

69 | Algorithms for facility location problems with outliers
- Charikar, Khuller, et al.
- 2001
(Show Context)
Citation Context ... the sum of all distances within each cluster [3, 18] and that of minimizing the sum of cluster diameters [14]. In addition to this, outlier formulations of clustering problems have also been studied =-=[12]-=-. Here the algorithm is allowed to discard a fraction of the input as outliers and is required to obtain a clustering solution that minimizes a given objective function on the remaining input points. ... |

65 | Improved scheduling algorithms for minsum criteria
- Chakrabarti, Phillips, et al.
- 1996
(Show Context)
Citation Context ...bility density function 1=r, set d 1 to rx, and redefine fi = e and ff = e=(e\Gamma1). Similar randomization of doublingalgorithms was used earlier in scheduling [31], and later in other applications =-=[7, 18]-=-. Theorem 9 The Randomized Doubling Algorithm has expected performance ratio 2es5:437 in any metric space. The same bound is also achieved for the radius measure. Proof: Let oe be the sequence of upda... |

57 |
Approximation schemes for clustering problems
- Vega, Karpinski, et al.
- 2003
(Show Context)
Citation Context ...enters such that the sum of distances of points to their centers is minimized. Other objectives that have been studied include the objective of minimizing the sum of all distances within each cluster =-=[3, 18]-=- and that of minimizing the sum of cluster diameters [14]. In addition to this, outlier formulations of clustering problems have also been studied [12]. Here the algorithm is allowed to discard a frac... |

56 |
Approximating min-sum k-clustering in metric spaces
- Bartal, Charikar, et al.
- 2001
(Show Context)
Citation Context ...enters such that the sum of distances of points to their centers is minimized. Other objectives that have been studied include the objective of minimizing the sum of all distances within each cluster =-=[3, 18]-=- and that of minimizing the sum of cluster diameters [14]. In addition to this, outlier formulations of clustering problems have also been studied [12]. Here the algorithm is allowed to discard a frac... |

49 |
Incremental clustering for dynamic information processing
- CAN
- 1993
(Show Context)
Citation Context ...orithms are not suitable for maintaining clusters in such a dynamic environment, and they have been strugglingwith the problem of updating clusters without frequently performing complete reclustering =-=[4, 5, 6, 8, 35]-=-. From a theoretical perspective, many different formulations are possible for this dynamic clustering problem, and it is not clear a priori which of these best addresses the concerns of the practitio... |

47 |
On the Complexity of Clustering Problems
- Brucker
- 1978
(Show Context)
Citation Context ... problem, respectively [2, 21]. Both are NP-hard [17, 28], and in fact hard to approximate to within factor 2 for arbitrary metric spaces [2, 21]. For Euclidean spaces, clustering on the line is easy =-=[3]-=-, but in higher dimensions it is NP-hard to approximate to within factors close to 2, regardless of the metric used [14, 15, 19, 29, 30]. The furthest point heuristic due to Gonzalez [19] (see also Ho... |

38 |
Clustering Algorithms”, in Information Retrieval Data Structures and Algorithms
- Rasmussen
- 1992
(Show Context)
Citation Context ...ndation, Shell Foundation, and Xerox Corporation. 1417s1418 M. CHARIKAR, C. CHEKURI, T. FEDER, AND R. MOTWANI Clustering is used for data analysis and classification in a wide variety of applications =-=[1, 20, 29, 35, 44]-=-. It has proved to be a particularly important tool in information retrieval for constructing a taxonomy of a corpus of documents by forming groups of closely related documents [21, 24, 37, 44, 45, 47... |

33 | Clustering to minimize the sum of cluster diameters
- CHARIKAR, PANIGRAHY
(Show Context)
Citation Context ...ters is minimized. Other objectives that have been studied include the objective of minimizing the sum of all distances within each cluster [3, 18] and that of minimizing the sum of cluster diameters =-=[14]-=-. In addition to this, outlier formulations of clustering problems have also been studied [12]. Here the algorithm is allowed to discard a fraction of the input as outliers and is required to obtain a... |

24 | Online computation
- Irani, Karlin
- 1995
(Show Context)
Citation Context ...age, while cluster representatives are preserved in main memory [32]. We have avoided labeling this model as the online clustering problem or referring to the performance ratio as a competitive ratio =-=[25]-=- for the following reasons. Recall that in an online setting, we would compare the performance of an algorithm to that of an adversary which knows the update sequence in advance but must process the p... |

23 |
Various notions of approximations
- Hochbaum
- 1995
(Show Context)
Citation Context ...eration here. Previous Work in Static Clustering. The closely-related problems of clustering to minimize diameter and radius are also called pairwise clustering and the k-center problem, respectively =-=[2, 21]-=-. Both are NP-hard [17, 28], and in fact hard to approximate to within factor 2 for arbitrary metric spaces [2, 21]. For Euclidean spaces, clustering on the line is easy [3], but in higher dimensions ... |

20 |
On Online Computation: Approximation Algorithms for NPHard Problems
- Irani, Karlin
- 1996
(Show Context)
Citation Context ...age, while cluster representatives are preserved in main memory [42]. We have avoided labeling this model as the online clustering problem or referring to the performance ratio as a competitive ratio =-=[34]-=- for the following reasons. Recall that in an online setting, we would compare the performance of an algorithm to that of an adversary which knows the update sequence in advance but must process the p... |

8 |
Clustering algorithms. Chapter 16
- Rasmussen
- 1992
(Show Context)
Citation Context |

6 |
On the complexity of some common geometric problems
- Megiddo, Supowit
- 1984
(Show Context)
Citation Context |

6 |
Non-clairvoyant scheduling, Theoret
- Motwani, Torng
- 1994
(Show Context)
Citation Context ...alue r from [1/e, 1] according to the probability density function 1/r, set d1 to rx, and redefine β = e and α = e/(e − 1). Similar randomization of doubling algorithms was used earlier in scheduling =-=[41]-=-, and later in other applications [10, 26]. Theorem 3.8. The randomized doubling algorithm has expected performance ratio 2e ≈ 5.437 in any metric space. The same bound is also achieved for the radius... |

5 |
A dynamic cluster maintenance system for information retrieval
- Can, Ozkarahan
- 1987
(Show Context)
Citation Context ...orithms are not suitable for maintaining clusters in such a dynamic environment, and they have been strugglingwith the problem of updating clusters without frequently performing complete reclustering =-=[4, 5, 6, 8, 35]-=-. From a theoretical perspective, many different formulations are possible for this dynamic clustering problem, and it is not clear a priori which of these best addresses the concerns of the practitio... |

4 |
A global approach to record clustering and file reorganization
- OMIECINSKI, P
- 1984
(Show Context)
Citation Context ...been observed that such incremental algorithms exhibit good paging performance when the clusters themselves are stored in secondary storage, while cluster representatives are preserved in main memory =-=[32]-=-. We have avoided labeling this model as the online clustering problem or referring to the performance ratio as a competitive ratio [25] for the following reasons. Recall that in an online setting, we... |