## ABSTRACT A Scalable Algorithm for High-Quality Clustering of Web Snippets

### BibTeX

@MISC{Pellegrini_abstracta,

author = {Marco Pellegrini and Paolo Pisati and Fabrizio Sebastiani},

title = {ABSTRACT A Scalable Algorithm for High-Quality Clustering of Web Snippets},

year = {}

}

### OpenURL

### Abstract

We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k non-overlapping clusters. We augment the well-known furthest-point-first algorithm for k-center clustering in metric spaces with a filtering scheme based on the triangular inequality. We apply this algorithm to Web snippet clustering, comparing it against strong baselines consisting of recent, fast variants of the classical k-means iterative algorithm. Our main conclusion is that our method attains solutions of better or comparable accuracy, and does this within a fraction of the time required by the baselines. Our algorithm is thus valuable when, as in Web snippet clustering, either the real-time nature of the task or the large amount of data make the poorly scalable, traditional clustering methods unsuitable.

### Citations

2070 | Some methods for classification and analysis of multivariate observations
- MacQueen
- 1967
(Show Context)
Citation Context ...th its mean µj and its standard deviation σj; the k initial centroids are obtained through k perturbations, driven by the µj’s and σj’s, of the centroid of all data points [10]. MQ This is MacQueen’s =-=[8]-=- variant of k-means: the initial centroids are randomly chosen among the input points, and the remaining points are assigned one at a time to the nearest centroid, and each such assignment causes the ... |

465 | A Comparison of Document Clustering Techniques [R
- STEINBACH, KARYPIS, et al.
(Show Context)
Citation Context ...with C = {C1, .., Ck} the outcome of the clustering algorithm. As measures of accuracy we use F1, entropy and purity, three well-known measures that have been widely used in text clustering (see e.g. =-=[1, 13]-=- and the references therein). As evident from Table 1, the three measures tend to rank our algorithms in a fairly consistent way. F1 is the harmonic mean of precision (π) and recall (ρ). Given a clust... |

420 | Reexamining the cluster hypothesis: scatter/gather on retrieval results
- Hearst, Pedersen
- 1996
(Show Context)
Citation Context ...ance computations, thus significantly speeding up the computation with respect to the standard implementation. In the context of on-line applications to IR, the algorithm of the Scatter/Gather system =-=[6]-=- is often used as a baseline. Scatter/Gather employs a variant of Lloyd’s k-means algorithm with additional split/join operations on clusters intended to improve the accuracy of lower-accuracy cluster... |

357 | Web document clustering: A feasibility demonstration
- Zamir, Etzioni
- 1996
(Show Context)
Citation Context ...on of the search results into clusters must be generated on-the-fly. In this paper we propose an algorithm for real-time clustering, and discuss its accuracy and efficiency for Web snippet clustering =-=[4, 15]-=-. In the terminology of Web search engines, a snippet is the concise description of a Web page p that is returned to the user as a result of a query, and usually consists of at least (i) the title of ... |

300 |
Clustering to minimize the maximum intercluster distance
- Gonzalez
- 1985
(Show Context)
Citation Context ...function that is a metric 2 ; (d) the partition should be highly accurate and should be found quickly. Our algorithm is a variation of the furthest-point-first (FPF) algorithm for k-center clustering =-=[5]-=- which attempts to minimize the radius of the widest cluster over all possible groupings into k clusters (k-clusterings) of a set of n documents in a metric space. We modify the FPF algorithm by using... |

174 |
Optimal algorithms for approximate clustering
- Feder, Greene
- 1988
(Show Context)
Citation Context ...ting of a group of n documents preclassified under k predefined classes. The accuracy of the clustering engine will be defined in terms of its ability at replicating this ground truth. 1059 algorithm =-=[3]-=- is an attempt to solve the so-called k-center clustering problem, defined as follows: The k-centers problem: Given a set S of points in a metric space M endowed with a metric distance function D, and... |

163 | Impact of Similarity Measures on Web-page Clustering
- Strehl, Ghosh, et al.
- 2002
(Show Context)
Citation Context ... of using distance functions other than sine. We have obtained interesting results by using a slight modification of the standard Jaccard Coefficient, which we call Weighted Jaccard Coefficient (WJC) =-=[14]-=-. The WJC takes advantage of the intrinsic structure of the ODP snippets, by weighting different parts of the snippet (title, body, URL) differently; more precisely, we assign weight 3 to the title, w... |

146 |
Similarity Search: The Metric Space Approach, volume 32 of Advances in Database Systems
- Zezula, Amato, et al.
- 2006
(Show Context)
Citation Context ... ⇔ x = y (identity); (iv) d(x, y) ≤ d(x, z) + d(z, y) (triangular inequality). Metrics include, as special cases, Euclidean distance and other distance functions commonly used in IR applications. See =-=[16]-=- for a detailed treatment.sScatter/Gather inherits the high computational costs of the standard k-means algorithm. For this reason, Phillips’ recently proposed fast variants of k-means [11], which als... |

106 | An empirical comparison of four initialization methods for the k-means algorithm
- Peña, Lozano, et al.
(Show Context)
Citation Context ...confirm this, that the quality of the initialization (i.e. the choice of the initial k centroids) has a deep impact on the resulting accuracy. Several methods for initializing k-means are compared in =-=[10]-=-. As our baselines we have chosen the three such methods that are most amply cited in the literature while being at the same time relatively simple; we have ruled out more advanced and complex initial... |

95 | A personalized search engine based on web-snippet hierarchical clustering
- Ferragina, Gulli
- 2005
(Show Context)
Citation Context ...on of the search results into clusters must be generated on-the-fly. In this paper we propose an algorithm for real-time clustering, and discuss its accuracy and efficiency for Web snippet clustering =-=[4, 15]-=-. In the terminology of Web search engines, a snippet is the concise description of a Web page p that is returned to the user as a result of a query, and usually consists of at least (i) the title of ... |

58 |
K-means type algorithms: A generalized convergence theorem and characterization of local optimality
- Selim, Ismail
- 1984
(Show Context)
Citation Context ...cluster quality booster. It takes as input a rough k-clustering (or, more precisely, k candidate centroids) and produces as output another k-clustering, hopefully of better quality. It has been shown =-=[12]-=- that by using the sum of squared Euclidean distances as objective function 5 , the procedure converges to a local minimum for the objective function within a finite number of iterations. The main bui... |

23 | Ephemeral document clustering for web applications. IBM research report RJ
- Maarek, Fagin, et al.
- 1018
(Show Context)
Citation Context ...lts are attained when full documents are used instead of their snippets, thus confirming the fact that Web search results can be effectively clustered by looking at their snippets only. Maarek et al. =-=[7]-=- characterize the challenges inherent in Web snippet clustering, and propose an algorithm based on complete-link hierarchical agglomerative clustering. The Lingo [9], Shoc [17] and Eigencluster [1] sy... |

18 |
Online Clustering of Web Search Results
- Semantic
- 2001
(Show Context)
Citation Context ...pets only. Maarek et al. [7] characterize the challenges inherent in Web snippet clustering, and propose an algorithm based on complete-link hierarchical agglomerative clustering. The Lingo [9], Shoc =-=[17]-=- and Eigencluster [1] systems all tackle Web snippet clustering by performing a singular value decomposition of the term-document incidence matrix; the problem is that SVD is extremely time-consuming,... |

12 |
Acceleration of k-means and related clustering algorithms
- Phillips
- 2002
(Show Context)
Citation Context ...tions. See [16] for a detailed treatment.sScatter/Gather inherits the high computational costs of the standard k-means algorithm. For this reason, Phillips’ recently proposed fast variants of k-means =-=[11]-=-, which also exploit filters based on the triangular inequality, are stronger baselines, and are the ones that we will check our algorithm against. In this paper we show that our method attains soluti... |

11 | On a recursive spectral algorithm for clustering from pairwise similarities
- CHENG, KANNAN, et al.
(Show Context)
Citation Context ...l. [7] characterize the challenges inherent in Web snippet clustering, and propose an algorithm based on complete-link hierarchical agglomerative clustering. The Lingo [9], Shoc [17] and Eigencluster =-=[1]-=- systems all tackle Web snippet clustering by performing a singular value decomposition of the term-document incidence matrix; the problem is that SVD is extremely time-consuming, hence problematic wh... |

6 |
Conceptual clustering using lingo algorithm: Evaluation on open directory project data. In Intelligent information processing and web mining
- Osiriski, Weiss
(Show Context)
Citation Context ...their snippets only. Maarek et al. [7] characterize the challenges inherent in Web snippet clustering, and propose an algorithm based on complete-link hierarchical agglomerative clustering. The Lingo =-=[9]-=-, Shoc [17] and Eigencluster [1] systems all tackle Web snippet clustering by performing a singular value decomposition of the term-document incidence matrix; the problem is that SVD is extremely time... |

5 | A Topology-Driven Approach to the Design of Web Meta-Search Clustering Engines
- Giacomo, Didimo, et al.
- 2005
(Show Context)
Citation Context ...on that has run over the deadline. In Table 1 we report clustering time and output quality of several variants of k-center and k-means on a sample of 12 queries subdivided, according to the method of =-=[2]-=-, in three broad families: ambiguous queries (armstrong, jaguar, mandrake, java), generic queries (health, language, machine, music, clusters), and specific queries (mickey mouse, olympic games, steve... |