## M.: Cluster generation and cluster labelling for web snippets: A fast and accurate hierarchical solution (2006)

Venue: | In Proceedings of the 13th Symposium on String Processing and Information Retrieval (SPIRE 2006 |

Citations: | 10 - 2 self |

### BibTeX

@TECHREPORT{Pellegrini06m.:cluster,

author = {Marco Pellegrini and Marco Maggini and Fabrizio Sebastiani},

title = {M.: Cluster generation and cluster labelling for web snippets: A fast and accurate hierarchical solution},

institution = {In Proceedings of the 13th Symposium on String Processing and Information Retrieval (SPIRE 2006},

year = {2006}

}

### OpenURL

### Abstract

Abstract. This paper describes Armil, a meta-search engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need. Striking the right balance between running time and cluster well-formedness was a key point in the design of our system. Both the clustering and the labelling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and use no external sources of knowledge. Clustering is performed by means of a fast version of the furthest-point-first algorithm for metric kcenter clustering. Cluster labelling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in Web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted “external” metrics of clustering quality, Armil achieves better performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labelling algorithms. 1

### Citations

8550 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...with the standard Jaccard Distance. The candidate words selection algorithm. We select candidate terms for labelling the generated clusters through a modified version of the information gain function =-=[2]-=-. For term t and category c, information gain is defined as IG(t, c) = � � P (x,y) x∈{t,¯t} y∈{c,¯c} P (x, y) log P (x)P (y) Intuitively, IG measures the amount of information that each argument conta... |

2355 | Modern Information Retrieval
- Baeza-Yates, Ribeiro-Neto
- 1999
(Show Context)
Citation Context ...alized variants of the Entropy and the Mutual Information [4], Armil achieves better performance levels by 10%. Note that, since the normalization reduces the ranges of these measures in the interval =-=[0, 1]-=-, an increase of 10% is noteworthy. We have tested the clustering effectiveness of Armil against variants of the classical k-means clustering algorithm. Our method outperforms the k-means clustering a... |

1848 | Some methods for classification and analysis of multivariate observations
- MacQueen
- 1967
(Show Context)
Citation Context ...d Etzioni [12, 6] propose a Web snippet clustering mechanism (Suffix Tree Clustering – STC) based on suffix arrays, and experimentally compare STC with algorithms such as k-means, single-pass k-means =-=[18]-=-, Backshot and Fractionation [19], and Group Average Hierarchical Agglomerative Clustering. They test the systems on a benchmark obtained by issuing 10 queries to the Metacrawler meta-search engine, r... |

950 | comparative study on feature selection in text categorization
- Yang, Pedersen
- 1997
(Show Context)
Citation Context ...ng M-PFP on the whole input set. 3.2 The candidate words selection algorithm We select candidate terms for labelling the generated clusters through a modified version of the information gain function =-=[4, 37]-=-. Let x be a document taken uniformly at random in the set of documents D. P (t) is the probability that x contains term t, P (c) is the probability that x is in category c. The complementary events a... |

839 | Least squares quantization in pcm
- Lloyd
- 1982
(Show Context)
Citation Context ... set of experiments in this Section compare our Variant of the Furthest-PointFirst algorithm for k-center clustering with three recently proposed, fast variants of k-means. The k-means algorithm (see =-=[8, 23]-=-) can be seen as an iterative cluster quality booster. It takes as input a rough k-clustering (or, more precisely, k candidate centroids) and produces as output another k-clustering, hopefully of bett... |

622 | Scatter/gather:a cluster-based approach to browsing large document collections
- Cutting, Karger, et al.
- 1992
(Show Context)
Citation Context ...nippet clustering mechanism (Suffix Tree Clustering – STC) based on suffix arrays, and experimentally compare STC with algorithms such as k-means, single-pass k-means [18], Backshot and Fractionation =-=[19]-=-, and Group Average Hierarchical Agglomerative Clustering. They test the systems on a benchmark obtained by issuing 10 queries to the Metacrawler meta-search engine, retaining the top-ranked 200 snipp... |

382 | Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results
- Hearst, Pederson
- 1996
(Show Context)
Citation Context ...d by equating the effectiveness of the system with the average precision of the highest-precision clusters that collectively contain 10% of the input documents. This methodology had been advocated in =-=[15]-=-, and is based on the assumption that the users will anyway be able to spot the clusters most relevant to their query. Average precision as computed with this method ranges from 0.2 to 0.4 for all the... |

330 | 24-28). Web document clustering: a feasibility demonstration. Paper presented at the
- Zamir, Etzioni
- 1998
(Show Context)
Citation Context ...l Web services such as Copernic, Dogpile, Groxis, iBoogie, Kartoo, Mooter, and Vivisimo seems to confirm the validity of the approach. Academic research prototypes are also available, such as Grouper =-=[12, 6]-=-, EigenCluster [13], Shoc [14], and SnakeT [3]. Generally, details of the algorithms underlying the commercial Web services are not in the public domain. Maarek et al. [15] give a precise characteriza... |

285 |
An Examination of Procedures for Determining the Number of Clusters in a Data Set
- Milligan, Cooper
- 1985
(Show Context)
Citation Context ...also because the “optimal” number of clusters to be displayed is a function of the goals and preferences of the user. Also, while techniques for automatically determining such optimal number do exist =-=[9]-=-, their computational cost is incompatible with the realtime nature of our application. Therefore, besides providing a default value, we allow the user to increase or decrease the value of k to his/he... |

280 |
Clustering to minimize the maximum intercluster distance, Theoret
- Gonzalez
- 1985
(Show Context)
Citation Context ...ts a very important property of the FPF algorithm, i.e. it is within a factor 2 of the optimal solution for 4 The Armil system can be freely accessed at http://armil.iit.cnr.it/.sthe k-center problem =-=[7]-=-. The second interesting property of M-FPF is that it does not compute centroids of clusters. Centroids tend to be dense vectors and, as such, their computation and/or update in high-dimensional space... |

229 | Similarity Estimation Techniques from Rounding Algorithms
- Charikar
- 2002
(Show Context)
Citation Context ..., (ii) has performed at the same level of accuracy as (i), but has proven much faster to compute. In this paper we improve on the results of [1] by using the Generalized Jaccard Distance described in =-=[22]-=-. Given two “bag-of-words” snippet vectors s1 = (s 1 1, ...s h 1) and s2 = (s 1 2, ...s h 2), the Generalized Jaccard Distance is: � D(s1, s2) = 1− � i min(si 1 ,si 2 ) i max(si 1 ,si 2 ). The term we... |

214 |
an e cient data clustering method for very large databases
- BIRCH
- 1996
(Show Context)
Citation Context ...y has a sparse representation. Unfortunately, the methods proposed in the literature for finding high-quality medoids are not compatible with the real-time nature of the envisioned online Web service =-=[42]-=-. In this paper we will give a fast and effective medoid searching technique. 3contained in other clusters) 4 . The possibility to extract good labels directly from the available snippets is strongly... |

183 |
Cluster analysis of multivariate data: efficiency versus interpretability of classifications, Biometrics 21
- Forgy
- 1965
(Show Context)
Citation Context ... set of experiments in this Section compare our Variant of the Furthest-PointFirst algorithm for k-center clustering with three recently proposed, fast variants of k-means. The k-means algorithm (see =-=[8, 23]-=-) can be seen as an iterative cluster quality booster. It takes as input a rough k-clustering (or, more precisely, k candidate centroids) and produces as output another k-clustering, hopefully of bett... |

123 | Learning to cluster web search results
- Zeng, He, et al.
- 2004
(Show Context)
Citation Context ...erative clustering that is quadratic in the number n of snippets.sThey introduce a technique called “lexical affinity” whereby the co-occurrence of words influences the similarity metric. Zeng et al. =-=[16]-=- tackle the problem of detecting good cluster names as preliminary to the formation of the clusters, using a supervised learning approach. Note that the methods considered in our paper are instead all... |

120 | Building efficient and effective metasearch engines
- MENG, YU, et al.
- 2002
(Show Context)
Citation Context ...wever, without an accurate design, MSEs might in principle even worsen the quality of the information access experience, since the user is typically confronted with an even larger set of results (see =-=[27]-=- for a recent survey of challenges and techniques related to building meta-search engines). Thus, key issues to be faced by MSEs concern the exploitation of effective algorithms for merging the ranked... |

101 | Fast and Intuitive Clustering of Web Documents
- Zamir, Madani, et al.
- 1997
(Show Context)
Citation Context ... different balance between the two aspects. Some systems (e.g. [3, 4]) view label extraction as the primary goal, and clustering is a by-product of the label extraction procedure. Other systems (e.g. =-=[5, 6]-=-) view instead the formation of clusters as the most important step, and the labelling phase is considered as strictly dependent on the clusters found. We have followed this latter approach. In order ... |

98 | An empirical comparison of four initialization methods for the K-Means algorithm, Pattern recognition letters 20
- Pena, Lozano, et al.
- 1999
(Show Context)
Citation Context ...confirm this, that the quality of the initialization (i.e. the choice of the initial k centroids) has a deep impact on the resulting accuracy. Several methods for initializing k-means are compared in =-=[29]-=-. As our baselines we have chosen the three such methods that are most amply cited in the literature while being at the same time relatively simple; we have ruled out more advanced and complex initial... |

84 | 2005a. A Personalized Search Engine Based on Web-snippet Hierarchical Clustering
- Ferragina, Gulli
(Show Context)
Citation Context ... and labelling are both essential operations for a Web snippet clustering system. However, each previously proposed such system strikes a different balance between the two aspects. Some systems (e.g. =-=[3, 4]-=-) view label extraction as the primary goal, and clustering is a by-product of the label extraction procedure. Other systems (e.g. [5, 6]) view instead the formation of clusters as the most important ... |

78 | Sublinear time algorithms for metric space roblems
- Indyk
- 1999
(Show Context)
Citation Context ...and maintaining these distances, which is anyhow dominated by the term O(nk). Using medoids. The M-FPF is applied to a random sample of size √ nk of the input points (this sample size is suggested in =-=[21]-=-). Afterwards the remaining points are associated to the closest (according to the Generalized Jaccard Distance) center. We obtain improvements in quality by making an iterative update of the “center”... |

69 | Evaluating strategies for similarity search on the web
- Haveliwala, Gionis, et al.
- 2002
(Show Context)
Citation Context ...n. Thus NCE(C ′ , C) = � |C ′ | |c l∈1 ′ l | , C). NCE values reported below are n ′ NCE(c ′ l thus obtained on the full set of snippets returned by Vivisimo. Establishing the ground truth. Following =-=[24]-=-, we have made a series of experiments using as input the snippets resulting from queries issued to the Open Directory Project (ODP – see Footnote 6). The ODP is a searchable Web-based directory consi... |

59 |
A hierarchical monothetic document clustering algorithm for summarization and browsing search results
- KUMMAMURU, LOTLIKAR, et al.
- 2004
(Show Context)
Citation Context ... different balance between the two aspects. Some systems (e.g. [3, 4]) view label extraction as the primary goal, and clustering is a by-product of the label extraction procedure. Other systems (e.g. =-=[5, 6]-=-) view instead the formation of clusters as the most important step, and the labelling phase is considered as strictly dependent on the clusters found. We have followed this latter approach. In order ... |

54 |
K-means type algorithms: a generalized convergence theorem and characterization of local optimality
- Selim, Ismail
- 1984
(Show Context)
Citation Context ...cluster quality booster. It takes as input a rough k-clustering (or, more precisely, k candidate centroids) and produces as output another k-clustering, hopefully of better quality. It has been shown =-=[31]-=- that by using the sum of squared Euclidean distances as objective function 17 , the procedure converges to a local minimum for the objective function within a finite number of iterations. The main bu... |

51 | Comparing clusterings
- Meila
- 2003
(Show Context)
Citation Context ... . The original labelling is thus viewed as the latent, hidden structure that the clustering system must discover. The measure we use is normalized mutual information (see e.g. [33, page 110][4], and =-=[26]-=- ), i.e. NMI(C, C ′ ) = 2 log |C||C′ ∑ | ∑ c∈C c ′ ∈C ′ P (c, c ′ ) · log P (c, c′ ) P (c) · P (c ′ ) , 13Figure 3: Labels generated by Armil for the query “allergy”. where P (c) represents the proba... |

50 | Relationship-based Clustering and Cluster Ensembles for High-dimensional Data Mining - Strehl - 2002 |

45 | The effectiveness of query-specific hierarchic clustering in information retrieval
- Tombros, Villa, et al.
- 2002
(Show Context)
Citation Context ...re research are discussed. 2 Previous work Tools for clustering Web snippets have received attention in the research community. In the past, this approach has had both critics [21, 20] and supporters =-=[35]-=-, but the proliferation of commercial Web services such as Copernic, Dogpile, Groxis, iBoogie, Kartoo, Mooter, and Vivisimo seems to confirm the potential validity of the approach. Academic research p... |

41 | Scalable techniques for clustering the web
- Haveliwala, Gionis, et al.
- 2000
(Show Context)
Citation Context ... of the fact that duplicates and near duplicate are easy to detect via multiple hashing or shingling techniques, and these tasks can be carried out, at least in part, in an off-line setting (see e.g. =-=[13]-=-). Moreover, in the duplicate detection activity labelling is not an issue. This paper describes the Armil system2 , a meta-search engine that organizes the Web snippets retrieved from auxiliary searc... |

35 |
A best possible approximation algorithm for the k–Center problem
- Hochbaum, Shmoys
- 1985
(Show Context)
Citation Context ...enters” µ1, . . . , µk ∈ M so that the radius maxj maxx∈Cj D(x, µj) of the widest cluster is minimized. The k-center problem can be solved approximately using the furthest-point-first (FPF) algorithm =-=[7, 20]-=-, which we now describe. Given a set S of n points, FPF builds a sequence T1 ⊂ . . . ⊂ Tk = T of k sets of “centers” (with Ti = {µ1, . . . , µi} ⊂ S) in the following way. 1. At the end of iteration i... |

30 | Generating hierarchical summaries for web searches
- Lawrie, Croft
- 2003
(Show Context)
Citation Context ...nd labelling are both essential operations for a Web snippet clustering system. However, each previously proposed such system strikes a different balance between these two aspects. Some systems (e.g. =-=[7, 22]-=-) view label extraction as the primary goal, and clustering as a by-product of the label extraction procedure. Other systems (e.g. [19, 39]) view instead the formation of clusters as the most importan... |

21 | Ephemeral document clustering for Web applications
- MAAREK, FAGIN, et al.
- 2000
(Show Context)
Citation Context ...ailable, such as Grouper [12, 6], EigenCluster [13], Shoc [14], and SnakeT [3]. Generally, details of the algorithms underlying the commercial Web services are not in the public domain. Maarek et al. =-=[15]-=- give a precise characterization of the challenges inherent in Web snippet clustering, and propose an algorithm based on complete-link hierarchical agglomerative clustering that is quadratic in the nu... |

18 |
Online Clustering of Web Search Results
- Semantic
- 2001
(Show Context)
Citation Context ...roxis, iBoogie, Kartoo, Mooter, and Vivisimo seems to confirm the potential validity of the approach. Academic research prototypes are also available, such as Grouper [38, 39], EigenCluster [3], Shoc =-=[41]-=-, and SnakeT [7]. Generally, details of the algorithms underlying commercial Web services are not in the public domain. Maarek et al. [24] give a precise characterization of the challenges inherent in... |

13 |
hierarchical, online clustering of Web search results
- ZHANG, Y
(Show Context)
Citation Context ...Dogpile, Groxis, iBoogie, Kartoo, Mooter, and Vivisimo seems to confirm the validity of the approach. Academic research prototypes are also available, such as Grouper [12, 6], EigenCluster [13], Shoc =-=[14]-=-, and SnakeT [3]. Generally, details of the algorithms underlying the commercial Web services are not in the public domain. Maarek et al. [15] give a precise characterization of the challenges inheren... |

12 | On a recursive spectral algorithm for clustering from pairwise similarities
- Cheng, Kannan, et al.
(Show Context)
Citation Context ... Copernic, Dogpile, Groxis, iBoogie, Kartoo, Mooter, and Vivisimo seems to confirm the validity of the approach. Academic research prototypes are also available, such as Grouper [12, 6], EigenCluster =-=[13]-=-, Shoc [14], and SnakeT [3]. Generally, details of the algorithms underlying the commercial Web services are not in the public domain. Maarek et al. [15] give a precise characterization of the challen... |

10 | Conceptual clustering using Lingo algorithm: Evaluation on Open Directory Project data
- Osiński, Weiss
- 2003
(Show Context)
Citation Context ...rmation of the clusters, using a supervised learning approach. Note that the methods considered in our paper are instead all unsupervised, thus requiring no labelled data. The EigenCluster [3], Lingo =-=[28]-=-, and Shoc [41] systems all tackle Web snippet clustering by performing a singular value decomposition of the term-document incidence matrix5 ; the problem with this approach is that SVD is extremely ... |

9 |
Deciphering cluster representations
- Kural, Robertson, et al.
(Show Context)
Citation Context ...per with more details is in [8]. 2 Previous work Tools for clustering Web snippets have recently become a focus of attention in the research community. In the past, this approach has had both critics =-=[9, 10]-=- and supporters [11], but the proliferation of commercial Web services such as Copernic, Dogpile, Groxis, iBoogie, Kartoo, Mooter, and Vivisimo seems to confirm the validity of the approach. Academic ... |

7 | A scalable algorithm for high-quality clustering of web snippets
- Geraci, Pellegrini, et al.
- 2006
(Show Context)
Citation Context ...le is played by the clustering component and by the labelling component. Clustering is accomplished by means of an improved version of the furthest-point-first (FPF) algorithm for k-center clustering =-=[1]-=-. To the best of our knowledge this algorithm had never been used in the context of Web snippet clustering or text clustering. The generation of the cluster labels is instead accomplished by means of ... |

7 |
Multitasking web search on vivisimo.com
- SPINK, KOSHMAN, et al.
(Show Context)
Citation Context ... we restrict our source of snippets to the ODP directory. Visivimo’s clustering search engine has been the focus of research in the field of Human Computer Interface and Information systems (see e.g. =-=[32, 18]-=-), with an emphasis on understanding user behavior by analyzing query logs. 6.2 Measuring clustering quality Following a consolidated practice, in this paper we measure the effectiveness of a clusteri... |

5 | Clustering Information Retrieval Search Outputs
- Kural
- 1999
(Show Context)
Citation Context ...per with more details is in [8]. 2 Previous work Tools for clustering Web snippets have recently become a focus of attention in the research community. In the past, this approach has had both critics =-=[9, 10]-=- and supporters [11], but the proliferation of commercial Web services such as Copernic, Dogpile, Groxis, iBoogie, Kartoo, Mooter, and Vivisimo seems to confirm the validity of the approach. Academic ... |

5 |
Conceptual clustering using Lingo algorithm: Evaluation on Open Directory Project data
- Osinski, Weiss
- 2004
(Show Context)
Citation Context ...mation of the clusters, using a supervised learning approach. Note that the methods considered in our paper are instead all unsupervised, thus requiring no labelled data. The EigenCluster [13], Lingo =-=[17]-=-, and Shoc [14] systems all tackle Web snippet clustering by performing a singular value decomposition of the term-document incidence matrix 5 ; the problem with this approach is that SVD is extremely... |

5 | A Topology-Driven Approach to the Design of Web Meta-Search Clustering Engines
- Giacomo, Didimo, et al.
- 2005
(Show Context)
Citation Context ...ion that has run over the deadline. In Table 2 we report clustering time and output quality of several variants of kcenter and k-means on a sample of 12 queries subdivided, according to the method of =-=[6]-=-, in three broad families: ambiguous queries (armstrong, jaguar, mandrake, java), generic queries (health, language, machine, music, clusters), and specific queries (mickey mouse, olympic games, steve... |

4 |
Using Clusters on the Vivisimo Web Search Engine
- Koshman, Spink, et al.
(Show Context)
Citation Context ... we restrict our source of snippets to the ODP directory. Visivimo’s clustering search engine has been the focus of research in the field of Human Computer Interface and Information systems (see e.g. =-=[32, 18]-=-), with an emphasis on understanding user behavior by analyzing query logs. 6.2 Measuring clustering quality Following a consolidated practice, in this paper we measure the effectiveness of a clusteri... |

3 |
Impact of Similarity Measures on Web Page Clustering
- Strehl, Ghosh, et al.
- 2000
(Show Context)
Citation Context ...y). Interestingly, the authors show that very similar results are attained when full documents are used instead of their snippets, thus validating the snippet-based clustering approach. Strehl et al. =-=[34]-=- experiment with four similarity measures (Cosine similarity, Euclidean distance, Pearson Correlation, extended Jaccard) in combination with five algorithms (k-means, self-organizing maps, random clus... |

2 |
Rijsbergen, C.J.: The effectiveness of query-specific hierarchic clustering in information retrieval
- Tombros, Villa, et al.
- 2002
(Show Context)
Citation Context ...s in [8]. 2 Previous work Tools for clustering Web snippets have recently become a focus of attention in the research community. In the past, this approach has had both critics [9, 10] and supporters =-=[11]-=-, but the proliferation of commercial Web services such as Copernic, Dogpile, Groxis, iBoogie, Kartoo, Mooter, and Vivisimo seems to confirm the validity of the approach. Academic research prototypes ... |

1 |
Easy and hard bottlneck location problems
- Hsu
- 1979
(Show Context)
Citation Context ... that minimizes such maximum diameter). The competitive property of M-FPF is even stronger: the approximation factor of 2 cannot be improved with any polynomial approximation algorithm, unless P = NP =-=[36]-=-. The strength of this formal property has been our main motivation for selecting M-FPF as the algorithmic backbone for Armil. The second interesting property of M-FPF is that it does not compute cent... |