Results 1 - 10
of
17
Clustering the tagged web
- In ACM WSDM ’09
, 2009
"... Automatically clustering web pages into semantic groups promises improved search and browsing on the web. In this paper, we demonstrate how user-generated tags from largescale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Automatically clustering web pages into semantic groups promises improved search and browsing on the web. In this paper, we demonstrate how user-generated tags from largescale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages. This paper explores the use of tags in 1) K-means clustering in an extended vector space model that includes tags as well as page text and 2) a novel generative clustering algorithm based on latent Dirichlet allocation that jointly models text and tags. We evaluate the models by comparing their output to an established web directory. We find that the naive inclusion of tagging data improves cluster quality versus page text alone, but a more principled inclusion can substantially improve the quality of all models with a statistically significant absolute F-score increase of 4%. The generative model outperforms K-means with another 8 % F-score increase.
Mobile Information Retrieval with Search Results Clustering: Prototypes and Evaluations
- Journal of American Society for Information Science and Technology (JASIST
, 2009
"... Web searches from mobile devices such as PDAs and cell phones are becoming increasingly popular. However, the traditional list-based search interface paradigm does not scale well to mobile devices due to their inherent limitations. In this article, we investigate the application of search results cl ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Web searches from mobile devices such as PDAs and cell phones are becoming increasingly popular. However, the traditional list-based search interface paradigm does not scale well to mobile devices due to their inherent limitations. In this article, we investigate the application of search results clustering, used with some success for desktop computer searches, to the mobile scenario. Building on CREDO (Conceptual Reorganization of Documents), a Web clustering engine based on concept lattices, we present its mobile versions Credino and SmartCREDO, for PDAs and cell phones, respectively. Next, we evaluate the retrieval performance of the three prototype systems. We measure the effectiveness of their clustered results compared to a ranked list of results on a subtopic retrieval task, by means of the device-independent notion of subtopic reach time together with a reusable test collection built from Wikipedia ambiguous entries. Then, we make a crosscomparison of methods (i.e., clustering and ranked list) and devices (i.e., desktop, PDA, and cell phone), using an interactive information-finding task performed by external participants. The main finding is that clustering engines are a viable complementary approach to plain search engines both for desktop and mobile searches especially, but not only, for multitopic informational queries.
Inducing Word Senses to Improve Web Search Result Clustering
, 2010
"... In this paper, we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI). We first acquire the senses (i.e., meanings) of a query by means of a graphbased clustering algorithm that explo ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
In this paper, we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI). We first acquire the senses (i.e., meanings) of a query by means of a graphbased clustering algorithm that exploits cycles (triangles and squares) in the co-occurrence graph of the query. Then we cluster the search results based on their semantic similarity to the induced word senses. Our experiments, conducted on datasets of ambiguous queries, show that our approach improves search result clustering in terms of both clustering quality and degree of diversification.
Improving Quality of Search Results Clustering with Approximate Matrix Factorisations
"... Abstract. In this paper we show how approximate matrix factorisations can be used to organise document summaries returned by a search engine into meaningful thematic categories. We compare four different factorisations (SVD, NMF, LNMF and K-Means/Concept Decomposition) with respect to topic separati ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract. In this paper we show how approximate matrix factorisations can be used to organise document summaries returned by a search engine into meaningful thematic categories. We compare four different factorisations (SVD, NMF, LNMF and K-Means/Concept Decomposition) with respect to topic separation capability, outlier detection and label quality. We also compare our approach with two other clustering algorithms: Suffix Tree Clustering (STC) and Tolerance Rough Set Clustering (TRC). For our experiments we use the standard merge-thencluster approach based on the Open Directory Project web catalogue as a source of human-clustered document summaries. 1
Full-Subtopic Retrieval with Keyphrase-based Search Results Clustering
"... We consider the problem of retrieving multiple documents relevant to the single subtopics of a given web query, termed “full-subtopic retrieval”. To solve this problem we present a novel search results clustering algorithm that generates clusters labeled by keyphrases. The keyphrases are extracted f ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We consider the problem of retrieving multiple documents relevant to the single subtopics of a given web query, termed “full-subtopic retrieval”. To solve this problem we present a novel search results clustering algorithm that generates clusters labeled by keyphrases. The keyphrases are extracted from the generalized suffix tree built from the search results and merged through an improved hierarchical agglomerative clustering procedure. We also introduce a novel measure for evaluating full-subtopic retrieval performance, namely “Subtopic Search Length under k document sufficiency”. Using a test collection specifically designed for evaluating subtopic retrieval, we found that our algorithm outperformed both other existing search results clustering algorithms and also a search results re-ranking method that emphasized diversity of results (at least for k>1; i.e., when we are interested in retrieving more than one relevant document per subtopic). Our approach has been implemented into KeySRC (Keyphrase-based Search Results Clustering), a full web clustering engine available online at
M.: CatS: A classification-powered meta-search engine
- In: Advances in Web Intelligence and Data Mining. Studies in Computational Intelligence 23, Springer-Verlag
, 2006
"... Summary. CatS is a meta-search engine that utilizes text classification techniques to improve the presentation of search results. After posting a query, the user is offered an opportunity to refine the results by browsing through a category tree derived from the dmoz Open Directory topic hierarchy. ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Summary. CatS is a meta-search engine that utilizes text classification techniques to improve the presentation of search results. After posting a query, the user is offered an opportunity to refine the results by browsing through a category tree derived from the dmoz Open Directory topic hierarchy. This paper describes some key aspects of the system (including HTML parsing, classification and displaying of results), outlines the text categorization experiments performed in order to choose the right parameters for classification, and puts the system into the context of related work on (meta-)search engines. The approach of using a separate category tree represents an extension of the standard relevance list, and provides a way to refine the search on need, offering the user a non-imposing, but potentially powerful tool for locating needed information quickly and efficiently. The current implementation of CatS may be considered a baseline, on top of which many enhancements are possible. 1
Comprehensible and Accurate Cluster Labels in Text Clustering
"... The purpose of text clustering in information retrieval is to discover groups of semantically related documents. Accurate and comprehensible cluster descriptions (labels) let the user comprehend the collection’s content faster and are essential for various document browsing interfaces. The task of c ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The purpose of text clustering in information retrieval is to discover groups of semantically related documents. Accurate and comprehensible cluster descriptions (labels) let the user comprehend the collection’s content faster and are essential for various document browsing interfaces. The task of creating descriptive, sensible cluster labels is difficult—typical text clustering algorithms focus on optimizing proximity between documents inside a cluster and rely on keyword representation for describing discovered clusters. In the approach called Description Comes First (DCF) cluster labels are as important as document groups—DCF promotes machine discovery of comprehensible candidate cluster labels later used to discover related document groups. In this paper we describe an application of DCF to the k-Means algorithm, including results of experiments performed on the 20-newsgroups document collection. Experimental evaluation showed that DCF does not decrease the metrics used to assess the quality of document assignment and offers good cluster labels in return. The algorithm utilizes search engine’s data structures directly to scale to large document collections.
Text Mining Services to Support E-Research
"... In recent years the developments and opportunities created for e-Science infrastructure have promised technological support for the ever growing area of text mining applications and services. The computationally expensive tools have previously only been usable on small scale systems but are now bein ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In recent years the developments and opportunities created for e-Science infrastructure have promised technological support for the ever growing area of text mining applications and services. The computationally expensive tools have previously only been usable on small scale systems but are now being developed for much larger scale tasks thanks to alternative models of processing and storage. In this paper, we provide an overview of a selection of the tools available at the National Centre for Text Mining, ways they can enhance document collections, improve on traditional search techniques and how complex combinations of the tools can interact to provide advanced solutions for researchers in the UK academic community. We finish with an account of two case studies currently being investigated; examining how these processes can be extended and customised to meet the needs of domains outside the boundaries of e-Science. 1.
Clustering Web Search Results with Maximum Spanning Trees
"... Abstract. We present a novel method for clustering Web search results based on Word Sense Induction. First, we acquire the meanings of a query by means of a graph-based clustering algorithm that calculates the maximum spanning tree of the co-occurrence graph of the query. Then we cluster the search ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. We present a novel method for clustering Web search results based on Word Sense Induction. First, we acquire the meanings of a query by means of a graph-based clustering algorithm that calculates the maximum spanning tree of the co-occurrence graph of the query. Then we cluster the search results based on their semantic similarity to the induced word senses. We show that our approach improves classical search result clustering methods in terms of both clustering quality and degree of diversification. 1
TALP at WePS-3 2010
"... Abstract. In this paper we present our system and experiments at the Third Web People Search Workshop (WePS-3) task for clustering web people search documents in English. In our experiments we used a simple approach with three algorithms: Lingo, Hierachical Agglomerative Clustering (HAC), and a 2-st ..."
Abstract
- Add to MetaCart
Abstract. In this paper we present our system and experiments at the Third Web People Search Workshop (WePS-3) task for clustering web people search documents in English. In our experiments we used a simple approach with three algorithms: Lingo, Hierachical Agglomerative Clustering (HAC), and a 2-step HAC algorithm. We also present the results and initial conclusions in the context of the WePS-3 Task 1 for clustering. We obtained best results with HAC and 2-step HAC algorithms. 1

