## Improving Quality of Search Results Clustering with Approximate Matrix Factorisations

Citations: | 7 - 1 self |

### BibTeX

@MISC{Osinski_improvingquality,

author = {Stanislaw Osinski},

title = {Improving Quality of Search Results Clustering with Approximate Matrix Factorisations},

year = {}

}

### OpenURL

### Abstract

Abstract. In this paper we show how approximate matrix factorisations can be used to organise document summaries returned by a search engine into meaningful thematic categories. We compare four different factorisations (SVD, NMF, LNMF and K-Means/Concept Decomposition) with respect to topic separation capability, outlier detection and label quality. We also compare our approach with two other clustering algorithms: Suffix Tree Clustering (STC) and Tolerance Rough Set Clustering (TRC). For our experiments we use the standard merge-thencluster approach based on the Open Directory Project web catalogue as a source of human-clustered document summaries. 1

### Citations

1213 |
Automatic text processing: the transformation, analysis, and retrieval of information by computer
- Salton
- 1989
(Show Context)
Citation Context ..., a matrix factorization is used to induce cluster labels. In phase four snippets are assigned to each of these labels to form proper clusters. The assignment is based on the Vector Space Model (VSM) =-=[15]-=- and the cosine similarity between vectors representing the label and the snippets. Finally, phase five is postprocessing, which includes cluster merging and pruning. In the context of this paper, pha... |

989 |
H.S.: Learning the parts of objects by non-negative matrix factorization
- Lee, Seung
- 1999
(Show Context)
Citation Context ...r topic identification capability, outlier detection and cluster labels quality. We evaluate four factorisation algorithms: Singular Value Decomposition (SVD), Non-negative Matrix factorisation (NMF) =-=[9]-=-, Local Non-negative Matrix Factorisation (LNMF) [10] and Concept Decomposition (CD) [11]. To further verify the viability the description-comesfirst approach, we compare Lingo with two other algorith... |

734 | H.S.: Algorithms for non-negative matrix factorization
- Lee, Seung
- 2001
(Show Context)
Citation Context ...tion called Non-negative Matrix Factorisation (NMF) applied to human face images was shown to be able to produce base vectors corresponding to different parts of a human face. It is further argued in =-=[16]-=- that low-dimensional base vectors can discover the latent structures present in the input data. Following this intuition, we believe that in the search results clustering setting each of the base vec... |

383 | Reexamining the cluster hypothesis: scatter/gather on retrieval results
- Hearst, Pedersen
- 1996
(Show Context)
Citation Context ...rawn from a large human-edited directory of web page summaries called Open Directory Project 1 . 2 Related Work The idea of search results clustering was first introduced in the Scatter/Gather system =-=[12]-=-, which was based on a variant of the classic K-Means algorithm. Scatter/Gather was followed by Suffix Tree Clustering (STC) [13], in which snippets sharing the same sequence of words were grouped tog... |

331 | document clustering: A feasibility demonstration
- Zamir, Etzioni, et al.
(Show Context)
Citation Context ...eed a general overview of all related topics would get a concise summary of each of them. Search results clustering involves a class of algorithms called post-retrieval document clustering algorithms =-=[1]-=-. A successful search results clustering algorithm must first of all identify the major and outlier topics dealt with in the results based only on the short document snippets returned by the search en... |

304 | D.S.: Concept decomposition for large sparse text data using clustering
- Dhillon, Modha
- 2001
(Show Context)
Citation Context ...uate four factorisation algorithms: Singular Value Decomposition (SVD), Non-negative Matrix factorisation (NMF) [9], Local Non-negative Matrix Factorisation (LNMF) [10] and Concept Decomposition (CD) =-=[11]-=-. To further verify the viability the description-comesfirst approach, we compare Lingo with two other algorithms designed specifically for clustering of search results: Suffix Tree Clustering (STC) a... |

236 | Grouper: A Dynamic Clustering Interface to Web Search Results
- Zamir, Etzioni
- 1999
(Show Context)
Citation Context ...h results clustering was first introduced in the Scatter/Gather system [12], which was based on a variant of the classic K-Means algorithm. Scatter/Gather was followed by Suffix Tree Clustering (STC) =-=[13]-=-, in which snippets sharing the same sequence of words were grouped together. The Semantic 1 http://dmoz.orgsImproving Quality of Search Results Clustering 169 On-Line Hierarchical Clustering (SHOC) [... |

178 |
Document clustering based on non-negative matrix factorization
- Xu, Liu, et al.
- 2003
(Show Context)
Citation Context ...ults in such a way as to maximise the coverage and distinctiveness of the clusters. Finally, there exist algorithms that use matrix factorisation techniques, such as Non-negative Matrix Factorisation =-=[14]-=-, for clustering full text documents. 3 Background Information 3.1 Lingo: Description-Comes-First Clustering In this section we provide a brief description of the Lingo algorithm, placing emphasis on ... |

137 | Random Projection in Dimensionality Reduction: Applications to Image and Text
- Bingham, Mannila
- 2001
(Show Context)
Citation Context ...pets. However, it may prove less efficient in identifying topics represented by relatively small groups of documents. There also exists a class of decomposition techniques based on random projections =-=[17]-=-. Even though these decompositions fairly well preserve distances and similarities between vectors, they are of little use in our approach. The reasons172 S. Osinski is that they rely on randomly gene... |

119 | Learning spatially localized, parts-based representation
- Li, Hou, et al.
- 2001
(Show Context)
Citation Context ... and cluster labels quality. We evaluate four factorisation algorithms: Singular Value Decomposition (SVD), Non-negative Matrix factorisation (NMF) [9], Local Non-negative Matrix Factorisation (LNMF) =-=[10]-=- and Concept Decomposition (CD) [11]. To further verify the viability the description-comesfirst approach, we compare Lingo with two other algorithms designed specifically for clustering of search res... |

61 | An information-theoretic external cluster-validity measure
- DOM
- 2001
(Show Context)
Citation Context ...cally generated clusters and the original reference groups is measured. The similarity between two sets of clusters can be expressed as a single numerical value using e.g. mutual-information measures =-=[18]-=-. One drawback of such measures is that a smallest difference between the automatically generated clusters and the reference groups will be treated as the algorithm’s mistake, even if the algorithm ma... |

60 |
R.: A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results
- Kummamuru, Lotlikar, et al.
(Show Context)
Citation Context ...o search results clustering have been proposed, including Suffix Tree Clustering (STC) [2], Semantic On-line Hierarchical Clustering (SHOC) [3], Tolerance Rough Set Clustering (TRC) [4], and DisCover =-=[5]-=-. With their respective advantages such as speed and scalability, all these algorithms share one important shortcoming: none of them explicitly addresses the problem of cluster description quality. Th... |

37 |
A concept-driven algorithm for clustering search results
- OSIŃSKI, WEISS
- 2005
(Show Context)
Citation Context ...to form proper clusters. For this reason this algorithm could be considered as an example of a description-comes-first approach. Although SVD performed fairly well as part of Lingo in our experiments =-=[8]-=-, it had certain limitations in the context of the description-comes-first approach. For this reason, we sought to verify how alternative matrix factorisations, known from e.g. image processing, would... |

32 | Lingo: Search results clustering algorithm based on singular value decomposition
- OSIŃSKI, STEFANOWSKI, et al.
- 2004
(Show Context)
Citation Context ...me being unable to concisely explain to the user what the group’s documents have in common. Based on our previous experiences with search results clustering [6], we proposed an algorithm called Lingo =-=[7]-=- in which special emphasis was placed on the quality of cluster labels. The main idea behind the algorithm was to reverse the usual order of the clustering process: Lingo first identified meaningful c... |

25 |
Document clustering by concept factorization
- Xu, Gong
- 2004
(Show Context)
Citation Context ...ed by Lingo significantly outperformed both STC and TRC in topic separation and outlier detection tests. We feel that future experiments should investigate more complex matrix factorisations, such as =-=[20]-=-. It is also very interesting how our algorithm would perform for the full-text test collections such as REUTERS-21578 or OHSUMED. Such experiments would require efficient implementations of the facto... |

15 | Clustering Web Documents: A Phrase-Based Method for Grouping Search Engine Results
- Zamir
- 1999
(Show Context)
Citation Context ... results fully automatically and must not introduce a noticeable delay to the query processing. Many approaches to search results clustering have been proposed, including Suffix Tree Clustering (STC) =-=[2]-=-, Semantic On-line Hierarchical Clustering (SHOC) [3], Tolerance Rough Set Clustering (TRC) [4], and DisCover [5]. With their respective advantages such as speed and scalability, all these algorithms ... |

11 | Carrot2 and language properties in Web search results clustering
- STEFANOWSKI, 2003a
(Show Context)
Citation Context ...ents should form a group and at the same time being unable to concisely explain to the user what the group’s documents have in common. Based on our previous experiences with search results clustering =-=[6]-=-, we proposed an algorithm called Lingo [7] in which special emphasis was placed on the quality of cluster labels. The main idea behind the algorithm was to reverse the usual order of the clustering p... |

7 |
Towards Web Information Clustering
- DONG
(Show Context)
Citation Context ... noticeable delay to the query processing. Many approaches to search results clustering have been proposed, including Suffix Tree Clustering (STC) [2], Semantic On-line Hierarchical Clustering (SHOC) =-=[3]-=-, Tolerance Rough Set Clustering (TRC) [4], and DisCover [5]. With their respective advantages such as speed and scalability, all these algorithms share one important shortcoming: none of them explici... |

5 |
Dimensionality reduction techniques for search results clustering
- Osinski
- 2004
(Show Context)
Citation Context ...sures: Cluster Contamination, Topic Coverage and Snippet Coverage. Due to the limited length of this paper we can only afford an informal description of these measures, we refer the reader to [8] and =-=[19]-=- for formalised definitions. 4.2 Clustering Quality Measures Let us define the Cluster Contamination (CC) measure to be the number of pairs of documents found in the same cluster K but originating fro... |

3 |
A tolerance rough set approach to clustering web search results
- Lang
- 2004
(Show Context)
Citation Context ... Many approaches to search results clustering have been proposed, including Suffix Tree Clustering (STC) [2], Semantic On-line Hierarchical Clustering (SHOC) [3], Tolerance Rough Set Clustering (TRC) =-=[4]-=-, and DisCover [5]. With their respective advantages such as speed and scalability, all these algorithms share one important shortcoming: none of them explicitly addresses the problem of cluster descr... |