Results 1 -
8 of
8
Learning to cluster web search results
- In Proc. of SIGIR ’04
, 2004
"... In web search, surfers are often faced with the problem of selecting their most wanted information from the potential huge amount of search results. The clustering of web search results is the possible solution, but the traditional content based clustering is not sufficient since it ignores many uni ..."
Abstract
-
Cited by 93 (6 self)
- Add to MetaCart
In web search, surfers are often faced with the problem of selecting their most wanted information from the potential huge amount of search results. The clustering of web search results is the possible solution, but the traditional content based clustering is not sufficient since it ignores many unique features of web pages. The link structure, authority, quality, or trustfulness of search results can play even the higher role than the actual contents of the web pages in clustering. These possible extents are reflected by Google's PageRank algorithm, HITS algorithm and etc. The main goal of this project is to integrate the authoritative information such as PageRank, link structure (e.g. in-links and out-links) into the K-Means clustering of web search results. The PageRank, inlinks and out-links can be used to extend the vector representation of web pages, and the PageRank can also be considered in the initial centroids selection, or the web page with higher PageRank influences the centroid computation to a higher degree. The relevance of this modified K-Means clustering algorithm needs to be compared to the ones obtained by the content based K-Means clustering, and the effects of different authoritative information also needs to be analyzed.
TopCat: Data Mining for Topic Identification in a Text Corpus
- In Proceedings of the 3rd European Conference of Principles and Practice of Knowledge Discovery in Databases
, 2002
"... TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a dat ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on "traditional" data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually-categorized "ground truth" news corpus showing this technique is effective in identifying "topics" in collections of news articles.
Developing a Framework for Assessing Information Quality on
- the World Wide Web. Informing Science Journal
, 2005
"... The rapid growth of the Internet as an environment for information exchange and the lack of enforceable standards regarding the information it contains has lead to numerous information qual ity problems. A major issue is the inability of Search Engine technology to wade through the vast expanse of q ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
The rapid growth of the Internet as an environment for information exchange and the lack of enforceable standards regarding the information it contains has lead to numerous information qual ity problems. A major issue is the inability of Search Engine technology to wade through the vast expanse of questionable content and return "quality " results to a user's query. This paper attempts to address some of the issues involved in determining what quality is, as it pertains to information retrieval on the Internet. The IQIP model is presented as an approach to managing the choice and implementation of quality related algorithms of an Internet crawling Search Engine.
Clustering web documents using co-citation, coupling, incoming, and outgoing hyperlinks: a comparative performance analysis of algorithms
- IJWIS
, 2006
"... ”jaguars ” returns results as diverse as web sites about cars, computer games, attack planes, American football, and animals. More and more search engines offer options to organize query results by categories or, given a document, to return a list of links to topically related documents. While infor ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
”jaguars ” returns results as diverse as web sites about cars, computer games, attack planes, American football, and animals. More and more search engines offer options to organize query results by categories or, given a document, to return a list of links to topically related documents. While information retrieval traditionally defines similarity of documents in terms of contents, it seems natural to expect that the very structure of the Web carries important information about the topical similarity of documents. Here we study the role of a matrix constructed from weighted cocitations (documents referenced by the same document), weighted couplings (documents referencing the same document), incoming, and outgoing links for the clustering of documents on the Web. We present and discuss three methods of clustering based on this matrix construction using three clustering algorithms, K-means, Markov and Maximum Spanning Tree, respectively. Our main contribution is a clustering technique based on the Maximum Spanning Tree technique and an evaluation of its effectiveness comparatively to the two most robust alternatives: K-means and Markov clustering. Index Terms — Search engines, clustering, co-citation, coupling, hyperlinks
A topology-driven approach to the design of web meta-search clustering engines
- In Theory and Practice of Computer Science (SOFSEM ’05), volume 3381 of Lecture Notes in Computer Science
, 2005
"... The paradigm adopted by classical Web search engines to output the results of a query is often inadequate. It typically consists of a ranked list of URLs, which may be very long and difficult to browse for the interested user. Recently, a lot of attention has been devoted to the design of Web meta-s ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The paradigm adopted by classical Web search engines to output the results of a query is often inadequate. It typically consists of a ranked list of URLs, which may be very long and difficult to browse for the interested user. Recently, a lot of attention has been devoted to the design of Web meta-search clustering engines. These systems support the user by grouping the URLs returned by a search engine into distinct semantic categories, which are organized in a hierarchy; each category is properly labeled with a sentence that reflects its topics. However, even the most effective Web meta-search engines usually end-up by presenting many “meaningful ” categories together with a few “inexpressive ” categories on some specific queries. In this paper we describe a novel topology-driven approach to the design of a Web meta-search clustering engine. By this approach the set of URLs is modeled as a suitable graph and the hierarchy of categories is obtained by variants of classical graph-clustering algorithms. The topologydriven approach turns out to be comparable with traditional text-based strategies for the definition of the cluster hierarchy. In addition, our approach makes it natural to use graph visualization techniques to support the user in handling inexpressive labels. Namely, categories with inexpressive labels can be visually related to more meaningful ones. 1
Inter-document similarity in web searches
, 2004
"... are stored in PDF, with the report number as filename. Alternatively, reports are available by post from the above address. Orientador: Júri: ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
are stored in PDF, with the report number as filename. Alternatively, reports are available by post from the above address. Orientador: Júri:
Discovering communities in linked data by multi-view clustering
- In From Data and Information Analysis to Knowledge Engineering: Proceedings of 29th Annual Conference of the German Classification Society, Studies in Classification, Data Analysis, and Knowledge Organization
, 2005
"... Abstract. We consider the problem of finding communities in large linked networks such as web structures or citation networks. We review similarity measures for linked objects and discuss the k-Means and EM algorithms, based on text similarity, bibliographic coupling, and co-citation strength. We st ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. We consider the problem of finding communities in large linked networks such as web structures or citation networks. We review similarity measures for linked objects and discuss the k-Means and EM algorithms, based on text similarity, bibliographic coupling, and co-citation strength. We study the utilization of the principle of multi-view learning to combine these similarity measures. We explore the clustering algorithms experimentally using web pages and the Cite-Seer repository of research papers and find that multi-view clustering effectively combines link-based and intrinsic similarity. 1
Spectral Clustering of Biological Sequence Data
- AAAI
, 2005
"... In this paper, we apply spectral techniques to clustering biological sequence data that has proved more difficult to cluster effectively. For this purpose, we have to (1) extend spectral clustering algorithms to deal with asymmetric affinities, like the alignment scores used in the comparison o ..."
Abstract
- Add to MetaCart
In this paper, we apply spectral techniques to clustering biological sequence data that has proved more difficult to cluster effectively. For this purpose, we have to (1) extend spectral clustering algorithms to deal with asymmetric affinities, like the alignment scores used in the comparison of biological sequences, and (2) devise a hierarchical algorithm that can handle many clusters with imbalanced sizes robustly. We present an algorithm for clustering asymmetric affinity data, and demonstrate the performance of this algorithm at recovering the higher levels of the Structural Classification of Proteins (SCOP) on a data base of highly conserved subsequences.

