Results 1 - 10
of
10
Stable Algorithms for Link Analysis
, 2001
"... The Kleinberg HITS and the Google PageRank algorithms are eigenvector methods for identifying "authoritative" or "influential" articles, given hyperlink or citation information. That such algorithms should give reliable or consistent answers is surely a desideratum, and in [10], we analyzed when th ..."
Abstract
-
Cited by 95 (1 self)
- Add to MetaCart
The Kleinberg HITS and the Google PageRank algorithms are eigenvector methods for identifying "authoritative" or "influential" articles, given hyperlink or citation information. That such algorithms should give reliable or consistent answers is surely a desideratum, and in [10], we analyzed when they can be expected to give stable rankings under small perturbations to the linkage patterns. In this paper, we extend the analysis and show how it gives insight into ways of designing stable link analysis methods. This in turn motivates two new algorithms, whose performance we study empirically using citation data and web hyperlink data.
Evaluating Topic-Driven Web Crawlers
, 2001
"... Due to limited bandwidth, storage, and computational resources, and to the dynamic nature of the Web, search engines cannot index every Web page, and even the covered portion of the Web cannot be monitored continuously for changes. Therefore it is essential to develop effective crawling strategies t ..."
Abstract
-
Cited by 72 (19 self)
- Add to MetaCart
Due to limited bandwidth, storage, and computational resources, and to the dynamic nature of the Web, search engines cannot index every Web page, and even the covered portion of the Web cannot be monitored continuously for changes. Therefore it is essential to develop effective crawling strategies to prioritize the pages to be indexed. The issue is even more important for topic-specific search engines, where crawlers must make additional decisions based on the relevance of visited pages. However, it is difficult to evaluate alternative crawling strategies because relevant sets are unknown and the search space is changing. We propose three different methods to evaluate crawling strategies. We apply the proposed metrics to compare three topic-driven crawling algorithms based on similarity ranking, link analysis, and adaptive agents.
A study of relevance propagation for web search
- In SIGIR 28
, 2005
"... Different from traditional information retrieval, both content and structure are critical to the success of Web information retrieval. In recent years, many relevance propagation techniques have been proposed to propagate content information between web pages through web structure to improve the per ..."
Abstract
-
Cited by 34 (11 self)
- Add to MetaCart
Different from traditional information retrieval, both content and structure are critical to the success of Web information retrieval. In recent years, many relevance propagation techniques have been proposed to propagate content information between web pages through web structure to improve the performance of web search. In this paper, we first propose a generic relevance propagation framework, and then provide a comparison study on the effectiveness and efficiency of various representative propagation models that can be derived from this generic framework. We come to many conclusions that are useful for selecting a propagation model in real-world search applications, including 1) sitemapbased propagation models outperform hyperlink-based models in sense of both effectiveness and efficiency, and 2) sitemap-based term propagation is easier to be integrated into real-world search engines because of its parallel offline implementation and acceptable complexity. Some other more detailed study results are also reported in the paper.
Block-level Link Analysis
- In SIGIR
, 2004
"... Link Analysis has shown great potential in improving the performance of web search. PageRank and HITS are two of the most popular algorithms. Most of the existing link analysis algorithms treat a web page as a single node in the web graph. However, in most cases, a web page contains multiple semanti ..."
Abstract
-
Cited by 30 (4 self)
- Add to MetaCart
Link Analysis has shown great potential in improving the performance of web search. PageRank and HITS are two of the most popular algorithms. Most of the existing link analysis algorithms treat a web page as a single node in the web graph. However, in most cases, a web page contains multiple semantics and hence the web page might not be considered as the atomic node. In this paper, the web page is partitioned into blocks using the visionbased page segmentation algorithm. By extracting the page-toblock, block-to-page relationships from link structure and page layout analysis, we can construct a semantic graph over the WWW such that each node exactly represents a single semantic topic. This graph can better describe the semantic structure of the web. Based on block-level link analysis, we proposed two new algorithms, Block Level PageRank and Block Level HITS, whose performances we study extensively using web data.
Web Search Via Hub Synthesis
, 2001
"... We present a model for web search that captures in a unified manner three critical components of the problem: how the link structure of the web is generated, how the content of a web document is generated, and how a human searcher generates a query. The key to this unification lies in capturing the ..."
Abstract
-
Cited by 28 (0 self)
- Add to MetaCart
We present a model for web search that captures in a unified manner three critical components of the problem: how the link structure of the web is generated, how the content of a web document is generated, and how a human searcher generates a query. The key to this unification lies in capturing the correlations between these components in terms of proximity in a shared latent semantic space. Given such a combined model, the correct answer to a search query is well defined, and thus it becomes possible to evaluate web search algorithms rigorously. We present a new web search algorithm, based on spectral techniques, and prove that it is guaranteed to produce an approximately correct answer in our model. The algorithm assumes no knowledge of the model, and is well-defined regardless of the model's accuracy. 1.
TopicShop: Enhanced Support for Evaluating and Organizing Collections of Web Sites
- IN PROCEEDINGS OF THE 13TH ANNUAL ACM SYMPOSIUM ON USER INTERFACE SOFTWARE AND TECHNOLOGY
, 2000
"... TopicShop is an interface that helps users evaluate and organize collections of web sites. The main interface components are site profiles, which contain information that helps users select high-quality items, and a work area, which offers thumbnail images, annotation, and lightweight grouping techn ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
TopicShop is an interface that helps users evaluate and organize collections of web sites. The main interface components are site profiles, which contain information that helps users select high-quality items, and a work area, which offers thumbnail images, annotation, and lightweight grouping techniques to help users organize selected sites. The two components are linked to allow task integration. Previous work [2] demonstrated that subjects who used TopicShop were able to select significantly more highquality sites, in less time and with less effort. We report here on studies that confirm and extend these results. We also show that TopicShop subjects spent just half the time organizing sites, yet still created more groups and more annotations, and agreed more in how they grouped sites. Finally, TopicShop subjects tightly integrated the tasks of evaluating and organizing sites.
Topic-Oriented Collaborative Crawling
- in Proceedings of the 2002 ACM CIKM
, 2002
"... A major concern in the implementation of a distributed Web crawler is the choice of a strategy for partitioning the Web among the nodes in the system. Our goal in selecting this strategy is to minimize the overlap between the activities of individual nodes. We propose a topic-oriented approach, in w ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
A major concern in the implementation of a distributed Web crawler is the choice of a strategy for partitioning the Web among the nodes in the system. Our goal in selecting this strategy is to minimize the overlap between the activities of individual nodes. We propose a topic-oriented approach, in which the Web is partitioned into general subject areas with a crawler assigned to each. We examine design alternatives for a topic-oriented distributed crawler, including the creation of a Web page classifier for use in this context. The approach is compared experimentally with a hash-based partitioning, in which crawler assignments are determined by hash functions computed over URLs and page contents. The experimental evaluation demonstrates the feasibility of the approach, addressing issues of communication overhead, duplicate content detection, and page quality assessment.
Hyperlink Analysis: Techniques and Applications
, 2002
"... ABSTRACT.................................................................................................................................................. 0 ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
ABSTRACT.................................................................................................................................................. 0
Exploiting pagerank at different block level
- in Proceedings of the 5th International Conference on Web Information Systems Engineering
"... Abstract. In recent years, information retrieval methods focusing on the link analysis have been developed; The PageRank and HITS are two typical ones According to the hierarchical organization of Web pages, we could partition the Web graph into blocks at different level, such as page level, directo ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. In recent years, information retrieval methods focusing on the link analysis have been developed; The PageRank and HITS are two typical ones According to the hierarchical organization of Web pages, we could partition the Web graph into blocks at different level, such as page level, directory level, host level and domain level. On the basis of block, we could analyze the different hyperlinks among pages. Several approaches proposed that the intrahyperlink in a host maybe less useful in computing the PageRank. However, there are no reports on how concretely the intra- or inter-hyperlink affects the PageRank. Furthermore, based on different block level, inter-hyperlink and intra-hyperlink can be two relative concepts. Thus which level should be optimal to distinguish the intra- or inter-hyperlink? And how the ratio set between the intra-hyperlink and inter-hyperlink could ultimately improve performance of the PageRank algorithm? In this paper, we analyze the link distribution at the different block level and evaluate the importance of the intra- and interhyperlink to PageRank on the TREC Web Track data set. Experiment shows that, if we set the block at host level and the ratio of the weight between the intra-hyperlink and inter-hyperlink is 1:4, the retrieval could achieve the best performance. 1
1. An Empirical Approach to Information Retrieval (IR)
"... Many findings in IR were developed through empirical work, rather than as a result of theory development. For example, the Vector Space Model (VSM) describes a way that a collection of documents can be represented for information retrieval, but there was initially no theoretical underpinning for thi ..."
Abstract
- Add to MetaCart
Many findings in IR were developed through empirical work, rather than as a result of theory development. For example, the Vector Space Model (VSM) describes a way that a collection of documents can be represented for information retrieval, but there was initially no theoretical underpinning for this model other than some intuitions. The domain of traditional IR is guided by striving to improve and optimize performance (specifically precision) over previous work. The mere demonstration of satisfactory performance results is sufficient to accept one’s work and apply it into a working system. 2. Vector Space Model (VSM) Fundamentals Recall from the previous lecture that the VSM is used to represent documents from a corpus as vectors in the following manner: Document d is represented as vector d r = [d1,…,dj,…,dm] T, where dj represents the weight of term v (j) in document d, and there are m terms in the vocabulary. It is generally accepted that the following three concepts should be incorporated into the term weights of a document: idfj, tfj(d), and normj(d). a. idfj

