Results 1 - 10
of
11
Shuffling a stacked deck: the case for partially randomized ranking of search engine results
- In Proc. 31st International Conference on Very Large Databases (VLDB
, 2005
"... In-degree, PageRank, number of visits and other measures of Web page popularity significantly influence the ranking of search results by modern search engines. The assumption is that popularity is closely correlated with quality, a more elusive concept that is difficult to measure directly. Unfortun ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
In-degree, PageRank, number of visits and other measures of Web page popularity significantly influence the ranking of search results by modern search engines. The assumption is that popularity is closely correlated with quality, a more elusive concept that is difficult to measure directly. Unfortunately, the correlation between popularity and quality is very weak for newly-created pages that have yet to receive many visits and/or in-links. Worse, since discovery of new content is largely done by querying search engines, and because users usually focus their attention on the top few results, newly-created but high-quality pages are effectively “shut out, ” and it can take a very long time before they become popular. We propose a simple and elegant solution to this problem: the introduction of a controlled amount of randomness into search result ranking methods. Doing so offers new pages a chance to prove their worth, although clearly using too much randomness will degrade result quality and annul any benefits achieved. Hence there is a tradeoff between exploration to estimate the quality of new pages and exploitation of pages already known to be of high quality. We study this tradeoff both analytically and via simulation, in the context of an economic objective function based on aggregate result quality amortized over time. We show that a modest amount of randomness leads to improved search results. 1
Effective Web Crawling
, 2004
"... The key factors for the success of the World Wide Web are its large size and the lack of a centralized control over its contents. Both issues are also the most important source of problems for locating information. The Web is a context in which traditional Information Retrieval methods are challenge ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
The key factors for the success of the World Wide Web are its large size and the lack of a centralized control over its contents. Both issues are also the most important source of problems for locating information. The Web is a context in which traditional Information Retrieval methods are challenged, and given the volume of the Web and its speed of change, the coverage of modern search engines is relatively small. Moreover, the distribution of quality is very skewed, and interesting pages are scarce in comparison with the rest of the content. Web crawling is the process used by search engines to collect pages from the Web. This thesis studies Web crawling at several different levels, ranging from the long-term goal of crawling important pages first, to the short-term goal of using the network connectivity efficiently, including implementation issues that are essential for crawling in practice. We start by designing a new model and architecture for a Web crawler that tightly integrates the crawler with the rest of the search engine, providing access to the metadata and links of the documents that can be used to guide the crawling process effectively. We implement this design in the WIRE project as an efficient Web crawler that provides an experimental framework for this research. In fact, we have used our crawler to
Refinement of TF-IDF Schemes for Web Pages using their Hyperlinked Neighboring Pages
, 2003
"... In IR (information retrieval) systems based on the vector space model, the tf-idf scheme is widely used to characterize documents. However, in the case of documents with hyperlink structures such as Web pages, it is necessary to develop a technique for representing the contents of Web pages more acc ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
In IR (information retrieval) systems based on the vector space model, the tf-idf scheme is widely used to characterize documents. However, in the case of documents with hyperlink structures such as Web pages, it is necessary to develop a technique for representing the contents of Web pages more accurately by exploiting the contents of their hyperlinked neighboring pages. In this paper, we first propose several approaches to refining the tf-idf scheme for a target Web page by using the contents of its hyperlinked neighboring pages, and then compare retrieval accuracy of our proposed approaches. Experimental results show that, generally, more accurate feature vectors of a target Web page can be generated in the case of utilizing the contents of its hyperlinked neighboring pages at levels up to second in the backward direction from the target page.
T-rank: Time-aware authority ranking
- In WAW
, 2004
"... Abstract. The link structure of the web is analyzed to measure the authority of pages, which can be taken into account for ranking query results. Due to the enormous dynamics of the web, with millions of pages created, updated, deleted, and linked to every day, temporal aspects of web pages and link ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
Abstract. The link structure of the web is analyzed to measure the authority of pages, which can be taken into account for ranking query results. Due to the enormous dynamics of the web, with millions of pages created, updated, deleted, and linked to every day, temporal aspects of web pages and links are crucial factors for their evaluation. Users are interested in important pages (i.e., pages with high authority score) but are equally interested in the recency of information. Time—and thus the freshness of web content and link structure—emanates as a factor that should be taken into account in link analysis when computing the importance of a page. So far only minor effort has been spent on the integration of temporal aspects into link-analysis techniques. In this paper we introduce T-Rank Light and T-Rank, two link-analysis approaches that take into account the temporal aspects freshness (i.e., timestamps of most recent updates) and activity (i.e., update rates) of pages and links. Experimental results show that T-Rank Light and T-Rank can produce better rankings of web pages. 1.
Ranking web sites with real user traffic
- INTERNATIONAL CONFERENCE ON WEB SEARCH AND WEB DATA MINING
, 2008
"... We analyze the traffic-weighted Web host graph obtained from a large sample of real Web users over about seven months. A number of interesting structural properties are revealed by this complex dynamic network, some in line with the well-studied boolean link host graph and others pointing to importa ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
We analyze the traffic-weighted Web host graph obtained from a large sample of real Web users over about seven months. A number of interesting structural properties are revealed by this complex dynamic network, some in line with the well-studied boolean link host graph and others pointing to important differences. We find that while search is directly involved in a surprisingly small fraction of user clicks, it leads to a much larger fraction of all sites visited. The temporal traffic patterns display strong regularities, with a large portion of future requests being statistically predictable by past ones. Given the importance of topological measures such as PageRank in modeling user navigation, as well as their role in ranking sites for Web search, we use the traffic data to validate the PageRank random surfing model. The ranking obtained by the actual frequency with which a site is visited by users differs significantly from that approximated by the uniform surfing/teleportation behavior modeled by PageRank, especially for the most important sites. To interpret this finding, we consider each of the fundamental assumptions underlying PageRank and show how each is violated by actual user behavior.
Web Dynamics, Structure, and Page Quality
- In Web Dynamics
, 2004
"... Introduction The purpose of a Web search engine is to provide an infrastructure that supports relationships between publishers of content and readers. In this context, as the numbers involved are very big (550 million users [2] and more than 3 billion pages (a lower bound that comes from the covera ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Introduction The purpose of a Web search engine is to provide an infrastructure that supports relationships between publishers of content and readers. In this context, as the numbers involved are very big (550 million users [2] and more than 3 billion pages (a lower bound that comes from the coverage of popular search engines) in 35 million sites [4] on January 2003) it is critical to provide good measures of quality that allow the user to choose "good" pages. We think that this is the main element that explain Google's [3] success. However, the notion of what is a "good page" and how this it is related to different Web characteristics is not well understood. Therefore, in this chapter we address the study of the relationships between the age of a page or a site, the quality of a page, and the structure of the Web. Age is defined as the time since the page was last updated (recency). For Web servers, we use the oldest page in the site as a lower bound on the age of the site. The spe
Time-aware and trend-based authority ranking
, 2004
"... This thesis devises time-aware and trend-based ranking techniques. The time-aware techniques exploit temporal information, present in networks like the World Wide Web, to produce rankings reflecting authority with regard to a temporal interest. The trendbased techniques produce rankings based on the ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This thesis devises time-aware and trend-based ranking techniques. The time-aware techniques exploit temporal information, present in networks like the World Wide Web, to produce rankings reflecting authority with regard to a temporal interest. The trendbased techniques produce rankings based on the relative change of authority with regard to a temporal interest. We describe mathematics behind the approaches and review efforts having related aims. On this basis, two time-aware and two trend-based methods are proposed. The time-aware methods extend PageRank and are defined incrementally. The trend-based methods are defined independently, one extending PageRank and the other based on a comparison of precomputed authority rankings. The methods were implemented in a prototype system using Java and extensively evaluated in a series of experiments on bibliographic and Web data. Results on the bibliographic data indicate that the methods produce meaningful rankings. Moreover, a user study, examining results obtained on Web data, gives strong evidence that the resulting rankings are preferred by the users and closer to their expectations.
Estimating the number of citations using author reputation
, 2007
"... Abstract. We study the problem of predicting the popularity of items in a dynamic environment in which authors post continuously new items and provide feedback on existing items. This problem can be applied to predict popularity of blog posts, rank photographs in a photo-sharing system, or predict t ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. We study the problem of predicting the popularity of items in a dynamic environment in which authors post continuously new items and provide feedback on existing items. This problem can be applied to predict popularity of blog posts, rank photographs in a photo-sharing system, or predict the citations of a scientific article using author information and monitoring the items of interest for a short period of time after their creation. As a case study, we show how to estimate the number of citations for an academic paper using information about past articles written by the same author(s) of the paper. If we use only the citation information over a short period of time, we obtain a predicted value that has a correlation of r = 0.57 with the actual value. This is our baseline prediction. Our best-performing system can improve that prediction by adding features extracted from the past publishing history of its authors, increasing the correlation between the actual and the predicted values to r = 0.81. 1
Chapter 4 Scheduling Algorithms for Web Crawling
"... In the previous chapter, we described the general model of our Web crawler. In this chapter, we deal with the specific algorithms for scheduling the visits to the Web pages. We started with a large sample of the Chilean Web that was used to build a Web graph and run a crawler simulator. Several stra ..."
Abstract
- Add to MetaCart
In the previous chapter, we described the general model of our Web crawler. In this chapter, we deal with the specific algorithms for scheduling the visits to the Web pages. We started with a large sample of the Chilean Web that was used to build a Web graph and run a crawler simulator. Several strategies were compared using the simulator to ensure identical conditions during the experiments. The rest of this chapter is organized as follows: Section 4.1 introduces our experimental framework and Section 4.2 the simulation parameters. Sections 4.3 and 4.4 compare different scheduling policies for longand short-term scheduling. In Section 4.5 we test one of these policies using a real Web crawler, and the last section presents our conclusions. Portions of this chapter were presented in [CMRBY04]. 4.1 Experimental setup We tested several scheduling policies in two different datasets corresponding to Chilean and Greek Web pages using a crawler simulator. This section describes how the dataset and how the simulator works. 4.1.1 Datasets:.cl and.gr Dill et al. [DKM+ 02] studied several sub-sets of the Web, and found that the Web graph is self-similar in
Vetting the Links of the Web
"... Many web links mislead human surfers and automated crawlers because they point to changed content, out-of-date information, or invalid URLs. It is a particular problem for large, well-known directories such as the dmoz Open Directory Project, which maintains links to representative and authoritative ..."
Abstract
- Add to MetaCart
Many web links mislead human surfers and automated crawlers because they point to changed content, out-of-date information, or invalid URLs. It is a particular problem for large, well-known directories such as the dmoz Open Directory Project, which maintains links to representative and authoritative external web pages within their various topics. Therefore, such sites involve many editors to manually revisit and revise links that have become out-ofdate. To remedy this situation, we propose the novel web mining task of identifying outdated links on the web. We build a general classification model, primarily using local and global temporal features extracted from historical content, topic, link and time-focused changes over time. We evaluate our system via five-fold crossvalidation on more than fifteen thousand ODP external links selected from thirteen top-level categories. Our system can predict the actions of ODP editors more than 75 % of the time. Our models and predictions could be useful for various applications that depend on analysis of web links, including ranking and crawling.

