Results 1 - 10
of
20
Beyond pagerank: Machine learning for static ranking
- In WWW ’06: Proceedings of the 15th international conference on World Wide Web
, 2006
"... Since the publication of Brin and Page’s paper on PageRank, many in the Web community have depended on PageRank for the static (query-independent) ordering of Web pages. We show that we can significantly outperform PageRank using features that are independent of the link structure of the Web. We gai ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
Since the publication of Brin and Page’s paper on PageRank, many in the Web community have depended on PageRank for the static (query-independent) ordering of Web pages. We show that we can significantly outperform PageRank using features that are independent of the link structure of the Web. We gain a further boost in accuracy by using data on the frequency at which users visit Web pages. We use RankNet, a ranking machine learning algorithm, to combine these and other static features based on anchor text and domain characteristics. The resulting model achieves a static ranking pairwise accuracy of 67.3 % (vs. 56.7% for PageRank or 50 % for random).
Modeling and Managing Content Changes in Text Databases
"... Large amounts of (often valuable) information are stored in web-accessible text databases. "Metasearchers" provide unified interfaces to query multiple such databases at once. For efficiency, metasearchers rely on succinct statistical summaries of the database contents to select the best databases f ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
Large amounts of (often valuable) information are stored in web-accessible text databases. "Metasearchers" provide unified interfaces to query multiple such databases at once. For efficiency, metasearchers rely on succinct statistical summaries of the database contents to select the best databases for each query. So far, database selection research has largely assumed that databases are static, so the associated statistical summaries do not need to change over time. However, databases are rarely static and the statistical summaries that describe their contents need to be updated periodically to reflect content changes. In this paper, we first report the results of a study showing how the content summaries of 152 real web databases evolved over a period of 52 weeks. Then, we show how to use "survival analysis " techniques in general, and Cox's proportional hazards regression in particular, to model database changes over time and predict when we should update each content summary. Finally, we exploit our change model to devise update schedules that keep the summaries up to date by contacting databases only when needed, and then we evaluate the quality of our schedules experimentally over real web databases. 1.
Efficient Monitoring Algorithm for Fast News Alert
, 2005
"... use of XML data to deliver information over the Web. Personal weblogs, news Web sites, and discussion forums are now publishing RSS feeds for their subscribers to retrieve new postings. While the subscribers rely on news feeders to regularly pull articles from the Web sites, the aggregated effect by ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
use of XML data to deliver information over the Web. Personal weblogs, news Web sites, and discussion forums are now publishing RSS feeds for their subscribers to retrieve new postings. While the subscribers rely on news feeders to regularly pull articles from the Web sites, the aggregated effect by all news feeders puts an enormous load on many sites. In this paper, we propose a blog aggregator approach where a central aggregator monitors and retrieves new postings from different data sources and subsequently disseminates them to the subscribers to alleviate such a problem.
The Discoverability of the Web
- In Proc. WWW, 2007. accrued 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Dataset 1 0.0005 0.001 0.0015 0.002 query sketches c=100 c=1000 c=10000 0.25 0.2 0.15 0.1 0.05 Dataset 2 0.0005 0.001 0.0015 0.002 query sketches c=100 c=1000 c=10000
, 2007
"... Previous studies have highlighted the high arrival rate of new content on the web. We study the extent to which this new content can be efficiently discovered by a crawler. Our study has two parts. First, we study the inherent difficulty of the discovery problem using a maximum cover formulation, un ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Previous studies have highlighted the high arrival rate of new content on the web. We study the extent to which this new content can be efficiently discovered by a crawler. Our study has two parts. First, we study the inherent difficulty of the discovery problem using a maximum cover formulation, under an assumption of perfect estimates of likely sources of links to new content. Second, we relax this assumption and study a more realistic setting in which algorithms must use historical statistics to estimate which pages are most likely to yield links to new content. We recommend a simple algorithm that performs comparably to all approaches we consider. We measure the overhead of discovering new content, defined as the average number of fetches required to discover one new page. We show first that with perfect foreknowledge of where to explore for links to new content, it is possible to discover 90 % of all new content with under 3 % overhead, and 100 % of new content with 9 % overhead. But actual algorithms, which do not have access to perfect foreknowledge, face a more difficult task: one quarter of new content is simply not amenable to efficient discovery. Of the remaining three quarters, 80 % of new content during a given week may be discovered with 160 % overhead if content is recrawled fully on a monthly basis.
RankMass Crawler: A Crawler with High Personalized PageRank Coverage Guarantee
, 2007
"... Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the unbounded number of pages available on the Web, search-engine operators constantly struggle with the following vexing questions: When can I stop downl ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the unbounded number of pages available on the Web, search-engine operators constantly struggle with the following vexing questions: When can I stop downloading the Web? How many pages should I download to cover “most ” of the Web? How can I know I am not missing an important part when I stop? In this paper we provide an answer to these questions by developing, in the context of a system that is given a set of trusted pages, a family of crawling algorithms that (1) provide a theoretical guarantee on how much of the “important ” part of the Web it will download after crawling a certain number of pages and (2) give a high priority to important pages during a crawl, so that the search engine can index the most important part of the Web first. We prove the correctness of our algorithms by theoretical analysis and evaluate their performance experimentally based on 141 million URLs obtained from the Web. Our experiments demonstrate that even our simple algorithm is effective in downloading important pages early on and provides high “coverage of the Web with a relatively small number of pages.
Looking at both the present and the past to efficiently update replicas of web content
- In Proc. of WIDM 2005
, 2005
"... Since Web sites are autonomous and independently updated, applications that keep replicas of Web data, such as Web warehouses and search engines, must periodically poll the sites and check for changes. Since this is a resource-intensive task, in order to keep the copies up-to-date, it is important t ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Since Web sites are autonomous and independently updated, applications that keep replicas of Web data, such as Web warehouses and search engines, must periodically poll the sites and check for changes. Since this is a resource-intensive task, in order to keep the copies up-to-date, it is important to devise efficient update schedules that adapt to the change rate of the pages and avoid visiting pages not modified since the last visit. In this paper, we propose a new approach that learns to predict the change behavior of Web pages based both on the static features and change history of pages, and refreshes the copies accordingly. Experiments using real-world data show that our technique leads to substantial performance improvements compared to previously proposed approaches.
Archiving the Web using Page Changes Pattern: A Case Study
- In ACM/IEEE Joint Conference on Digital Libraries (JCDL ’11
, 2011
"... A pattern is a model or a template used to summarize and describe the behavior (or the trend) of a data having generally some recurrent events. Patterns have received a considerable attention in recent years and were widely studied in the data mining field. Various pattern mining approaches have bee ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
A pattern is a model or a template used to summarize and describe the behavior (or the trend) of a data having generally some recurrent events. Patterns have received a considerable attention in recent years and were widely studied in the data mining field. Various pattern mining approaches have been proposed and used for different applications such as network monitoring, moving object tracking, financial or medical data analysis, scientific data processing, etc. In these different contexts, discovered patterns were useful to detect anomalies, to predict data behavior (or trend), or more generally, to simplify data processing or to improve system performance. However, to the best of our knowledge, patterns have never been used in the context of web archiving. Web archiving is the process of continuously collecting and preserving portions of the World Wide Web for future generations. In this paper, we show how patterns of page changes can be useful tools to efficiently archive web sites. We first define our pattern model that describes the changes of pages. Then, we present the strategy used to (i) extract the temporal evolution of page changes, to (ii) discover patterns and to (iii) exploit them to improve web archives. We choose the archive of French public TV channels France Télévisions as a case study 1 in order to validate our approach. Our experimental evaluation based on real web pages shows the utility of patterns to improve archive quality and to optimize indexing or storing.
Exploring traversal strategy for Web forum crawling
- In Proc.ofSIGIR
, 2008
"... In this paper, we study the problem of Web forum crawling. Web forum has now become an important data source of many Web applications; while forum crawling is still a challenging task due to complex in-site link structures and login controls of most forum sites. Without carefully selecting the trave ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In this paper, we study the problem of Web forum crawling. Web forum has now become an important data source of many Web applications; while forum crawling is still a challenging task due to complex in-site link structures and login controls of most forum sites. Without carefully selecting the traversal path, a generic crawler usually downloads many duplicate and invalid pages from forums, and thus wastes both the precious bandwidth and the limited storage space. To crawl forum data more effectively and efficiently, in this paper, we propose an automatic approach to exploring an appropriate traversal strategy to direct the crawling of a given target forum. In detail, the traversal strategy consists of the identification of the skeleton links and the detection of the page-flipping links. The skeleton links instruct the crawler to only crawl valuable pages and meanwhile avoid duplicate and uninformative ones; and the page-flipping links tell the crawler how to completely download a long discussion thread which is usually shown in multiple pages in Web forums. The extensive experimental results on several forums show encouraging performance of our approach. Following the discovered traversal strategy, our forum crawler can archive more informative pages in comparison with previous related work and a commercial generic crawler.
Monitoring RSS Feeds Based on User Browsing Pattern Abstract
"... RSS has been widely used to disseminate information on the Web over the years. With the help of RSS feed readers, a user may subscribe to the feeds that are published by her favorite blogs, news channels, or Websites, and access the most recent content from these information sources. However, when t ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
RSS has been widely used to disseminate information on the Web over the years. With the help of RSS feed readers, a user may subscribe to the feeds that are published by her favorite blogs, news channels, or Websites, and access the most recent content from these information sources. However, when the size of the subscription list grows over time, it becomes less manageable for the user to catch up with the most up-to-date information. In this paper, we propose a Personal Information Manager that helps a user monitor the pool of information sources in her subscription list and recommends relevant articles based on her browsing history. In particular, in order for the manager to provide the most up-to-date content, we propose a retrieval scheduling algorithm that allocates limited system resources in an optimal way based on the user’s previous access pattern. Experiments show that our scheduling algorithm significantly improves the freshness of content when compared to other scheduling algorithms which do not take into account a user’s behavior.
Maintaining Dynamic Channel Profiles on the Web
, 2008
"... This work addresses a novel problem of maintaining channel profiles on the Web. Such channel maintenance is essential for next generation of Web 2.0 applications that provide sophisticated search and discovery services over Web information channels. Maintaining a fresh channel profile is extremely d ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This work addresses a novel problem of maintaining channel profiles on the Web. Such channel maintenance is essential for next generation of Web 2.0 applications that provide sophisticated search and discovery services over Web information channels. Maintaining a fresh channel profile is extremely difficult due to the the dynamic nature of the channel, especially under the constraint of a limited monitoring budget. We propose a novel monitoring scheme that learns the channels’ monitoring rates. The monitoring scheme is further extended to consider the content that is published on the channels. We describe a novelty detection filter that refines the monitoring rate according to the expected rate of novel content published on the channels. We further show how inter-channel profile similarities can be utilized to refine the channel monitoring rates. Using real-world data of Web feeds we study the performance of the monitoring scheme. We experiment with several monitoring policies over a large set of Web feeds and show that a policy based on learning the monitoring rate of the channels, combined with novelty detection, outperforms alternative channel monitoring policies. Our results show that the suggested content-based policy is able to maintain high quality channel profiles under limited monitoring resources.

