• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Effective page refresh policies for web crawlers (0)

by J Cho, H García-Molina
Venue:TODS
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 37
Next 10 →

Effective Web Crawling

by Carlos Castillo, Dr. Alistair Moffat, Dr. Gonzalo Navarro , 2004
"... The key factors for the success of the World Wide Web are its large size and the lack of a centralized control over its contents. Both issues are also the most important source of problems for locating information. The Web is a context in which traditional Information Retrieval methods are challenge ..."
Abstract - Cited by 17 (2 self) - Add to MetaCart
The key factors for the success of the World Wide Web are its large size and the lack of a centralized control over its contents. Both issues are also the most important source of problems for locating information. The Web is a context in which traditional Information Retrieval methods are challenged, and given the volume of the Web and its speed of change, the coverage of modern search engines is relatively small. Moreover, the distribution of quality is very skewed, and interesting pages are scarce in comparison with the rest of the content. Web crawling is the process used by search engines to collect pages from the Web. This thesis studies Web crawling at several different levels, ranging from the long-term goal of crawling important pages first, to the short-term goal of using the network connectivity efficiently, including implementation issues that are essential for crawling in practice. We start by designing a new model and architecture for a Web crawler that tightly integrates the crawler with the rest of the search engine, providing access to the metadata and links of the documents that can be used to guide the crawling process effectively. We implement this design in the WIRE project as an efficient Web crawler that provides an experimental framework for this research. In fact, we have used our crawler to

Recrawl Scheduling Based on Information Longevity

by Christopher Olston - In Proc. of the 17th International World Wide Web Conference (WWW , 2008
"... It is crucial for a web crawler to distinguish between ephemeral and persistent content. Ephemeral content (e.g., quote of the day) is usually not worth crawling, because by the time it reaches the index it is no longer representative of the web page from which it was acquired. On the other hand, co ..."
Abstract - Cited by 17 (0 self) - Add to MetaCart
It is crucial for a web crawler to distinguish between ephemeral and persistent content. Ephemeral content (e.g., quote of the day) is usually not worth crawling, because by the time it reaches the index it is no longer representative of the web page from which it was acquired. On the other hand, content that persists across multiple page updates (e.g., recent blog postings) may be worth acquiring, because it matches the page’s true content for a sustained period of time. In this paper we characterize the longevity of information found on the web, via both empirical measurements and a generative model that coincides with these measurements. We then develop new recrawl scheduling policies that take longevity into account. As we show via experiments over real web data, our policies obtain better freshness at lower cost, compared with previous approaches.

Eigen-Trend: Trend Analysis in the Blogosphere Based on Singular Value Decompositions

by Yun Chi, Belle L. Tseng, Junichi Tatemura , 2006
"... The blogosphere---the totality of blog-related Web sites--- has become a great source of trend analysis in areas such as product survey, customer relationship, and marketing. Existing approaches are based on simple counts, such as the number of entries or the number of links. In this paper, we intro ..."
Abstract - Cited by 12 (2 self) - Add to MetaCart
The blogosphere---the totality of blog-related Web sites--- has become a great source of trend analysis in areas such as product survey, customer relationship, and marketing. Existing approaches are based on simple counts, such as the number of entries or the number of links. In this paper, we introduce a novel concept, coined eigen-trend, to represent the temporal trend in a group of blogs with common interests and propose two new techniques for extracting eigentrends in blogs. First, we propose a trend analysis technique based on the singular value decomposition. Extracted eigentrends provide new insights into multiple trends on the same keyword. Second, we propose another trend analysis technique based on a higher-order singular value decomposition. This analyzes the blogosphere as a dynamic graph structure and extracts eigen-trends that reflect the structural changes of the blogosphere over time. Experimental studies based on synthetic data sets and a real blog data set show that our new techniques can reveal a lot of interesting trend information and insights in the blogosphere that are not obtainable from traditional count-based methods.

Efficient Monitoring Algorithm for Fast News Alert

by Ka Cheung Sia, Junghoo Cho , 2005
"... use of XML data to deliver information over the Web. Personal weblogs, news Web sites, and discussion forums are now publishing RSS feeds for their subscribers to retrieve new postings. While the subscribers rely on news feeders to regularly pull articles from the Web sites, the aggregated effect by ..."
Abstract - Cited by 9 (1 self) - Add to MetaCart
use of XML data to deliver information over the Web. Personal weblogs, news Web sites, and discussion forums are now publishing RSS feeds for their subscribers to retrieve new postings. While the subscribers rely on news feeders to regularly pull articles from the Web sites, the aggregated effect by all news feeders puts an enormous load on many sites. In this paper, we propose a blog aggregator approach where a central aggregator monitors and retrieves new postings from different data sources and subsequently disseminates them to the subscribers to alleviate such a problem.

Efficient information extraction over evolving text data

by Fei Chen, Anhai Doan, Jun Yang, Raghu Ramakrishnan - in ICDE , 2008
"... Most current information extraction (IE) approaches have considered only static text corpora, over which we typically have to apply IE only once. Many real-world text corpora however are dynamic. They evolve over time, and to keep extracted information up to date, we often must apply IE repeatedly, ..."
Abstract - Cited by 9 (4 self) - Add to MetaCart
Most current information extraction (IE) approaches have considered only static text corpora, over which we typically have to apply IE only once. Many real-world text corpora however are dynamic. They evolve over time, and to keep extracted information up to date, we often must apply IE repeatedly, to consecutive corpus snapshots. We describe Cyclex, an approach that efficiently executes such repeated IE, by recycling previous IE efforts. Specifically, given a current corpus snapshot U, Cyclex identifies text portions of U that also appears in the previous corpus snapshot V. Since Cyclex has already executed IE over V, it can now recycle the IE results of these parts, by combining these results with the results of executing IE over the remaining parts of U, to produce the complete IE results for U. Realizing Cyclex raises many challenges, including modeling information extractors, exploring the trade-off between runtime and completeness in identifying overlapping text, and making informed, cost-based decisions between redoing IE from scratch and recycling previous IE results. We describe initial solutions to these challenges, and experiments over two real-world data sets that demonstrate the utility of our approach. 1

Freshness-aware scheduling of continuous queries in the dynamic web

by Mohamed A. Sharaf, Ros Labrinidis, Panos K. Chrysanthis, Kirk Pruhs - In Proc. Int. Workshop on the Web and Databases (WebDB , 2005
"... The dynamics of the Web and the demand for new, active services are imposing new requirements on Web servers. One such new service is the processing of continuous queries whose output data stream can be used to support the personalization of individual user’s web pages. In this paper, we are proposi ..."
Abstract - Cited by 6 (3 self) - Add to MetaCart
The dynamics of the Web and the demand for new, active services are imposing new requirements on Web servers. One such new service is the processing of continuous queries whose output data stream can be used to support the personalization of individual user’s web pages. In this paper, we are proposing a new scheduling policy for continuous queries with the objective of maximizing the freshness of the output data stream and hence the QoD of such new services. The proposed Freshness-Aware Scheduling of Multiple Continuous Queries (FAS-MCQ) policy decides the execution order of continuous queries based on each query’s properties (i.e., cost and selectivity) as well the properties of the input update streams (i.e., variability of updates). Our experimental results have shown that FAS-MCQ can increase freshness by up to 50 % compared to existing scheduling policies used in Web servers. 1.

Updating Collection Representations For Federated Search ABSTRACT

by Milad Shokouhi
"... To facilitate the search for relevant information across a set of online distributed collections, a federated information retrieval system typically represents each collection, centrally, by a set of vocabularies or sampled documents. Accurate retrieval is therefore related to how precise each repre ..."
Abstract - Cited by 6 (1 self) - Add to MetaCart
To facilitate the search for relevant information across a set of online distributed collections, a federated information retrieval system typically represents each collection, centrally, by a set of vocabularies or sampled documents. Accurate retrieval is therefore related to how precise each representation reflects the underlying content stored in that collection. As collections evolve over time, collection representations should also be updated to reflect any change, however, a current solution has not yet been proposed. In this study we examine both the implications of out-of-date representation sets on retrieval accuracy, as well as proposing three different policies for managing necessary updates. Each policy is evaluated on a testbed of forty-four dynamic collections over an eight-week period. Our findings show that out-ofdate representations significantly degrade performance over time, however, adopting a suitable update policy can minimise this problem.

Efficient, automatic web resource harvesting

by Michael L. Nelson, Joan A. Smith, Ignacio Garcia Del Campo - In RECOMB , 2006
"... There are two problems associated with conventional web crawling techniques: a crawler cannot know if all resources at a non-trivial web site have been discovered and crawled (“the counting problem”) and the human-readable format of the resources are not always suitable for machine processing (“the ..."
Abstract - Cited by 6 (4 self) - Add to MetaCart
There are two problems associated with conventional web crawling techniques: a crawler cannot know if all resources at a non-trivial web site have been discovered and crawled (“the counting problem”) and the human-readable format of the resources are not always suitable for machine processing (“the representation problem”). We introduce an approach that solves these two problems by implementing support for both the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and MPEG-21 Digital Item Declaration Language (DIDL) into the web server itself. We present the Apache module “mod oai”, which can be used to address the counting problem by listing all valid URIs at a web server and efficiently discovering updates and additions on subsequent crawls. Our experiments indicated comparable performance for initial crawls, and dramatic increases in update speed. mod oai can also be used to address the representation problem by providing “preservation ready” versions of web resources aggregated with their respective forensic metadata in MPEG-21 DIDL format. Categories and Subject Descriptors:H.3.5 Information

mod_oai: An Apache Module for Metadata Harvesting

by Michael L. Nelson, Herbert Van De Sompel, Xiaoming Liu, Terry L. Harrison, Nathan McFarland - IN PROCEEDINGS OF THE 2ND EUROPEAN CONFERENCE ON DIGITAL LIBRARIES
"... We describe mod_oai, an Apache 2.0 module that implements the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). The OAI-PMH is the de facto standard for metadata exchange in digital libraries and allows repositories to expose their contents in a structured, application-neutral fo ..."
Abstract - Cited by 4 (3 self) - Add to MetaCart
We describe mod_oai, an Apache 2.0 module that implements the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). The OAI-PMH is the de facto standard for metadata exchange in digital libraries and allows repositories to expose their contents in a structured, application-neutral format with semantics optimized for accurate incremental harvesting. Current implementations of OAI-PMH are either separate applications that access an existing repository, or are built-in to repository software packages. mod_oai is different in that it optimizes harvesting web content by building OAI-PMH capability into the Apache server. We discuss the implications of adding harvesting capability to an Apache server and describe our initial experimental results accessing a departmental web site using both web crawling and OAI-PMH harvesting techniques.

Opal: In vivo based preservation framework for locating lost web pages

by Terry L. Harrison, Terry L. Harrison, Director Dr, Michael L. Nelson , 2005
"... We present Opal, a framework for interactively locating missing web pages ..."
Abstract - Cited by 4 (0 self) - Add to MetaCart
We present Opal, a framework for interactively locating missing web pages
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University