Results 1 - 10
of
19
Improved File Synchronization Techniques for Maintaining Large Replicated Collections over Slow Networks
- IN PROC. OF THE INT. CONF. ON DATA ENGINEERING
, 2004
"... We study the problem of maintaining large replicated collections of files or documents in a distributed environment with limited bandwidth. This problem arises in a number of important applications, such as synchronization of data between accounts or devices, content distibution and web caching netw ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
We study the problem of maintaining large replicated collections of files or documents in a distributed environment with limited bandwidth. This problem arises in a number of important applications, such as synchronization of data between accounts or devices, content distibution and web caching networks, web site mirroring, storage networks, and large scale web search and mining. At the core of the problem lies the following challenge, called the file synchronization problem: given two versions of a file on different machines, say an outdated and a current one, how can we update the outdated version with minimum communication cost, by exploiting the significant similarity between the versions? While a popular open source tool for this problem called rsync is used in hundreds of thousands of installations, there have been only very few attempts to improve upon this tool in practice. In this paper,
Modeling and Managing Content Changes in Text Databases
"... Large amounts of (often valuable) information are stored in web-accessible text databases. "Metasearchers" provide unified interfaces to query multiple such databases at once. For efficiency, metasearchers rely on succinct statistical summaries of the database contents to select the best databases f ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
Large amounts of (often valuable) information are stored in web-accessible text databases. "Metasearchers" provide unified interfaces to query multiple such databases at once. For efficiency, metasearchers rely on succinct statistical summaries of the database contents to select the best databases for each query. So far, database selection research has largely assumed that databases are static, so the associated statistical summaries do not need to change over time. However, databases are rarely static and the statistical summaries that describe their contents need to be updated periodically to reflect content changes. In this paper, we first report the results of a study showing how the content summaries of 152 real web databases evolved over a period of 52 weeks. Then, we show how to use "survival analysis " techniques in general, and Cox's proportional hazards regression in particular, to model database changes over time and predict when we should update each content summary. Finally, we exploit our change model to devise update schedules that keep the summaries up to date by contacting databases only when needed, and then we evaluate the quality of our schedules experimentally over real web databases. 1.
Efficient Monitoring Algorithm for Fast News Alert
, 2005
"... use of XML data to deliver information over the Web. Personal weblogs, news Web sites, and discussion forums are now publishing RSS feeds for their subscribers to retrieve new postings. While the subscribers rely on news feeders to regularly pull articles from the Web sites, the aggregated effect by ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
use of XML data to deliver information over the Web. Personal weblogs, news Web sites, and discussion forums are now publishing RSS feeds for their subscribers to retrieve new postings. While the subscribers rely on news feeders to regularly pull articles from the Web sites, the aggregated effect by all news feeders puts an enormous load on many sites. In this paper, we propose a blog aggregator approach where a central aggregator monitors and retrieves new postings from different data sources and subsequently disseminates them to the subscribers to alleviate such a problem.
Web Dynamics, Structure, and Page Quality
- In Web Dynamics
, 2004
"... Introduction The purpose of a Web search engine is to provide an infrastructure that supports relationships between publishers of content and readers. In this context, as the numbers involved are very big (550 million users [2] and more than 3 billion pages (a lower bound that comes from the covera ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Introduction The purpose of a Web search engine is to provide an infrastructure that supports relationships between publishers of content and readers. In this context, as the numbers involved are very big (550 million users [2] and more than 3 billion pages (a lower bound that comes from the coverage of popular search engines) in 35 million sites [4] on January 2003) it is critical to provide good measures of quality that allow the user to choose "good" pages. We think that this is the main element that explain Google's [3] success. However, the notion of what is a "good page" and how this it is related to different Web characteristics is not well understood. Therefore, in this chapter we address the study of the relationships between the age of a page or a site, the quality of a page, and the structure of the Web. Age is defined as the time since the page was last updated (recency). For Web servers, we use the oldest page in the site as a lower bound on the age of the site. The spe
Scalable Application-Aware Data Freshening
, 2003
"... Distributed databases and other networked information systems use copies or mirrors to reduce latency and to increase availability. Copies need to be refreshed. In a loosely coupled system, the copy sites are typically responsible for synchronizing their own copies. This involves polling and can be ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Distributed databases and other networked information systems use copies or mirrors to reduce latency and to increase availability. Copies need to be refreshed. In a loosely coupled system, the copy sites are typically responsible for synchronizing their own copies. This involves polling and can be quite expensive if not done in a disciplined way. This paper explores the topic of how to determine a refresh schedule given knowledge of the update frequencies and limited bandwidth. The emphasis here is on how to use additional information about the aggregate interest of the user community in each of the copies in order to maximize the perceived freshness of the copies. This paper develops a model and an optimal solution for small cases, presents several heuristic algorithms that work for large cases, then explores the impact of object size on the refresh schedule. It also presents experimental evidence that our algorithms perform quite well.
The infocious web search engine: Improving web searching through linguistic analysis
- In Proceeding of the 14th International Conference on World Wide Web
, 2005
"... In this paper we present the Infocious Web search engine [23]. Our goal in creating Infocious is to improve the way people find information on the Web by resolving ambiguities present in natural language text. This is achieved by performing linguistic analysis on the content of the Web pages we inde ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
In this paper we present the Infocious Web search engine [23]. Our goal in creating Infocious is to improve the way people find information on the Web by resolving ambiguities present in natural language text. This is achieved by performing linguistic analysis on the content of the Web pages we index, which is a departure from existing Web search engines that return results mainly based on keyword matching. This additional step of linguistic processing gives Infocious two main advantages. First, Infocious gains a deeper understanding of the content of Web pages so it can better match users ’ queries with indexed documents and therefore can improve relevancy of the returned results. Second, based on its linguistic processing, Infocious can organize and present the results to the user in more intuitive ways. In this paper we present the linguistic processing technologies that we incorporated in Infocious and how they are applied in helping users find information on the Web more efficiently. We discuss the various components in the architecture of Infocious and how each of them benefits from the added linguistic processing. Finally, we experimentally evaluate the performance of a component which leverages linguistic information in order to categorize Web pages.
Temporal Multi-Page Summarization
- WEB INTELLIGENCE AND AGENT SYSTEMS
, 2006
"... With the increasing popularity of the Web, efficient approaches to the information overload are becoming more necessary. Summarization of web pages aims at detecting the most important contents from pages so that a user can obtain a compact version of a web document or a group of pages. Traditional ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
With the increasing popularity of the Web, efficient approaches to the information overload are becoming more necessary. Summarization of web pages aims at detecting the most important contents from pages so that a user can obtain a compact version of a web document or a group of pages. Traditionally, summaries are constructed on static snapshots of web pages. However, web pages are dynamic objects that can change their contents anytime. In this paper, we discuss the research on temporal multi-document summarization in the Web. We analyze the temporal contents of topically related collections of web pages monitored for certain time intervals. The contents derived from the temporal versions of web documents are summarized to provide information on hot topics and popular events in the collection. We propose two summarization methods that use changing and static contents of web pages downloaded at defined time intervals. The first uses a sliding window mechanism and the second is based on analyzing the time series of the document frequencies of terms. Additionally, we introduce a novel sentence selection algorithm designed for time-dependent scenarios such as temporal summarization.
Looking at both the present and the past to efficiently update replicas of web content
- In Proc. of WIDM 2005
, 2005
"... Since Web sites are autonomous and independently updated, applications that keep replicas of Web data, such as Web warehouses and search engines, must periodically poll the sites and check for changes. Since this is a resource-intensive task, in order to keep the copies up-to-date, it is important t ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Since Web sites are autonomous and independently updated, applications that keep replicas of Web data, such as Web warehouses and search engines, must periodically poll the sites and check for changes. Since this is a resource-intensive task, in order to keep the copies up-to-date, it is important to devise efficient update schedules that adapt to the change rate of the pages and avoid visiting pages not modified since the last visit. In this paper, we propose a new approach that learns to predict the change behavior of Web pages based both on the static features and change history of pages, and refreshes the copies accordingly. Experiments using real-world data show that our technique leads to substantial performance improvements compared to previously proposed approaches.
SHARC: Framework for Quality-Conscious Web Archiving
"... Web archives preserve the history of born-digital content and offer great potential for sociologists, business analysts, and legal experts on intellectual property and compliance issues. Data quality is crucial for these purposes. Ideally, crawlers should gather sharp captures of entire Web sites, b ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Web archives preserve the history of born-digital content and offer great potential for sociologists, business analysts, and legal experts on intellectual property and compliance issues. Data quality is crucial for these purposes. Ideally, crawlers should gather sharp captures of entire Web sites, but the politeness etiquette and completeness requirement mandate very slow, long-duration crawling while Web sites undergo changes. This paper presents the SHARC framework for assessing the data quality in Web archives and for tuning capturing strategies towards better quality with given resources. We define quality measures, characterize their properties, and derive a suite of quality-conscious scheduling strategies for archive crawling. It is assumed that change rates of Web pages can be statistically predicted based on page types, directory depths, and URL names. We develop a stochastically optimal crawl algorithm for the offline case where all change rates are known. We generalize the approach into an online algorithm that detect information on a Web site while it is crawled. For dating a site capture and for assessing its quality, we propose several strategies that revisit pages after their initial downloads in a judiciously chosen order. All strategies are fully implemented in a testbed, and shown to be effective by experiments with both synthetically generated sites and a daily crawl series for a medium-sized site.
Submitted: Capturing web dynamics by regular approximation
- In WISE04, International Conference on Web Information Systems Engineering
, 2004
"... Abstract. Software systems like Web crawlers, Web archives or Web caches depend on or may be improved with the knowledge of update times of remote sources. In the literature, based on the assumption of an exponential distribution of time intervals between updates, diverse statistical methods were pr ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract. Software systems like Web crawlers, Web archives or Web caches depend on or may be improved with the knowledge of update times of remote sources. In the literature, based on the assumption of an exponential distribution of time intervals between updates, diverse statistical methods were presented to find optimal reload times of remote sources. In this article first we present the observation that the time behavior of a fraction of Web data may be described more precisely by regular or quasi regular grammars. Second we present an approach to estimate the parameters of such grammars automatically. By comparing a reload policy based on regular approximation to previous exponential-distribution based methods we show that the quality of local copies of remote sources concerning ’freshness ’ and the amount of lost data may be improved significantly. 1

