Results 1 - 10
of
13
How Dynamic is the Web?
, 2000
"... Recent experiments and analysis suggest that there are about 800 million publicly-indexable web pages. However, unlike books in a traditional library, web pages continue to change even after they are initially published by their authors and indexed by search engines. This paper describes prelimina ..."
Abstract
-
Cited by 97 (0 self)
- Add to MetaCart
Recent experiments and analysis suggest that there are about 800 million publicly-indexable web pages. However, unlike books in a traditional library, web pages continue to change even after they are initially published by their authors and indexed by search engines. This paper describes preliminary data on and statistical analysis of the frequency and nature of web page modications. Using empirical models and a novel analytic metric of \up-to-dateness", we estimate the rate at which web search engines must re-index the web to remain current. Keywords: web dynamics, monitoring, document management 1 Introduction Since its inception scarcely a decade ago, the World Wide Web has become a popular vehicle for disseminating scientic, commercial and personal information. The web consists of individual pages linked to and from other pages through Hyper Text Markup Language (HTML) constructs. The web is patently decentralized. Web pages are created, maintained and modied at random t...
A Dynamic Object Replication and Migration Protocol for an Internet Hosting Service
- IN PROC. OF IEEE ICDCS
, 1998
"... This paper proposes a protocol suite for dynamic replication and migration of Internet objects. It consists of an algorithm for deciding on the number and location of object replicas and an algorithm for distributing requests among currently available replicas. Our approach attempts to place replica ..."
Abstract
-
Cited by 63 (8 self)
- Add to MetaCart
This paper proposes a protocol suite for dynamic replication and migration of Internet objects. It consists of an algorithm for deciding on the number and location of object replicas and an algorithm for distributing requests among currently available replicas. Our approach attempts to place replicas in the vicinity of a majority of requests while ensuring at the same time that no servers are overloaded. The request distribution algorithm uses the same simple mechanism to take into account both server proximity and load, without actually knowing the latter. The replica placement algorithm executes autonomously on each node, without the knowledge of other object replicas in the system. The proposed algorithms rely on the information available in databases maintained by Internet routers. A simulation study using synthetic workloads and the network backbone of UUNET, one of the largest Internet service providers, shows that the proposed protocol is effective in eliminating hot spots and ...
Keeping Up With The Changing Web
- IEEE Computer
, 2000
"... Our access to information today is unprecedented in history. However, information depreciates in value as it gets older, and the problem of updating information to keep it current presents new design challenges for information providers and consumers. These issues lead to novel concepts and result ..."
Abstract
-
Cited by 47 (0 self)
- Add to MetaCart
Our access to information today is unprecedented in history. However, information depreciates in value as it gets older, and the problem of updating information to keep it current presents new design challenges for information providers and consumers. These issues lead to novel concepts and results in the context of the World Wide Web. We quantify what it means to for search engines to be \up-to-date" and estimate how often search engines must re-index the web to keep current with it changing pages and structure. Three weeks prior to the Soviet invasion of Czechoslovakia, Corona satellite imagery of the area showed no signs of imminent attack. By the time another round of imagery was available, it was too late to react; the invasion had already taken place. In a real sense, the information obtained by the satellite weeks earlier was no longer useful. The fact that information has a useful lifetime is well known in the intelligence community. On the other side of the Iron Curtain,...
Modeling and Managing Content Changes in Text Databases
"... Large amounts of (often valuable) information are stored in web-accessible text databases. "Metasearchers" provide unified interfaces to query multiple such databases at once. For efficiency, metasearchers rely on succinct statistical summaries of the database contents to select the best databases f ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
Large amounts of (often valuable) information are stored in web-accessible text databases. "Metasearchers" provide unified interfaces to query multiple such databases at once. For efficiency, metasearchers rely on succinct statistical summaries of the database contents to select the best databases for each query. So far, database selection research has largely assumed that databases are static, so the associated statistical summaries do not need to change over time. However, databases are rarely static and the statistical summaries that describe their contents need to be updated periodically to reflect content changes. In this paper, we first report the results of a study showing how the content summaries of 152 real web databases evolved over a period of 52 weeks. Then, we show how to use "survival analysis " techniques in general, and Cox's proportional hazards regression in particular, to model database changes over time and predict when we should update each content summary. Finally, we exploit our change model to devise update schedules that keep the summaries up to date by contacting databases only when needed, and then we evaluate the quality of our schedules experimentally over real web databases. 1.
Scheduling Algorithms For Web Crawling
, 2004
"... es, enforcing a politeness policy as described in Section ??: a Web crawler should not download more than one page from a single Web site at a time, and it should wait several seconds between requests. (a) Full parallelization (b) Full serialization Figure 6.1: Two unrealistic scenarios for Web cr ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
es, enforcing a politeness policy as described in Section ??: a Web crawler should not download more than one page from a single Web site at a time, and it should wait several seconds between requests. (a) Full parallelization (b) Full serialization Figure 6.1: Two unrealistic scenarios for Web crawling: (a) parallelizing all page downloads and (b) serializing all page downloads. The areas represent page sizes, as size = speed time. Instead of downloading all pages in parallel, we could also serialize all the requests, downloading only one page at a time at the maximum speed, as depicted in Figure 6.1b. However, the bandwidth available for Web sites B i is usually lower than the crawler bandwidth B, so this scenario is not realistic either. The presented observations suggest that actual download time lines are similar to the one shown in Figure 6.2. In the Figure, the optimal time T is not achieved, because some bandwidth is wasted due to limitations in the speed of Web sites
Observation of Changing Information Sources
, 2000
"... Many modern information management tasks consist of an observer that must maintain current knowledge of a collection of changing information. The goal of this observer is to maintain acceptably accurate state estimates given limited observation resources, such as bandwidth, time, and storage. Good e ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Many modern information management tasks consist of an observer that must maintain current knowledge of a collection of changing information. The goal of this observer is to maintain acceptably accurate state estimates given limited observation resources, such as bandwidth, time, and storage. Good examples of such \observation problems" are found in any situation where bandwidth is limited and old observations become less useful over time. Two such examples are maintaining a search engine's index of the World Wide Web (WWW) and automated monitoring of multiple sensors. This thesis addresses the general observation problem by (1) devising a formal framework of what it means to be \up-to-date", (2) gathering empirical data about the web that allows us to apply this framework to an important setting, and (3) presenting algorithms for scheduling revisits to optimize formal performance measures. One year's worth of web page observations are analyzed to show how quickly and in what ways web ...
Performance of Dynamic Replication Schemes for an Internet Hosting Service
- ONLINE]. AVAILABLE: CITESEER.IST.PSU.EDU/AGGARWAL98PERFORMANCE.HTML
, 1998
"... The paper explores schemes for dynamic replication and migration of web objects in the context of an Internet hosting service. It describes a replica placement algorithm for deciding the location and number of replicas of an object as well as request distribution schemes for choosing among cur ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
The paper explores schemes for dynamic replication and migration of web objects in the context of an Internet hosting service. It describes a replica placement algorithm for deciding the location and number of replicas of an object as well as request distribution schemes for choosing among currently available replicas. We compare two classes of request distribution algorithms -- namely feedback and non-feedback based. Further, we compare dynamic replication to a static replication scheme. We have simulated the algorithms using synthetic workloads as well as a real trace from a hosting service. Measurement and analysis show that dynamic replication significantly reduces bandwidth consumption and latency, removes hot spots from the network and smooths out bursts in bandwidth demand while imposing only a low network traffic overhead. For example, on the trace, our algorithm reduces bandwidth consumption by as much as 52% while imposing a traffic overhead of only about ...
A.D.: Objective-greedy algorithms for long-term Web prefetching
- In: Proceedings of the IEEE Conference on Network Computing and Applications (NCA
, 2004
"... Web prefetching is based on web caching and attempts to reduce user-perceived latency. Unlike on-demand caching, web prefetching fetches objects and stores them in advance, hoping that the prefetched objects are likely to be accessed in the near future and such accesses would be satisfied from the c ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Web prefetching is based on web caching and attempts to reduce user-perceived latency. Unlike on-demand caching, web prefetching fetches objects and stores them in advance, hoping that the prefetched objects are likely to be accessed in the near future and such accesses would be satisfied from the cache rather than by retrieving the objects from the web server. This paper reviews the popular prefetching algorithms based on Popularity, Good Fetch, APL characteristic, and Lifetime, and then makes the following contributions. (1) The paper proposes a family of prefetching algorithms, Objective-Greedy prefetching, wherein each algorithm greedily prefetches those web objects that give the highest performance as per the metric that it aims to improve. (2) The paper shows the results of a performance analysis via simulations, comparing the objective-greedy algorithms with the existing algorithms in terms of the respective objectives – the hit rate, bandwidth, and the H/B metrics. The proposed prefetching algorithms are seen to provide the best objective-based performance. (3) The paper also proves that the algorithms based on Good Fetch and on the APL characteristic, although using different criteria, are equivalent in terms of their choice of objects selected for prefetching. 1
Web Prefetching: Costs, Benefits and Performance
"... Due to the fast development of internet services and a huge amount of network tra#c, it is becoming an essential issue to reduce World Wide Web user-perceived latency. Although web performance is improved by caching, the benefit of caches is limited. To further reduce the retrieval latency, web pref ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Due to the fast development of internet services and a huge amount of network tra#c, it is becoming an essential issue to reduce World Wide Web user-perceived latency. Although web performance is improved by caching, the benefit of caches is limited. To further reduce the retrieval latency, web prefetching becomes an attractive solution to this problem. Prefetching reduces user access time, but at the same time, it requires more bandwidth and increases traffic. Performance measurement of prefetching techniques is primarily in terms of hit ratio and bandwidth usage. A significant factor for a prefetching algorithm in its ability to reduce latency is deciding which objects to prefetch in advance. This paper presents a solution space of prefetching according to various object-selecting criteria and a comparison of their performance is provided.
How dynamic is the web? Estimating the information highway speed limit
, 1999
"... Recent experiments and analysis suggest that there are about 800 million publicly-indexable web pages. However, unlike books in a traditional library, web pages continue to change even after they are initially published by their authors and indexed by search engines. This paper describes prelimina ..."
Abstract
- Add to MetaCart
Recent experiments and analysis suggest that there are about 800 million publicly-indexable web pages. However, unlike books in a traditional library, web pages continue to change even after they are initially published by their authors and indexed by search engines. This paper describes preliminary data on and statistical analysis of the frequency and nature of web page modications. Using empirical models and a novel analytic metric of \up-to-dateness", we estimate the rate at which web search engines must re-index the web to remain current. 1 Introduction Since its inception scarcely a decade ago, the World Wide Web has become a popular vehicle for disseminating scientic, commercial and personal information. The web consists of individual pages linked to and from other pages through Hyper Text Markup Language (HTML) constructs. The web is patently decentralized. Web pages are created, maintained and modied at random times by thousands, perhaps millions, of users around the ...

