Results 11 - 20
of
20
Effective criteria for Web page changes
- In Proceedings of APWeb ’06
, 2006
"... Abstract. A number of similarity metrics have been used to measure the degree of web page changes in the literature. In this paper, we define criteria for web page changes to evaluate the effectiveness of the metrics. Using real web pages and synthesized pages, we analyze the five existing metrics ( ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Abstract. A number of similarity metrics have been used to measure the degree of web page changes in the literature. In this paper, we define criteria for web page changes to evaluate the effectiveness of the metrics. Using real web pages and synthesized pages, we analyze the five existing metrics (i.e., the byte-wise comparison, the TF·IDF cosine distance, the word distance, the edit distance, and the shingling) under the proposed criteria. The analysis result can help users select an appropriate metric for particular web applications. 1
USING THE WEB INFRASTRUCTURE FOR REAL TIME RECOVERY OF MISSING WEB PAGES
, 2011
"... Given the dynamic nature of the World Wide Web, missing web pages, or “404 Page not Found” responses, are part of our web browsing experience. It is our intuition that information on the web is rarely completely lost, it is just missing. In whole or in part, content often moves from one URI to anoth ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Given the dynamic nature of the World Wide Web, missing web pages, or “404 Page not Found” responses, are part of our web browsing experience. It is our intuition that information on the web is rarely completely lost, it is just missing. In whole or in part, content often moves from one URI to another and hence it just needs to be (re-)discovered. We evaluate several methods for a “justin-time” approach to web page preservation. We investigate the suitability of lexical signatures and web page titles to rediscover missing content. It is understood that web pages change over time which implies that the performance of these two methods depends on the age of the content. We therefore conduct a temporal study of the decay of lexical signatures and titles and estimate their half-life. We further propose the use of tags that users have created to annotate pages as well as the most salient terms derived from a page’s link neighborhood. We utilize the Memento framework to discover previous versions of web pages and to execute the above methods. We provide a workflow including a set of parameters that is most promising for the (re-)discovery of missing web pages. We introduce Synchronicity, a web browser add-on that implements this workflow. It works while the user is browsing and detects the occurrence of 404 errors automatically. When activated by the user Synchronicity offers a total of six methods to either rediscover the missing page at its new URI or discover an alternative page that satisfies the user’s information need. Synchronicity depends on user interaction which enables it to provide results in real time. c○Copyright, 2011, by Martin Klein, All Rights Reserved. iii Dedicated to a plate, may it always be full of shrimp! iv v
Abstract AN EXPERIMENT ON VISIBLE CHANGES OF WEB PAGES
"... Since web pages are created, changed, and destroyed constantly, web databases (local collections of web pages) should be updated to maintain web pages up-to-date. In order to effectively keep web databases fresh, a number of studies on the change detection of web pages have been carried out, and var ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Since web pages are created, changed, and destroyed constantly, web databases (local collections of web pages) should be updated to maintain web pages up-to-date. In order to effectively keep web databases fresh, a number of studies on the change detection of web pages have been carried out, and various web statistics have been reported in the literature. This paper considers the issues of web page changes in terms of user visuality. First, we consider the effect of a number of tags that do not make difference in terms of user visuality. We learned that approximately 4.5 % of web page changes under the byte-wise comparison were unnecessarily determined. Secondly, we investigated the relationship between ‘TITLE ’ tags and ‘BODY ’ tags in terms of web page changes. We found out that an inspection of ‘TITLE ’ tags could allow users to sufficiently determine the change of web pages, so that we can significantly reduce the comparison time of web pages. 1.
Time series analysis of the dynamics of news websites
"... Abstract-The content of news websites changes frequently and rapidly and its relevance tends to decay with time. To be of any value to the users, tools, such as, search engines, have to cope with these evolving websites and detect in a timely manner their changes. In this paper we apply time series ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract-The content of news websites changes frequently and rapidly and its relevance tends to decay with time. To be of any value to the users, tools, such as, search engines, have to cope with these evolving websites and detect in a timely manner their changes. In this paper we apply time series analysis to study the properties and the temporal patterns of the change rates of the content of three news websites. Our investigation shows that changes are characterized by large fluctuations with periodic patterns and time dependent behavior. The time series describing the change rate is decomposed into trend, seasonal and irregular components and models of each component are then identified. The trend and seasonal components describe the daily and weekly patterns of the change rates. Trigonometric polynomials best fit these deterministic components, whereas the class of ARMA models represents the irregular component. The resulting models can be used to describe the dynamics of the changes and predict future change rates.
unknown title
"... This article presents a comparative study of strategies for Web crawling. We show that a combination of breadthfirst ordering with the largest sites first is a practical alternative since it is fast, simple to implement, and able to retrieve the best ranked pages at a rate that is closer to the opti ..."
Abstract
- Add to MetaCart
(Show Context)
This article presents a comparative study of strategies for Web crawling. We show that a combination of breadthfirst ordering with the largest sites first is a practical alternative since it is fast, simple to implement, and able to retrieve the best ranked pages at a rate that is closer to the optimal than other alternatives. Our study was performed on a large sample of the Chilean Web which was crawled by using simulators, so that all strategies were compared under the same conditions, and actual crawls to validate our conclusions. We also explored the effects of large scale parallelism in the page retrieval task and multiple-page requests in a single connection for effective amortization of latency times. 1.
Perelgut, Jen Hawkins, “Internal Corporate Blogs: Empowering Social Networks within Large
, 1979
"... Professional publications: ..."
(Show Context)
Knowledge Channels
"... In this paper, we present a new framework to extract knowledge from today's non-semantic web. It associates semantics with the information extracted, which improves agent interoperability; it can also deal with changes to the structure of a web page, which improves adaptability; furthermore ..."
Abstract
- Add to MetaCart
In this paper, we present a new framework to extract knowledge from today's non-semantic web. It associates semantics with the information extracted, which improves agent interoperability; it can also deal with changes to the structure of a web page, which improves adaptability; furthermore, it achieves to delegate the knowledge extraction procedure to specialist agents, easing software development and promoting software reuse and maintainability.