Results 1 - 10
of
11
Why Websites Are Lost (and How They’re Sometimes Found)
"... Abstract. We have surveyed 52 individuals who have “lost ” their own personal website (through a hard drive crash, bankrupt ISP, etc.) or tried to recover a lost website that once belonged to someone else. Our survey investigates why websites are lost and how successful individuals have been at reco ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Abstract. We have surveyed 52 individuals who have “lost ” their own personal website (through a hard drive crash, bankrupt ISP, etc.) or tried to recover a lost website that once belonged to someone else. Our survey investigates why websites are lost and how successful individuals have been at recovering them using a variety of methods, including the use of search engine caches and web archives. The findings suggest that personal and third party loss of digital data is likely to continue as methods for backing up data are overlooked or performed incorrectly, and individual behavior is unlikely to change because of the perception that losing digital data is very uncommon and the responsibility of others. 4 1
Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure ABSTRACT
"... Missing web pages (pages that return the 404 “Page Not Found”error) are part of the browsing experience. The manual use of search engines to rediscover missing pages can be frustrating and unsuccessful. We compare four automated methods for rediscovering web pages. We extract the page’s title, gener ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Missing web pages (pages that return the 404 “Page Not Found”error) are part of the browsing experience. The manual use of search engines to rediscover missing pages can be frustrating and unsuccessful. We compare four automated methods for rediscovering web pages. We extract the page’s title, generate the page’s lexical signature (LS), obtain the page’s tags from the bookmarking website delicious.com and generate a LS from the page’s link neighborhood. We use the output of all methods to query Internet search engines and analyze their retrieval performance. Our results show that both LSs and titles perform fairly well with over 60 % URIs returned top ranked from Yahoo!. However, the combination of methods improves the retrieval performance. Considering the complexity of the LS generation, querying the title first and in case of insufficient results querying the LSs second is the preferable setup. This combination accounts for more than 75 % top ranked URIs.
Usage analysis of a public website reconstruction tool
- In Proceedings of JCDL ’08
, 2008
"... The Web is increasingly the medium by which information is published today, but due to its ephemeral nature, web pages and sometimes entire websites are often “lost ” due to server crashes, viruses, hackers, run-ins with the law, bankruptcy and loss of interest. When a website is lost and backups ar ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
The Web is increasingly the medium by which information is published today, but due to its ephemeral nature, web pages and sometimes entire websites are often “lost ” due to server crashes, viruses, hackers, run-ins with the law, bankruptcy and loss of interest. When a website is lost and backups are unavailable, an individual or third party can use Warrick to recover the website from several search engine caches and web archives (the Web Infrastructure). In this short paper, we present Warrick usage data obtained from Brass, a queueing system for Warrick hosted at Old Dominion University and made available to the public for free. Over the last six months, 520 individuals have reconstructed more than 700 websites with 800K resources from the Web Infrastructure. Sixty-two percent of the static web pages were recovered, and 41 % of all website resources were recovered. The Internet Archive was the largest contributor of recovered resources (78%).
Is This a Good Title?
"... Missing web pages, URIs that return the 404 “Page Not Found ” error or the HTTP response code 200 but dereference unexpected content, are ubiquitous in today’s browsing experience. We use Internet search engines to relocate such missing pages and provide means that help automate the rediscovery proc ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Missing web pages, URIs that return the 404 “Page Not Found ” error or the HTTP response code 200 but dereference unexpected content, are ubiquitous in today’s browsing experience. We use Internet search engines to relocate such missing pages and provide means that help automate the rediscovery process. We propose querying web pages ’ titles against search engines. We investigate the retrieval performance of titles and compare them to lexical signatures which are derived from the pages ’ content. Since titles naturally represent the content of a document they intuitively change over time. We measure the edit distance between current titles and titles of copies of the same pages obtained from the Internet Archive and display their evolution. We further investigate the correlation between title changes and content modifications of a web page over time. Lastly we provide a predictive model for the quality of any given web page title in terms of its discovery performance. Our results show that titles return more than 60 % URIs top ranked and further relevant content returned in the top 10 results. We show that titles decay slowly but are far more stable than the pages’ content. We further distill stop titles than can help identify insufficiently performing search engine queries.
Lost in translation: Understanding the possession of digital things in the cloud
- In CHI ’12: Proceedings of the International Conference on Human factors in Computing Systems
, 2012
"... {asellen, r.harper, ..."
A Framework for Describing Web Repositories
"... In prior work we have demonstrated that search engine caches and archiving projects like the Internet Archive’s Wayback Machine can be used to “lazily preserve ” websites and reconstruct them when they are lost. We use the term “web repositories ” for collections of automatically refreshed and migra ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In prior work we have demonstrated that search engine caches and archiving projects like the Internet Archive’s Wayback Machine can be used to “lazily preserve ” websites and reconstruct them when they are lost. We use the term “web repositories ” for collections of automatically refreshed and migrated content, and collectively we refer to these repositories as the “web infrastructure”. In this paper we present a framework for describing web repositories and the status of web resources in them. This includes an abstract API for web repository interaction, the concepts of deep vs. flat and light/dark/grey repositories and terminology for describing the recoverability of a web resource. Our API may serve as a foundation for future web repository interfaces.
Synchronicity- Automatically Rediscover Missing Web Pages in Real Time
"... Missing web pages (pages that return the 404 “Page Not Found ” error) are part of the browsing experience. The manual use of search engines to rediscover such pages can be frustrating and unsuccessful. We introduce Synchronicity, a Mozilla Firefox add-on that supports the Internet user in (re-)disco ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Missing web pages (pages that return the 404 “Page Not Found ” error) are part of the browsing experience. The manual use of search engines to rediscover such pages can be frustrating and unsuccessful. We introduce Synchronicity, a Mozilla Firefox add-on that supports the Internet user in (re-)discovering missing web pages in real time.
system issues, user issues
"... This paper reports the results of a qualitative field study of the scholarly writing, collaboration, information management, and long-term archiving practices of researchers in five related subdisciplines. The study focuses on the kinds of artifacts the researchers create in the process of writing a ..."
Abstract
- Add to MetaCart
This paper reports the results of a qualitative field study of the scholarly writing, collaboration, information management, and long-term archiving practices of researchers in five related subdisciplines. The study focuses on the kinds of artifacts the researchers create in the process of writing a paper, how they exchange and store materials over the short term, how they handle references and bibliographic resources, and the strategies they use to guarantee the long term safety of their scholarly materials. The findings reveal: (1) the adoption of a new CIM infrastructure relies crucially on whether it compares favorably to email along six critical dimensions; (2) personal scholarly archives should be maintained as a side-effect of collaboration and the role of ancillary material such as datasets remains to be worked out; and (3) it is vital to consider agency when we talk about depositing new types of scholarly materials into disciplinary repositories.
Using the Web Infrastructure for Just-In-Time Recovery of Missing Web Pages [Extended Abstract]
"... The Internet provides access to a great number of web sites, but the structure of the web is constantly changing. Missing web pages remain a pervasive problem that users experience every day. This dissertation is about creating a method to overcome this problem by automatically mapping between Unifo ..."
Abstract
- Add to MetaCart
The Internet provides access to a great number of web sites, but the structure of the web is constantly changing. Missing web pages remain a pervasive problem that users experience every day. This dissertation is about creating a method to overcome this problem by automatically mapping between Uniform Resource Identifiers (URIs) and textual content of web pages using lexical signatures (LSs) and tags. We introduce a “just-in-time ” approach to support the preservation of web content relying on the “living ” web. We propose a method to harness the collective behavior of the Web Infrastructure and investigate the suitability of lexical signatures and tags to give a “good enough ” description of the “aboutness” of missing pages. Utilizing Internet search engines by querying these LSs will return the replacement page or a very similar page which can be provided to the user. We investigate the evolution of lexical signatures over time and propose a framework to aid in the creation of LSs. Analyzing snapshots of the web from recent years will enable us to investigate the decay of such lightweight descriptions and also the characteristics of missing pages
Fast, Inexpensive Content-Addressed Storage in Foundation
"... Foundation is a preservation system for users ’ personal, digital artifacts. Foundation preserves all of a user’s data and its dependencies—fonts, programs, plugins, kernel, and configuration state—by archiving nightly snapshots of the user’s entire hard disk. Users can browse through these images t ..."
Abstract
- Add to MetaCart
Foundation is a preservation system for users ’ personal, digital artifacts. Foundation preserves all of a user’s data and its dependencies—fonts, programs, plugins, kernel, and configuration state—by archiving nightly snapshots of the user’s entire hard disk. Users can browse through these images to view old data or recover accidentally deleted files. To access data that a user’s current environment can no longer interpret, Foundation boots the disk image in which that data resides under an emulator, allowing the user to view and modify the data with the same programs with which the user originally accessed it. This paper describes Foundation’s archival storage layer, which uses content-addressed storage (CAS) to retain nightly snapshots of users ’ disks indefinitely. Current state-of-the-art CAS systems, such as Venti [34], require multiple high-speed disks or other expensive hardware to achieve high performance. Foundation’s archival storage layer, in contrast, matches the storage efficiency of Venti using only a single USB hard drive. Foundation archives disk snapshots at an average throughput of 21 MB/s and restores them at an average of 14 MB/s, more than an order of magnitude improvement over Venti running on the same hardware. Unlike Venti, Foundation does not rely on the assumption that SHA-1 is collision-free. 1

