Results 1 - 10
of
12
404 not found: the stability and persistence of urls published in medline
- Bioinformatics
, 2004
"... Motivation: The advent of the World Wide Web has enabled unprecedented supplementation of traditional journal publications, allowing access to resources, such as video, sound, software, databases, datasets too large to publish, and even supplementary information and discussion. However, unlike tradi ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Motivation: The advent of the World Wide Web has enabled unprecedented supplementation of traditional journal publications, allowing access to resources, such as video, sound, software, databases, datasets too large to publish, and even supplementary information and discussion. However, unlike traditional publications, continued availability of these online resources is not guaranteed. An automated survey was conducted to quantify the growth in Uniform Resource Locators (URLs) published to date in MEDLINE abstracts, their current availability and distribution by journal. Results: Of 1630 unique URLs identified, formatting and/or spelling errors were detected within 201 (12%) of them as published. After corrections were made, a survey revealed that ∼63 % of these URLs were consistently available, and another 19 % were available intermittently. The rate of failure was far worse for anonymous login to FTP sites, with only 12 of 33 sites (36%) responding. This survey also shows that journals vary disproportionately in the number of web citations published, suggesting policy implementation among a few could have a profound impact overall. Out of the 306 journals with a URL published in an abstract, Bioinformatics published the most (12 % of total). Availability: URL database and program available by request. Contact:
Analysis of lexical signatures for improving information persistence on the World Wide Web
- ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 2004
"... A lexical signature (LS) consisting of several key words from a Web document is often sufficient information for finding the document later, even if its URL has changed. We conduct a large-scale empirical study of nine methods for generating lexical signatures, including Phelps and Wilensky’s origin ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
A lexical signature (LS) consisting of several key words from a Web document is often sufficient information for finding the document later, even if its URL has changed. We conduct a large-scale empirical study of nine methods for generating lexical signatures, including Phelps and Wilensky’s original proposal (PW), seven of our own static variations, and one new dynamic method. We examine their performance on the Web over a 10-month period, and on a TREC data set, evaluating their ability to both (1) uniquely identify the original (possibly modified) document, and (2) locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the Web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. The term-frequency inverse-document-frequency- (TFIDF-) based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates among static methods for generating effective lexical signatures. We propose a dynamic LS generator
Preserving the Fabric of Our Lives: A Survey of Web Preservation Initiatives
- In Proc. 7 th ECDL
, 2003
"... Abstract. This paper argues that the growing importance of the World Wide Web means that Web sites are key candidates for digital preservation. After an brief outline of some of the main reasons why the preservation of Web sites can be problematic, a review of selected Web archiving initiatives show ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract. This paper argues that the growing importance of the World Wide Web means that Web sites are key candidates for digital preservation. After an brief outline of some of the main reasons why the preservation of Web sites can be problematic, a review of selected Web archiving initiatives shows that most current initiatives are based on combinations of three main approaches: automatic harvesting, selection and deposit. The paper ends with a discussion of issues relating to collection and access policies, software, costs and preservation. 1
Opal: In vivo based preservation framework for locating lost web pages
, 2005
"... We present Opal, a framework for interactively locating missing web pages ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We present Opal, a framework for interactively locating missing web pages
SALT: Weaving the claim web
"... Abstract. In this paper we present a solution for “weaving the claim web”, i.e. the creation of knowledge networks via so-called claims stated in scientific publications created with the SALT (Semantically Annotated L ATEX) framework. To attain this objective, we provide support for claim identifica ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Abstract. In this paper we present a solution for “weaving the claim web”, i.e. the creation of knowledge networks via so-called claims stated in scientific publications created with the SALT (Semantically Annotated L ATEX) framework. To attain this objective, we provide support for claim identification, evolved the appropriate ontologies and defined a claim citation and reference mechanism. We also describe a prototypical claim search engine, which allows to reference to existing claims and hence, weave the web. Finally, we performed a small-scale evaluation of the authoring framework with a quite promising outcome. 1
DSNotify: Handling Broken Links in the Web of Data
- In 19th International WWW Conference (WWW2010
"... The Web of Data has emerged as a way of exposing structured linked data on the Web. It builds on the central building blocks of the Web (URIs, HTTP) and benefits from its simplicity and wide-spread adoption. It does, however, also inherit the unresolved issues such as the broken link problem. Broken ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
The Web of Data has emerged as a way of exposing structured linked data on the Web. It builds on the central building blocks of the Web (URIs, HTTP) and benefits from its simplicity and wide-spread adoption. It does, however, also inherit the unresolved issues such as the broken link problem. Broken links constitute a major challenge for actors consuming Linked Data as they require them to deal with reduced accessibility of data. We believe that the broken link problem is a major threat to the whole Web of Data idea and that both Linked Data consumers and providers will require solutions that deal with this problem. Since no general solutions for fixing such links in the Web of Data have emerged, we make three contributions into this direction: first, we provide a concise definition of the broken link problem and a comprehensive analysis of existing approaches. Second, we present DSNotify, a generic framework able to assist human and machine actors in fixing broken links. It uses heuristic feature comparison and employs a time-interval-based blocking technique for the underlying instance matching problem. Third, we derived benchmark datasets from knowledge bases such as DBpedia and evaluated the effectiveness of our approach with respect to the broken link problem. Our results show the feasibility of a time-interval-based blocking approach for systems that aim at detecting and fixing broken links in the Web of Data.
Turning The Web Into An Effective Knowledge Repository
"... Abstract: To fulfill Vannevar Bush’s Memex and Ted Nelson’s Hyper-Text vision of a world-size interconnected store of knowledge, there are still quite a few rough-edges to solve. There are no large-scale mechanisms to enforce referential integrity in the WWW. The weight of dynamically generated cont ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract: To fulfill Vannevar Bush’s Memex and Ted Nelson’s Hyper-Text vision of a world-size interconnected store of knowledge, there are still quite a few rough-edges to solve. There are no large-scale mechanisms to enforce referential integrity in the WWW. The weight of dynamically generated content w.r.t. static content has progressed enormously. Preserving accessibility to this type of content raises new issues. We propose a system, comprised of a distributed web-proxy and cache architecture, to access and automatically manage web content, static and dynamically generated. It is combined with an implementation of a cyclic distributed garbage collection algorithm for wide-area memory. It correctly handles dynamic content, enforces referential integrity on the web, and is complete w.r.t minimizing storage waste.
Information-Hiding URLs for Easier Website Evolution
"... Many common elements of URLs do not adhere to the principle of information hiding. For example, filename extensions and parameter names can reveal volatile implementation details. As a result, when website implementations change, links between pages break. Bookmarks and code that generates URLs ofte ..."
Abstract
- Add to MetaCart
Many common elements of URLs do not adhere to the principle of information hiding. For example, filename extensions and parameter names can reveal volatile implementation details. As a result, when website implementations change, links between pages break. Bookmarks and code that generates URLs often break as well. In this paper, we present two tools for information-hiding URLs. An information-hiding URL uses an alias to identify a web resource and appends parameter values into the hierarchical structure of the URL. The InformationHidingFilter uses a Java Servlet filter to facilitate the use of informationhiding URLs with JSP/Servlet web applications. Given a request, the filter identifies the JSP or Servlet being requested and identifies parameter values contained in the information-hiding URL. Required values not provided in the URL are automatically substituted with default values specified by the web developer. Thus, old links remain valid even when the website changes and new parameters have been added to the page. The InformationHidingChecker helps web developers adhere to information hiding by helping them identify JSPs or Servlets that lack URL information for the InformationHidingFilter or lack default values for parameters. We also discuss the performance cost of using information-hiding URLs. 1.
Should Students Use Digital Textbooks: A Research Design
, 2003
"... sive or alternate readings largely because they are basis of the discipline. In his discussion of scientific progress, Ludwig Fleck identified several of these objects, particularly handbooks and textbooks. For the purposes of our study, we are interested in textbooks and how aspiring scholars inter ..."
Abstract
- Add to MetaCart
sive or alternate readings largely because they are basis of the discipline. In his discussion of scientific progress, Ludwig Fleck identified several of these objects, particularly handbooks and textbooks. For the purposes of our study, we are interested in textbooks and how aspiring scholars interpret them. As a medium of study, textbooks offer several advantages: the chapters include questions for students, many are available in both electronic and print form, and textbooks are generally written by a key member of a particular scientific community i.e., a Saint (Traweek, 1996) who's word is Gospel. Our intention is to study differences in students' ability to correctly answer textbook questions when given either the print based text or a digital text to study. 3.0 RESEARCH DESIGN The first step in designing our research was to determine which textbook to use. Using "textbook" as a keyword, the electronic resources of the UWO library were searched. It quickly became apparent that "te
Using the Web Infrastructure for Just-In-Time Recovery of Missing Web Pages [Extended Abstract]
"... The Internet provides access to a great number of web sites, but the structure of the web is constantly changing. Missing web pages remain a pervasive problem that users experience every day. This dissertation is about creating a method to overcome this problem by automatically mapping between Unifo ..."
Abstract
- Add to MetaCart
The Internet provides access to a great number of web sites, but the structure of the web is constantly changing. Missing web pages remain a pervasive problem that users experience every day. This dissertation is about creating a method to overcome this problem by automatically mapping between Uniform Resource Identifiers (URIs) and textual content of web pages using lexical signatures (LSs) and tags. We introduce a “just-in-time ” approach to support the preservation of web content relying on the “living ” web. We propose a method to harness the collective behavior of the Web Infrastructure and investigate the suitability of lexical signatures and tags to give a “good enough ” description of the “aboutness” of missing pages. Utilizing Internet search engines by querying these LSs will return the replacement page or a very similar page which can be provided to the user. We investigate the evolution of lexical signatures over time and propose a framework to aid in the creation of LSs. Analyzing snapshots of the web from recent years will enable us to investigate the decay of such lightweight descriptions and also the characteristics of missing pages

