Results 1 -
7 of
7
Ozone: Integrating Structured and Semistructured Data
, 2000
"... Applications have an increasing need to manage semistructured data (such as data encoded in XML) along with conventional structured data. We extend the structured object database model ODMG and its query language OQL with the ability to handle semistructured data based on the OEM model and Lorel lan ..."
Abstract
-
Cited by 38 (2 self)
- Add to MetaCart
Applications have an increasing need to manage semistructured data (such as data encoded in XML) along with conventional structured data. We extend the structured object database model ODMG and its query language OQL with the ability to handle semistructured data based on the OEM model and Lorel language, and we implement our extensions in a system called Ozone. In our approach, structured data may contain entry points to semistructured data, and vice-versa. The unified representation and querying of such "hybrid" data is the main contribution of our work. We retain strong typing and access to all properties of structured portions of the data while allowing flexible navigation of semistructured data without requiring full knowledge of structure. Ozone also enhances both ODMG/OQL and OEM/Lorel by virtue of their combination. For instance, Ozone allows OEM semantics to be applied to ODMG data, thus supporting semistructured-style navigation of structured data. Ozone als...
Persistence of Web references in scientific research
- IEEE COMPUTER
, 2001
"... The web has greatly improved the accessibility of scientific information, however the role of the web in formal scientific publishing has been debated. Some argue that the lack of persistence of web resources means that they should not be cited in scientific research. We analyze references to web re ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
The web has greatly improved the accessibility of scientific information, however the role of the web in formal scientific publishing has been debated. Some argue that the lack of persistence of web resources means that they should not be cited in scientific research. We analyze references to web resources in computer science publications, finding that the number of web references has increased dramatically in the last few years, and that many of these references are now invalid. We also find that most invalid web references can be relocated easily. We argue that, while formal references to published articles should always be used when possible, web references help to improve communication and progress in science. However, citation practices need to be improved to minimize future loss. We provide recommended practices for citing web resources, and discuss methods for relocating invalid references.
Analysis of lexical signatures for improving information persistence on the World Wide Web
- ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 2004
"... A lexical signature (LS) consisting of several key words from a Web document is often sufficient information for finding the document later, even if its URL has changed. We conduct a large-scale empirical study of nine methods for generating lexical signatures, including Phelps and Wilensky’s origin ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
A lexical signature (LS) consisting of several key words from a Web document is often sufficient information for finding the document later, even if its URL has changed. We conduct a large-scale empirical study of nine methods for generating lexical signatures, including Phelps and Wilensky’s original proposal (PW), seven of our own static variations, and one new dynamic method. We examine their performance on the Web over a 10-month period, and on a TREC data set, evaluating their ability to both (1) uniquely identify the original (possibly modified) document, and (2) locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the Web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. The term-frequency inverse-document-frequency- (TFIDF-) based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates among static methods for generating effective lexical signatures. We propose a dynamic LS generator
Web Interaction and the Navigation Problem in Hypertext written for Encyclopedia of Microcomputers
, 2001
"... The web has become a ubiquitous tool, used in day-to-day work, to find information and conduct business, and it is revolutionising the role and availability of information. One of the problems encountered in web interaction, which is still unsolved, is the navigation problem, whereby users can "g ..."
Abstract
-
Cited by 9 (7 self)
- Add to MetaCart
The web has become a ubiquitous tool, used in day-to-day work, to find information and conduct business, and it is revolutionising the role and availability of information. One of the problems encountered in web interaction, which is still unsolved, is the navigation problem, whereby users can "get lost in hyperspace", meaning that when following a sequence of links, i.e. a trail of information, users tend to become disoriented in terms of the goal of their original query and the relevance to the query of the information they are currently browsing. Herein we build statistical foundations for tackling the navigation problem based on a formal model of the web in terms of a probabilistic automaton, which can also be viewed as a finite ergodic Markov chain. In our model of the web the probabilities attached to state transitions have two interpretations, namely, they can denote the proportion of times a user followed a link, and alternatively they can denote the expected utility of following a link. Using this approach we have developed two techniques for constructing a web view based on the two interpretations of the probabilities of links, where a web view is a collection of relevant trails. The first method we describe is concerned with finding frequent user behaviour patterns. A collection of trails is taken as input and an ergodic Markov chain is produced as output with the probabilities of transitions corresponding to the frequency the user traversed the associated links. The second method we describe is a reinforcement learning algorithm that attaches higher probabilities to links whose expected trail relevance is higher. The user's home page and a query are taken as input and an ergodic Markov chain is produced as output with the probabilities of...
Hypermedia Scenarios for Command Control
- in Proceedings of the 13th Army Science Conference
, 1998
"... The purpose of this paper is to present research in defining a scenario scripting language. Scenarios have been used in many disciplines, such as software engineering, cognitive science, and HCI (Human-Computer Interaction) to aid in decision making, comprehension, design, and training. Scenarios ar ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The purpose of this paper is to present research in defining a scenario scripting language. Scenarios have been used in many disciplines, such as software engineering, cognitive science, and HCI (Human-Computer Interaction) to aid in decision making, comprehension, design, and training. Scenarios are instances of system behavior. Building new scenarios or analyzing existing scenarios orient the discussion in collaborative activities and increase understanding in single-user tasks. In this research, we propose a general conceptual model of a scenario. We then describe a document structure for scenarios. This document structure will define a Standard Generalized Markup Language(SGML) approach to scenarios. The use of a markup language will allow scenario documents to take advantage of hypermedia representations of the components. Building appropriate tools to interpret the Scenario Markup Language (SCML) will support the creation of different representations of scenarios (such as storybo...
Persistence of information on the web: Analyzing citations contained in research articles
- In CIKM ’00: Proceedings of the ninth international conference on Information and knowledge management
, 2000
"... We analyze the persistence of information on the web, looking at the percentage of invalid URLs contained in academic articles within the CiteSeer database. The number of URLs contained in research papers has increased from an average of 0.04 in 1992 to 1.8 in 1999. We analyzed the validity of URLs ..."
Abstract
- Add to MetaCart
We analyze the persistence of information on the web, looking at the percentage of invalid URLs contained in academic articles within the CiteSeer database. The number of URLs contained in research papers has increased from an average of 0.04 in 1992 to 1.8 in 1999. We analyzed the validity of URLs based on the year of publication of articles. The percentage of invalid URLs in articles varied from 23% for articles published in 1999, to a maximum of 54% for 1993. We found that it was possible to find the new location of invalid URLs about 51% of the time, by either guessing (14%) or with the use of search engines (37%). For an additional 26% of URLs we were able to locate highly related information. 3% of URLs could not be easily located but were accompanied by a formal citation, while 14% of invalid links could not be easily located and were not accompanied by a formal citation (the remaining 6% were never valid). We also analyze the type of URLs cited in research articles, the correla...

