Results 11 -
16 of
16
An Information Concierge for the Web
- Proceedings of the First International Workshop on Internet Bots: Systems and Applications (INBOSA 2001), in conjunction with the 12th International Conference on Database and Expert System Applications (DEXA 2001
, 2001
"... WWW Information Collection, Collaging and Programming (WICCAP) system is a software system for the generation of logical views of web resources, and the extraction of desired information in the form of a structured document. It is designed to enable people to obtain information of interests in a sim ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
WWW Information Collection, Collaging and Programming (WICCAP) system is a software system for the generation of logical views of web resources, and the extraction of desired information in the form of a structured document. It is designed to enable people to obtain information of interests in a simple and effective manner as well as to make information from the WWW accessible to applications so as to afford automation, inter-operation and Web-awareness among services. A key factor in making this system useful in practice is that it provides tools to automate and facilitate the process of constructing logical representation of websites, to identify and define information of interest, and to retrieving them. In this paper, we present the design of the WICCAP system and its two main components, namely Mapping Wizard and Network Extraction Agent.
Start and Beyond
, 2002
"... To address the problem of information overload in today's world, we have developed Start, a natural language question answering system that provides users with multimedia information access through the use of natural language annotations. In order to harness the potential of knowledge sources on the ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
To address the problem of information overload in today's world, we have developed Start, a natural language question answering system that provides users with multimedia information access through the use of natural language annotations. In order to harness the potential of knowledge sources on the World Wide Web, we have developed Omnibase, a virtual database that provides uniform access to Web resources. Our ultimate goal is to develop a computer system that acts like a "smart reference librarian," and to a large extent, we have accomplished our goal. However, expanding our system's domain of knowledge is a time-consuming task that requires trained individuals. This paper describes several research directions aimed at overcoming the limitations of our current technology.
A Methodical Approach to Extracting Interesting Objects from Dynamic Web Pages
"... Abstract: This paper presents a fully automated object extraction system for web documents. Our methodology consists of a layered framework and a suite of algorithms. A distinct feature of our approach is the full automation of both the extraction of data object regions from dynamic Web pages and th ..."
Abstract
- Add to MetaCart
Abstract: This paper presents a fully automated object extraction system for web documents. Our methodology consists of a layered framework and a suite of algorithms. A distinct feature of our approach is the full automation of both the extraction of data object regions from dynamic Web pages and the identification of correct object boundary separators. We implemented the methodology in the XWRAPElite object extraction system and evaluated the system using more than 3,200 pages over 75 diverse web sites. Our experiments show three important and interesting results: First, our algorithms for identifying the minimal object rich subtree achieves 96 % success rate over all web pages we have tested. Second, our algorithms for discovering and extracting object separator tags reach the success rate of 95%. Most significantly, the overall system achieves precision between 96 % and 100 % (returns only correct objects) and excellent recall (between 95 % and 96%, with very few significant objects left out). The minimal subtree extraction algorithms and the object boundary identification algorithms are fast, about 87 milliseconds per page with an average page size of 30KB.
XML-Enabled Data Extraction for Web Sources
"... The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often interesting web data are not in database systems but in HTML pages, XML pages, or text les. Data in these formats is not directly usable by standard SQL-like query processing engines that support sophist ..."
Abstract
- Add to MetaCart
The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often interesting web data are not in database systems but in HTML pages, XML pages, or text les. Data in these formats is not directly usable by standard SQL-like query processing engines that support sophisticated querying and reporting beyond keyword-based retrieval. Hence, the web users or applications need a smart way of extracting data from these web sources. One of the popular approaches is to write wrappers around the sources, either manually or with software assistance, to bring the web data within the reach of more sophisticated query tools and general mediator-based information integration systems. In this paper, we describe the methodology and the software development of an XML-enabled wrapper construction system- XWRAP for semi-automatic generation of wrapper programs. By XML-enabled we mean that the metadata about information content that are implicit in the original web pages will be extracted and encoded explicitly as XML tags in the wrapped documents. In addition, the query-based content ltering process is performed against the XML documents. The XWRAP wrapper generation framework has three distinct features. First, it explicitly separates
Taking the OXPath down the Deep Web ∗
"... Although deep web analysis has been studied extensively, there is no succinct formalism to describe user interactions with AJAX-enabled web applications. Toward this end, we introduce OXPath as a superset of XPath 1.0. Beyond XPath, OXPath is able (1) to fill web forms and trigger DOM events, (2) to ..."
Abstract
- Add to MetaCart
Although deep web analysis has been studied extensively, there is no succinct formalism to describe user interactions with AJAX-enabled web applications. Toward this end, we introduce OXPath as a superset of XPath 1.0. Beyond XPath, OXPath is able (1) to fill web forms and trigger DOM events, (2) to access dynamically computed CSS attributes, (3) to navigate between visible form fields, and (4) to mark relevant information for extraction. This way, OXPath expressions can closely simulate the human interaction relevant for navigation rather than rely exclusively on the HTML structure. Thus, they are quite resilient against technical changes. We demonstrate the expressiveness and practical efficacy of OXPath to tackle a group flight planning problem. We use the OXPath implementation and visual interface to access the popular, highly-scripted travel site Kayak. We show, both visually and manually, how to formulate OXPath expressions to extract all booking information with just a few lines of code. 1.
Automated Meta-Data Extraction for Confsearch Semester Thesis
, 2011
"... Extracting meta-data from websites is an open eld and up till now there exists no satisfying solution for extracting important dates (e.g. the Paper Submission Deadline) from conference websites. We present an automated way to extract the meta-data of an academic conference from its website. We aim ..."
Abstract
- Add to MetaCart
Extracting meta-data from websites is an open eld and up till now there exists no satisfying solution for extracting important dates (e.g. the Paper Submission Deadline) from conference websites. We present an automated way to extract the meta-data of an academic conference from its website. We aim to facilitate the manual update of such data on the conference directory Confsearch

