Results 1 - 10
of
15
Wrapper Induction: Efficiency and Expressiveness
- Artificial Intelligence
, 2000
"... The Internet presents numerous sources of useful information---telephone directories, product catalogs, stock quotes, event listings, etc. Recently, many systems have been built that automatically gather and manipulate such information on a user's behalf. However, these resources are usually formatt ..."
Abstract
-
Cited by 191 (12 self)
- Add to MetaCart
The Internet presents numerous sources of useful information---telephone directories, product catalogs, stock quotes, event listings, etc. Recently, many systems have been built that automatically gather and manipulate such information on a user's behalf. However, these resources are usually formatted for use by people (e.g., the relevant content is embedded in HTML pages), so extracting their content is difficult. Most systems use customized wrapper procedures to perform this extraction task. Unfortunately, writing wrappers is tedious and error-prone. As an alternative, we advocate wrapper induction, a technique for automatically constructing wrappers. In this article, we describe six wrapper classes, and use a combination of empirical and analytical techniques to evaluate the computational tradeoffs among them. We first consider expressiveness: how well the classes can handle actual Internet resources, and the extent to which wrappers in one class can mimic those in another. We then...
Information Integration Using Contextual Knowledge and Ontology Merging
, 2003
"... With the advances in telecommunications, and the introduction of the Internet, information systems achieved physical connectivity, but have yet to establish logical connectivity. Lack of logical connectivity is often inviting disaster as in the case of Mars Orbiter, which was lost because one team u ..."
Abstract
-
Cited by 39 (5 self)
- Add to MetaCart
With the advances in telecommunications, and the introduction of the Internet, information systems achieved physical connectivity, but have yet to establish logical connectivity. Lack of logical connectivity is often inviting disaster as in the case of Mars Orbiter, which was lost because one team used metric units, the other English while exchanging a critical maneuver data. In this Thesis, we focus on the two intertwined sub problems of logical connectivity, namely data extraction and data interpretation in the domain of heterogeneous information systems. The first challenge, data extraction, is about making it possible to easily exchange data among semi-structured and structured information systems. We describe the design and implementation of a general purpose, regular expression based Caméléon wrapper engine with an integrated capabilities-aware planner/optimizer/executioner. The second challenge, data interpretation, deals with the existence of heterogeneous contexts, whereby each source of information and potential receiver of that information may operate with a different context, leading to large-scale semantic heterogeneity. We extend the existing formalization of the COIN framework with new logical formalisms and features to handle larger
The Caméléon Web Wrapper Engine
- Proceedings of the VLDB2000 Workshop on Technologies for E-Services
, 2000
"... The web is rapidly becoming the universal repository of information. A major challenge is the ability to support the effective flow of information among the sources and services on the web and their interconnection with legacy systems that were designed to operate with traditional relational databas ..."
Abstract
-
Cited by 36 (25 self)
- Add to MetaCart
The web is rapidly becoming the universal repository of information. A major challenge is the ability to support the effective flow of information among the sources and services on the web and their interconnection with legacy systems that were designed to operate with traditional relational databases. This paper describes a technology and infrastructure to address these needs, based on the design of a web wrapper engine called Caméléon. Caméléon extracts data from web pages using declarative specification files that define extraction rules. Caméléon is based on the relational model and designed to work as a relational front-end to web sources. ODBC drivers can be used to send SQL queries to Caméléon. Query results by Caméléon are presented in either XML or HTML table formats. Users can also easily call Caméléon from other applications (e.g. Microsoft Excel by using Caméléon web query file (Caméléon.iqy)). Unlike its predecessor, Grenouille, Caméléon lets users segment web pages and define independent extraction patterns for each attribute. The HTTPClient package used in Caméléon supports both GET and POST methods and is able to deal with authentication, redirection, and cookie issues when connecting to web pages.
Omnibase: Uniform Access to Heterogeneous Data for Question Answering
- IN PROCEEDINGS OF THE 7TH INTERNATIONAL WORKSHOP ON APPLICATIONS OF NATURAL LANGUAGE TO INFORMATION SYSTEMS (NLDB
, 2002
"... Although the World Wide Web contains a tremendous amount of information, the lack of uniform structure makes finding the right knowledge difficult. A solution is to turn the Web into a "virtual database" and to access it through natural language. We built Omnibase, a system that integrates heter ..."
Abstract
-
Cited by 36 (13 self)
- Add to MetaCart
Although the World Wide Web contains a tremendous amount of information, the lack of uniform structure makes finding the right knowledge difficult. A solution is to turn the Web into a "virtual database" and to access it through natural language. We built Omnibase, a system that integrates heterogeneous data sources using an object--property-- value model. With the help of Omnibase, our Start natural language system can now access numerous heterogeneous data sources on the Web in a uniform manner, and answers millions of user questions with high precision.
WysiWyg Web Wrapper Factory (W4F
- Proceedings of WWW Conference
, 1999
"... In this paper, we present the W4F toolkit for the generation of wrappers for Web sources. W4F consists of a retrieval language to identify Web sources, a declarative extraction language (the HTML Extraction Language) to express robust extraction rules and a mapping interface to export the extracted ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
In this paper, we present the W4F toolkit for the generation of wrappers for Web sources. W4F consists of a retrieval language to identify Web sources, a declarative extraction language (the HTML Extraction Language) to express robust extraction rules and a mapping interface to export the extracted information into some userde ned data-structures. To assist the user and make the creation of wrappers rapid and easy, the toolkit o ers some wysiwyg support via some wizards. Together, they permit the fast and semi-automatic generation of ready-to-go wrappers provided as Java classes. W4F has been successfully used to generate wrappers for database systems and software agents, making the content of Web sources easily accessible to any kind of application. Keywords: Web wrapper, information extraction, HTML parsing, HTML to XML conversion.
Personalizing the Web using site descriptions
- In Proceedings of the 10th International Workshop on Database and Expert Systems Applications (DEXA
, 1999
"... The information overload on the Web has created a great need for efficient filtering mechanisms. Many sites (e.g., CNN and Quicken) address this problem by allowing a user to create personalized pages that contain only information that is of interest to the user. We propose a new approach for person ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
The information overload on the Web has created a great need for efficient filtering mechanisms. Many sites (e.g., CNN and Quicken) address this problem by allowing a user to create personalized pages that contain only information that is of interest to the user. We propose a new approach for personalization that improves on existing services in three significant ways: the user can create personalized pages with information from any site (without being restricted to sites that offer personalization); personalized pages may contain information from multiple Web sites (e.g., a user can create a personalized page that contains not only news categories from her favorite news sources, but also information about the prices of all stocks whose names appear in the headlines of selected news, and weather information for a particular city); and users have more privacy since they are not required to sign up for the service. In order to build a personalization service that is general and easy to maintain, we make use of site descriptions that facilitate access to the data stored in and generated by Web sites. Site descriptions encode information about the contents, structure, and services offered by a Web site, and they can be created semi-automatically. 1.
Transclusions in the 21st Century
- Journal of Universal Computer Science
, 2001
"... Abstract: When quoting some part of a document authors usually cut and paste the relevant content into the new document. Thereby the connection between this selected part and the original document is lost. Transclusions – first mentioned in 1960 by Ted Nelson – address this problem of ’lost context’ ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Abstract: When quoting some part of a document authors usually cut and paste the relevant content into the new document. Thereby the connection between this selected part and the original document is lost. Transclusions – first mentioned in 1960 by Ted Nelson – address this problem of ’lost context’. With transclusions it is possible to store information about the original document and the exact position of the quote in the newly created document and provide the reader with additional navigational features. Document formats and information systems matured over the last 40 years. This paper gives an overview of some document formats available today in the WWW environment and points to some requirements for server systems providing transclusions. Thereafter we present some ideas on how to implement transclusions
Beyond XML Query Languages
- In In Proceedings of the Query Language Workshop (QL’98
, 1998
"... A query language is essential, if XML is to serve e ectively as an exchange medium for large data sets. The design of query languages for XML is in its infancy, and the choice of a standard may begoverned more by user acceptance than by any understanding of underlying principles. One would hope that ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A query language is essential, if XML is to serve e ectively as an exchange medium for large data sets. The design of query languages for XML is in its infancy, and the choice of a standard may begoverned more by user acceptance than by any understanding of underlying principles. One would hope that expressive power, performance, and compatibility with other languages will be considered in choosing among alternatives, but it is likely that several contenders will co-exist for some time. It is worth observing that, during the 20-year development of relational query languages, several competing languages were developed � and even today there are several relational query language standards. In spite of this, a great deal of technology was developed that was independent of the surface syntax of a query language. This included technology \below " the language such as e cient execution models and work \above " the level of language { such astechniques for view de nition and maintenance, triggers, etc. At Penn we are working on some of these language-independent issues. We include a summary of them here. They include execution and data models to support XML and semistructured query languages � the use of schemas and constraints in optimizing XML query languages � and tools for extracting data form existing sources and presenting it as XML. 1 Challenges for Query Languages
Open Information Pools
, 1999
"... On the WWW it is not possible to supplement existing web pages of other people with new information or a link to that information, because the WWW does not have a standard method for write access. With write access, information can be added in the right context, which eases searching. We therefore d ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
On the WWW it is not possible to supplement existing web pages of other people with new information or a link to that information, because the WWW does not have a standard method for write access. With write access, information can be added in the right context, which eases searching. We therefore define Open Information Pools: a collection of WWW based databases with public write access. By using databases we add structure to the information. Each database deals with a specific topic. We developed an architecture to support Open Information Pools. Important elements in the architecture are the rating and moderation tools. With these tools the user group is able to maintain and update the database and also to prevent errors and abuse. We conducted measurements on operational rating and moderation tools to show the validity of our idea. The study of Slashdot.org's rating and moderation tools shows that insightful information is recognised after only 37 minutes. We implemented a prototype...
Extraction Of Web Information Using W4f Wrapper Factory And Xml-Ql Query Language
, 1999
"... In many ways, the Web has become the largest knowledge base known to us. The problem facing the user now is not that the information he seeks is not available, but that it is not easy for him to extract exactly what he needs from what is available. It is also becoming clear that a top down approach ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In many ways, the Web has become the largest knowledge base known to us. The problem facing the user now is not that the information he seeks is not available, but that it is not easy for him to extract exactly what he needs from what is available. It is also becoming clear that a top down approach of gathering all the information, and structuring it will not work, except in some special cases. Indeed, most of the information is present in HTML documents structured only for visual content. Instead, new tools are being developed that attack this problem from a different angle. XML is a language that allows the publisher of the data to structure it using markup tags. These mark-up tags clarify not only the visual structure of the document, but also the semantic structure. Additionally, one can make use of a query language XML-QL to query XML pages for information, and to merge information from disparate XML sources. However, most of the content of the web is published in HTML. The W4F sy...

