Results 1 - 10
of
13
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources
- In ICDE
, 2000
"... The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often interesting web data are not in database systems but in HTML pages, XML pages, or text les. Data in these formats is not directly usable by standard SQL-like query processing engines that support sophist ..."
Abstract
-
Cited by 130 (7 self)
- Add to MetaCart
The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often interesting web data are not in database systems but in HTML pages, XML pages, or text les. Data in these formats is not directly usable by standard SQL-like query processing engines that support sophisticated querying and reporting beyond keyword-based retrieval. Hence, the web users or applications need a smart way of extracting data from these web sources. One of the popular approaches is to write wrappers around the sources, either manually or with software assistance, to bring the web data within the reach of more sophisticated query tools and general mediator-based information integration systems. In this paper, we describe the methodology and the software development of an XML-enabled wrapper construction system- XWRAP for semi-automatic generation of wrapper programs. By XML-enabled we mean that the metadata about information content that are implicit in the original web pages will be extracted and encoded explicitly as XML tags in the wrapped documents. In addition, the query-based content ltering process is performed against the XML documents. The XWRAP wrapper generation framework has three distinct features. First, it explicitly separates
Building light-weight wrappers for legacy web data-sources using w4f
- In Proc. of VLDB
, 1999
"... sahuguet�saul.cis.upenn.edu ..."
Building Intelligent Web Applications Using Lightweight Wrappers
, 2000
"... The Web so far has been incredibly successful at delivering information... ..."
Abstract
-
Cited by 57 (0 self)
- Add to MetaCart
The Web so far has been incredibly successful at delivering information...
Effective Web Data Extraction with Standard XML Technologies
- IN PROCEEDINGS OF THE TENTH INTERNATIONAL WORLD WIDE WEB CONFERENCE, HONG KONG
, 2001
"... We discuss the problem of Web data extraction and describe an XML-based methodology whose goal extends far beyond simple "screen scraping." An ideal data extraction process is able to digest target Web databases that are visible only as HTML pages, and create a local, identical replica of those data ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
We discuss the problem of Web data extraction and describe an XML-based methodology whose goal extends far beyond simple "screen scraping." An ideal data extraction process is able to digest target Web databases that are visible only as HTML pages, and create a local, identical replica of those databases as a result. What is needed in this process is much more than a Web crawler and set of Web site wrappers. A comprehensive data extraction process needs to deal with such roadblocks such as session identifiers, HTML forms, and client-side JavaScript, and data integration problems such as incompatible datasets and vocabularies, and missing and conflicting data. Proper data extraction also requires a solid data validation and error recovery service to handle data extraction failures, which are unavoidable. In this paper
Web Ecology: Recycling HTML pages as XML documents using W4F
- In ACM SIGMOD Workshop on the Web and Databases (WebDB
, 1999
"... In this paper we present the World-Wide Web Wrapper Factory (W4F), a Java toolkit to generate wrappers for Web data sources. Some key features of W4F are an expressive language to extract information from HTML pages in a structured way, a mapping to export it as XML documents and some visual tools t ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
In this paper we present the World-Wide Web Wrapper Factory (W4F), a Java toolkit to generate wrappers for Web data sources. Some key features of W4F are an expressive language to extract information from HTML pages in a structured way, a mapping to export it as XML documents and some visual tools to assist the user during wrapper creation. Moreover, the entire description of wrappers is fully declarative. As an illustration, we demonstrate how to use W4F to create XML gateways, that serve transparently and on-the-fly HTML pages as XML documents with their DTDs. 1 Introduction The Web has become a major conduit to information repositories of all kinds. Today, more than 80% of information published on the Web is generated by underlying databases and this proportion keeps increasing. But Web data sources also consist of stand-alone HTML pages hand-coded by individuals, that provide very useful information such as reviews, digests, links, etc. As soon as we want to go beyond the basic m...
Looking at the Web through XML glasses
, 1999
"... The Web so far has been incredibly successful at delivering information to human users. So successful actually, that there is now an urgent need to go beyond a browsing human and make information accessible to applications, in order to offer automation, inter-operation and Web-awareness among servic ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
The Web so far has been incredibly successful at delivering information to human users. So successful actually, that there is now an urgent need to go beyond a browsing human and make information accessible to applications, in order to offer automation, inter-operation and Web-awareness among services. To do so, information from Web sources needs to be accessible in a structured way. XML and its various extensions (data-models, query languages) are a step in this direction. Unfortunately, the Web is not yet a well organized repository of nicely structured XML documents but rather a conglomerate of volatile HTML pages, for which structure has to be extracted. To address this problem, we present the World Wide Web Wrapper Factory (W4F), a Java toolkit for the generation of wrappers for Web sources. Our main contributions are: (1) an expressive language to specify the extraction of complex structures from HTML pages; (2) a declarative mapping to XML documents, with the automatic generat...
WysiWyg Web Wrapper Factory (W4F
- Proceedings of WWW Conference
, 1999
"... In this paper, we present the W4F toolkit for the generation of wrappers for Web sources. W4F consists of a retrieval language to identify Web sources, a declarative extraction language (the HTML Extraction Language) to express robust extraction rules and a mapping interface to export the extracted ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
In this paper, we present the W4F toolkit for the generation of wrappers for Web sources. W4F consists of a retrieval language to identify Web sources, a declarative extraction language (the HTML Extraction Language) to express robust extraction rules and a mapping interface to export the extracted information into some userde ned data-structures. To assist the user and make the creation of wrappers rapid and easy, the toolkit o ers some wysiwyg support via some wizards. Together, they permit the fast and semi-automatic generation of ready-to-go wrappers provided as Java classes. W4F has been successfully used to generate wrappers for database systems and software agents, making the content of Web sources easily accessible to any kind of application. Keywords: Web wrapper, information extraction, HTML parsing, HTML to XML conversion.
Robust Web Data Extraction with XML Path Expressions
- IBM Research Report
, 2002
"... Automated extraction of structured Web data has attracted considerable interest in both the academia and industry. A particularly promising approach is to employ XML technologies to translate semi-structured HTML documents to “pure ” XML documents. In this approach, HTML documents are first normaliz ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Automated extraction of structured Web data has attracted considerable interest in both the academia and industry. A particularly promising approach is to employ XML technologies to translate semi-structured HTML documents to “pure ” XML documents. In this approach, HTML documents are first normalized into XHMTL and then mapped to the desired XML application format by using XML path expressions and regular expressions. In this paper we describe a methodology for creating XML path (XPath) expressions that are capable of extracting data from virtually any HTML page, while placing an emphasis on the persistent integrity of these expressions. This robustness is critical given the vulnerability of extraction technologies to the continually changing content, structure, and formatting of pages on the Web. We define categories of extraction rules in terms of their dependence on content, structural, or formatting features, and provide practical tips on how to create dependable data extraction patterns for the Web. 1
Beyond XML Query Languages
- In In Proceedings of the Query Language Workshop (QL’98
, 1998
"... A query language is essential, if XML is to serve e ectively as an exchange medium for large data sets. The design of query languages for XML is in its infancy, and the choice of a standard may begoverned more by user acceptance than by any understanding of underlying principles. One would hope that ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A query language is essential, if XML is to serve e ectively as an exchange medium for large data sets. The design of query languages for XML is in its infancy, and the choice of a standard may begoverned more by user acceptance than by any understanding of underlying principles. One would hope that expressive power, performance, and compatibility with other languages will be considered in choosing among alternatives, but it is likely that several contenders will co-exist for some time. It is worth observing that, during the 20-year development of relational query languages, several competing languages were developed � and even today there are several relational query language standards. In spite of this, a great deal of technology was developed that was independent of the surface syntax of a query language. This included technology \below " the language such as e cient execution models and work \above " the level of language { such astechniques for view de nition and maintenance, triggers, etc. At Penn we are working on some of these language-independent issues. We include a summary of them here. They include execution and data models to support XML and semistructured query languages � the use of schemas and constraints in optimizing XML query languages � and tools for extracting data form existing sources and presenting it as XML. 1 Challenges for Query Languages
IDB: Toward the Scalable Integration of Queryable Internet Data Sources
, 2000
"... As the number of databases accessible on the Web grows, the ability to execute queries spanning multiple heterogeneous queryable sources is becoming increasingly important. To date, research in this area has focused on providing semantic completeness, and has generated solutions that work well wh ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
As the number of databases accessible on the Web grows, the ability to execute queries spanning multiple heterogeneous queryable sources is becoming increasingly important. To date, research in this area has focused on providing semantic completeness, and has generated solutions that work well when querying over a relatively small number of databases that have static and well-defined schemas. Unfortunately, these solutions do not extend to the scale of the present Internet, let alone the Internet of the future. In this paper, we present an approach that makes the opposite tradeoff: it provides a scalable, unified view over large numbers of queryable information sources by sacrificing some expressive power in the set of queries supported. We have developed a prototype system, IDB, which implements this approach. The IDB system provides scalability through three main techniques. First, it uses a collection of ontologies organized into hierarchical namespaces as a medium for ex...

