Results 1 -
3 of
3
Looking at the Web through XML glasses
, 1999
"... The Web so far has been incredibly successful at delivering information to human users. So successful actually, that there is now an urgent need to go beyond a browsing human and make information accessible to applications, in order to offer automation, inter-operation and Web-awareness among servic ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
The Web so far has been incredibly successful at delivering information to human users. So successful actually, that there is now an urgent need to go beyond a browsing human and make information accessible to applications, in order to offer automation, inter-operation and Web-awareness among services. To do so, information from Web sources needs to be accessible in a structured way. XML and its various extensions (data-models, query languages) are a step in this direction. Unfortunately, the Web is not yet a well organized repository of nicely structured XML documents but rather a conglomerate of volatile HTML pages, for which structure has to be extracted. To address this problem, we present the World Wide Web Wrapper Factory (W4F), a Java toolkit for the generation of wrappers for Web sources. Our main contributions are: (1) an expressive language to specify the extraction of complex structures from HTML pages; (2) a declarative mapping to XML documents, with the automatic generat...
WysiWyg Web Wrapper Factory (W4F
- Proceedings of WWW Conference
, 1999
"... In this paper, we present the W4F toolkit for the generation of wrappers for Web sources. W4F consists of a retrieval language to identify Web sources, a declarative extraction language (the HTML Extraction Language) to express robust extraction rules and a mapping interface to export the extracted ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
In this paper, we present the W4F toolkit for the generation of wrappers for Web sources. W4F consists of a retrieval language to identify Web sources, a declarative extraction language (the HTML Extraction Language) to express robust extraction rules and a mapping interface to export the extracted information into some userde ned data-structures. To assist the user and make the creation of wrappers rapid and easy, the toolkit o ers some wysiwyg support via some wizards. Together, they permit the fast and semi-automatic generation of ready-to-go wrappers provided as Java classes. W4F has been successfully used to generate wrappers for database systems and software agents, making the content of Web sources easily accessible to any kind of application. Keywords: Web wrapper, information extraction, HTML parsing, HTML to XML conversion.
Taming Web sources with "minute-made " wrappers
"... The Web has become a major conduit to information repositories of all kinds. Today, more than 80 % of information published on the Web is generated by underlying databases and this proportion keeps increasing. In some cases, database access is only granted through a Web gateway using forms as a quer ..."
Abstract
- Add to MetaCart
The Web has become a major conduit to information repositories of all kinds. Today, more than 80 % of information published on the Web is generated by underlying databases and this proportion keeps increasing. In some cases, database access is only granted through a Web gateway using forms as a query language and HTML as a display vehicle. In order to permit inter-operation (between Web sources and legacy databases or among Web sources themselves) there is a strong need for Web wrappers. Web wrappers share some of the characteristics of standard database wrappers but usually the underlying data sources o er very limited query capabilities and the structure of the result (due to HTML shortcomings) might be loose and unstable. To overcome these problems, we divide the architecture of our Web wrappers into three components: (1) fetching the document, (2) extracting the information from its HTML formatting, and (3) mapping the information into a structure that can be used by applications (such as mediators). W4F is a toolkit that allows the fast generation of Web wrappers. GivenaWeb source, some extraction rules and some structural mappings, the toolkit generates a Web wrapper (a Java class) that can be used as a stand-alone program or integrated into a more complex system. W4F provides a rich language (HEL: HTML Extraction Language) to express declaratively extraction rules and mappings, as well as a wysiwyg interface that allows the creator of the wrapper to pick relevant pieces of information just by clicking on them, as he sees them in his Web browser. As an illustration, we presenttheTV-Guide Agent that allows users to query TV movie listings by time scheduled (date, time, channel) and program content (movie genre, rating, year, cast, country, etc.). This example demonstrates real inter-operability between TV-listing information

