Results 1 -
9 of
9
Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources
, 1997
"... Garlic is a middleware system that provides an in-tegrated view of a variety of legacy data sources, without changing how or where data is stored. In this paper, we describe our architecture for wrap-pers, key components of Garlic that encapsulate data sources and mediate between them and the middle ..."
Abstract
-
Cited by 246 (2 self)
- Add to MetaCart
(Show Context)
Garlic is a middleware system that provides an in-tegrated view of a variety of legacy data sources, without changing how or where data is stored. In this paper, we describe our architecture for wrap-pers, key components of Garlic that encapsulate data sources and mediate between them and the middleware. Garlic wrappers model legacy data as objects, participate in query planning, and provide standard interfaces for method invocation and query execution. To date, we have built wrappers for 10 data sources. Our experience shows that Garlic wrappers can be written quickly and that our architecture is flexible enough to accommo-date data sources with a variety of data models and a broad range of traditional and non-tradition-al query processing capabilities. 1
Building light-weight wrappers for legacy web datasources using w4f.
- In International Conference on Very Large Databases (VLDB)
, 1999
"... ..."
(Show Context)
Building Intelligent Web Applications Using Lightweight Wrappers
, 2000
"... The Web so far has been incredibly successful at delivering information... ..."
Abstract
-
Cited by 80 (0 self)
- Add to MetaCart
(Show Context)
The Web so far has been incredibly successful at delivering information...
Looking at the Web through XML glasses
, 1999
"... The Web so far has been incredibly successful at delivering information to human users. So successful actually, that there is now an urgent need to go beyond a browsing human and make information accessible to applications, in order to offer automation, inter-operation and Web-awareness among servic ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
The Web so far has been incredibly successful at delivering information to human users. So successful actually, that there is now an urgent need to go beyond a browsing human and make information accessible to applications, in order to offer automation, inter-operation and Web-awareness among services. To do so, information from Web sources needs to be accessible in a structured way. XML and its various extensions (data-models, query languages) are a step in this direction. Unfortunately, the Web is not yet a well organized repository of nicely structured XML documents but rather a conglomerate of volatile HTML pages, for which structure has to be extracted. To address this problem, we present the World Wide Web Wrapper Factory (W4F), a Java toolkit for the generation of wrappers for Web sources. Our main contributions are: (1) an expressive language to specify the extraction of complex structures from HTML pages; (2) a declarative mapping to XML documents, with the automatic generat...
Abstract Web Ecology� Recycling HTML pages as XML documents using W4F
"... sahuguet�saul.cis.upenn.edu ..."
A Multithreaded Java Framework for Information Extraction
"... In this paper, we present a new multithreaded framework for information extraction with Java in heterogeneous enterprise application environments, which frees the developer from having to deal with the error-prone task of low-level thread programming. The power of this framework is demonstrated b ..."
Abstract
- Add to MetaCart
In this paper, we present a new multithreaded framework for information extraction with Java in heterogeneous enterprise application environments, which frees the developer from having to deal with the error-prone task of low-level thread programming. The power of this framework is demonstrated by an example of extracting product prices from web sites, but the framework is useful for numerous other purposes, too. Strong points of the framework are its performance, continuous feedback, and adherence to maximum response times. The description of the framework uses UML modeling techniques for visualizing multithreading. Moreover, we tackle Java problems of stopping running threads.
Taming Web sources with "minute-made " wrappers
"... The Web has become a major conduit to information repositories of all kinds. Today, more than 80 % of information published on the Web is generated by underlying databases and this proportion keeps increasing. In some cases, database access is only granted through a Web gateway using forms as a quer ..."
Abstract
- Add to MetaCart
(Show Context)
The Web has become a major conduit to information repositories of all kinds. Today, more than 80 % of information published on the Web is generated by underlying databases and this proportion keeps increasing. In some cases, database access is only granted through a Web gateway using forms as a query language and HTML as a display vehicle. In order to permit inter-operation (between Web sources and legacy databases or among Web sources themselves) there is a strong need for Web wrappers. Web wrappers share some of the characteristics of standard database wrappers but usually the underlying data sources o er very limited query capabilities and the structure of the result (due to HTML shortcomings) might be loose and unstable. To overcome these problems, we divide the architecture of our Web wrappers into three components: (1) fetching the document, (2) extracting the information from its HTML formatting, and (3) mapping the information into a structure that can be used by applications (such as mediators). W4F is a toolkit that allows the fast generation of Web wrappers. GivenaWeb source, some extraction rules and some structural mappings, the toolkit generates a Web wrapper (a Java class) that can be used as a stand-alone program or integrated into a more complex system. W4F provides a rich language (HEL: HTML Extraction Language) to express declaratively extraction rules and mappings, as well as a wysiwyg interface that allows the creator of the wrapper to pick relevant pieces of information just by clicking on them, as he sees them in his Web browser. As an illustration, we presenttheTV-Guide Agent that allows users to query TV movie listings by time scheduled (date, time, channel) and program content (movie genre, rating, year, cast, country, etc.). This example demonstrates real inter-operability between TV-listing information
Optimizing Communications in Processing Data Integration Queries
"... Since query processing of data integration needs to access data from numerous wide-distributed sources over network, it is crucial to investigate how to deal with the expensive communication overhead. A staged data integration model is introduced for grid environment in this paper. It takes advantag ..."
Abstract
- Add to MetaCart
(Show Context)
Since query processing of data integration needs to access data from numerous wide-distributed sources over network, it is crucial to investigate how to deal with the expensive communication overhead. A staged data integration model is introduced for grid environment in this paper. It takes advantage of the abundant computer nodes to process integrated query over a number of highly-distributed and high-volume data sources. The content-based scheduling algorithm in the model groups the queries over the similar data sources together to enhances the opportunities of data sharing among concurrent queries for the same data source. Furthermore, an approach of multiple queries optimization is proposed to exploit data sharing, and avoid redundant data transfer without sacrificing the autonomy of data sources as well. Experimental results validate that our algorithms improve data integration performance in terms of both communication traffic and response time.