Results 1 - 10
of
45
FiVaTech: Page-Level Web Data Extraction from Template Pages
"... In this paper, we proposed a new approach, called FiVaTech for the problem of Web data extraction. FiVaTech is a page-level data extraction system which deduces the data schema and templates for the input pages generated from a CGI program. FiVaTech uses tree templates to model the generation of dyn ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
In this paper, we proposed a new approach, called FiVaTech for the problem of Web data extraction. FiVaTech is a page-level data extraction system which deduces the data schema and templates for the input pages generated from a CGI program. FiVaTech uses tree templates to model the generation of dynamic Web pages. FiVaTech can deduce the schema and templates for each individual Deep Web site, which contains either singleton or multiple data records in one Web page. FiVaTech applies tree matching, tree alignment, and mining techniques to achieve the challenging task. The experiments show an encouraging result for the test pages used in many state-of-the-art Web data extraction works. 1.
Provenance-based refresh in data-oriented workflows
, 2010
"... We consider a general workflow setting in which input data sets are processed by a graph of transformations to produce output results. Our goal is to perform efficient selective refresh of elements in the output data, i.e., compute the latest values of specific output elements when the input data ma ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
We consider a general workflow setting in which input data sets are processed by a graph of transformations to produce output results. Our goal is to perform efficient selective refresh of elements in the output data, i.e., compute the latest values of specific output elements when the input data may have changed. Our approach is based on capturing one-level data provenance at each transformation when the workflow is run initially. Then at refresh time provenance is used to determine (transitively) which input elements are responsible for given output elements, and the workflow is rerun only on that portion of the data needed for refresh. Our contributions are to formalize the problem setting and the problem itself, to specify properties of transformations and provenance that are required for efficient refresh, and to provide algorithms that apply to a wide class of transformations and workflows. We have built a preliminary prototype system supporting the features and algorithms presented in the paper. 1
Updating Probabilistic XML ∗
"... We investigate the complexity of performing updates on probabilistic XML data for various classes of probabilistic XML documents of different succinctness. We consider two elementary kinds of updates, insertions and deletions, that are defined with the help of a locator query that specifies the node ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
We investigate the complexity of performing updates on probabilistic XML data for various classes of probabilistic XML documents of different succinctness. We consider two elementary kinds of updates, insertions and deletions, that are defined with the help of a locator query that specifies the nodes where the update is to be performed. For insertions, two semantics are considered, depending on whether a node is to be inserted once or for every match of the query. We first discuss deterministic updates over probabilistic XML, and then extend the algorithms and complexity bounds to probabilistic updates. In addition to a number of intractability results, our main result is an efficient algorithm for insertions defined with branching-free queries over probabilistic models with local dependencies. Finally, we discuss the problem of updating probabilistic XML databases with continuous probability distributions.
Towards Bridging the Gap between Personalization and Information Extraction
"... Abstract—In this paper we propose to integrate Information ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract—In this paper we propose to integrate Information
Automated Information Extraction from Web Sources: a Survey
"... Abstract. The Web contains an enormous quantity of information which is usually formatted for human users. This makes it difficult to extract relevant content from various sources. In the last few years some authors have addressed the problem to convert Web documents from unstructured or semi-struct ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. The Web contains an enormous quantity of information which is usually formatted for human users. This makes it difficult to extract relevant content from various sources. In the last few years some authors have addressed the problem to convert Web documents from unstructured or semi-structured format into structured and therefore machine-understandable format such as, for example, XML. In this paper we briefly survey some of the most promising and recently developed extraction tools. 1
Efficient Lyrics Retrieval and Alignment
"... We present an algorithm to efficiently retrieve from the Web multiple versions of the lyrics of a given song. First, multiple web pages are collected that potentially contain the lyrics of the given song, by querying Google with the song title and artist name. Next, from each of these web pages, the ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We present an algorithm to efficiently retrieve from the Web multiple versions of the lyrics of a given song. First, multiple web pages are collected that potentially contain the lyrics of the given song, by querying Google with the song title and artist name. Next, from each of these web pages, the part that probably contains the lyrics is efficiently extracted by making explicit use of the structural properties of lyrics. In addition, we present an efficient approximation algorithm to align the multiple lyrics versions. Multiple sequence alignment is a known NPhard problem, and we propose an approximation algorithm that is much more efficient than an algorithm proposed in the literature for this application. We present results that we obtained for a set of 258 songs, illustrating that by using our approach we are able to extract relevant lyrics for 97 % of them. Key words: lyrics retrieval, Google, multiple sequence alignment. 1
Scalable Interoperability Through the Use of COIN Lightweight Ontology
- in Proceedings of the 2nd VLDB Workshop on Ontologies-based techniques for DataBases and Information Systems (ODBIS 2006), Seoul, Korea
"... Abstract. There are many different kinds of ontologies used for different purposes in modern computing. A continuum exists from lightweight ontologies to formal ontologies. In this paper we compare and contrast the lightweight ontology and the formal ontology approaches to data interoperability. Bot ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract. There are many different kinds of ontologies used for different purposes in modern computing. A continuum exists from lightweight ontologies to formal ontologies. In this paper we compare and contrast the lightweight ontology and the formal ontology approaches to data interoperability. Both approaches have strengths and weaknesses, but they both lack scalability because of the n 2 problem. We present an approach that combines their strengths and avoids their weaknesses. In this approach, the ontology includes only high level concepts; subtle differences in the interpretation of the concepts are captured as context descriptions outside the ontology. The resulting ontology is simple, thus it is easy to create. It also provides a structure for context descriptions. The structure can be exploited to facilitate automatic composition of context mappings. This mechanism leads to a scalable solution to semantic interoperability among disparate data sources and contexts.
WebKnox: Web Knowledge Extraction
"... Abstract The paper describes and evaluates a system for extracting knowledge from the web that uses a domain independent fact extraction approach and a self supervised learning algorithm. Using a trust algorithm, the precision of the system is improved to over 70% compared with a baseline of 52%. ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract The paper describes and evaluates a system for extracting knowledge from the web that uses a domain independent fact extraction approach and a self supervised learning algorithm. Using a trust algorithm, the precision of the system is improved to over 70% compared with a baseline of 52%.
ObjectRunner: Lightweight, Targeted Extraction and Querying of Structured Web Data
"... We present in this paper ObjectRunner, a system for extracting, integrating and querying structured data from the Web. Our system harvests real-world items from template-based HTML pages (the so-called structured Web). It illustrates a two-phase querying of the Web, in which an intentional descripti ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We present in this paper ObjectRunner, a system for extracting, integrating and querying structured data from the Web. Our system harvests real-world items from template-based HTML pages (the so-called structured Web). It illustrates a two-phase querying of the Web, in which an intentional description of the targeted data is first provided, in a flexible and widely applicable manner. ObjectRunner follows then a lightweight, best-effort approach, leveraging both the input description and the source structure. This process is domain-independent, in the sense that it applies to any relation, either flat or nested, describing real-world items. We advocate via our prototype that fully automatic extraction and integration of structured data can be done fast and effectively, when the redundancy of the Web meets knowledge over the to-be-extracted data. We present the technical details and the overall platform through several application scenarios on real-life Web sources. 1.
Enabling Global Price Comparison through Semantic Integration of Web Data
, 2008
"... Abstract: “Sell Globally ” and “Shop Globally ” have been seen as a potential benefit of web-enabled electronic business. One important step toward realizing this benefit is to know how things are selling in various parts of the world. A global price comparison service would address this need. But t ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract: “Sell Globally ” and “Shop Globally ” have been seen as a potential benefit of web-enabled electronic business. One important step toward realizing this benefit is to know how things are selling in various parts of the world. A global price comparison service would address this need. But there have not been many such services. In this paper, we use a case study of global price dispersion to illustrate the need and the value of a global price comparison service. Then we identify and discuss several technology challenges, including semantic heterogeneity, in providing a global price comparison service. We propose a mediation architecture to address the semantic heterogeneity problem, and demonstrate the feasibility of the proposed architecture by implementing a prototype that enables global price comparison using data from

