Results 1 -
4 of
4
Robust web extraction: an approach based on a probabilistic tree-edit model
- In SIGMOD
"... On script-generated web sites, many documents share common HTML tree structure, allowing wrappers to effectively extract information of interest. Of course, the scripts and thus the tree structure evolve over time, causing wrappers to break repeatedly, and resulting in a high cost of maintaining wra ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
On script-generated web sites, many documents share common HTML tree structure, allowing wrappers to effectively extract information of interest. Of course, the scripts and thus the tree structure evolve over time, causing wrappers to break repeatedly, and resulting in a high cost of maintaining wrappers. In this paper, we explore a novel approach: we use temporal snapshots of web pages to develop a tree-edit model of HTML, and use this model to improve wrapper construction. We view the changes to the tree structure as suppositions of a series of edit operations: deleting nodes, inserting nodes and substituting labels of nodes. The tree structures evolve by choosing these edit operations stochastically. Our model is attractive in that the probability that a source tree has evolved into a target tree can be estimated efficiently—in quadratic time in the size of the trees—making it a potentially useful tool for a variety of tree-evolution problems. We give an algorithm to learn the probabilistic model from training examples consisting of pairs of trees, and apply this algorithm to collections of web-page snapshots to derive HTML-specific tree edit models. Finally, we describe a novel wrapper-construction framework that takes the tree-edit model into account, and compare the quality of resulting wrappers to that of traditional wrappers on synthetic and real HTML document examples. 1.
Extensible framework of authoring tools for Web document annotation
- Proceedings of International Workshop on Semantic Web Foundations and Application Technologies (SWFAT
, 2003
"... Abstract. Web metadata is crucial for providing machine-understandable descriptions of Web resources, and has a number of applications such as discovery, qualification, and adaptation of Web documents. While metadata is often embedded into a target document, metadata can also be associated externall ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract. Web metadata is crucial for providing machine-understandable descriptions of Web resources, and has a number of applications such as discovery, qualification, and adaptation of Web documents. While metadata is often embedded into a target document, metadata can also be associated externally by means of an addressing scheme such as the XPath language. However, creation and modification of external metadata solely with a conventional editor is not easy because metadata authoring involves the maintenance and elaboration of addressing expressions as well as editing individual documents. The objective of this study is to advance extensibility and variations in the configuration of annotation tools, taking account of different authoring methods as well as the different roles of annotations for assertion and transformation. In this paper, we explain a schema for external annotation as a basis for the design of annotation tools. The framework of annotation tools is then introduced, distinguishing tool configurations for annotation by selection and annotation by example. Finally, we present practical applications of external annotations for Web document clipping, and show how annotation tools are used for annotation authoring. 1
myPortal: Robust Extraction and Aggregation of Web Content
"... We demonstrate myPortal – an application for web content block extraction and aggregation. The research issues behind the tool are also explained, with an emphasis on robustness of web content extraction. 1. ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We demonstrate myPortal – an application for web content block extraction and aggregation. The research issues behind the tool are also explained, with an emphasis on robustness of web content extraction. 1.

