Results 1 - 10
of
16
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources
- In ICDE
, 2000
"... The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often interesting web data are not in database systems but in HTML pages, XML pages, or text les. Data in these formats is not directly usable by standard SQL-like query processing engines that support sophist ..."
Abstract
-
Cited by 130 (7 self)
- Add to MetaCart
The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often interesting web data are not in database systems but in HTML pages, XML pages, or text les. Data in these formats is not directly usable by standard SQL-like query processing engines that support sophisticated querying and reporting beyond keyword-based retrieval. Hence, the web users or applications need a smart way of extracting data from these web sources. One of the popular approaches is to write wrappers around the sources, either manually or with software assistance, to bring the web data within the reach of more sophisticated query tools and general mediator-based information integration systems. In this paper, we describe the methodology and the software development of an XML-enabled wrapper construction system- XWRAP for semi-automatic generation of wrapper programs. By XML-enabled we mean that the metadata about information content that are implicit in the original web pages will be extracted and encoded explicitly as XML tags in the wrapped documents. In addition, the query-based content ltering process is performed against the XML documents. The XWRAP wrapper generation framework has three distinct features. First, it explicitly separates
A Fully Automated Object Extraction System for the World Wide Web
, 2001
"... This paper presents a fully automated object extraction system \Gamma Omini. A distinct feature of Omini is the suite of algorithms and the automatically learned information extraction rules for discovering and extracting objects from dynamic Web pages or static Web pages that contain multiple obje ..."
Abstract
-
Cited by 61 (12 self)
- Add to MetaCart
This paper presents a fully automated object extraction system \Gamma Omini. A distinct feature of Omini is the suite of algorithms and the automatically learned information extraction rules for discovering and extracting objects from dynamic Web pages or static Web pages that contain multiple object instances. We evaluated the system using more than 2,000 Web pages over 40 sites. It achieves 100% precision (returns only correct objects) and excellent recall (between 93% and 98%, with very few significant objects left out). The object boundary identification algorithms are fast, about 0.1 second per page with a simple optimization.
AI for the Web - Ontology-based Community Web Portals
, 2000
"... Community web portals serve as portals for the information needs of particular communities on the web. We here discuss how a comprehensive, ontology-based approach for building and maintaining a high-value community web portal has been conceived and implemented. The ontology serves as a semantic bac ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
Community web portals serve as portals for the information needs of particular communities on the web. We here discuss how a comprehensive, ontology-based approach for building and maintaining a high-value community web portal has been conceived and implemented. The ontology serves as a semantic backbone for accessing knowledge on the portal, for contributing information, as well as for developing and maintaining the portal. In particular, the ontology allows for flexible querying and inferencing of knowledge. Actual usage of our technology is facilitated through a set of tools that are about to turn our research system into a portal for wide-spread usage right now. The development of these tools has greatly benefited from some first experiences we had with actual users of the community web portal of the knowledge acquisition community. 1 Introduction One of the major strengths of the World Wide Web is that virtually everyone who owns a computer may contribute high-valu...
The Web as a Resource for Question Answering: Perspectives and Challenges
- IN PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC-2002
, 2002
"... The vast amounts of information readily available on the World Wide Web can be effectively used for question answering in two fundamentally different ways. In the federated approach, techniques for handling semistructured data are applied to access Web sources as if they were databases, allowing lar ..."
Abstract
-
Cited by 19 (5 self)
- Add to MetaCart
The vast amounts of information readily available on the World Wide Web can be effectively used for question answering in two fundamentally different ways. In the federated approach, techniques for handling semistructured data are applied to access Web sources as if they were databases, allowing large classes of common questions to be answered uniformly. In the distributed approach, largescale text-processing techniques are used to extract answers directly from unstructured Web documents. Because the Web is orders of magnitude larger than any human-collected corpus, question answering systems can capitalize on its unparalleled-levels of data redundancy. Analysis of real-world user questions reveals that the federated and distributed approaches complement each other nicely, suggesting a hybrid approach in future question answering systems.
Extracting Web data using instance-based learning
- In WISE-05
, 2005
"... Abstract. This paper studies structured data extraction from Web pages, e.g., online product description pages. Existing approaches to data extraction include wrapper induction and automatic methods. In this paper, we propose an instance-based learning method, which performs extraction by comparing ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Abstract. This paper studies structured data extraction from Web pages, e.g., online product description pages. Existing approaches to data extraction include wrapper induction and automatic methods. In this paper, we propose an instance-based learning method, which performs extraction by comparing each new instance (or page) to be extracted with labeled instances (or pages). The key advantage of our method is that it does not need an initial set of labeled pages to learn extraction rules as in wrapper induction. Instead, the algorithm is able to start extraction from a single labeled instance (or page). Only when a new page cannot be extracted does the page need labeling. This avoids unnecessary page labeling, which solves a major problem with inductive learning (or wrapper induction), i.e., the set of labeled pages may not be representative of all other pages. The instance-based approach is very natural because structured data on the Web usually follow some fixed templates and pages of the same template usually can be extracted using a single page instance of the template. The key issue is the similarity or distance measure. Traditional measures based on the Euclidean distance or text similarity are not easily applicable in this context because items to be extracted from different pages can be entirely different. This paper proposes a novel similarity measure for the purpose, which is suitable for templated Web pages. Experimental results with product data extraction from 1200 pages in 24 diverse Web sites show that the approach is surprisingly effective. It outperforms the state-of-the-art existing systems significantly. 1
Wiccap Data Model: Mapping Physical Websites to Logical Views
- In Proceedings of the 21st International Conference on Conceptual Modelling (ER2002) (Tempere
, 2002
"... Information sources over the WWW contain a large amount of data organized according to different interests and values. Thus, it is important that facilities are there to enable users to extract information of interests in a simple and effective manner. To do this, information from the Web sources ne ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Information sources over the WWW contain a large amount of data organized according to different interests and values. Thus, it is important that facilities are there to enable users to extract information of interests in a simple and effective manner. To do this, information from the Web sources need to be extracted automatically according to users' interests. However, the extraction of information requires in-depth knowledge of relevant technologies and the extraction process is slow, tedious and difficult for ordinary users. We propose the Wiccap Data Model, an XML data model that maps Web information sources into commonly perceived logical models. Based on this data model, ordinary users are able to extract information easily and efficiently. To accelerate the creation of data models, we also define a formal process for creating such data model and have implemented a software tool to facilitate and automate the process of producing Wiccap Data Models.
Generating Wrappers for Command Line Programs: The Cal-Aggie Wrap-O-Matic Project
- In Proceedings of the 23rd International Conference on Software Engineering
, 2001
"... Software developers writing new software have strong incentives to make their products compliant to standards such as corba, com, and JavaBeans. Standards-compliance facilitates inter-operability, component-based software assembly, and software reuse, thus leading to improved quality and productivit ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Software developers writing new software have strong incentives to make their products compliant to standards such as corba, com, and JavaBeans. Standards-compliance facilitates inter-operability, component-based software assembly, and software reuse, thus leading to improved quality and productivity. Legacy software, on the other hand, is usually monolithic, and hard to maintain and adapt. Many organizations, saddled with entrenched legacy software, are confronted with the need to integrate legacy assets into more modern, distributed, componentized systems that provide critical business services. Thus wrapping legacy systems for inter-operability has been an area of considerable interest. Wrappers are usually constructed by hand, which can be costly and error-prone. In this paper, we specifically target command-line oriented legacy systems, and describe a tool framework that automates away some of the drudgery of constructing wrappers for these systems. We describe the Cal-Aggie Wrap-O-...
Towards more personalized Web: Extraction and Integration of Dynamic Content from the Web
- IN PROCEEDINGS OF THE 8TH ASIA PACIFIC WEB CONFERENCE APWEB 2006
, 2006
"... ..."
Semi-Structured Data Extraction from Heterogeneous Sources
- 2cd International Workshop on Innovative Internet Information Systems (IIIS'99), in conjunction with the European Conference on Information Systems (ECIS'99
, 1999
"... This paper concerns the extraction of semi-structured data from Web pages generated from multiple on-line services. This task is addressed by representing the schemas for semi-structured data and crafting generic wrappers based on the schemas. We introduce a hybrid representation method for schema ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This paper concerns the extraction of semi-structured data from Web pages generated from multiple on-line services. This task is addressed by representing the schemas for semi-structured data and crafting generic wrappers based on the schemas. We introduce a hybrid representation method for schemas of semi-structured data, consisting of a concept hierarchy and a set of knowledge unit frames. A content-based and structure-bounded information extraction algorithm is developed to build the generic wrapper, which utilizes the schemas and takes advantage of the semi-structured page layouts. The main advantages of the system are that a single wrapper can be applied to multiple Web sites, and the wrapper can handle resources with missing data and data presented in free texts, which can not be wrapped by existing techniques. The hybrid representation has been used for writing schemas for seven domains. Experiments in two domains, on-line real estate advertisements and car advertisement...
Question answering techniques for the world wide web
- In Tutorial presentation at EACL
, 2003
"... Question answering systems have become increasingly popular because they deliver users short, succinct answers instead of overloading them with a large number of irrelevant documents. The vast amount of information readily available on the World Wide Web presents new opportunities and challenges for ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Question answering systems have become increasingly popular because they deliver users short, succinct answers instead of overloading them with a large number of irrelevant documents. The vast amount of information readily available on the World Wide Web presents new opportunities and challenges for question answering. In order for question answering systems to benefit from this vast store of useful knowledge, they must cope with large volumes of useless data. Many characteristics of the World Wide Web distinguish Web-based question answering from question answering on closed corpora such as newspaper texts. The Web is vastly larger in size and boasts incredible “data redundancy, ” which renders it amenable to statistical techniques for answer extraction. A data-driven approach can yield high levels of performance and nicely complements traditional question answering techniques driven by information extraction. In addition to enormous amounts of unstructured text, the Web also contains pockets of structured and semistructured knowledge that can serve as a valuable resource for question answering. By organizing these resources and annotating them with natural language, we can successfully incorporate Web knowledge into question answering systems. This tutorial surveys recent Web-based question answering technology, focusing on two separate paradigms: knowledge mining using statistical tools and knowledge annotation using database concepts. Both approaches can employ a wide spectrum of techniques ranging in linguistic sophistication from simple “bag-of-words ” treatments to full syntactic parsing.

