Results 1 - 10
of
162
Wrapper Induction: Efficiency and Expressiveness
- Artificial Intelligence
, 2000
"... The Internet presents numerous sources of useful information---telephone directories, product catalogs, stock quotes, event listings, etc. Recently, many systems have been built that automatically gather and manipulate such information on a user's behalf. However, these resources are usually formatt ..."
Abstract
-
Cited by 191 (12 self)
- Add to MetaCart
The Internet presents numerous sources of useful information---telephone directories, product catalogs, stock quotes, event listings, etc. Recently, many systems have been built that automatically gather and manipulate such information on a user's behalf. However, these resources are usually formatted for use by people (e.g., the relevant content is embedded in HTML pages), so extracting their content is difficult. Most systems use customized wrapper procedures to perform this extraction task. Unfortunately, writing wrappers is tedious and error-prone. As an alternative, we advocate wrapper induction, a technique for automatically constructing wrappers. In this article, we describe six wrapper classes, and use a combination of empirical and analytical techniques to evaluate the computational tradeoffs among them. We first consider expressiveness: how well the classes can handle actual Internet resources, and the extent to which wrappers in one class can mimic those in another. We then...
Constructing Biological Knowledge Bases by Extracting Information from Text Sources
, 1999
"... Recently, there has been much effort in making databases for molecular biology more accessible and interoperable. However, information in text form, such as MEDLINE records, remains a greatly underutilized source of biological information. We have begun a research effort aimed at automatically mappi ..."
Abstract
-
Cited by 151 (0 self)
- Add to MetaCart
Recently, there has been much effort in making databases for molecular biology more accessible and interoperable. However, information in text form, such as MEDLINE records, remains a greatly underutilized source of biological information. We have begun a research effort aimed at automatically mapping information from text sources into structured representations, such as knowledge bases. Our approach to this task is to use machine-learning methods to induce routines for extracting facts from text. We describe two learning methods that we have applied to this task --- a statistical text classification method, and a relational learning method --- and our initial experiments in learning such information-extraction routines. We also present an approach to decreasing the cost of learning information-extraction routines by learning from "weakly" labeled training data.
Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages
- Data & Knowledge Engineering
, 1999
"... Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the esse ..."
Abstract
-
Cited by 101 (43 self)
- Add to MetaCart
Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the essence of a document's content. For these kinds of data-rich, multiple-record documents (e.g. advertisements, movie reviews, weather reports, travel information, sports summaries, financial statements, obituaries, and many others) we can apply a conceptual-modeling approach to extract and structure data automatically. The approach is based on an ontology---a conceptual model instance---that describes the data of interest, including relationships, lexical appearance, and context keywords. By parsing the ontology, we can automatically produce a database scheme and recognizers for constants and keywords, and then invoke routines to recognize and extract data from unstructured documents an...
Multistrategy Learning for Information Extraction
- In Proceedings of the Fifteenth International Conference on Machine Learning
, 1998
"... Information extraction (IE) is the problem of filling out pre-defined structured summaries from text documents. We are interested in performing IE in non-traditional domains, where much of the text is often ungrammatical, such as electronic bulletin board posts and Web pages. We suggest that the bes ..."
Abstract
-
Cited by 76 (2 self)
- Add to MetaCart
Information extraction (IE) is the problem of filling out pre-defined structured summaries from text documents. We are interested in performing IE in non-traditional domains, where much of the text is often ungrammatical, such as electronic bulletin board posts and Web pages. We suggest that the best approach is one that takes into account many different kinds of information, and argue for the suitability of a multistrategy approach. We describe learners for IE drawn from three separate machine learning paradigms: rote memorization, term-space text classification, and relational rule induction. By building regression models mapping from learner confidence to probability of correctness and combining probabilities appropriately, it is possible to improve extraction accuracy over that achieved by any individual learner. We describe three different multistrategy approaches. Experiments on two IE domains, a collection of electronic seminar announcements from a university computer science de...
Semantic Matching: Formal Ontological Distinctions for Information Organization, Extraction, and Integration
- INFORMATION TECHNOLOGY, INTERNATIONAL SUMMER SCHOOL, SCIE-97
, 1997
"... The task of information extraction can be seen as a problem of semantic matching between a user-defined template and a piece of information written in natural language. To this purpose, the ontological assumptions of the template need to be suitably specified, and compared with the ontological im ..."
Abstract
-
Cited by 74 (2 self)
- Add to MetaCart
The task of information extraction can be seen as a problem of semantic matching between a user-defined template and a piece of information written in natural language. To this purpose, the ontological assumptions of the template need to be suitably specified, and compared with the ontological implications of the text. So-called "ontologies", consisting of theories of various kinds expressing the meaning of shared vocabularies, begin to be used for this task. This paper addresses the theoretical issues related to the design and use of such ontologies for purposes of information retrieval and extraction. After a discussion on the nature of semantic matching within a model-theoretical framework, we introduce the subject of Formal Ontology, showing how the notions of parthood, integrity, identity, and dependence can be of help in understanding, organizing and formalizing fundamental ontological distinctions. We present then some basic principles for ontology design, and we illustrate a preliminary proposal for a top-level ontology develped according to such principles. As a concrete example of ontology-based information retrieval, we finally report an ongoing experience of use of a large linguistic ontology for the retrieval of object-oriented software components.
An Information Extraction Core System for Real World German Text Processing
- In 5th International Conference of Applied Natural Language
, 1997
"... This paper describes SMES, an information extraction core system for real world German text processing. The basic design criterion of the system is of providing a set of basic powerful, robust, and efficient natural language components and generic linguistic knowledge sources which can easily be cus ..."
Abstract
-
Cited by 73 (17 self)
- Add to MetaCart
This paper describes SMES, an information extraction core system for real world German text processing. The basic design criterion of the system is of providing a set of basic powerful, robust, and efficient natural language components and generic linguistic knowledge sources which can easily be customized for processing different tasks in a flexible manner.
Two applications of information extraction to biological science journal articles: Enzyme interactions and protein structures
, 2000
"... Information extraction technology, as de ned and developed through the U.S. DARPA Message Understanding Conferences (MUCs), has proved successful at extracting information primarily from newswire texts and primarily in domains concerned with human activity. In this paper we consider the application ..."
Abstract
-
Cited by 73 (3 self)
- Add to MetaCart
Information extraction technology, as de ned and developed through the U.S. DARPA Message Understanding Conferences (MUCs), has proved successful at extracting information primarily from newswire texts and primarily in domains concerned with human activity. In this paper we consider the application of this technology to the extraction of information from scienti c journal papers in the area of molecular biology. In particular, we describe how an information extraction system designed to participate in the MUC exercises has been modi ed for two bioinformatics applications: EMPathIE, concerned with enzyme and metabolic pathways; and PASTA, concerned with protein structure. Progress to date provides convincing grounds for believing that IE techniques will deliver novel and e ective ways for scientists to make use of the core literature which de nes their disciplines. 1
Mining the Biomedical Literature in the Genomic Era: An Overview
- JOURNAL OF COMPUTATIONAL BIOLOGY
, 2003
"... The past decade has seen a tremendous growth in the amount of experimental and computational biomedical data, specifically in the areas of Genomics and Proteomics. This growth is accompanied by an accelerated increase in the number of biomedical publications discussing the findings. In the last f ..."
Abstract
-
Cited by 72 (2 self)
- Add to MetaCart
The past decade has seen a tremendous growth in the amount of experimental and computational biomedical data, specifically in the areas of Genomics and Proteomics. This growth is accompanied by an accelerated increase in the number of biomedical publications discussing the findings. In the last few years there is a lot of interest within the scientific community in literature-mining tools to help sort through this abundance of literature, and find the nuggets of information most relevant and useful for specific analysis tasks. This paper
A Logic-Based Theory of Deductive Arguments
, 2001
"... We explore a framework for argumentation (based on classical logic) in which an argument is a pair where the first item in the pair is a minimal consistent set of formulae that proves the second item (which is a formula). We provide some basic definitions for arguments, and various kinds of counter- ..."
Abstract
-
Cited by 69 (16 self)
- Add to MetaCart
We explore a framework for argumentation (based on classical logic) in which an argument is a pair where the first item in the pair is a minimal consistent set of formulae that proves the second item (which is a formula). We provide some basic definitions for arguments, and various kinds of counter-arguments (defeaters). This leads us to the definition of canonical undercuts which we argue are the only defeaters that we need to take into account. We then motivate and formalise the notion of argument trees and argument structures which provide a way of exhaustively collating arguments and counter-arguments. We use argument structures as the basis of our general proposal for argument aggregation.

