Results 1 - 10
of
56
Bottom-Up Relational Learning of Pattern Matching Rules for Information Extraction
, 2003
"... Information extraction is a form of shallow text processing that locates a specified set of relevant items in a natural-language document. Systems for this task require significant domain-specific knowledge and are time-consuming and difficult to build by hand, making them a good application for ..."
Abstract
-
Cited by 277 (16 self)
- Add to MetaCart
Information extraction is a form of shallow text processing that locates a specified set of relevant items in a natural-language document. Systems for this task require significant domain-specific knowledge and are time-consuming and difficult to build by hand, making them a good application for machine learning. We present an algorithm, RAPIER, that uses pairs of sample documents and filled templates to induce pattern-match rules that directly extract fillers for the slots in the template. RAPIER is a bottom-up learning algorithm that incorporates techniques from several inductive logic programming systems. We have implemented the algorithm in a system that allows patterns to have constraints on the words, part-of-speech tags, and semantic classes present in the filler and the surrounding text. We present encouraging experimental results on two domains.
Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping
- PROCEEDINGS OF THE SIXTEENTH NATIONAL CONFERENCE ON ARTICIAL INTELLIGENCE (AAAI-99)
, 1999
"... Information extraction systems usually require two dictionaries: a semantic lexicon containing domainspecific phrases and a dictionary of extraction patterns for the domain. We present a multi-level bootstrapping algorithm for building both the semantic lexicon and extraction patterns simultaneously ..."
Abstract
-
Cited by 236 (13 self)
- Add to MetaCart
Information extraction systems usually require two dictionaries: a semantic lexicon containing domainspecific phrases and a dictionary of extraction patterns for the domain. We present a multi-level bootstrapping algorithm for building both the semantic lexicon and extraction patterns simultaneously. As input, our technique requires only unannotated training texts and a handful of "seed words" for a category. We use a "mutual bootstrapping" technique to alternately select the best extraction pattern for the category and bootstrap its extractions into the semantic lexicon, which is the basis for selecting the next extraction pattern. To make this approach more robust, we add a second level of bootstrapping ("meta-bootstrapping") that retains only the most reliable lexicon entries produced by mutual bootstrapping and then restarts the process. We evaluated this multi-level bootstrapping technique on a collection of corporate web pages and a corpus of terrorism news articles. The algorithm produced high-quality dictionaries for several semantic categories.
Constructing Biological Knowledge Bases by Extracting Information from Text Sources
, 1999
"... Recently, there has been much effort in making databases for molecular biology more accessible and interoperable. However, information in text form, such as MEDLINE records, remains a greatly underutilized source of biological information. We have begun a research effort aimed at automatically mappi ..."
Abstract
-
Cited by 151 (0 self)
- Add to MetaCart
Recently, there has been much effort in making databases for molecular biology more accessible and interoperable. However, information in text form, such as MEDLINE records, remains a greatly underutilized source of biological information. We have begun a research effort aimed at automatically mapping information from text sources into structured representations, such as knowledge bases. Our approach to this task is to use machine-learning methods to induce routines for extracting facts from text. We describe two learning methods that we have applied to this task --- a statistical text classification method, and a relational learning method --- and our initial experiments in learning such information-extraction routines. We also present an approach to decreasing the cost of learning information-extraction routines by learning from "weakly" labeled training data.
Boosted Wrapper Induction
, 2000
"... Recent work in machine learning for information extraction has focused on two distinct sub-problems: the conventional problem of filling template slots from natural language text, and the problem of wrapper induction, learning simple extraction procedures ("wrappers") for highly structured text such ..."
Abstract
-
Cited by 105 (7 self)
- Add to MetaCart
Recent work in machine learning for information extraction has focused on two distinct sub-problems: the conventional problem of filling template slots from natural language text, and the problem of wrapper induction, learning simple extraction procedures ("wrappers") for highly structured text such as Web pages produced by CGI scripts. For suitably regular domains, existing wrapper induction algorithms can efficiently learn wrappers that are simple and highly accurate, but the regularity bias of these algorithms makes them unsuitable for most conventional information extraction tasks. Boosting is a technique for improving the performance of a simple machine learning algorithm by repeatedly applying it to the training set with different example weightings. We describe an algorithm that learns simple, low-coverage wrapper-like extraction patterns, which we then apply to conventional information extraction problems using boosting. The result is BWI, a trainable information extraction system with a strong precision bias and F1 performance better than state-of-the-art techniques in many domains.
Information Extraction with HMM Structures Learned by Stochastic Optimization
- In Proceedings of the Seventeenth National Conference on Artificial Intelligence
, 2000
"... Recent research has demonstrated the strong performance of hidden Markov models applied to information extraction -- the task of populating database slots with corresponding phrases from text documents. A remaining problem, however, is the selection of state-transition structure for the model. This ..."
Abstract
-
Cited by 89 (2 self)
- Add to MetaCart
Recent research has demonstrated the strong performance of hidden Markov models applied to information extraction -- the task of populating database slots with corresponding phrases from text documents. A remaining problem, however, is the selection of state-transition structure for the model. This paper demonstrates that extraction accuracy strongly depends on the selection of structure, and presents an algorithm for automatically finding good structures by stochastic optimization. Our algorithm begins with a simple model and then performs hill-climbing in the space of possible structures by splitting states and gauging performance on a validation set. Experimental results show that this technique finds HMM models that almost always out-perform a fixed model, and have superior average performance across tasks.
Adaptive Information Extraction: Core Technologies For Information Agents
, 2003
"... This paper gives a state of the art overview about machine learning approaches for information extraction from documents based on finite state techniques and relational learning methods related to inductive logic programming. ..."
Abstract
-
Cited by 50 (2 self)
- Add to MetaCart
This paper gives a state of the art overview about machine learning approaches for information extraction from documents based on finite state techniques and relational learning methods related to inductive logic programming.
User-system cooperation in document annotation based on information extraction
- In Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management, EKAW02
, 2002
"... Abstract. The process of document annotation for the Semantic Web is complex and time consuming, as it requires a great deal of manual annotation. Information extraction from texts (IE) is a technology used by some very recent systems for reducing the burden of annotation. The integration of IE syst ..."
Abstract
-
Cited by 49 (13 self)
- Add to MetaCart
Abstract. The process of document annotation for the Semantic Web is complex and time consuming, as it requires a great deal of manual annotation. Information extraction from texts (IE) is a technology used by some very recent systems for reducing the burden of annotation. The integration of IE systems in annotation tools is quite a new development and there is still the necessity of thinking the impact of the IE system on the whole annotation process. In this paper we initially discuss a number of requirements for the use of IE as support for annotation. Then we present and discuss a model of interaction that addresses such issues and Melita, an annotation framework that implements a methodology for active annotation for the Semantic Web based on IE. Finally we present an experiment that quantifies the gain in using IE as support to human annotators. 1.
A Mutually Beneficial Integration of Data Mining and Information Extraction
- In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000
, 2000
"... Text mining concerns applying data mining techniques to unstructured text. Information extraction (IE) is a form of shallow text understanding that locates specific pieces of data in natural language documents, transforming unstructured text into a structured database. This paper describes a sys ..."
Abstract
-
Cited by 45 (6 self)
- Add to MetaCart
Text mining concerns applying data mining techniques to unstructured text. Information extraction (IE) is a form of shallow text understanding that locates specific pieces of data in natural language documents, transforming unstructured text into a structured database. This paper describes a system called DISCOTEX, that combines IE and data mining methodologies to perform text mining as well as improve the performance of the underlying extraction system. Rules mined from a database extracted from a corpus of texts are used to predict additional information to extract from future documents, thereby improving the recall of IE. Encouraging results are presented on applying these techniques to a corpus of computer job announcement postings from an Internet newsgroup.
A maximum entropy approach to information extraction from semi-structured and free text
- In Proceedings of the Eighteenth National Conference on Artificial Intelligence
, 2002
"... In this paper, we present a classification-based approach towards single-slot as well as multi-slot information extraction (IE). For single-slot IE, we worked on the domain of Seminar Announcements, where each document contains information on only one seminar. For multi-slot IE, we worked on the dom ..."
Abstract
-
Cited by 35 (0 self)
- Add to MetaCart
In this paper, we present a classification-based approach towards single-slot as well as multi-slot information extraction (IE). For single-slot IE, we worked on the domain of Seminar Announcements, where each document contains information on only one seminar. For multi-slot IE, we worked on the domain of Management Succession. For this domain, we restrict ourselves to extracting information sentence by sentence, in the same way as (Soderland 1999). Each sentence can contain information on several management succession events. By using a classification approach based on a maximum entropy framework, our system achieves higher accuracy than the best previously published results in both domains.
Bootstrapping for Text Learning Tasks
- IN IJCAI-99 WORKSHOP ON TEXT MINING: FOUNDATIONS, TECHNIQUES AND APPLICATIONS
, 1999
"... When applying text learning algorithms to complex tasks, it is tedious and expensive to hand-label the large amounts of training data necessary for good performance. This paper presents bootstrapping as an alternative approach to learning from large sets of labeled data. Instead of a large qua ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
When applying text learning algorithms to complex tasks, it is tedious and expensive to hand-label the large amounts of training data necessary for good performance. This paper presents bootstrapping as an alternative approach to learning from large sets of labeled data. Instead of a large quantity of labeled data, this paper advocates using a small amount of seed information and a large collection of easily-obtained unlabeled data. Bootstrapping initializes a learner with the seed information; it then iterates, applying the learner to calculate labels for the unlabeled data, and incorporating some of these labels into the training input for the learner. Two case studies of this approach are presented. Bootstrapping for information extraction provides 76% precision for a 250-word dictionary for extracting locations from web pages, when starting with just a few seed locations. Bootstrapping a text classifier from a few keywords per class and a class hierarchy provides accuracy of 66%, a level close to human agreement, when placing computer science research papers into a topic hierarchy. The success of these two examples argues for the strength of the general bootstrapping approach for text learning tasks. 1

