Results 1 - 10
of
55
Learning Information Extraction Rules for Semi-structured and Free Text
- Machine Learning
, 1999
"... . A wealth of on-line text information can be made available to automatic processing by information extraction (IE) systems. Each IE application needs a separate set of rules tuned to the domain and writing style. WHISK helps to overcome this knowledge-engineering bottleneck by learning text extract ..."
Abstract
-
Cited by 296 (9 self)
- Add to MetaCart
. A wealth of on-line text information can be made available to automatic processing by information extraction (IE) systems. Each IE application needs a separate set of rules tuned to the domain and writing style. WHISK helps to overcome this knowledge-engineering bottleneck by learning text extraction rules automatically. WHISK is designed to handle text styles ranging from highly structured to free text, including text that is neither rigidly formatted nor composed of grammatical sentences. Such semistructured text has largely been beyond the scope of previous systems. When used in conjunction with a syntactic analyzer and semantic tagging, WHISK can also handle extraction from free text such as news stories. Keywords: natural language processing, information extraction, rule learning 1. Information extraction As more and more text becomes available on-line, there is a growing need for systems that extract information automatically from text data. An information extraction (IE) sys...
Bottom-Up Relational Learning of Pattern Matching Rules for Information Extraction
, 2003
"... Information extraction is a form of shallow text processing that locates a specified set of relevant items in a natural-language document. Systems for this task require significant domain-specific knowledge and are time-consuming and difficult to build by hand, making them a good application for ..."
Abstract
-
Cited by 277 (16 self)
- Add to MetaCart
Information extraction is a form of shallow text processing that locates a specified set of relevant items in a natural-language document. Systems for this task require significant domain-specific knowledge and are time-consuming and difficult to build by hand, making them a good application for machine learning. We present an algorithm, RAPIER, that uses pairs of sample documents and filled templates to induce pattern-match rules that directly extract fillers for the slots in the template. RAPIER is a bottom-up learning algorithm that incorporates techniques from several inductive logic programming systems. We have implemented the algorithm in a system that allows patterns to have constraints on the words, part-of-speech tags, and semantic classes present in the filler and the surrounding text. We present encouraging experimental results on two domains.
Automatically Generating Extraction Patterns from Untagged Text
- In Proceedings of the Thirteenth National Conference on Artificial Intelligence
, 1996
"... Many corpus-based natural language processing systems rely on text corpora that have been manually annotated with syntactic or semantic tags. In particular, all previous dictionary construction systems for information extraction have used an annotated training corpus or some form of annotated input. ..."
Abstract
-
Cited by 244 (22 self)
- Add to MetaCart
Many corpus-based natural language processing systems rely on text corpora that have been manually annotated with syntactic or semantic tags. In particular, all previous dictionary construction systems for information extraction have used an annotated training corpus or some form of annotated input. We have developed a system called AutoSlog-TS that creates dictionaries of extraction patterns using only untagged text. AutoSlog-TS is based on the AutoSlog system, which generated extraction patterns using annotated text and a set of heuristic rules. By adapting AutoSlog and combining it with statistical techniques, we eliminated its dependency on tagged text. In experiments with the MUC-4 terrorism domain, AutoSlogTS created a dictionary of extraction patterns that performed comparably to a dictionary created by AutoSlog, using only preclassified texts as input. Motivation The vast amount of text becoming available on-line offers new possibilities for conquering the knowledgeengineerin...
Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping
- PROCEEDINGS OF THE SIXTEENTH NATIONAL CONFERENCE ON ARTICIAL INTELLIGENCE (AAAI-99)
, 1999
"... Information extraction systems usually require two dictionaries: a semantic lexicon containing domainspecific phrases and a dictionary of extraction patterns for the domain. We present a multi-level bootstrapping algorithm for building both the semantic lexicon and extraction patterns simultaneously ..."
Abstract
-
Cited by 236 (13 self)
- Add to MetaCart
Information extraction systems usually require two dictionaries: a semantic lexicon containing domainspecific phrases and a dictionary of extraction patterns for the domain. We present a multi-level bootstrapping algorithm for building both the semantic lexicon and extraction patterns simultaneously. As input, our technique requires only unannotated training texts and a handful of "seed words" for a category. We use a "mutual bootstrapping" technique to alternately select the best extraction pattern for the category and bootstrap its extractions into the semantic lexicon, which is the basis for selecting the next extraction pattern. To make this approach more robust, we add a second level of bootstrapping ("meta-bootstrapping") that retains only the most reliable lexicon entries produced by mutual bootstrapping and then restarts the process. We evaluated this multi-level bootstrapping technique on a collection of corporate web pages and a corpus of terrorism news articles. The algorithm produced high-quality dictionaries for several semantic categories.
Wrapper Induction: Efficiency and Expressiveness
- Artificial Intelligence
, 2000
"... The Internet presents numerous sources of useful information---telephone directories, product catalogs, stock quotes, event listings, etc. Recently, many systems have been built that automatically gather and manipulate such information on a user's behalf. However, these resources are usually formatt ..."
Abstract
-
Cited by 191 (12 self)
- Add to MetaCart
The Internet presents numerous sources of useful information---telephone directories, product catalogs, stock quotes, event listings, etc. Recently, many systems have been built that automatically gather and manipulate such information on a user's behalf. However, these resources are usually formatted for use by people (e.g., the relevant content is embedded in HTML pages), so extracting their content is difficult. Most systems use customized wrapper procedures to perform this extraction task. Unfortunately, writing wrappers is tedious and error-prone. As an alternative, we advocate wrapper induction, a technique for automatically constructing wrappers. In this article, we describe six wrapper classes, and use a combination of empirical and analytical techniques to evaluate the computational tradeoffs among them. We first consider expressiveness: how well the classes can handle actual Internet resources, and the extent to which wrappers in one class can mimic those in another. We then...
Empirical Methods in Information Extraction
- AI magazine
, 1997
"... this article surveys the use of empirical methods for a particular natural language understanding task that is inherently domain-specific. The task is information extraction. Very generally, an information extraction system takes as input an unrestricted text and "summarizes" the text with respect t ..."
Abstract
-
Cited by 92 (7 self)
- Add to MetaCart
this article surveys the use of empirical methods for a particular natural language understanding task that is inherently domain-specific. The task is information extraction. Very generally, an information extraction system takes as input an unrestricted text and "summarizes" the text with respect to a prespecified topic or domain of interest: it finds useful information about the domain and encodes that information in a structured form, suitable for populating databases. In contrast to in-depth natural language understanding tasks, information extraction systems effectively skim a text to find relevant sections and then focus only on these sections in subsequent processing. The information extraction system in Figure 1, for example, summarizes stories about natural disasters, extracting for each such event the type of disaster, the date and time that it occurred, and data on any property damage or human injury caused by the event. Infor
Extraction Patterns for Information Extraction Tasks: A Survey
- In AAAI-99 Workshop on Machine Learning for Information Extraction
, 1999
"... Information Extraction systems rely on a set of extraction patterns that they use in order to retrieve from each document the relevant information. In this paper we survey the various types of extraction patterns that are generated by machine learning algorithms. We identify three main categories of ..."
Abstract
-
Cited by 91 (0 self)
- Add to MetaCart
Information Extraction systems rely on a set of extraction patterns that they use in order to retrieve from each document the relevant information. In this paper we survey the various types of extraction patterns that are generated by machine learning algorithms. We identify three main categories of patterns, which cover a variety of application domains, and we compare and contrast the patterns from each category.
Information Extraction Using Hidden Markov Models
, 1997
"... This thesis shows how to design and tune a hidden Markov model to extract factual information from a corpus of machine-readable English prose. In particular, the thesis presents a HMM that classifies and parses natural language assertions about genes being located at particular positions on chromoso ..."
Abstract
-
Cited by 76 (0 self)
- Add to MetaCart
This thesis shows how to design and tune a hidden Markov model to extract factual information from a corpus of machine-readable English prose. In particular, the thesis presents a HMM that classifies and parses natural language assertions about genes being located at particular positions on chromosomes. The facts extracted by this HMM can be inserted into biological databases. The HMM is trained on a small set of sentence fragments chosen from the collected scientific abstracts in the OMIM (On-Line Mendelian Inheritance in Man) database and judged to contain the target binary relationship between gene names and gene locations. Given a novel sentence, all contiguous fragments are ranked by log-odds score, i.e. the log of the ratio of the probability of the fragment according to the target HMM to that according to a "null" HMM trained on all OMIM sentences. The most probable path through the HMM gives bindings for the annotations with precision as high as 80%. In contrast with traditional natural language processing methods, this stochastic approach makes no use either of part-of-speech taggers or dictionaries, instead employing non-emitting states to assemble modules roughly corresponding to noun, verb, and prepostional phrases. Algorithms for reestimating parameters for HMMs with non-emitting states are presented in detail. The ability to tolerate new words and recognize a wide variety of syntactic forms arises from the judicious use of "gap" states.
Relational Learning Techniques for Natural Language Information Extraction
, 1998
"... The recent growth of online information available in the form of natural language documents creates a greater need for computing systems with the ability to process those documents to simplify access to the information. One type of processing appropriate for many tasks is information extraction, a t ..."
Abstract
-
Cited by 73 (4 self)
- Add to MetaCart
The recent growth of online information available in the form of natural language documents creates a greater need for computing systems with the ability to process those documents to simplify access to the information. One type of processing appropriate for many tasks is information extraction, a type of text skimming that retrieves specific types of information from text. Although information extraction systems have existed for two decades, these systems have generally been built by hand and contain domain specific information, making them difficult to port to other domains. A few researchers have begun to apply machine learning to information extraction tasks, but most of this work has involved applying learning to pieces of a much larger system. This paper presents a novel rule representation specific to natural language and a learning system, Rapier, which learns information extraction rules. Rapier takes pairs of documents and filled templates indicating the information to be ext...
A Survey of Web Information Extraction Systems
- IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2006
"... The Internet presents a huge amount of useful information which is usually formatted for its users, which makes it difficult to extract relevant data from various sources. Therefore, the availability of robust, flexible Information Extraction (IE) systems that transform the Web pages into program-fr ..."
Abstract
-
Cited by 57 (2 self)
- Add to MetaCart
The Internet presents a huge amount of useful information which is usually formatted for its users, which makes it difficult to extract relevant data from various sources. Therefore, the availability of robust, flexible Information Extraction (IE) systems that transform the Web pages into program-friendly structures such as a relational database will become a great necessity. Although many approaches for data extraction from Web pages have been developed, there has been limited effort to compare such tools. Unfortunately, in only a few cases can the results generated by distinct tools be directly compared since the addressed extraction tasks are different. This paper surveys the major Web data extraction approaches and compares them in three dimensions: the task domain, the automation degree, and the techniques used. The criteria of the first dimension explain why an IE system fails to handle some Web sites of particular structures. The criteria of the second dimension classify IE systems based on the techniques used. The criteria of the third dimension measure the degree of automation for IE systems. We believe these criteria provide qualitatively measures to evaluate various IE approaches.

