Results 1 - 10
of
139
Information extraction: techniques and challenges
- In Information Extraction (International Summer School SCIE-97
, 1997
"... This volume takes a broad view of information extraction as any method for ltering information from large volumes of text. This includes the retrieval of documents from collections and the tagging of particular terms in text. In this paper we shall use a narrower de nition: the identi cation of inst ..."
Abstract
-
Cited by 119 (4 self)
- Add to MetaCart
This volume takes a broad view of information extraction as any method for ltering information from large volumes of text. This includes the retrieval of documents from collections and the tagging of particular terms in text. In this paper we shall use a narrower de nition: the identi cation of instances of a particular class of events or relationships in a natural language text, and the extraction of the relevant arguments ofthe event or relationship. Information extraction therefore involves the creation of a structured representation (such asadata base) of selected information drawn from the text. The idea of reducing the information in a document toatabular structure is not new. Its feasibility for sublanguage texts was suggested by Zellig Harris in the 1950's, and an early implementation for medical texts was done at New York University by Naomi Sager[20]. However, the speci c notion of information extraction described here has received wide currency over the last decade through the series of Message Understanding Conferences [1, 2, 3, 4, 14]. We shall discuss these Conferences in more detail a bit later, and shall use simpli ed versions of
An Information Extraction Core System for Real World German Text Processing
- In 5th International Conference of Applied Natural Language
, 1997
"... This paper describes SMES, an information extraction core system for real world German text processing. The basic design criterion of the system is of providing a set of basic powerful, robust, and efficient natural language components and generic linguistic knowledge sources which can easily be cus ..."
Abstract
-
Cited by 73 (17 self)
- Add to MetaCart
This paper describes SMES, an information extraction core system for real world German text processing. The basic design criterion of the system is of providing a set of basic powerful, robust, and efficient natural language components and generic linguistic knowledge sources which can easily be customized for processing different tasks in a flexible manner.
Large-scale named entity disambiguation based on Wikipedia data
- In Proc. 2007 Joint Conference on EMNLP and CNLL
, 2007
"... This paper presents a large-scale system for the recognition and semantic disambiguation of named entities based on information extracted from a large encyclopedic collection and Web search results. It describes in detail the disambiguation paradigm employed and the information extraction process fr ..."
Abstract
-
Cited by 60 (2 self)
- Add to MetaCart
This paper presents a large-scale system for the recognition and semantic disambiguation of named entities based on information extracted from a large encyclopedic collection and Web search results. It describes in detail the disambiguation paradigm employed and the information extraction process from Wikipedia. Through a process of maximizing the agreement between the contextual information extracted from Wikipedia and the context of a document, as well as the agreement among the category tags associated with the candidate entities, the implemented system shows high disambiguation accuracy on both news stories and Wikipedia articles. 1 Introduction and Related Work
Automatic Paraphrase Acquisition from News Articles
, 2002
"... Paraphrases play an important role in the variety and complexity of natural language documents. However they adds to the difficulty of natural language processing. Here we describe a procedure for obtaining paraphrases from news article. A set of paraphrases can be useful for various kinds of applic ..."
Abstract
-
Cited by 59 (3 self)
- Add to MetaCart
Paraphrases play an important role in the variety and complexity of natural language documents. However they adds to the difficulty of natural language processing. Here we describe a procedure for obtaining paraphrases from news article. A set of paraphrases can be useful for various kinds of applications. Articles derived from different newspapers can contain paraphrases if they report the same event of the same day. We exploit this feature by using Named Entity recognition. Our basic approach is based on the assumption that Named Entities are preserved across paraphrases. We applied our method to articles of two domains and obtained notable examples. Although this is our initial attempt to automatically extracting paraphrases from a corpus, the results are promising. 1.
Mixed-Initiative Development of Language Processing Systems David Day, John Aberdeen, Lynette Hirschman,
- IN PROCEEDINGS OF THE FIFTH ACL CONFERENCE ON APPLIED NATURAL LANGUAGE PROCESSING
, 1997
"... Historically, tailoring language processing systems to specific domains and languages for which they were not originally built has required a great deal of effort. Recent advances in corpus-based manual and automatic training methods have shown promise in reducing the time and cost of this porting p ..."
Abstract
-
Cited by 56 (2 self)
- Add to MetaCart
Historically, tailoring language processing systems to specific domains and languages for which they were not originally built has required a great deal of effort. Recent advances in corpus-based manual and automatic training methods have shown promise in reducing the time and cost of this porting process. These developments have focused even greater attention on the bottleneck of acquiring reliable, manually tagged training data. This paper describes a new set of integrateat tools, collectively called the Alembic Workbench, that uses a mixed-initiative approach to "bootstrapping" the manual tagging process, with the goal of reducing the overhead associated with corpus development. Initial empirical studies using the Alembic Workbench to annotate "named entities" demonstrates that this approach can approximately double the production rate. As an adderl benefit, the combined efforts of machine and user produce domainspecific annotation rules that can be used to annotate similar texts automatically through the Alembic NLP system. The ultimate goal of this project is to enable end users to generate a practical domain-specific information extraction system within a single session.
Information Extraction: Beyond Document Retrieval
- COMPUTATIONAL LINGUISTICS AND CHINESE LANGUAGE PROCESSING
, 1998
"... In this paper we give a synoptic view of the growth text processing technology of information extraction (IE) whose function is to extract information about a pre-specified set of entities, relations or events from natural language textsand to record this information in structured representations ..."
Abstract
-
Cited by 48 (10 self)
- Add to MetaCart
In this paper we give a synoptic view of the growth text processing technology of information extraction (IE) whose function is to extract information about a pre-specified set of entities, relations or events from natural language textsand to record this information in structured representations called templates. Here we describe the nature of the IE task, review the history of the area from its origins in AI work in the 1960's and 70's till the present, discuss the techniques being used to carry out the task, describe application areas where IE systems are or are about to be at work, and conclude with a discussion of the challenges facing the area. What emerges is a picture of an exciting new text processing technology with a host of new applications, both on its own and in conjunction with other technologies, such as information retrieval, machine translation and data mining.
A Survey of Named Entity Recognition and Classification
, 2007
"... The term “Named Entity”, now widely used in Natural Language Processing, was coined for the Sixth Message Understanding Conference (MUC-6) (R. Grishman & Sundheim 1996). At that time, MUC was focusing on Information Extraction (IE) tasks where structured information of company activities and defense ..."
Abstract
-
Cited by 33 (1 self)
- Add to MetaCart
The term “Named Entity”, now widely used in Natural Language Processing, was coined for the Sixth Message Understanding Conference (MUC-6) (R. Grishman & Sundheim 1996). At that time, MUC was focusing on Information Extraction (IE) tasks where structured information of company activities and defense related activities is extracted
Quantitative Evaluation of Coreference Algorithms in an Information Extraction System
- Colloquium. Lancaster University
, 1996
"... Algorithms for performing coreference resolution can only be precisely evaluated given a benchmark corpus of coreference-annotated texts, together with techniques for evaluating the algorithms' output against the corpus. Such a corpus and such techniques have become available for the first time as p ..."
Abstract
-
Cited by 28 (12 self)
- Add to MetaCart
Algorithms for performing coreference resolution can only be precisely evaluated given a benchmark corpus of coreference-annotated texts, together with techniques for evaluating the algorithms' output against the corpus. Such a corpus and such techniques have become available for the first time as part of the Message Understanding Conference 6 (MUC-6) evaluations of information extraction systems. In this paper we describe the MUC-6 coreference task and the approach to taken to it by the Large Scale Information Extraction (LaSIE) system developed at the University of Sheffield. The basic coreference algorithm used by this system is described in detail, as well as a set of variants, which allow us to experiment with different constraints such as restrictions to certain classes of anaphor, distance restrictions between anaphor and antecedent, and weighting factors in assessing semantic similarity of potential coreferents. Quantitative evaluation results are presented for these variants, ...
An Intelligent Text Extraction and Navigation System
, 2000
"... We present sppc, a high-performance system for intelligent text extraction and navigation from German free text documents. The main purpose of sppc is to extract as much linguistic structure as possible for performing domain-specific processing. sppc consists of a set of domain-independent shallo ..."
Abstract
-
Cited by 25 (7 self)
- Add to MetaCart
We present sppc, a high-performance system for intelligent text extraction and navigation from German free text documents. The main purpose of sppc is to extract as much linguistic structure as possible for performing domain-specific processing. sppc consists of a set of domain-independent shallow core components which are realized by means of cascaded weighted finite state machines and generic dynamic tries. All extracted information is represented uniformly in one data structure (called the text chart) in a highly compact and linked form in order to support indexing and navigation through the set of solutions. German
Evaluating a focus-based approach to anaphora resolution
- In Proceedings of COLING-ACL'98
, 1998
"... We present an approach to anaphora resolution based on a focusing algorithm, and implemented within an existing MUC (Message Understand-ing Conference) Information Extraction system, allowing quantitative evaluation against a sub-stantial corpus of annotated real-world texts. Extensions to the basic ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
We present an approach to anaphora resolution based on a focusing algorithm, and implemented within an existing MUC (Message Understand-ing Conference) Information Extraction system, allowing quantitative evaluation against a sub-stantial corpus of annotated real-world texts. Extensions to the basic focusing mechanism can be easily tested, resulting in refinements to the mechanism and resolution rules. Results show that the focusing algorithm is highly sensitive to the quality of syntactic-semantic analyses, when compared to a simpler heuristic-based ap-proach. 1

