• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence (1999)

by Silviu Cucerzan, David Yarowsky
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 57
Next 10 →

A Bootstrapping Method for Learning Semantic Lexicons Using Extraction Pattern Contexts

by Michael Thelen, Ellen Riloff - In Proc. 2002 Conf. Empirical Methods in NLP (EMNLP , 2002
"... This paper describes a bootstrapping algorithm called Basilisk that learns highquality semantic lexicons for multiple categories. ..."
Abstract - Cited by 57 (5 self) - Add to MetaCart
This paper describes a bootstrapping algorithm called Basilisk that learns highquality semantic lexicons for multiple categories.

Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora

by David Yarowsky, Grace Ngai, Richard Wicentowski , 2000
"... This paper describes a system and set of algorithms for automatically inducing stand-alone monolingual part-of-speech taggers, base noun-phrase bracketers, named-entity taggers and morphological analyzers for an arbitrary foreign language. Case studies include French, Chinese, Czech and Spanish. Exi ..."
Abstract - Cited by 34 (1 self) - Add to MetaCart
This paper describes a system and set of algorithms for automatically inducing stand-alone monolingual part-of-speech taggers, base noun-phrase bracketers, named-entity taggers and morphological analyzers for an arbitrary foreign language. Case studies include French, Chinese, Czech and Spanish. Existing text analysis tools for English are applied to bilingual text corpora and their output projected onto the second language via statistically derived word alignments. Simple direct annotation projection is quite noisy, however, even with optimal alignments. Thus this paper presents noise-robust tagger, bracketer and lemmatizer training procedures capable of accurate system bootstrapping from noisy and incomplete initial projections. Performance of the induced stand-alone part-of-speech tagger applied to French achieves 96% core part-of-speech (POS) tag accuracy, and the corresponding induced noun-phrase bracketer exceeds 91% F-measure. The induced morphological analyzer achieves over 99% lemmatization accuracy on the complete French verbal system. This achievement is particularly noteworthy in that it required absolutely no hand-annotated training data in the given language, and virtually no language-specific knowledge or resources beyond raw text. Performance also significantly exceeds that obtained by direct annotation projection. Keywords multilingual, text analysis, part-of-speech tagging, noun phrase bracketing, named entity, morphology, lemmatization, parallel corpora 1. TASK OVERVIEW A fundamental roadblock to developing statistical taggers, bracketers and other analyzers for many of the world's 200 major languages is the shortage or absence of annotated training data for the large majority of these languages. Ideally, one would like to lever- . [ ] [ ] IN N...

A Survey of Named Entity Recognition and Classification

by David Nadeau, Satoshi Sekine , 2007
"... The term “Named Entity”, now widely used in Natural Language Processing, was coined for the Sixth Message Understanding Conference (MUC-6) (R. Grishman & Sundheim 1996). At that time, MUC was focusing on Information Extraction (IE) tasks where structured information of company activities and defense ..."
Abstract - Cited by 33 (1 self) - Add to MetaCart
The term “Named Entity”, now widely used in Natural Language Processing, was coined for the Sixth Message Understanding Conference (MUC-6) (R. Grishman & Sundheim 1996). At that time, MUC was focusing on Information Extraction (IE) tasks where structured information of company activities and defense related activities is extracted

D.: Named entity transliteration and discovery from multilingual comparable corpora

by Alexandre Klementiev, Dan Roth - In: Proc. of NAACL. (2006
"... Named Entity recognition (NER) is an important part of many natural language processing tasks. Most current approaches employ machine learning techniques and require supervised data. However, many languages lack such resources. This paper presents an algorithm to automatically discover Named Entitie ..."
Abstract - Cited by 22 (1 self) - Add to MetaCart
Named Entity recognition (NER) is an important part of many natural language processing tasks. Most current approaches employ machine learning techniques and require supervised data. However, many languages lack such resources. This paper presents an algorithm to automatically discover Named Entities (NEs) in a resource free language, given a bilingual corpora in which it is weakly temporally aligned with a resource rich language. We observe that NEs have similar time distributions across such corpora, and that they are often transliterated, and develop an algorithm that exploits both iteratively. The algorithm makes use of a new, frequency based, metric for time distributions and a resource free discriminative approach to transliteration. We evaluate the algorithm on an English-Russian corpus, and show high level of NEs discovery in Russian. 1

Unsupervised Learning of Generalized Names

by Roman Yangarber, Winston Lin, Ralph Grishman - Proceedings of the 19th international conference on Computational linguistics , 2002
"... We present an algorithm, NOMEN, for learning generalized names in text. Examples of these are names of diseases and infectious agents, such as bacteria and viruses. These names exhibit certain properties that make their identification more complex than that of regular proper names, NOMEN uses a nove ..."
Abstract - Cited by 17 (7 self) - Add to MetaCart
We present an algorithm, NOMEN, for learning generalized names in text. Examples of these are names of diseases and infectious agents, such as bacteria and viruses. These names exhibit certain properties that make their identification more complex than that of regular proper names, NOMEN uses a novel form of bootstrapping to grow sets of textual instances and of their contextual patterns. The algorithm makes use of competing evidence to boost the learning of several categories of names simultaneously. We present results of the algorithm on a large corpus. We also investigate the relative merits of several evaluation strategies.

Weakly supervised named entity transliteration and discovery from multilingual comparable corpora

by Alexandre Klementiev, Dan Roth - In Association for Computational Linguistics , 2006
"... Named Entity recognition (NER) is an important part of many natural language processing tasks. Current approaches often employ machine learning techniques and require supervised data. However, many languages lack such resources. This paper presents an (almost) unsupervised learning algorithm for aut ..."
Abstract - Cited by 16 (6 self) - Add to MetaCart
Named Entity recognition (NER) is an important part of many natural language processing tasks. Current approaches often employ machine learning techniques and require supervised data. However, many languages lack such resources. This paper presents an (almost) unsupervised learning algorithm for automatic discovery of Named Entities (NEs) in a resource free language, given a bilingual corpora in which it is weakly temporally aligned with a resource rich language. NEs have similar time distributions across such corpora, and often some of the tokens in a multi-word NE are transliterated. We develop an algorithm that exploits both observations iteratively. The algorithm makes use of a new, frequency based, metric for time distributions and a resource free discriminative approach to transliteration. Seeded with a small number of transliteration pairs, our algorithm discovers multi-word NEs, and takes advantage of a dictionary (if one exists) to account for translated or partially translated NEs. We evaluate the algorithm on an English-Russian corpus, and show high level of NEs discovery in Russian. 1

Information Extraction

by Ralph Grishman
"... Information extraction is the automatic identification of selected types of entities, relations, or events in free text. This chapter considers two types of extraction: extraction of names and extraction of events. In each case, an approach to writing extraction rules is presented. In addition, meth ..."
Abstract - Cited by 14 (0 self) - Add to MetaCart
Information extraction is the automatic identification of selected types of entities, relations, or events in free text. This chapter considers two types of extraction: extraction of names and extraction of events. In each case, an approach to writing extraction rules is presented. In addition, methods are described for learning extraction rules (or statistical models) automatically from text corpora which have been annotated with information about the names or the events they contain. Information extraction (IE) is the automatic identification of selected types of entities, relations, or events in free text. It covers a wide range of tasks, from finding all the company names in a text, to finding all the murders, including who killed whom, when and where. Such capabilities are increasingly important for sifting through the enormous volumes of on-line text for the specific information which is required. This chapter will look at two of the more intensively studied IE tasks, that of name identification and classification, and that of event capture.

Integrating seed names and n-grams for a named entity list and classifier

by Sabine Buchholz, Antal Van Den Bosch - In Proceedings of the Second International Conference on Language Resources and Evaluation , 2000
"... We present a method for building a named-entity list and machine-learned named-entity classifier from a corpus of Dutch newspaper text, a rule-based named entity recognizer, and labeled seed name lists taken from the internet. The seed names, labeled either as PERSON, LOCATION, ORGANIZATION, or ADJE ..."
Abstract - Cited by 9 (5 self) - Add to MetaCart
We present a method for building a named-entity list and machine-learned named-entity classifier from a corpus of Dutch newspaper text, a rule-based named entity recognizer, and labeled seed name lists taken from the internet. The seed names, labeled either as PERSON, LOCATION, ORGANIZATION, or ADJECTIVAL name, are looked up in a 83-million word corpus, and their immediate contexts are stored as instances of their label. The latter 8-grams are used by a decision-tree learning algorithm that, after training, (i) can produce high-precision labeling of instances to be added to the seed lists, and (ii) more generally labels new, unseen names. Unlabeled namedentity types are labeled with a precision of 61 % and a recall of 56%; aiming at optimizing precision, an overall precision of 83 % can be obtained (a top precision of 88 % on PERSON). On free text, named-entity token labeling accuracy is 71%. 1.

Language Independent NER using a Unified Model of Internal and Contextual Evidence

by Silviu Cucerzan, David Yarowsky , 2002
"... This paper investigates the use of a language independent model for named entity recognition based on iterative learning in a co-training fashion, using word-internal and contextual information as independent evidence sources. Its bootstrapping process begins with only seed entities and seed context ..."
Abstract - Cited by 8 (1 self) - Add to MetaCart
This paper investigates the use of a language independent model for named entity recognition based on iterative learning in a co-training fashion, using word-internal and contextual information as independent evidence sources. Its bootstrapping process begins with only seed entities and seed contexts extracted from the provided annotated corpus. F-measure exceeds 77 in Spanish and 72 in Dutch.

Using N-best Lists for Named Entity Recognition from Chinese Speech

by Lufeng Zhai, Pascale Fung, Richard Schwartz, Marine Carpuat, Dekai Wu - Proc. of the NAACL 2004 (Short Papers , 2004
"... We present the first known result for named entity recognition (NER) in realistic largevocabulary spoken Chinese. We establish this result by applying a maximum entropy model, currently the single best known approach for textual Chinese NER, to the recognition output of the BBN LVCSR system on Chine ..."
Abstract - Cited by 8 (0 self) - Add to MetaCart
We present the first known result for named entity recognition (NER) in realistic largevocabulary spoken Chinese. We establish this result by applying a maximum entropy model, currently the single best known approach for textual Chinese NER, to the recognition output of the BBN LVCSR system on Chinese Broadcast News utterances. Our results support the claim that transferring NER approaches from text to spoken language is a significantly more difficult task for Chinese than for English. We propose re-segmenting the ASR hypotheses as well as applying postclassification to improve the performance. Finally, we introduce a method of using n-best hypotheses that yields a small but nevertheless useful improvement NER accuracy. We use acoustic, phonetic, language model, NER and other scores as confidence measure. Experimental results show an average of 6.7% relative improvement in precision and 1.7% relative improvement in F-measure. 1.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University