Results 1 -
2 of
2
Automatic Rule Induction for Unknown Word Guessing
- Computational Linguistics
, 1997
"... Words unknown to the lexicon present a substantial problem to NLP modules that rely on mor-phosyntactic information, such as part-of-speech taggers or syntactic parsers. In this paper we present a technique for fully automatic acquisition of rules that guess possible part-of-speech tags for unknown ..."
Abstract
-
Cited by 104 (6 self)
- Add to MetaCart
Words unknown to the lexicon present a substantial problem to NLP modules that rely on mor-phosyntactic information, such as part-of-speech taggers or syntactic parsers. In this paper we present a technique for fully automatic acquisition of rules that guess possible part-of-speech tags for unknown words using their starting and ending segments. The learning is performed from a general-purpose lexicon and word frequencies collected from a raw corpus. Three complimentary sets of word-guessing rules are statistically induced: prefix morphological rules, suffix morpho-logical rules and ending-guessing rules. Using the proposed technique, unknown-word-guessing rule sets were induced and integrated into a stochastic tagger and a rule-based tagger, which were then applied to texts with unknown words. 1.
Periods, Capitalized Words, etc.
- Computational Linguistics
, 1999
"... this paper we present an approach which tackles three problems: sentence boundary disambiguation, disambiguation of capitalized words when they are used in positions where capitalization is expected and identification of abbreviations. All these tasks are important tasks of text normalization, which ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
this paper we present an approach which tackles three problems: sentence boundary disambiguation, disambiguation of capitalized words when they are used in positions where capitalization is expected and identification of abbreviations. All these tasks are important tasks of text normalization, which is a necessary phase in almost all text processing activities. The main feature of our approach is that it uses a minimum of pre-built resources. To compensate for the lack of pre-acquired knowledge, the system tries to dynamically infer disambiguation clues from the entire document itself. This makes our approach domain independent, closely targeted to each document and portable to other languages. We thoroughly evaluated our approach on the Brown Corpus and on a corpus of newswire articles from The New York Times. The system produced a very strong performance reaching about 99% accuracy on capitalized words and about 99.3-99.7% accuracy on sentence boundaries. This performance is the highest quoted in the literature for the tasks. We also present the results of applying our system to a corpus of news in Russian and training a part-of-speech tagger which uses a maximum entropy model that utilizes nonlocal features generated by our method.

