Results 1 -
8 of
8
Improving Chinese tokenization with linguistic filters on statistical lexical acquisition
- In Proceedings of the Fourth Conference on Applied Natural Language Processing
, 1994
"... The first step in Chinese NLP is to tokenize or segment character sequences into words, since the text contains no word delimiters. Recent heavy activity in this area has shown the biggest stumbling block to be words that are absent from the lexicon, since successful tokenizers to date have been bas ..."
Abstract
-
Cited by 33 (13 self)
- Add to MetaCart
The first step in Chinese NLP is to tokenize or segment character sequences into words, since the text contains no word delimiters. Recent heavy activity in this area has shown the biggest stumbling block to be words that are absent from the lexicon, since successful tokenizers to date have been based on dictionary lookup (e.g., Chang & Chen 1993;
Large-Scale Automatic Extraction of an English-Chinese Translation Lexicon
- Machine Translation
, 1995
"... We report experimental results on automatic extraction of an English-Chinese translation lexicon, by statistical analysis of a large parallel corpus, using limited amounts of linguistic knowledge. To our knowledge, these are the first empirical results of the kind between an Indo-Europeanand non-Ind ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
We report experimental results on automatic extraction of an English-Chinese translation lexicon, by statistical analysis of a large parallel corpus, using limited amounts of linguistic knowledge. To our knowledge, these are the first empirical results of the kind between an Indo-Europeanand non-Indo-Europeanlanguage for any significant vocabulary and corpus size. The learned vocabulary size is about 6,500 English words, achieving translation precision in the 86--96% range, with alignment proceeding at paragraph, sentence, and word levels. Specifically, we report (1) progress on the HKUST English-Chinese Parallel Bilingual Corpus, (2) experiments supportingthe usefulness of restricted lexical cues for statistical paragraphand sentence alignment, and (3) experiments that question the role of hand-derived monolingual lexicons for automatic word translation acquisition. Using a hand-derived monolingual lexicon, the learned translation lexicon averages 2.33 Chinese translations per English entry, with a manually-filtered precision of 95.1%, and an automatically-filtered weighted precision of 86.0%. We then introduce a fully automatic two-stage statistical methodology that is able to learn translations for collocations. A statistically-learned monolingual Chinese lexicon is first used to segment the Chinese text, before applying bilingual training to produce 6,429 English entries with 2.25 Chinese translations per entry. This method improves the manually-filtered precision to 96.0% and the automaticallyfiltered weighted precision to 91.0%, an error rate reduction of 35.7% from using a hand-derived monolingual lexicon.
Learning an English-Chinese lexicon from a parallel corpus
- In Proceedings of the First Conference of the Association for Machine Translation in the Americas
, 1994
"... We report experiments on automatic learning of an English-Chinese translation lexicon, through statistical training on a large parallel corpus. The learned vocabulary size is nontrivial at 6,517 English words averaging 2.33 Chinese translations per entry, with a manuallyfiltered precision of 95.1 % ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
We report experiments on automatic learning of an English-Chinese translation lexicon, through statistical training on a large parallel corpus. The learned vocabulary size is nontrivial at 6,517 English words averaging 2.33 Chinese translations per entry, with a manuallyfiltered precision of 95.1 % and a single-most-probable precision of 91.2%. We then introduce a significance filtering method that is fully automatic, yet still yields a weighted precision of 86.0%. Learning of translations is adaptive to the domain. To our knowledge, these are the first empirical results of the kind between an Indo-European and non-Indo-European language for any significant corpus size with a non-toy vocabulary. 1
Extracting Key Terms from Chinese and Japanese texts
- Computer Processing of Oriental Languages
, 1998
"... Key term extraction is very useful for information retrieval. Most term extraction methods use one of two approaches, namely lexical and grammatical. We argue that due to the differences in linguistic and character set characteristics of Chinese and Japanese, a lexical approach is more suitable for ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Key term extraction is very useful for information retrieval. Most term extraction methods use one of two approaches, namely lexical and grammatical. We argue that due to the differences in linguistic and character set characteristics of Chinese and Japanese, a lexical approach is more suitable for Chinese whereas a grammatical approach is more suitable for Japanese. In this paper, we present two simple yet powerful systems for Chinese and Japanese key term extraction---CXtract and JBrat. CXtract uses predominantly statistical lexical information to find term boundaries in large text. JBrat is based on morphosyntactic information of the Japanese character sets for terms. Evaluation results show that CXtract has a 80.24% average precision in term extraction, and JBrat has a 88.07% average precision. 1 Introduction Linguists have argued that the smallest semantic unit is often not a single word, as defined by a string of letters delimited by spaces, but a phrase (or a term) (Pinchuck 19...
Adaptive Sentence Alignment based on Length and Lexical Information
- In Proceedings of the 40th Annual Meeting of Association for Computational Linguistics, Comp. Volume
, 2002
"... This prototype system demonstrates a novel sentence alignment method for bilingual texts based on adaptive learning and lexical information. The system aligns bilingual text at the paragraph level first and acquires length related statistics for the subsequent sentence alignment process. In addition ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This prototype system demonstrates a novel sentence alignment method for bilingual texts based on adaptive learning and lexical information. The system aligns bilingual text at the paragraph level first and acquires length related statistics for the subsequent sentence alignment process. In addition to lengths, a probabilistic translation lexicon is utilized to further enhance the precision. The system is especially effective in the case of noisy translations produced in either translation direction that may involve different domains. 1
2004b. Alignment of bilingual named entities in parallel corpora using statistical model
- Lecture Notes in Artificial Intelligence
, 2005
"... Named entity (NE) extraction is one of the fundamental tasks in natural language processing (NLP). Although many studies have focused on identifying NEs within monolingual documents, aligning NEs in bilingual documents has not been investigated extensively due to the complexity of the task. In this ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Named entity (NE) extraction is one of the fundamental tasks in natural language processing (NLP). Although many studies have focused on identifying NEs within monolingual documents, aligning NEs in bilingual documents has not been investigated extensively due to the complexity of the task. In this article, we introduce a new approach to aligning bilingual NEs in parallel cor-pora by incorporating statistical models with multiple knowledge sources. In our approach, we model the process of translating an English NE phrase into a Chinese equivalent using lexical translation/transliteration probabilities for word translation and alignment probabilities for word reordering. The method involves automatically learning phrase alignment and acquiring word translations from a bilingual phrase dictionary and parallel corpora, and automatically discover-ing transliteration transformations from a training set of name-transliteration pairs. The method also involves language-specific knowledge functions, including abbreviation handling, Chinese person name recognition, and acronym expansion. At run time, the proposed models are applied to each source NE in a pair of bilingual sentences to generate and evaluate the target NE candi-dates, and the source and target NEs are aligned based on the computed probabilities. Experi-
Statistical Augmentation of a Chinese Machine-Readable Dictionary
- In Proceedings of the Second Annual Workshop on Very Large Corpora
, 1994
"... We describe a method of using statistically-collected Chinese character groups from a corpus to augment a Chinese dictionary. The method is particularly useful for extracting domainspecific and regional words not readily available in machine-readable dictionaries. Output was evaluated both using hum ..."
Abstract
- Add to MetaCart
We describe a method of using statistically-collected Chinese character groups from a corpus to augment a Chinese dictionary. The method is particularly useful for extracting domainspecific and regional words not readily available in machine-readable dictionaries. Output was evaluated both using human evaluators and against a previously available dictionary. We also evaluated performance improvement in automatic Chinese tokenization. Results show that our method outputs legitimate words, acronymic constructions, idioms, names and titles, as well as technical compounds, many of which were lacking from the original dictionary. 1 Introduction Finding new lexical entries for Chinese is hampered by a particularly obscure distinction between characters, morphemes, words, and compounds. Even in Indo-European text where words can be separated by spaces, no absolute criteria are known for deciding whether a collocation constitutes a compound word. Chinese defies such distinctions yet more str...
Parsing Chinese with an Almost-Context-Free Grammar
"... We describe a novel parsing strategy we are employing for Chinese. We believe progress in Chinese parsing technology has been slowed by the excessive ambiguity that typically arises in pure context-free grammars. This problem has inspired a modified formalism that enhances our ability to write and m ..."
Abstract
- Add to MetaCart
We describe a novel parsing strategy we are employing for Chinese. We believe progress in Chinese parsing technology has been slowed by the excessive ambiguity that typically arises in pure context-free grammars. This problem has inspired a modified formalism that enhances our ability to write and maintain robust large grammars, by constraining productions with left/right contexts and/or nonterminal functions. Parsing is somewhat more expensive than for pure context-free parsing, but is still efficient by both theoretical and empirical analyses. Encouraging experimental results with our current grammar are described.

