Results 1 - 10
of
36
A Stochastic Finite-State Word-Segmentation Algorithm For Chinese
- Computational Linguistics
, 1996
"... Chinese text into dictionary entries and productively derived words, and providing pronunciations for these words; the method incorporates a class-based model in its treatment of personal names. We also evaluate the system's performance, taking into account the fact that people often do not agree on ..."
Abstract
-
Cited by 99 (9 self)
- Add to MetaCart
Chinese text into dictionary entries and productively derived words, and providing pronunciations for these words; the method incorporates a class-based model in its treatment of personal names. We also evaluate the system's performance, taking into account the fact that people often do not agree on a single seg- mentation.
Comparing Representations in Chinese Information Retrieval
- In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 1997
"... Three representation methods are empirically investigated for Chinese information retrieval: 1-gram (single character), bigram (two contiguous overlapping characters), and short-word indexing based on a simple segmentation of the text. The retrieval collection is the approximately 170 MB TREC-5 Chin ..."
Abstract
-
Cited by 39 (1 self)
- Add to MetaCart
Three representation methods are empirically investigated for Chinese information retrieval: 1-gram (single character), bigram (two contiguous overlapping characters), and short-word indexing based on a simple segmentation of the text. The retrieval collection is the approximately 170 MB TREC-5 Chinese corpus of news articles, and 28 queries that are long and rich in wordings. Evaluation shows that 1-gram indexing is good but not sufficiently competitive, while bigram indexing works surprisingly well. Bigram indexing leads to a large index term space, three times that of short-word indexing, but is as good as short-word indexing in precision, and about 5% better in relevants retrieved. The best average non-interpolated precision is about 0.45, 17% better than 1-gram indexing and quite high for a mainly statistical approach. 1. Introduction While information retrieval (IR) in English has over thirty years of history, IR in Chinese is relatively recent. It is well-known that written Chi...
Statistical models for word segmentation and unknown resolution
- In Proceedings of ROCLING-92
, 1992
"... In a Chinese sentence, there are no word delimiters, like blanks, between the “words”. Therefore, it is important to identify the word boundaries before processing Chinese text. Traditional approaches tend to use dictionary lookup, morphological rules and heuristics to identify the word boundaries. ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
In a Chinese sentence, there are no word delimiters, like blanks, between the “words”. Therefore, it is important to identify the word boundaries before processing Chinese text. Traditional approaches tend to use dictionary lookup, morphological rules and heuristics to identify the word boundaries. Such approaches may not be applied to a large system due to the complicated linguistic phenomena involved in Chinese morphology and syntax. In this paper, the various available features in a sentence are used to construct a generalized word segmentation model; the various probabilistic models for word segmentation are then derived based on the generalized model. In general, the likelihood measure adopted in a probabilistic model does not provide a scoring mechanism that directly indicates the real ranks of the various candidate segmentation patterns. To enhance the baseline models, a robust adaptive learning algorithm is proposed to adjust the parameters of the baseline models so as to increase the discrimination power and robustness of the models. The simulation shows that cost-effective word segmentation could be achieved under various contexts with the proposed models. It is possible to achieve accuracy in word recognition rate of 99.39 % and sentence recognition rate of 97.65 % in the testing corpus by incorporating word length information to a context-independent word model and applying a robust adaptive learning algorithm in the segmentation process. Since not all lexical items could be found in the system dictionary in real applications, the performance of most word segmentation methods in the literature may degraded significantly when unknown words are encountered. Such an “unknown word problem ” is also examined in this paper. An error recovery mechanism based on the segmentation model is proposed. 1 Preliminary experiments show that the error rates introduced by unknown words could be reduced significantly. 1.
Retrieval Of Broadcast News Speech In Mandarin Chinese Collected In Taiwan Using Syllable-Level Statistical Characteristics
- Proceedings of the 2000 International Conference on Acoustics Speech and Signal Processing
, 2000
"... Spoken document retrieval has been extensively studied in recent years because of its high potential in various applications in the near future. Considering the monosyllabic structure of Chinese language, a whole class of indexing features for retrieval of spoken documents in Mandarin Chinese us ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
Spoken document retrieval has been extensively studied in recent years because of its high potential in various applications in the near future. Considering the monosyllabic structure of Chinese language, a whole class of indexing features for retrieval of spoken documents in Mandarin Chinese using syllable-level statistical characteristics has been studied, and very encouraging experimental results on retrieval of broadcast news speech collected in Taiwan were obtained. This paper reports some interesting initial results and findings obtained in this research. 1. INTRODUCTION The network technologies and the Internet activities have created a completely new information era. Intelligent and efficient information retrieval techniques providing Internet users with easy access to spoken documents, such as broadcast radio and television programs, become highly desired and have been extensively studied in recent years [1-6]. At the same time, the DARPA Hub-4 contest that began in...
Critical Tokenization and its Properties
- Computational Linguistics
, 1997
"... This paper sets out to study critical tokenization, a distinctive type of tokenization following the principle of maximum tokenization. The objective in this paper is to develop its mathematical description and understanding ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
This paper sets out to study critical tokenization, a distinctive type of tokenization following the principle of maximum tokenization. The objective in this paper is to develop its mathematical description and understanding
On the Use of Words and N-grams for Chinese Information Retrieval
- In Fifth International Workshop on Information Retrieval with Asian Languages, IRAL2000, Hong Kong
, 2000
"... : In the processing of Chinese documents and queries in information retrieval (IR), one has to ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
: In the processing of Chinese documents and queries in information retrieval (IR), one has to
Identification and Classification of Proper Nouns in Chinese Texts
- PROCEEDINGS OF 16TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS
, 1996
"... Various strategies are proposed to identify and classify three types of proper notins in Chinese texts. Clues from character, sentence and paragraph levels are employed to resolve Chinese personal names. Character, ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
Various strategies are proposed to identify and classify three types of proper notins in Chinese texts. Clues from character, sentence and paragraph levels are employed to resolve Chinese personal names. Character,
Statistically-Enhanced New Word Identification in a Rule-Based Chinese System
- in Proceedings of the 2 nd Chinese Language Processing Workshop
, 2000
"... This paper presents a mechanism of new word identification in Chinese text where probabilities are used to filter candidate character strings and to assign POS to the selected strings in a ruled-based system. This mechanism avoids the sparse data problem of pure statistical approaches and the over-g ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
This paper presents a mechanism of new word identification in Chinese text where probabilities are used to filter candidate character strings and to assign POS to the selected strings in a ruled-based system. This mechanism avoids the sparse data problem of pure statistical approaches and the over-generation problem of rule-based approaches. It improves parser coverage and provides a tool for the lexical acquisition of new words. 1
Introduction to CKIP Chinese word segmentation system for the first international Chinese Word Segmentation Bakeoff
- Proceedings of the second SIGHAN workshop on Chinese language processing
, 2003
"... ..."

