Results 1 -
5 of
5
A Stochastic Finite-State Word-Segmentation Algorithm For Chinese
- Computational Linguistics
, 1996
"... Chinese text into dictionary entries and productively derived words, and providing pronunciations for these words; the method incorporates a class-based model in its treatment of personal names. We also evaluate the system's performance, taking into account the fact that people often do not agree on ..."
Abstract
-
Cited by 99 (9 self)
- Add to MetaCart
Chinese text into dictionary entries and productively derived words, and providing pronunciations for these words; the method incorporates a class-based model in its treatment of personal names. We also evaluate the system's performance, taking into account the fact that people often do not agree on a single seg- mentation.
Critical Tokenization and its Properties
- Computational Linguistics
, 1997
"... This paper sets out to study critical tokenization, a distinctive type of tokenization following the principle of maximum tokenization. The objective in this paper is to develop its mathematical description and understanding ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
This paper sets out to study critical tokenization, a distinctive type of tokenization following the principle of maximum tokenization. The objective in this paper is to develop its mathematical description and understanding
Accessor variety criteria for chinese word extraction
- Computational Linguistics
, 2004
"... We are interested in the problem of word extraction from Chinese text collections. We de�ne a word to be a meaningful string composed of several Chinese characters. For example,, ‘percent’, and, ‘more and more’, are not recognized as traditional Chinese words from the viewpoint of some people. Howev ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
We are interested in the problem of word extraction from Chinese text collections. We de�ne a word to be a meaningful string composed of several Chinese characters. For example,, ‘percent’, and, ‘more and more’, are not recognized as traditional Chinese words from the viewpoint of some people. However, in our work, they are words because they are very widely used and have speci�c meanings. We start with the viewpoint that a word is a distinguished linguistic entity that can be used in many different language environments. We consider the characters that are directly before a string (predecessors) and the characters that are directly after a string (successors) as important factors for determining the independence of the string. We call such characters accessors of the string, consider the number of distinct predecessors and successors of a string in a largecorpus (TREC 5 and TREC 6 documents), and use them as the measurement of the context independency of a string from the rest of the sentences in the document. Our experiments con�rm our hypothesis and show that this simple rule gives quite good results for Chinese word extraction and is comparable to, and for long words outperforms, other iterative methods. 1.
Unsupervised Segmentation of Chinese Corpus Using Accessor Variety (Extended Abstract)
"... Haodi Feng City University of Hong Kong fenghaodi@hotmail.com Kang Chen and Technology TsingHua University, Beijing, PRC Chunyu Kit Department of Chinese, Translation and Linguistics Xiaotie Deng City University of Hong Kong Abstract Chinese texts are di#erent from English texts in that ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Haodi Feng City University of Hong Kong fenghaodi@hotmail.com Kang Chen and Technology TsingHua University, Beijing, PRC Chunyu Kit Department of Chinese, Translation and Linguistics Xiaotie Deng City University of Hong Kong Abstract Chinese texts are di#erent from English texts in that they have no spaces to mark the boundaries of words. This makes the segmentation a special issue in Chinese texts processing. Since the amount of Chinese texts grows rapidly, especially due to the fast increase of the Internet, the number of Chinese words is also increasing fast. Those segmentation methods that depend on an existing dictionary thus have an obvious defect when they are used to segment texts which may contain words unknown to the dictionary.

