Results 1 - 10
of
10
A Stochastic Finite-State Word-Segmentation Algorithm For Chinese
- Computational Linguistics
, 1996
"... Chinese text into dictionary entries and productively derived words, and providing pronunciations for these words; the method incorporates a class-based model in its treatment of personal names. We also evaluate the system's performance, taking into account the fact that people often do not agree on ..."
Abstract
-
Cited by 99 (9 self)
- Add to MetaCart
Chinese text into dictionary entries and productively derived words, and providing pronunciations for these words; the method incorporates a class-based model in its treatment of personal names. We also evaluate the system's performance, taking into account the fact that people often do not agree on a single seg- mentation.
Chinese word segmentation and named entity recognition: a pragmatic approach
- Computational Linguistics
, 2005
"... This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragm ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in realistic computer applications. Secondly, we propose a pragmatic mathematical framework in which segmenting known words and detecting unknown words of different types (i.e. morphologically derived words, factoids, named entities, and other unlisted words) can be performed simultaneously in a unified way. These tasks are usually conducted separately in other systems. Finally, we do not assume the existence of a universal word segmentation standard which is application independent. Instead, we argue for the necessity of multiple segmentation standards due to the pragmatic fact that different NLP applications might require different granularities of Chinese words. These pragmatic approaches have been implemented in an adaptive Chinese word segmenter, called MSRSeg (access
Critical Tokenization and its Properties
- Computational Linguistics
, 1997
"... This paper sets out to study critical tokenization, a distinctive type of tokenization following the principle of maximum tokenization. The objective in this paper is to develop its mathematical description and understanding ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
This paper sets out to study critical tokenization, a distinctive type of tokenization following the principle of maximum tokenization. The objective in this paper is to develop its mathematical description and understanding
Combination and boundary detection approaches on chinese indexing
- Journal of the American Society for Information Science
, 2000
"... Digital libraries store materials in electronic format. Research and development in digital libraries includes content creation, conversion, indexing, organization, and dissemination. The key technological issues are how to search and display desired selections from and across large collections effe ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
Digital libraries store materials in electronic format. Research and development in digital libraries includes content creation, conversion, indexing, organization, and dissemination. The key technological issues are how to search and display desired selections from and across large collections effectively [Schatz & Chen, 1996]. Digital library research projects (DLI-1) sponsored by NSF/ DARPA/NASA have a common theme of bringing search to the net, which is the flagship research effort for the National Information Infrastructure (NII) in the United States. A repository is an indexed collection of objects. Indexing is an important task for searching. The better the indexing, the better the searching result. Developing a universal digital library has been the dream of many researchers, however, there are still many problems to
Exploiting the Web as Parallel Corpora for Cross-Language Information Retrieval
- Web Intelligence
, 2002
"... The expansion of the Web creates more requirements for Cross-Language Information Retrieval (CLIR). Query translation is the key problem. Previous studies have shown that query translation can be done by exploiting a large set of parallel texts. However, the problem arisen is the unavailability of l ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
The expansion of the Web creates more requirements for Cross-Language Information Retrieval (CLIR). Query translation is the key problem. Previous studies have shown that query translation can be done by exploiting a large set of parallel texts. However, the problem arisen is the unavailability of large parallel corpora for many languages. In this paper, we describe a mining system that automatically discovers parallel Web pages on the Web. This system exploits the existing search engines and the common characteristics in the organization of Web pages. Several large text corpora have been constructed using this system. This paper describes the mining process as well as the experimental results for English-French and English-Chinese CLIR. Our experiments show that query translation using the mined corpora can be as good as those by high-quality machine translation systems. This study shows the feasibility of building automatically a query translation system for all the active languages on the Web. 1.
Experiments on unsupervised chinese word segmentation and classification
- First Students workshop on Computational Linguistics
, 2002
"... Abstract: There are several problems encountered for Chinese language processing as Chinese is written without word delimiters. The difficulty in defining a word makes it even harder. This paper explores the possibility of automatically segmenting Chinese character sequences into words and classifyi ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract: There are several problems encountered for Chinese language processing as Chinese is written without word delimiters. The difficulty in defining a word makes it even harder. This paper explores the possibility of automatically segmenting Chinese character sequences into words and classifying these words through distributional analysis in contrast with the usual approaches that depends on dictionaries.
© 2005 Association for Computational Linguistics
"... This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragm ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in realistic computer applications. Secondly, we propose a pragmatic mathematical framework in which segmenting known words and detecting unknown words of different types (i.e. morphologically derived words, factoids, named entities, and other unlisted words) can be performed simultaneously in a unified way. These tasks are usually conducted separately in other systems. Finally, we do not assume the existence of a universal word segmentation standard which is application independent. Instead, we argue for the necessity of multiple segmentation standards due to the pragmatic fact that different NLP applications might require different granularities of Chinese words. These pragmatic approaches have been implemented in an adaptive Chinese word segmenter, called MSRSeg, which will be described in detail. It consists of two components: (1) a generic segmenter that is based on the framework of linear mixture models, and provides a unified approach to the five fundamental features of word-level Chinese language processing: lexicon word processing, morphological analysis, factoid detection, named entity recognition, and new word identification; and (2) a set of output adaptors for adapting the output of the former to different application-specific standards. Evaluation on five test sets with different standards shows that the adaptive system achieves state-of-the-art performance on all the test sets. 1.
Chinese Word Segmentation Based on Contextual Entropy
"... Chinese is written without word delimiters. Word segmentation is generally considered the key step in processing Chinese texts. This paper presents a new statistical approach to segment Chinese sequences into words. This approach is based on contextual entropy on both sides of a bigram. It is used t ..."
Abstract
- Add to MetaCart
Chinese is written without word delimiters. Word segmentation is generally considered the key step in processing Chinese texts. This paper presents a new statistical approach to segment Chinese sequences into words. This approach is based on contextual entropy on both sides of a bigram. It is used to capture the dependency with the left and right contexts in which a bigram occurs. Our approach tries to find the word boundaries instead of words for segmentation. Experimental results show that it is effective for Chinese word segmentation. 1

