Results 11 -
14 of
14
Text segmentation and Chinese site search
"... Automatic segmentation and overlapping bigrams are the most com-mon methods for overcoming the lack of explicit word boundaries in Chinese text. Past studies have compared their effectiveness, but findings have been equivocal and site search has been little studied. We compare representatives of the ..."
Abstract
- Add to MetaCart
(Show Context)
Automatic segmentation and overlapping bigrams are the most com-mon methods for overcoming the lack of explicit word boundaries in Chinese text. Past studies have compared their effectiveness, but findings have been equivocal and site search has been little studied. We compare representatives of the two approaches using a 465,000 page crawl and test queries applicable to the university context. 503 pairs of result sets were judged by 56 Chinese students. Although there are differences on certain queries, we find no overall advantage to either method. To understand the merits of each approach, we analyze cases where they performed differently. Our analysis enumerates situations which favour segmentation, and those which favour bigrams. We observe that further improvements in segmentation accuracy will not improve retrieval effectiveness.
By
, 2006
"... Word segmentation, part-of-speech (POS) tagging, and sense tagging are important steps in various Chinese natural language processing (CNLP) systems. Unknown words, i.e., words that are not in the dictionary or training data used in a CNLP system, constitute a major challenge for each of these steps ..."
Abstract
- Add to MetaCart
Word segmentation, part-of-speech (POS) tagging, and sense tagging are important steps in various Chinese natural language processing (CNLP) systems. Unknown words, i.e., words that are not in the dictionary or training data used in a CNLP system, constitute a major challenge for each of these steps. This dissertation is concerned with developing hybrid models that effectively combine statistical, knowledge-based, and machine learn-ing approaches for Chinese unknown word resolution, including the identification, part-of-speech (POS) tagging, and sense tagging of Chinese unknown words. What makes Chinese unknown word resolution hard is the limited information available for predicting the properties of unknown words, and for this reason it is crucial to make optimal use of information that is available. To this end, this research explores two central ideas and aims to achieve two major goals. First, the morphological, syntactic, and semantic information of the component characters or morphemes of an unknown word provides useful insights into its structural and semantic properties. The first goal of this work is to develop novel algorithms that
The Sparse Data Problem in Statistical Language Modeling and Unsupervised Word Segmentation
, 2001
"... The sparse data problem is one of the most important problems in natural language processing. In this thesis, we are focusing on the sparse data problem in statistical language modeling and unsupervised word segmentation. To handle the sparse data problem in language modeling, we propose a factored ..."
Abstract
- Add to MetaCart
The sparse data problem is one of the most important problems in natural language processing. In this thesis, we are focusing on the sparse data problem in statistical language modeling and unsupervised word segmentation. To handle the sparse data problem in language modeling, we propose a factored closed-/open-class m/n-gram models to improve standard n-gram model. In unsupervised word segmentation problems, we propose a hierarchical EM approach for continuous speech segmentation, a mutual information based lexicon pruning scheme, and a variant of EM algorithm for Chinese word segmentation which is called self-supervised training. We also want to evaluate the effect of unsupervised Chinese word segmentation in Chinese information retrieval comparing to the standard word-based and character-based methods.