DMCA
Self-supervised Chinese Word Segmentation (2001)
Venue: | In F. Homan et al. (Eds.): Advances in Intelligent Data Analysis, Proceedings of the Fourth International Conference (IDA-01), LNCS 2189 |
Citations: | 32 - 7 self |
Citations
11956 | Maximum Likelihood from Incomplete Data via the EM Algorithm
- Dempster, Laird, et al.
- 1997
(Show Context)
Citation Context ...orithm is widely adopted for unsupervised training is that it is guaranteed to converge to a good probability model that locally maximizes the likelihood or posterior probability of the training data =-=[6]-=-. For Chinese segmentation, EM is usually applied bysrst extracting a lexicon which contains the candidate multi-grams from a given training corpus, initializing a probability distribution over lexico... |
1138 | Foundations of statistical natural language processing
- Manning, Schuetze
- 1999
(Show Context)
Citation Context ...wo-chunk segmentation, say s 1 =\abcd" and s 2 =\efghijk". Let the probabilities of the original string and the two chunks be p(s), p(s 1 ), and p(s 2 ) respectively. The pointwise mutual in=-=formation [10-=-] between s 1 and s 2 is MI(s 1 ; s 2 ) = log p(s) p(s 1 ) p(s 2 ) : (6) To apply this measure to pruning, we set two thresholdss1 >s2 . If the mutual information is higher than the thresholds1 , we ... |
458 | A Tutorial on - Rabiner - 1989 |
79 | Structure Learning in Conditional Probability Models via an Entropic Prior and
- Brand
- 1999
(Show Context)
Citation Context ...rter primitives (once the EM optimization has stabilized). Not only does this have the advantage of producing a smaller core lexicon, it also has the side eect of driving EM out of poor local maxima [=-=6, 2]-=- and yielding better segmentation performance. The remainder of the paper describes the self-supervised training procedure in detail, followed by the mutual information lexicon pruning criterion, expe... |
54 | Language modeling by variable length sequences : Theoretical formulation and evaluation of multigrams.
- Deligne, Bimbot
- 1995
(Show Context)
Citation Context ... (4) therefore are weighted frequency counts. Thus, the updates can be eciently calculated using the forward and backward algorithm, or eciently approximated using the Viterbi algorithm; see [13] and =-=[5]-=- for detailed algorithms. 2.2 Self-supervised training The main diculty with applying EM to this problem is that the probability distributions are complex and typically cause EM to get trapped in poor... |
46 | Mostly-unsupervised statistical segmentation of Japanese: Application to kanji
- Ando, Lee
- 2000
(Show Context)
Citation Context ...63 are most commonly used, building a complete lexicon by hand is impractical. Therefore a number of unsupervised segmentation methods have been proposed recently to segment Chinese and Japanese text =-=[1, 3, 8, 12, 9]-=-. Most of these approaches use some form of EM to learn a probabilistic model of character sequences and then employ Viterbi-decoding-like procedures to segment new text into words. One reason that EM... |
40 | An unsupervised iterative method for Chinese new lexicon extraction
- Chang, Su
- 1997
(Show Context)
Citation Context ...63 are most commonly used, building a complete lexicon by hand is impractical. Therefore a number of unsupervised segmentation methods have been proposed recently to segment Chinese and Japanese text =-=[1, 3, 8, 12, 9]-=-. Most of these approaches use some form of EM to learn a probabilistic model of character sequences and then employ Viterbi-decoding-like procedures to segment new text into words. One reason that EM... |
34 |
On the discovery of novel word-like units from utterances: An artificial-language study with implications for native-language acquisition.
- Dahan, Brent
- 1999
(Show Context)
Citation Context ...ease the posterior probability of the training data. One advantage of unsupervised lexicon construction is that it can automatically discover new words once other words have acquired high probability =-=[4]. For example, -=-if one knows the word \computer" then upon seeing \computerscience" it is natural to segment \science" as a new word. Based on this observation, we propose a new word discovery method t... |
34 | USeg: A retargetable word segmentation procedure for information retrieval.
- PONTE, CROFT
- 1996
(Show Context)
Citation Context ...63 are most commonly used, building a complete lexicon by hand is impractical. Therefore a number of unsupervised segmentation methods have been proposed recently to segment Chinese and Japanese text =-=[1, 3, 8, 12, 9]-=-. Most of these approaches use some form of EM to learn a probabilistic model of character sequences and then employ Viterbi-decoding-like procedures to segment new text into words. One reason that EM... |
15 |
Chinese Segmentation and its Disambiguation
- Jin
- 1992
(Show Context)
Citation Context ...sts the context information is very important. However, because of a dierent test set (our test set is the 1M ChineseTreebank from LDC, whereas their test data is 61K pre-segmented by NMSU segmenter [=-=9]-=- and corrected by hand), the comparison is not fully calibrated. In the perfect lexicon experiments, [12] achieves higher performance (94.7% F-measure), whereas only 91.9% is achieved in our experimen... |
14 | Chinese word segmentation and information retrieval
- Palmer, Burger
- 1997
(Show Context)
Citation Context ... our technique to previous results, we follow [8, 12] and measure performance by precision, recall, and F-measure on detecting word boundaries. Here, a word is considered to be correctly recovered i [=-=11]-=-: 1. a boundary is correctly placed in front of thesrst character of the word, 2. a boundary is correctly placed at the end of the last character of the word, 3. and there is no boundary between thesr... |
6 |
Extracting key terms from Chinese and Japnese text
- Fung
- 1998
(Show Context)
Citation Context ...tion 2 to recover C 3 0 from C 3 00 . The validation corpus, C 2 , consists of 2000 sentences randomly selected from the test corpus. According to the 1980 Frequency Dictionary of Modern Chinese (see =-=[7]-=-), the top 9000 most frequent words in Chinese consist of 26.7% unigrams, 69.8% 1 http://www.ldc.upenn.edu/ctb/ bigrams, 2.7% trigrams, 0.007% 4-grams, and 0.002% 5-grams. So in our model, we limit th... |
2 |
Discovering Chinese Words from Unsegmented Text. SIGIR-99
- Ge, Pratt, et al.
- 1999
(Show Context)
Citation Context ...63 are most commonly used, building a complete lexicon by hand is impractical. Therefore a number of unsupervised segmentation methods have been proposed recently to segment Chinese and Japanese text =-=[1, 3, 8, 12, 9]-=-. Most of these approaches use some form of EM to learn a probabilistic model of character sequences and then employ Viterbi-decoding-like procedures to segment new text into words. One reason that EM... |
2 |
A stochastic word-segmentation algorithm for Chinese
- Sproat, Shih, et al.
- 1996
(Show Context)
Citation Context ...gmenting an input sentence into words is a nontrivial task in such cases. For Chinese, there has been a signicant amount of research on techniques for discovering word segmentations; see for example [=-=14]-=-. The main idea behind most of these techniques is to start with a lexicon that contains the set of possible Chinese words and then segment a concatenated Chinese character string by optimizing a heur... |