Results 1 -
1 of
1
Algorithms For Bigram And Trigram Word Clustering
- Speech Communication
, 1995
"... . This paper presents and analyzes improved algorithms for clustering bigram and trigram word equivalence classes, and their respective results: 1) We give a detailed time complexity analysis of bigram clustering algorithms. 2) We present an improved implementation of bigram clustering so that large ..."
Abstract
-
Cited by 46 (0 self)
- Add to MetaCart
. This paper presents and analyzes improved algorithms for clustering bigram and trigram word equivalence classes, and their respective results: 1) We give a detailed time complexity analysis of bigram clustering algorithms. 2) We present an improved implementation of bigram clustering so that large corpora (38 million words and more) can be clustered within a small number of days or even hours. 3) We extend the clustering approach from bigrams to trigrams. 4) We present experimental results on a 38 million word training corpus. 1. INTRODUCTION Word equivalence classes are a method for improving undertrained word M--gram language models [1], [2], [4]. Words are grouped into classes, and each word belongs to only one such class. Thus, if a word pair is not seen in training, it is quite likely that the corresponding class pair is seen. For bigram and trigram class models, we have the equations p(wn jwn\Gamma1 ) = p0(wn jG(wn)) (1) \Deltap 1(G(wn)jG(wn\Gamma1)) p(wn jwn\Gamma2 ; wn\Gam...

