## Selection Criteria for Word Trigger Pairs in Language Modeling (1996)

### Cached

### Download Links

- [www.informatik.rwth-aachen.de]
- [www-i6.informatik.rwth-aachen.de]
- [www-i6.informatik.rwth-aachen.de]
- DBLP

### Other Repositories/Bibliography

Venue: | In ICGI’96 |

Citations: | 9 - 1 self |

### BibTeX

@INPROCEEDINGS{Tillmann96selectioncriteria,

author = {Christoph Tillmann and Hermann Ney},

title = {Selection Criteria for Word Trigger Pairs in Language Modeling},

booktitle = {In ICGI’96},

year = {1996},

pages = {95--106},

publisher = {Springer}

}

### OpenURL

### Abstract

. In this paper, we study selection criteria for the use of word trigger pairs in statistical language modeling. A word trigger pair is defined as a long-distance word pair. To select the most significant trigger pairs, we need suitable criteria which are the topics of this paper. We extend a baseline language model by a single word trigger pair and use the perplexity of this extended language model as selection criterion. This extension is applied to all possible trigger pairs, the number of which is the square of the vocabulary size. When using a unigram language model as baseline model, this approach produces the mutual information criterion used in [7, 11]. The more interesting case is to use this criterion for a more powerful model such as a bigram/trigram model with a cache. We study different variants for including word trigger pairs into such a language model. This approach produced better word trigger pairs than the usual mutual information criterion. When used on...

### Citations

9054 | Maximum likelihood from incomplete data via the EM algorithm (with discussion
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...sk. We used the computed word pairs together with a cache in an interpolated model. Thesi in Eq. (10) were adjusted by trial and error in informal experiments. They can be trained by the EM procedure =-=[3, 5]-=-. The baseline trigram model was a backing-off model presented in [9]. We choose a number of the best trigger pairs as judged by the different selection criteria. We suppose that the combination of th... |

1152 | A maximum entropy approach to natural language processing
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ...ria are in terms of the direct perplexity improvement by a trigger pair on p(wjh). This approach to select a trigger pair to extend a given model can be compared to the so-called feature selection in =-=[2]-=-. We present two new selection criteria: high level trigger and low level trigger selection. 2.1 High Level Triggers In order to select a trigger pair, we fix a long distance trigger pair (a; b) and d... |

701 | Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer
- Katz
- 1987
(Show Context)
Citation Context ...additional types of dependencies? For the selection criterion, we consider two variants. In the first variant, we directly combine trigger pairs with a given baseline model using a backing-off scheme =-=[6]-=-. When using a unigram language model as baseline model, this approach produces the mutual information criterion used by Rosenfeld in [11]. The second variant we examined is based on the idea that tri... |

362 | Self-organized language modeling for speech recognition
- Jelinek
- 1990
(Show Context)
Citation Context ...sk. We used the computed word pairs together with a cache in an interpolated model. Thesi in Eq. (10) were adjusted by trial and error in informal experiments. They can be trained by the EM procedure =-=[3, 5]-=-. The baseline trigram model was a backing-off model presented in [9]. We choose a number of the best trigger pairs as judged by the different selection criteria. We suppose that the combination of th... |

189 | Adaptive statistical language modeling: A maximum entropy approach
- Rosenfeld
- 1994
(Show Context)
Citation Context ...ossible trigger pairs, the number of which is the square of the vocabulary size. When using a unigram language model as baseline model, this approach produces the mutual information criterion used in =-=[7, 11]-=-. The more interesting case is to use this criterion for a more powerful model such as a bigram/trigram model with a cache. We study different variants for including word trigger pairs into such a lan... |

178 | The design for the Wall Street Journal-based CSR corpus
- Paul, Baker
- 1992
(Show Context)
Citation Context ...terion in Eq. (8) B: high level selection criterion in Eq. (1) C: low level selection criterion in Eq. (9). For the experiments we used training corpora from the Wall Street Journal task ( WSJ task ) =-=[10]-=-. There were three different corpora of 1, 5 and 38 million words. In the first part of this section we present samples of the selected pairs for the three criteria. They were computed on the 38 milli... |

88 |
Trigger-based language models: A maximum entropy approach
- Lau, Rosenfeld, et al.
- 1993
(Show Context)
Citation Context ...ossible trigger pairs, the number of which is the square of the vocabulary size. When using a unigram language model as baseline model, this approach produces the mutual information criterion used in =-=[7, 11]-=-. The more interesting case is to use this criterion for a more powerful model such as a bigram/trigram model with a cache. We study different variants for including word trigger pairs into such a lan... |

38 | Adaptive Language Modeling Using The Maximum Entropy Prin- ciple
- Lau, Rosenfeld, et al.
- 1993
(Show Context)
Citation Context ... baseline model p(wjh) we get: F ab \Gamma F 0 = N (a; b) log q(bja) p(b) + N (a;sb) log 1 \Gamma q(bja) 1 \Gamma p(b) + N (a; b) log 1 \Gamma q(bja) p(b) + N (a; b) log 1 \Gamma q(bja) 1 \Gamma p(b) =-=(8)-=- If we multiply Eq. (8) by 1=N and suppose p(a; b) = N(a;b) N we get exactly the mutual information criterion, used in [7, 11]. Thus this criterion is simply the improvement on the log-perplexity of a... |

31 |
Inference and estimation of a long-range trigram model
- Pietra, Pietra, et al.
- 1994
(Show Context)
Citation Context ...onfine the history to the current sentence, we get trigger pairs, showing more grammatical structure, e.g. "I ! myself", "We ! ourselves". These results can be compared to the link=-= grammar results in [4]-=-, where the grammar consists simply of pair of words. The choice of pairs being used to extend a full language model depends on the model to be extended. The unigram trigger might offer a greater aver... |

13 |
Extensions of Absolute Discounting for Language Modeling
- Ney, Generet, et al.
- 1995
(Show Context)
Citation Context ...ated model. Thesi in Eq. (10) were adjusted by trial and error in informal experiments. They can be trained by the EM procedure [3, 5]. The baseline trigram model was a backing-off model presented in =-=[9]-=-. We choose a number of the best trigger pairs as judged by the different selection criteria. We suppose that the combination of these trigger pairs will yield the best perplexity improvement within t... |

3 |
Next Word Statistical Predictor
- Bahl, Jelinek, et al.
- 1984
(Show Context)
Citation Context ...pairs" [7, 11]. In this work, we restrict ourselves to trigger pairs where both the triggered and the triggering events are single words (as opposed to word phrases). Unlike the approach presente=-=d in [1, 7]-=-, where the trigger pairs are selected on the basis of a mutual information criterion, the selection criterion presented in this paper is directly the perplexity improvement obtained by extending the ... |