## ON THE RELATION BETWEEN ADDITIVE SMOOTHING AND UNIVERSAL CODING

### BibTeX

@MISC{_onthe,

author = {},

title = {ON THE RELATION BETWEEN ADDITIVE SMOOTHING AND UNIVERSAL CODING},

year = {}

}

### OpenURL

### Abstract

We analyze the performance of smoothing methods for language modeling from the perspective of universal compression. We use existing asymptotic bounds on the performance of simple additive rules for compression of finite-alphabet memoryless sources to explain the empirical predictive abilities of additive smoothing techniques. We further suggest a smoothing method that overcomes some of the problems observed in previous approaches. The new method outperforms existing ones on the Wall Street Journal(WSJ) database for bigram and trigram models. We then suggest possible directions for future research. 1.

### Citations

940 | An empirical study of smoothing techniques for language modeling. Computer Speech and Language
- SF, Goodman
- 1999
(Show Context)
Citation Context ...dations of universal compression, capable of producing tight performance bounds for underlying models, and practical techniques used in language modeling. We also build on a paper by Chen and Goodman =-=[1]-=- that first compared many of the smoothing techniques, and our experiments use similar methodologies. A sequential language model is a probability distribution over sentences, w l 0 = w0 . . . wl. We ... |

570 |
Theory of Probability
- Jeffreys
- 1961
(Show Context)
Citation Context ...s equation is known as the Laplace’s law of succession, and presents the posterior probability of symbol w, if we assume uniform prior probability over all possible sample count ensembles, see, e.g., =-=[3, 4]-=-. When δ �= 1, Equation (1) represents Lidstone’s law, often used to estimate probabilities in sparse data. These probability estimation methods are also considered in universal compression, where the... |

357 |
Interpolated estimation of markov source parameters from sparse data
- Jelinek, Mercer
- 2000
(Show Context)
Citation Context ...on that the Good-Turing method is a universal compression method for unknown vocabularies[9], in a problem setting different than the one considered here. 3.1. Jelinek-Mercer model Jelinek and Mercer =-=[10]-=- have described a general class of n-gram models that interpolate different memory Markov sources. p(wi|w i−1 n−1 � i−n+1) = λj · pML(wi|w i−1 i−j ), j=0 where �n−1 j=0 λj = 1, pML(wi|wi−1 c(w i−1 . T... |

305 |
Improved backing-off for m-gram language modeling
- Kneser, Ney
- 1995
(Show Context)
Citation Context ...g w i−1 i−n+1 tion 2) performs better than linear discounting. However, when context clustering is used for linear discounting, the performance is about the same. 3.3. Kneser-Ney model Kneser and Ney =-=[15]-=- took the absolute discounting model a step further. They put a constraint on the back-off distribution, forcing the marginals of a higher order distribution to match the marginals of the training dat... |

184 |
On structuring probabilistic dependency in stochastic language modeling", Computer Speech & Language 8
- Ney
- 1994
(Show Context)
Citation Context ...as a criterion for clustering. This criterion is also much less sensitive to the cluster size change, compared to the total count criterion. 3.2. Linear and Absolute Discounting Ney, Essen and Kneser =-=[13, 14]-=- argued that all the words in longer contexts are oversampled and that there could be two general wayssof discounting (or reducing their probabilities) to share the probability with the unobserved wor... |

181 |
A cache-based natural language model for speech recognition
- Kuhn, Mori
- 1990
(Show Context)
Citation Context ... insight to the non static nature of language. Similar evidence that language can not be modeled properly by a static model was also offered e.g. through successful use of cache based language models =-=[18]-=-. We have shown that only the existence of sublinear word sequences in a context would justify count discounting asymptotically. It is up to the future research to identify words responsible for those... |

135 |
The performance of universal encoding
- Krichevsky, Trofimov
- 1981
(Show Context)
Citation Context ...he worst source, the leading term in the redundancy bound is at least twice that of the best universal predictor. 2.4. Add-half rule The add-half rule, also known as the Krichevsky-Trofimov estimator =-=[8]-=-, is often suggested in the compression literature as the best universal sequential compression method for finite-alphabets. The average redundancy of the add-half rule asymptotically converges to the... |

94 | A bit of progress in language modeling
- Goodman
- 2001
(Show Context)
Citation Context ...hs of unseen data, and it is not clear why did this occur. In general, a language model that has marginals that do not match the correct probabilities of the source will diverge arbitrarily over time =-=[16]-=-. In the given analysis however, n-gram frequencies are used instead of the correct marginals which are unknown. We feel that deeper understanding of the reasons for this effect is needed. 3.4. Variat... |

69 |
An estimate of an upper bound for the entropy of English
- Brown, Pietra, et al.
- 1992
(Show Context)
Citation Context ...n parameters were uniform over all contexts. However, separate estimation for every context was not possible due to a lack of data. A particularly useful variant defined interpolation as hierarchical =-=[11]-=-. i−n+1 ) = c(wi i−n+1 ) pinterp(wi|w i−1 i−n+1) = λ i−1 pML(wi|w w i−n+1 i−1 i−n+1 ) + (1 − λ i−1 )pinterp(wi|w w i−n+1 i−1 i−n+2 ). The hierarchical definition made it possible to cluster λ’s for se... |

67 | Building probabilistic models for natural language. Doctoral dissertation, The Division of Applied Sciences
- Chen
- 1996
(Show Context)
Citation Context ...ustered together, or the number of free interpolation parameters that should be estimated, and it has been observed that these values depend on the size of the training data. Chen shows in his thesis =-=[12]-=- that it is better to use average word count as a criterion for clustering. This criterion is also much less sensitive to the cluster size change, compared to the total count criterion. 3.2. Linear an... |

58 |
Barron,Asymptotic minimax regret for data compression, gambling and prediction, IEEETrans
- Xie, R
- 2000
(Show Context)
Citation Context ...ol redundancy diminishes with the block size. We limit P to be the set of all i.i.d. distributions over an alphabet of finite cardinality V . In this setting, the worst case redundancy is analyzed in =-=[5]-=-, ˆRn = V − 1 2 log n + CV + o(1), 2π and the average case redundancy, in [6] where it was shown that where ¯Rn = V − 1 2 log n + CV + o(1), 2πe CV = log 1 Γ( 2 )V Γ( V 2 Note that the two redundancie... |

51 |
Wessel: “Statistical Language Modeling Using Leaving-One-Out
- Ney, Martin, et al.
- 1997
(Show Context)
Citation Context ...frequencies are used instead of the correct marginals which are unknown. We feel that deeper understanding of the reasons for this effect is needed. 3.4. Variations of the Kneser-Ney model Ney et al. =-=[17]-=- have suggested a variation of absolute discounting that used two discounts, D1 for symbols observed once and D2+ for those observed two or more times. They reported mixed results compared to the sing... |

45 |
Always good turing: asymptotically optimal probability estimation, Science 302 (5644
- Orlitsky, Santhanam, et al.
- 2003
(Show Context)
Citation Context ...d-Turing based methods are not covered, but it includes the state-of-the-art models. It is interesting to mention that the Good-Turing method is a universal compression method for unknown vocabularies=-=[9]-=-, in a problem setting different than the one considered here. 3.1. Jelinek-Mercer model Jelinek and Mercer [10] have described a general class of n-gram models that interpolate different memory Marko... |

29 |
Minimax redundancy for the class of memoryless sources
- Xie, Barron
- 1997
(Show Context)
Citation Context ... i.i.d. distributions over an alphabet of finite cardinality V . In this setting, the worst case redundancy is analyzed in [5], ˆRn = V − 1 2 log n + CV + o(1), 2π and the average case redundancy, in =-=[6]-=- where it was shown that where ¯Rn = V − 1 2 log n + CV + o(1), 2πe CV = log 1 Γ( 2 )V Γ( V 2 Note that the two redundancies differ only by a constant, and the results obtained for the two measures ar... |

25 |
The zero frequency problem: Estimating the probabilities of novel events in adaptative text compression
- Witten, Bell
- 1991
(Show Context)
Citation Context ...s equation is known as the Laplace’s law of succession, and presents the posterior probability of symbol w, if we assume uniform prior probability over all possible sample count ensembles, see, e.g., =-=[3, 4]-=-. When δ �= 1, Equation (1) represents Lidstone’s law, often used to estimate probabilities in sparse data. These probability estimation methods are also considered in universal compression, where the... |

24 |
On Smoothing Techniques for Bigram-Based Natural Language Modelling
- Ney, Essen
- 1991
(Show Context)
Citation Context ...as a criterion for clustering. This criterion is also much less sensitive to the cluster size change, compared to the total count criterion. 3.2. Linear and Absolute Discounting Ney, Essen and Kneser =-=[13, 14]-=- argued that all the words in longer contexts are oversampled and that there could be two general wayssof discounting (or reducing their probabilities) to share the probability with the unobserved wor... |

5 |
K.W.: What’s wrong with adding one? In: Corpus-Based Research into Language. Rodolpi
- Gale, Church
- 1994
(Show Context)
Citation Context ... Add-one rule The add-one rule uses the Laplace’s law of succession for estimating the probability of the next word. It was one of the first methods employed in language modeling, but Church and Gale =-=[7]-=- showed experimentally that it had poor performance. It is easy to show that the worst-case redundancy of the add-one rule satisfies ˆRn(m1) = max wn log ˆp(wn ) m1(wn ≥ (V − 1) log n + O(1), ) hence,... |

3 | Essay Philosophique sur la Probabilités. Courcier Imprimeur - Laplace - 1816 |