## Iterative language model estimation: Efficient data structure & algorithms (2008)

### Cached

### Download Links

Venue: | in Proc. Interspeech |

Citations: | 9 - 3 self |

### BibTeX

@INPROCEEDINGS{Hsu08iterativelanguage,

author = {Bo-june (paul Hsu and James Glass},

title = {Iterative language model estimation: Efficient data structure & algorithms},

booktitle = {in Proc. Interspeech},

year = {2008}

}

### OpenURL

### Abstract

Despite the availability of better performing techniques, most language models are trained using popular toolkits that do not support perplexity optimization. In this work, we present an efficient data structure and optimized algorithms specifically designed for iterative parameter tuning. With the resulting implementation, we demonstrate the feasibility and effectiveness of such iterative techniques in language model estimation. Index Terms: language modeling, smoothing, interpolation 1.

### Citations

8090 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...bulary words in the corpus, respectively. The interpolation weights are typically chosen to minimize the development set perplexity using an iterative algorithm, such as Expectation-Maximization (EM) =-=[12]-=- or numerical optimization techniques [13]. Since computing the development set perplexity typically only involves a small subset of n-gram probabilities and backoff weights, we can pre-compute Boolea... |

851 | An Empirical Study of Smoothing Techniques for Language Modeling
- Chen, Goodman
- 1998
(Show Context)
Citation Context ...g, interpolation 1. Introduction For domains with limited matched training data, many of the most effective techniques for n-gram language model (LM) estimation, such as modified Kneser-Ney smoothing =-=[1]-=- and generalized linear interpolation [2], involve iterative parameter tuning to optimize the development set perplexity. However, due to the lack of support for performing such iterative LM estimatio... |

759 | SRILM – an extensible language modeling toolkit
- Stolcke
- 2002
(Show Context)
Citation Context ...he development set perplexity. However, due to the lack of support for performing such iterative LM estimation in popular language modeling toolkits, such as the SRI Language Modeling (SRILM) toolkit =-=[3]-=-, most work in the field opts for simpler techniques with inferior results. Previous work on data structures for n-gram models has primarily focused on runtime performance and storage compression [4].... |

336 |
Interpolated estimation of markov source parameters from sparse data
- Jelinek, Mercer
- 1980
(Show Context)
Citation Context ... the parameters f. Thus, w ˜c(hw) by caching ch in the n-gram vector structure, we significantly reduce the amount of computation within each iteration. 3.2. Linear Interpolation Linear interpolation =-=[11]-=- is the most popular algorithm for merging multiple n-gram LMs. To create a static backoff ngram LM from component LMs, SRILM interpolates the component probabilities for the union of observed n-grams... |

316 | Statistical language modeling using the CMU-Cambridge toolkit
- Clarkson, Rosenfeld
- 1997
(Show Context)
Citation Context ...ory node (Figure 1c). Although simple to build incrementally, tries demonstrate poor memory locality during node traversals. Instead, the CMU-Cambridge Statistical Language Modeling (CMU SLM) toolkit =-=[9]-=- utilizes a compact vector encoding of the n-gram trie [4] where pointers to the child nodes are encoded with array indices (Figure 1d). The resulting structure not only improves memory locality, but ... |

271 |
Improved backing-off for m-gram language modeling
- Kneser, Ney
- 1995
(Show Context)
Citation Context ...estimation approaches. 3.1. Modified Kneser-Ney Smoothing Modified Kneser-Ney smoothing [1] is one of the best performing techniques for n-gram LM estimation. Unlike the original Kneser-Ney smoothing =-=[10]-=- where a constant is subtracted from each count, modified Kneser-Ney subtracts a different value depending on the actual count. Specifically, it assigns probability p(w|h) = ˜c(hw)−D(˜c(hw)) ∑ w ˜c(hw... |

267 | A limited memory algorithm for bound constrained optimization
- Byrd, Lu, et al.
- 1995
(Show Context)
Citation Context ...ing [14], and generalized linear interpolation [2]. Model parameters are tuned to minimize development set perplexity using numerical optimization techniques such as Powell’s method [13] and L-BFGS-B =-=[15]-=-. For efficiency, core data structures and vector operations are implemented in C++. For ease of development and experimentation, high level algorithms are written in Python with NumPy and SciPy packa... |

103 | The LIMSI Broadcast News transcription system
- Gauvain, Lamel, et al.
(Show Context)
Citation Context ...and reduces the memory usage by 40%. Furthermore, it enables the efficient optimization of parameters for advanced interpolation techniques, instead of relying on empirical trial and error approaches =-=[14, 18]-=-. For future work, we plan to explore additional iterative LM estimation techniques and conduct more detailed perplexity and recognition evaluations. We would also like to further optimize the memory ... |

44 |
Web 1t 5-gram version 1, linguistic data consortium
- Brants, Franz
- 2006
(Show Context)
Citation Context ...inferior results. Previous work on data structures for n-gram models has primarily focused on runtime performance and storage compression [4]. With the availability of the Google Web 1T 5-gram corpus =-=[5]-=-, recent research has examined efficient representations for building large-scale LMs [6, 7]. However, these efforts only support simple subpar smoothing and interpolation techniques, few that involve... |

20 |
English Gigaword. Linguistic Data Consortium
- Graff
- 2003
(Show Context)
Citation Context ...elopment set (BNDev) and used the remaining five months as the test set (BNTest). For training data, we use the Broadcast News data from 1992 to 1995 (BNTrain), along with the New York Times articles =-=[17]-=- from 1996 (NYT96). Table 1 summarizes all the evaluation data. 5.1. LM Smoothing In [1], Chen and Goodman conclusively showed that modified Kneser-Ney smoothing achieves better performance with tuned... |

19 | Efficient Handling of N-gram Language Models for Statistical Machine Translation
- Federico, Cettolo
- 2007
(Show Context)
Citation Context ...d on runtime performance and storage compression [4]. With the availability of the Google Web 1T 5-gram corpus [5], recent research has examined efficient representations for building large-scale LMs =-=[6, 7]-=-. However, these efforts only support simple subpar smoothing and interpolation techniques, few that involve iterative parameter optimization. In this work, we propose a data structure designed specif... |

18 |
Quantization-based language model compression
- Whittaker, Raj
(Show Context)
Citation Context ... [3], most work in the field opts for simpler techniques with inferior results. Previous work on data structures for n-gram models has primarily focused on runtime performance and storage compression =-=[4]-=-. With the availability of the Google Web 1T 5-gram corpus [5], recent research has examined efficient representations for building large-scale LMs [6, 7]. However, these efforts only support simple s... |

15 |
MAP adaptation of stochastic grammars
- Bacchiani, Riley, et al.
(Show Context)
Citation Context ...8,976 355,995 9,153,440 37,884,316 NYT96 156,879,556 319,279 10,939,278 38,566,120 Table 1: Summary of evaluation datasets. modified Kneser-Ney smoothing [1], linear interpolation [11], count merging =-=[14]-=-, and generalized linear interpolation [2]. Model parameters are tuned to minimize development set perplexity using numerical optimization techniques such as Powell’s method [13] and L-BFGS-B [15]. Fo... |

10 |
MSRLM: a scalable language modeling toolkit
- Nguyen, Gao, et al.
- 2007
(Show Context)
Citation Context ...d on runtime performance and storage compression [4]. With the availability of the Google Web 1T 5-gram corpus [5], recent research has examined efficient representations for building large-scale LMs =-=[6, 7]-=-. However, these efforts only support simple subpar smoothing and interpolation techniques, few that involve iterative parameter optimization. In this work, we propose a data structure designed specif... |

9 | Generalized linear interpolation of language models
- Hsu
- 2007
(Show Context)
Citation Context ...ng backoff probabilities if necessary, and computes the backoff weights to normalize the model. Figure 4 contains an efficient vector implementation of the linear interpolation algorithm.1. for i in =-=[1, 2]-=-: pi,αi = loadlm(lmi, model) 2. 3. for i in [1, 2]: z =(pi == 0); pi[z] =αi[h[z]] × p − i [b[z]] p = p1 × w1 + p2 × w2 4. α =(1−binweight(h, p))/(1 − binweight(h, p− [b])) Figure 4: Linear interpolati... |

3 |
SRILM man pages: ngram-format, 2004. http://www.speech.sri. com/projects/srilm/manpages/ngram-format.html. 107 David Talbot and Thorsten Brants. Randomized language models via perfect hash functions
- Stolcke
- 2008
(Show Context)
Citation Context ... C -1.3 1 A -0.8 b 0 0 0 0 1 2 0 0 0 2 3 (e) MITLM Vectors A -0.4 Figure 1: Various n-gram data structures. w: word, p: probability, α: backoff weight, h: history index, b: backoff index. file format =-=[8]-=-, which stores p(w|h) and α(h) of the observed n-grams and their histories, as shown in Figure 1a. 2.1. Existing Representations While the ARPA file format serves as a standard cross-implementation re... |

1 |
English Broadcast News transcripts (HUB4),” Linguistic Data Consortium
- Graff, Alabiso
- 1996
(Show Context)
Citation Context ...lkit, we will evaluate the runtime performance of select LM smoothing and interpolation experiments on the broadcast news domain. Specifically, the target corpus consists of the Broadcast News corpus =-=[16]-=- from 1996, where we designated the first month as the development set (BNDev) and used the remaining five months as the test set (BNTest). For training data, we use the Broadcast News data from 1992 ... |