## Generalized linear interpolation of language models (2007)

### Cached

### Download Links

Venue: | IEEE Workshop on ASRU |

Citations: | 9 - 3 self |

### BibTeX

@INPROCEEDINGS{Hsu07generalizedlinear,

author = {Bo-june (paul Hsu},

title = {Generalized linear interpolation of language models},

booktitle = {IEEE Workshop on ASRU},

year = {2007},

pages = {136--140}

}

### OpenURL

### Abstract

Despite the prevalent use of model combination techniques to improve speech recognition performance on domains with limited data, little prior research has focused on the choice of the actual interpolation model. For merging language models, the most popular approach has been the simple linear interpolation. In this work, we propose a generalization of linear interpolation that computes context-dependent mixture weights from arbitrary features. Results on a lecture transcription task yield up to a 1.0 % absolute improvement in recognition word error rate (WER). Index Terms — Language modeling, interpolation, adaptation, mixture models

### Citations

4893 |
Neural Networks For Pattern Recognition
- Bishop
- 1996
(Show Context)
Citation Context ... right branch count (defined below) to derive the model 1 We can also interpret this interpolation weight function as a two-layer neural network with a softmax activation function on the output layer =-=[11]-=-. 137features. We will leave other features such as part-of-speech tags, topic labels, and document counts to future work. Following the notation in [12], we define the left branch count c l (h) =N1+... |

863 | An empirical study of smoothing techniques for language modelling
- Chen, Goodman
- 1998
(Show Context)
Citation Context ...ftmax activation function on the output layer [11]. 137features. We will leave other features such as part-of-speech tags, topic labels, and document counts to future work. Following the notation in =-=[12]-=-, we define the left branch count c l (h) =N1+(•h) as the number of unique words that appear before h. This count is used in Kneser-Ney smoothing [13] to estimate the probability of lower-order models... |

766 | SRILM-An Extensible Language Modeling Toolkit
- Stolcke
(Show Context)
Citation Context ...speech recognition than a single backoff n-gram LM, as it requires M probability evaluations for each possible word expansion and the storages of each component LM. Thus, as an approximation, Stolcke =-=[6]-=- constructs a single n-gram backoff model where the probability for all observed n-grams is the weighted average of the component model probabilities. The remaining probabilities are computed via appr... |

714 |
Numerical recipes
- Press, Teukolsky, et al.
- 2007
(Show Context)
Citation Context ...ts, the interpolation model parameters are initialized to 0 and tuned to minimize the development set perplexity using the Broyden-Fletcher-Goldfarb-Shanno (BFGS) unconstrained optimization technique =-=[16]-=-, a quasi-Newton method that uses the second derivative Hessian matrix iteratively estimated from the gradients evaluated along the search path to improve the convergence towards the function minimum.... |

339 |
Interpolated estimation of Markov source parameters from sparse data
- Jelinek, Mercer
- 1980
(Show Context)
Citation Context ...gnizer hypotheses [4]. While many of these techniques involve the combination of multiple n-gram language models, most existing works only evaluate their performance using simple linear interpolation =-=[5]-=-. 1.1. Existing Techniques Given a set of M training texts, we can build a combined language model (LM) using multiple techniques. One of the simplest techniques, sometimes referred to as the brute-fo... |

275 |
Improved backing-off for m-gram language modeling
- Kneser, Ney
- 1995
(Show Context)
Citation Context ... counts to future work. Following the notation in [12], we define the left branch count c l (h) =N1+(•h) as the number of unique words that appear before h. This count is used in Kneser-Ney smoothing =-=[13]-=- to estimate the probability of lower-order models. Symmetrically, we define the right branch count c r (h) = N1+(h•), motivated by Witten-Bell smoothing [9], as the number of unique words that appear... |

245 | A Maximum Entropy Approach to Adaptive Statistical Language Modeling. Computer Speech and Language
- Rosenfeld
- 1996
(Show Context)
Citation Context ...erved in both [4] and section 3.2, count merging generally achieves lower perplexity than linear interpolation. Previous work has also investigated log-linear interpolation [7] and exponential models =-=[8]-=-. Unlike linear interpola978-1-4244-1746-9/07/$25.00 ©2007 IEEE 136 ASRU 2007tion and count merging, the resulting models from these techniques cannot be efficiently represented as a backoff n-gram m... |

231 |
The Zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression
- Witten, Bell
- 1991
(Show Context)
Citation Context ...to the number of observances of n-gram history h. The more data used to train the word distribution following h, the more we trust and weigh the resulting estimate. Motivated by Witten-Bell smoothing =-=[9]-=-, Zhou et al. proposed a linear interpolation model where the interpolation weight is defined as a function of not just c(h), but also the number of unique words that follow h [10]. However, this heur... |

197 | Some statistical issues in the comparison of speech recognition algorithms
- Gillick, Cox
- 1989
(Show Context)
Citation Context ... significant reduction in WER. Validating the observations in [4], count merging (CM) outperforms linear interpolation, with p<0.001 on the Matched Pairs Sentence Segment Word Error significance test =-=[19]-=-. With log c(h) as the only feature, generalized linear interpolation (GLI) extends count merging by allowing an arbitrary exponent on the count features. Intuitively, given sufficient n-gram historie... |

155 | A probabilistic framework for segment-based speech recognition
- Glass
- 2003
(Show Context)
Citation Context ...ted along the search path to improve the convergence towards the function minimum. To compute the word error rates associated with each language model, we used a speaker-independent speech recognizer =-=[17]-=-. The evaluation lectures were pre-segmented into utterances via forced alignment against the reference transcription [18]. Since the interpolated language models can be encoded as n-gram backoff mode... |

54 | Getting more mileage from web text sources for conversational speech language modeling using class-dependent mixtures
- Bulyko, Ostendorf, et al.
(Show Context)
Citation Context ...ling, these techniques include improving the estimation of the underlying probability distributions via class n-grams [1] and topic mixtures [2], and adapting to additional training text from the web =-=[3]-=- and the initial recognizer hypotheses [4]. While many of these techniques involve the combination of multiple n-gram language models, most existing works only evaluate their performance using simple ... |

37 |
Log-linear interpolation of language models
- Klakow
- 1998
(Show Context)
Citation Context ...int development set. As observed in both [4] and section 3.2, count merging generally achieves lower perplexity than linear interpolation. Previous work has also investigated log-linear interpolation =-=[7]-=- and exponential models [8]. Unlike linear interpola978-1-4244-1746-9/07/$25.00 ©2007 IEEE 136 ASRU 2007tion and count merging, the resulting models from these techniques cannot be efficiently repres... |

26 |
Recent progress in the MIT spoken lecture processing project
- Glass
- 2007
(Show Context)
Citation Context ...compare the performance of the generalized linear interpolation model against a few existing interpolation techniques by evaluating the perplexity and recognizer WER in a lecture transcription domain =-=[14]-=-. The target domain consists of 20 lectures from an introductory computer science course, from which we withheld the first 10 lectures for the development set (CS Dev) and used the last 10 for the tes... |

25 | Style & topic language model adaptation using HMM-LDA
- Hsu, Glass
(Show Context)
Citation Context ... approaches involving model combinations. For language modeling, these techniques include improving the estimation of the underlying probability distributions via class n-grams [1] and topic mixtures =-=[2]-=-, and adapting to additional training text from the web [3] and the initial recognizer hypotheses [4]. While many of these techniques involve the combination of multiple n-gram language models, most e... |

15 |
MAP adaptation of stochastic grammars
- Bacchiani, Riley, et al.
- 2006
(Show Context)
Citation Context ...he estimation of the underlying probability distributions via class n-grams [1] and topic mixtures [2], and adapting to additional training text from the web [3] and the initial recognizer hypotheses =-=[4]-=-. While many of these techniques involve the combination of multiple n-gram language models, most existing works only evaluate their performance using simple linear interpolation [5]. 1.1. Existing Te... |

11 | Automatic alignment and error correction of human generated transcripts for long speech recordings
- Hazen
- 2006
(Show Context)
Citation Context ...ted with each language model, we used a speaker-independent speech recognizer [17]. The evaluation lectures were pre-segmented into utterances via forced alignment against the reference transcription =-=[18]-=-. Since the interpolated language models can be encoded as n-gram backoff models, they are applied directly in the first recognition pass instead of a separate n-best rescoring step. Table 2 summarize... |

8 |
Combining word- and class-based language models: A comparative study in several languages using automatic and manual word-clustering techniques
- Maltese, Bravetti, et al.
- 2001
(Show Context)
Citation Context ...ormance often relies on approaches involving model combinations. For language modeling, these techniques include improving the estimation of the underlying probability distributions via class n-grams =-=[1]-=- and topic mixtures [2], and adapting to additional training text from the web [3] and the initial recognizer hypotheses [4]. While many of these techniques involve the combination of multiple n-gram ... |

3 |
SRILM man pages: ngram-format,” 2004, http: //www.speech.sri.com/projects/srilm/manpages/ngram-format. html
- Stolcke
(Show Context)
Citation Context ... obtained with relatively little in-domain data. 3.3. Implementation Similar to linear interpolation and count merging, we can represent a generalized linear interpolation model in the ARPA LM format =-=[20]-=- consisting of only the observed n-grams across the component models. By representing LMs as vectors of probabilities and backoff weights, and pre-computing their contributions towards the development... |

2 |
Improving language modeling by combining heteogeneous corpora
- Zhou, Gao, et al.
- 2002
(Show Context)
Citation Context ...itten-Bell smoothing [9], Zhou et al. proposed a linear interpolation model where the interpolation weight is defined as a function of not just c(h), but also the number of unique words that follow h =-=[10]-=-. However, this heuristic-based scheme assumes the existence of an indomain training set and does not perform parameter optimization to maximize data likelihood. Intuitively, we should be able to obta... |