## Performance Prediction for Exponential Language Models

### Cached

### Download Links

Citations: | 11 - 3 self |

### BibTeX

@MISC{Chen_performanceprediction,

author = {Stanley F. Chen},

title = {Performance Prediction for Exponential Language Models},

year = {}

}

### OpenURL

### Abstract

We investigate the task of performance prediction for language models belonging to the exponential family. First, we attempt to empirically discover a formula for predicting test set cross-entropy for n-gram language models. We build models over varying domains, data set sizes, and n-gram orders, and perform linear regression to see whether we can model test set performance as a simple function of training set performance and various model statistics. Remarkably, we discover a very simple relationship that predicts test performance with a correlation of 0.9996. We provide analysis of why this relationship holds, and show how this relationship can be used to motivate two heuristics for improving existing language models. We use the first heuristic to develop a novel class-based language model that outperforms a baseline word trigram model by up to 28 % in perplexity and 2.1 % absolute in speech recognition word-error rate on Wall Street Journal data. We use the second heuristic to provide a new motivation for minimum discrimination information (MDI) models (Della Pietra et al., 1992), and show how this method outperforms other methods for domain adaptation on a Wall Street Journal data set. 1

### Citations

9946 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...arget classifiers, the Vapnik-Chervonenkis (VC) dimension of the set can be used to bound the true error rate relative to the training error rate with some probability (Vapnik and Chervonenkis, 1971; =-=Vapnik, 1998-=-); this technique has been used to compute error bounds for many types of classifiers. Extensions of this method include methods that bound the true error rate based on the fat-shattering dimension of... |

3389 | An Introduction to the Bootstrap
- Efron, Tibshirani
- 1993
(Show Context)
Citation Context ...old-out (or split-sample) method; leave-one-out and k-fold cross-validation; and bootstrapping (Allen, 1974; Stone, 1974; Geisser, 1975; Stone, 1977; Craven and Wahba, 1979; Stone, 1979; Efron, 1983; =-=Efron, 1993-=-; Kohavi, 1995; Shao, 1997). In the hold-out method, a single split of the training data is performed and performance on the held-out set is taken as an estimate of test performance. In the other meth... |

2771 |
Estimating the dimension of a model
- Schwarz
- 1978
(Show Context)
Citation Context ...well as all of the statistics listed on the left in Table 2. The statistics F D , F F log D D−F −1 , and D are motivated by AIC, AICc (Hurvich and Tsai, 1989), and the Bayesian Information Criterion (=-=Schwarz, 1978-=-), respectively. As features fi with ˜ λi = 0 have no effect, instead of F we also consider using F̸=0, the number of features fi with ˜ λi ̸= 0. The statistics 1 ∑F D i=1 |˜ λi| and 1 ∑F D i=1 ˜ λ2 i... |

2423 |
A new look at the statistical model identification
- Akaike
- 1974
(Show Context)
Citation Context ...pular non-data-splitting methods for predicting test set cross-entropy (or likelihood) are the Akaike Information Criterion (AIC) and variants such as AICc, quasi-AIC (QAIC), and QAICc (Akaike, 1973; =-=Akaike, 1974-=-; Hurvich and Tsai, 1989; Lebreton et al., 1992); other related methods include Mallows’ Cp for least squares regression and the Takeuchi Information Criterion (TIC) (Mallows, 1973; Takeuchi, 1976). A... |

2235 | Building a Large Annotated Corpus for English: The Penn Treebank
- Marcus, Santorini, et al.
- 1993
(Show Context)
Citation Context ...r settings found in Section 2.1 for each regularization method. domain B Part-of-speech (POS) sequences corresponding to sentences from tagged Wall Street Journal (WSJ) text from the Penn Treebank 3 (=-=Marcus et al., 1993-=-). domains C–E 1993 Wall Street Journal text with verbalized punctuation from the CSR-III Text corpus from the Linguistic Data Consortium. These three domains differ only in vocabulary. In domains C a... |

2066 | Regression shrinkage and selection via the Lasso
- Tibshirani
- 1996
(Show Context)
Citation Context ... regularization in our later experiments. Following the terminology of Dudík and Schapire (2006), the most widely-used and effective methods for regularizing exponential models are ℓ1 regularization (=-=Tibshirani, 1994-=-; Kazama and Tsujii, 2003; data token range training voc. source type of n sents. size A RH letter 2–7 100–75k 27 B WSJ POS 2–7 100–30k 45 C WSJ word 2–5 100–100k 300 D WSJ word 2–5 100–100k 3k E WSJ ... |

1686 |
Information theory and an extension of the maximum likelihood principle
- Akaike
- 1974
(Show Context)
Citation Context ... . , λF } where pΛ(y|x) = exp(∑ F i=1 λifi(x, y)) ZΛ(x) (2) and where ZΛ(x) is a normalization factor. One of the seminal methods for performance prediction is the Akaike Information Criterion (AIC) (=-=Akaike, 1973-=-). For a model, let ˆ Λ be the maximum likelihood estimate of Λ on some training data. Akaike derived the following estimate for the expected value of the test set cross-entropy H(p ∗ , pˆ Λ ): H(p ∗ ... |

976 |
On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications
- Vapnik, Chervonenkis
- 1971
(Show Context)
Citation Context ...ning an element from a set of target classifiers, the Vapnik-Chervonenkis (VC) dimension of the set can be used to bound the true error rate relative to the training error rate with some probability (=-=Vapnik and Chervonenkis, 1971-=-; Vapnik, 1998); this technique has been used to compute error bounds for many types of classifiers. Extensions of this method include methods that bound the true error rate based on the fat-shatterin... |

940 | An empirical study of smoothing techniques for language modeling. Computer Speech and Language
- SF, Goodman
- 1999
(Show Context)
Citation Context ...opposed to the training data, if the test data were normalized to be the same size as the training data. Discounts for n-grams have been studied extensively, e.g., (Good, 1953; Church and Gale, 1991; =-=Chen and Goodman, 1998-=-), and tend not to vary much across training set sizes. We can check how well eq. (12) holds for actual regularized n-gram models. We construct a total of ten n-gram models on domains A and E. We buil... |

853 | A study of cross-validation and bootstrap for accuracy estimation and model selection
- Kohavi
- 1995
(Show Context)
Citation Context ...plit-sample) method; leave-one-out and k-fold cross-validation; and bootstrapping (Allen, 1974; Stone, 1974; Geisser, 1975; Stone, 1977; Craven and Wahba, 1979; Stone, 1979; Efron, 1983; Efron, 1993; =-=Kohavi, 1995-=-; Shao, 1997). In the hold-out method, a single split of the training data is performed and performance on the held-out set is taken as an estimate of test performance. In the other methods, performan... |

829 |
Cross-validatory choice and assessment of statistically prediction (with discussion
- Stone
- 1974
(Show Context)
Citation Context ...e prediction in many contexts are data-splitting methods (Guyon et al., 2006). These techniques include the hold-out method; leave-one-out and k-fold cross-validation; and bootstrapping (Allen, 1974; =-=Stone, 1974-=-; Geisser, 1975; Craven and Wahba, 1979; Efron, 1983). However, unlike non-datasplitting methods, these methods do not lend themselves well to providing insight into model design as discussed in Secti... |

771 | Boosting the margin: a new explanation of effectiveness of voting methods
- Schapire, Freund, et al.
- 1998
(Show Context)
Citation Context ...he true error rate based on the fat-shattering dimension of a set of target classifiers, e.g., (Bartlett, 1998), and methods that bound error based on the training set margins of a classifier, e.g., (=-=Schapire et al., 1998-=-). Bartlett (1998) shows that for a neural network with small weights and small training set squared error, the true error depends on the size of its weights rather than the number of weights; this fi... |

738 | Class-Based N-Gram Models of Natural Language
- Brown, Pietra
- 1992
(Show Context)
Citation Context ...j−1)=png(wj|wj−2wj−1cj) pL(cj|c1···cj−1,w1···wj−1)=png(cj|wj−2cj−2wj−1cj−1) pL(wj|c1···cj,w1···wj−1)=png(wj|wj−2cj−2wj−1cj−1cj) Model S is an exponential version of the class-based n-gram model from (=-=Brown et al., 1992-=-); model M is a novel model introduced in (Chen, 2009); and model L is an exponential version of the model indexpredict from (Goodman, 2001). To evaluate whether eq. (4) can accurately predict test pe... |

701 | Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer
- Katz
- 1987
(Show Context)
Citation Context ...m model on our largest WSJ training set; this model is then pruned to contain a total of about 18 The most widely-used smoothing method in the literature for baseline n-gram models is Katz smoothing (=-=Katz, 1987-=-), not modified Kneser-Ney smoothing as used here. We report results for Katz smoothing in Table 12. 29training set (sents.) 1k 10k 100k 900k conventional word n-gram, modified KN 3g 34.5% 30.5% 26.1... |

691 |
Model Selection and Multimodel Inference: a Practical Information–Theoretic Approach. 2nd edn
- Burnham, Anderson
- 2002
(Show Context)
Citation Context ...s original paper, many refinements and variations of this criterion have been developed, e.g., (Takeuchi, 1976; Hurvich and Tsai, 1989; Lebreton et al., 1992), and these methods remain popular today (=-=Burnham and Anderson, 2002-=-). However, maximum likelihood estimates for language models typically yield infinite cross-entropy on test data, and thus AIC behaves poorly for these domains. In this work, instead of deriving a per... |

582 |
Ridge regression: biased estimation for nonorthogonal problems
- Hoerl, Kennard
- 1970
(Show Context)
Citation Context ...d and effective methods for regularizing exponential models are ℓ1 regularization (Tibshirani, 1994; Khudanpur, 1995; Williams, 1995; Kazama and Tsujii, 2003; Goodman, 2004) and ℓ 2 2 regularization (=-=Hoerl and Kennard, 1970-=-; Lau, 1994; Chen and Rosenfeld, 2000; Lebanon and Lafferty, 2001). In these methods, we have a prior distribution over parameters Λ and choose the maximum a posteriori (MAP) parameter estimates given... |

579 | Inducing features of random fields
- Pietra, Pietra, et al.
- 1997
(Show Context)
Citation Context ...didate hyperparameters further in the next section. We list the candidate hyperparameters for each regularization method in Table 2. 3 Here, we describe how we adapt improved iterative scaling (Della =-=Pietra et al., 1997-=-) to ℓ1 + ℓ 2 2 regularization. The original update is to take where δ (t) i satisfies ∑ ∑ ˜p(x, y)fi(x, y) = x,y and where f # (x, y) = ∑ for δ (t) i λ (t+1) i x,y ← λ(t) i + δ(t) i (14) ˜p(x)p Λ (t)... |

569 |
SWTICHBOARD: Telephone speech corpus for research and development. Paper presented at the
- Godfrey, Holliman, et al.
- 1992
(Show Context)
Citation Context ...us Speech Recognition (CSR) corpus (Doddington, 1992; Paul and Baker, 1992). domain F 1997 Broadcast News (BN) text (Graff, 1997). An internal IBM vocabulary is used. domain G Switchboard (SWB) text (=-=Godfrey et al., 1992-=-). The vocabulary consists of all words occurring at least twice in the entire data set. We provide summary statistics for each domain in Table 1. For each domain, we first take all of our data and ra... |

501 |
Smoothing noisy data with spline functions
- Craven, Wahba
- 1979
(Show Context)
Citation Context ...s are data-splitting methods (Guyon et al., 2006). These techniques include the hold-out method; leave-one-out and k-fold cross-validation; and bootstrapping (Allen, 1974; Stone, 1974; Geisser, 1975; =-=Craven and Wahba, 1979-=-; Efron, 1983). However, unlike non-datasplitting methods, these methods do not lend themselves well to providing insight into model design as discussed in Section 6. 6 Discussion We show that for sev... |

411 |
The population frequencies of species and the estimation of population parameters
- Good
- 1953
(Show Context)
Citation Context ...feature occurs in the test data as opposed to the training data, if the test data were normalized to be the same size as the training data. Discounts for n-grams have been studied extensively, e.g., (=-=Good, 1953-=-; Church and Gale, 1991; Chen and Goodman, 1998), and tend not to vary much across training set sizes. We can check how well eq. (12) holds for actual regularized n-gram models. We construct a total o... |

361 | Self-organized language modeling for speech recognition
- Jelinek
- 1990
(Show Context)
Citation Context ...ques for domain adaptation are linear interpolation and count merging. In linear interpolation, separate n-gram models are built on the in-domain and out-of-domain data and are interpolated together (=-=Jelinek et al., 1991-=-). In count merging, the in-domain and outof-domain data are concatenated into a single training set, and a single n-gram model is built on the combined data set (Iyer et al., 1997; Bacchiani et al., ... |

357 |
Interpolated estimation of markov source parameters from sparse data
- Jelinek, Mercer
- 2000
(Show Context)
Citation Context ...s are present or not. Since a backoff model has more parameters F , AIC predicts that backoff features hurt performance. In fact, it is well-known that backoff features help performance a great deal (=-=Jelinek and Mercer, 1980-=-), and we analyze this phenomenon using eq. (7). We present statistics in Table 6 for various trigram models built on the same data set. 12 The last row corresponds to a normal trigram model; the seco... |

284 |
Regression and time series model selection in small samples
- Hurvich, Tsai
- 1989
(Show Context)
Citation Context ...ach model, we compute training set cross-entropy H(˜p, p˜ Λ ) as well as all of the statistics listed on the left in Table 2. The statistics F D , F F log D D−F −1 , and D are motivated by AIC, AICc (=-=Hurvich and Tsai, 1989-=-), and the Bayesian Information Criterion (Schwarz, 1978), respectively. As features fi with ˜ λi = 0 have no effect, instead of F we also consider using F̸=0, the number of features fi with ˜ λi ̸= 0... |

265 |
Some comments on cp
- Mallows
- 1973
(Show Context)
Citation Context ...QAICc (Akaike, 1973; Akaike, 1974; Hurvich and Tsai, 1989; Lebreton et al., 1992); other related methods include Mallows’ Cp for least squares regression and the Takeuchi Information Criterion (TIC) (=-=Mallows, 1973-=-; Takeuchi, 1976). As seen in eq. (6), AIC can be used to predict the test cross-entropy of a model from the number of model parameters and its training cross-entropy (using maximum likelihood paramet... |

261 |
Estimating the error rate of a prediction rule: Improvement on cross-validation
- Efron
- 1983
(Show Context)
Citation Context ...hods (Guyon et al., 2006). These techniques include the hold-out method; leave-one-out and k-fold cross-validation; and bootstrapping (Allen, 1974; Stone, 1974; Geisser, 1975; Craven and Wahba, 1979; =-=Efron, 1983-=-). However, unlike non-datasplitting methods, these methods do not lend themselves well to providing insight into model design as discussed in Section 6. 6 Discussion We show that for several types of... |

247 |
The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression,” in
- Witten, Bell
- 1991
(Show Context)
Citation Context ...conventional n-gram model except over word/tag pairs instead of words. Using a 5M word WSJ training set, a reduction in perplexity of about 5% is achieved over a WittenBell-smoothed trigram baseline (=-=Witten and Bell, 1991-=-). Cui et al. (2007) also predict a word and its part-of-speech tag as a single unit, but instead of using a conventional n-gram model for prediction, they use an exponential model that includes featu... |

239 |
Specification Searches: Ad Hoc Inference with Nonexperimental Data
- Leamer
- 1978
(Show Context)
Citation Context ...bution in Bayesian estimation. However, we note 20 Another approach is Bayesian Model Averaging in which a model is produced by interpolating candidate models weighted by their posterior probability (=-=Leamer, 1978-=-). 36training set (sents.) 1k 10k 100k 900k 3g 579.3 317.1 196.7 137.5 4g 592.6 325.6 202.4 136.7 training set (sents.) 1k 10k 100k 900k 3g 35.5% 30.7% 26.2% 22.7% 4g 35.6% 30.9% 26.3% 22.7% Table 12... |

188 |
Modeling by the shortest data description. Automatica-J.IFAC
- Rissanen
- 1978
(Show Context)
Citation Context ...mplexity of a model. Accordingly, there have been many methods for model selection that measure the size of a model in terms of the number of features or parameters in the model, e.g., (Akaike, 1973; =-=Rissanen, 1978-=-; Schwarz, 1978). Surprisingly, for exponential language models, the number of model parameters seems to matter not at all; all that matters are the magnitudes of the parameter values. Consequently, o... |

184 |
On structuring probabilistic dependency in stochastic language modeling", Computer Speech & Language 8
- Ney
- 1994
(Show Context)
Citation Context ...need not perform the sum in eq. (35); we need only consider the class sequence c(w1) . . . c(wl). The most widely-used class model is the model we refer to as the IBM class model (Brown et al., 1992; =-=Ney et al., 1994-=-). For this model, we assume p(cj|c1 · · · cj−1, w1 · · · wj−1) ≈ p(cj|cj−2cj−1) p(wj|c1 · · · cj, w1 · · · wj−1) ≈ p(wj|cj) (46) where the distributions p(cj|cj−2cj−1) and p(wj|cj) are parameterized ... |

182 | The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network
- Bartlett
- 1998
(Show Context)
Citation Context ...mpute error bounds for many types of classifiers. Extensions of this method include methods that bound the true error rate based on the fat-shattering dimension of a set of target classifiers, e.g., (=-=Bartlett, 1998-=-), and methods that bound error based on the training set margins of a classifier, e.g., (Schapire et al., 1998). Bartlett (1998) shows that for a neural network with small weights and small training ... |

179 |
Modeling survival and testing biological hypotheses using marked animals: case studies and recent advances
- D, Burnham, et al.
- 1992
(Show Context)
Citation Context ...The most popular non-data-splitting methods for predicting test set cross-entropy (or likelihood) are AIC and variants such as AICc, quasi-AIC (QAIC), and QAICc (Akaike, 1973; Hurvich and Tsai, 1989; =-=Lebreton et al., 1992-=-). In Section 3, we considered performance prediction formulae with the same form as AIC and AICc (except using regularized parameter estimates), and neither performed as well as eq. (4); e.g., see Ta... |

178 | The design for the Wall Street Journal-based CSR corpus
- Paul, Baker
- 1992
(Show Context)
Citation Context ... the vocabulary consists of the union of the training vocabulary and 20k word “closed” test vocabulary from the first Wall Street Journal Continuous Speech Recognition (CSR) corpus (Doddington, 1992; =-=Paul and Baker, 1992-=-). domain F 1997 Broadcast News (BN) text (Graff, 1997). An internal IBM vocabulary is used. domain G Switchboard (SWB) text (Godfrey et al., 1992). The vocabulary consists of all words occurring at l... |

157 |
The Relationship Between Variable Selection and Data Augmentation and a Method for Prediction
- Allen
- 1974
(Show Context)
Citation Context ...or performance prediction in many contexts are data-splitting methods (Guyon et al., 2006). These techniques include the hold-out method; leave-one-out and k-fold cross-validation; and bootstrapping (=-=Allen, 1974-=-; Stone, 1974; Geisser, 1975; Craven and Wahba, 1979; Efron, 1983). However, unlike non-datasplitting methods, these methods do not lend themselves well to providing insight into model design as discu... |

131 |
A Comparison of the Enhanced Good-Turing and Deleted Estimation Methods for Estimating
- Church, Gale
- 1991
(Show Context)
Citation Context ...rs in the test data as opposed to the training data, if the test data were normalized to be the same size as the training data. Discounts for n-grams have been studied extensively, e.g., (Good, 1953; =-=Church and Gale, 1991-=-; Chen and Goodman, 1998), and tend not to vary much across training set sizes. We can check how well eq. (12) holds for actual regularized n-gram models. We construct a total of ten n-gram models on ... |

128 | Entropy-based Pruning of Backoff Language models - Stolcke - 1998 |

125 | A Tree-Based Statistical Language Model for Natural Language Speech Recognition - Bahl, Brown, et al. - 1989 |

121 |
A predictive sample reuse method with application
- Geisser
- 1975
(Show Context)
Citation Context ...in many contexts are data-splitting methods (Guyon et al., 2006). These techniques include the hold-out method; leave-one-out and k-fold cross-validation; and bootstrapping (Allen, 1974; Stone, 1974; =-=Geisser, 1975-=-; Craven and Wahba, 1979; Efron, 1983). However, unlike non-datasplitting methods, these methods do not lend themselves well to providing insight into model design as discussed in Section 6. 6 Discuss... |

94 | A bit of progress in language modeling
- Goodman
- 2001
(Show Context)
Citation Context ...ponential version of the class-based n-gram model from (Brown et al., 1992); model M is a novel model introduced in (Chen, 2009); and model L is an exponential version of the model indexpredict from (=-=Goodman, 2001-=-). To evaluate whether eq. (4) can accurately predict test performance for these class-based models, we use the WSJ data and vocabulary from domain E and consider training set sizes of 1k, 10k, 100k, ... |

89 |
Bayesian regularization and pruning using a Laplace prior
- Williams
- 1995
(Show Context)
Citation Context .... Following the terminology used by Dudík and Schapire (2006), the most widely-used and effective methods for regularizing exponential models are ℓ1 regularization (Tibshirani, 1994; Khudanpur, 1995; =-=Williams, 1995-=-; Kazama and Tsujii, 2003; Goodman, 2004) and ℓ 2 2 regularization (Hoerl and Kennard, 1970; Lau, 1994; Chen and Rosenfeld, 2000; Lebanon and Lafferty, 2001). In these methods, we have a prior distrib... |

86 | Tutorial on Practical Prediction Theory for Classification - Langford - 2005 |

85 | Boosting and maximum likelihood for exponential models
- Lebanon, Lafferty
- 2001
(Show Context)
Citation Context ... Statistics of data sets. RH = Random House dictionary; WSJ = Wall Street Journal; BN = Broadcast News; SWB = Switchboard. Goodman, 2004) and ℓ2 2 regularization (Lau, 1994; Chen and Rosenfeld, 2000; =-=Lebanon and Lafferty, 2001-=-). While not as popular, another regularization scheme that has been shown to be effective is 2-norm inequality regularization (Kazama and Tsujii, 2003) regularization as noted which is an instance of... |

82 | PAC-Bayesian model averaging
- McAllester
- 1999
(Show Context)
Citation Context ...r estimates), and neither performed as well as eq. (4); e.g., see Table 2. There are many techniques for bounding test set classification error including the Occam’s Razor bound (Blumer et al., 1987; =-=McAllester, 1999-=-), PAC-Bayes bound (McAllester, 1999), and the sample compression bound (Littlestone and Warmuth, 1986; Floyd and Warmuth, 1995). These methods derive theoretical guarantees that the true error rate o... |

69 |
An asymptotic theory for linear model selection (with discussion
- Shao
- 1997
(Show Context)
Citation Context ...ethod; leave-one-out and k-fold cross-validation; and bootstrapping (Allen, 1974; Stone, 1974; Geisser, 1975; Stone, 1977; Craven and Wahba, 1979; Stone, 1979; Efron, 1983; Efron, 1993; Kohavi, 1995; =-=Shao, 1997-=-). In the hold-out method, a single split of the training data is performed and performance on the held-out set is taken as an estimate of test performance. In the other methods, performance is averag... |

66 | Sample compression, learnability, and the VapnikChervonenkis dimension
- Floyd, Warmuth
- 1995
(Show Context)
Citation Context ...et classification error including the Occam’s Razor bound (Blumer et al., 1987; McAllester, 1999), PAC-Bayes bound (McAllester, 1999), and the sample compression bound (Littlestone and Warmuth, 1986; =-=Floyd and Warmuth, 1995-=-). These methods derive theoretical guarantees that the true error rate of a classifier will be below (or above) some value with a certain probability. Langford (2005) evaluates these techniques over ... |

62 | Statistical Language Model Adaptation: Review and Perspectives - Bellegarda - 2004 |

57 | Algorithms for bigram and trigram word clustering
- Martin, Liermann, et al.
- 1995
(Show Context)
Citation Context ...s training set, the IBM class models are much worse than the word n-gram models, but the interpolated class model is slightly better as is consistent with previous results, e.g., (Brown et al., 1992; =-=Martin et al., 1995-=-). Next, we compare the other class models with the state-of-the-art interpolated class model. As expected, the interpolated IBM class model outperforms the IBM class model alone, both conventional an... |

56 | Relating data compression and learnability
- Littlestone, Warmuth
- 1986
(Show Context)
Citation Context ... techniques for bounding test set classification error including the Occam’s Razor bound (Blumer et al., 1987; McAllester, 1999), PAC-Bayes bound (McAllester, 1999), and the sample compression bound (=-=Littlestone and Warmuth, 1986-=-; Floyd and Warmuth, 1995). These methods derive theoretical guarantees that the true error rate of a classifier will be below (or above) some value with a certain probability. Langford (2005) evaluat... |

56 |
Language model adaptation using dynamic marginals
- Kneser, Peters, et al.
- 1997
(Show Context)
Citation Context ...ons is an exponential model containing one feature for each linear constraint with q(y|x) as its prior as in eq. (44). While this method has been used many times for language model adaptation, e.g., (=-=Kneser et al., 1997-=-; Federico, 1999), MDI models have not performed as well as linear interpolation in perplexity or word-error rate in previous work (Rao et al., 1995; Rao et al., 1997). One of the issues present when ... |

49 |
Asymptotics for and against cross-validation
- Stone
- 1977
(Show Context)
Citation Context ...ng methods (Guyon et al., 2006). These techniques include the hold-out (or split-sample) method; leave-one-out and k-fold cross-validation; and bootstrapping (Allen, 1974; Stone, 1974; Geisser, 1975; =-=Stone, 1977-=-; Craven and Wahba, 1979; Stone, 1979; Efron, 1983; Efron, 1993; Kohavi, 1995; Shao, 1997). In the hold-out method, a single split of the training data is performed and performance on the held-out set... |

44 | Adaptive language modelling using minimum discriminant estimation
- Pietra, Pietra, et al.
- 1992
(Show Context)
Citation Context ...and 2.1% absolute in speech recognition word-error rate on Wall Street Journal data. We use the second heuristic to provide a new motivation for minimum discrimination information (MDI) models (Della =-=Pietra et al., 1992-=-), and show how this method outperforms other methods for domain adaptation on a Wall Street Journal data set. 1 Introduction In this paper, we investigate the following question for language models b... |