## Language Modeling for Efficient Beam-Search (1995)

Venue: | Computer Speech and Language |

Citations: | 5 - 4 self |

### BibTeX

@ARTICLE{Federico95languagemodeling,

author = {Marcello Federico and Mauro Cettolo and Fabio Brugnara and Giuliano Antoniol},

title = {Language Modeling for Efficient Beam-Search},

journal = {Computer Speech and Language},

year = {1995},

volume = {9}

}

### OpenURL

### Abstract

This paper considers the problems of estimating bigram language models and of efficiently representing them by a finite state network, which can be employed by an hidden Markov model based, beam-search, continuous speech recognizer.

### Citations

9359 |
Elements of information theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...ach word (regardless of t) to the last n − 1 words: P r(W ) = P r(w1 . . . wn−1) N� P r(wt | wt−n+1 . . . wt−1). t=n This n-gram approximation, which formally assumes a time-invariant Markov=-= process (Cover & Thomas, 1991-=-), greatly reduces the statistics to be collected in order to compute Pr(W), clearly at the expense of precision. However, even a 3-gram (trigram) model may require a large amount of data (texts) for ... |

2573 | The design and analysis of computer algorithms - Aho, Hopcroft, et al. - 1974 |

841 |
Estimation of Dependences Based on Empirical Data
- Vapnik
- 1982
(Show Context)
Citation Context ...formula, absolute discounting, and linear discounting. In Table II a compendium of these approaches is given. Adding-1 (A1). This very simple estimator results from the Bayesian estimation criterion (=-=Vapnik, 1982-=-) discussed in Appendix A. This method simply adds a constant 1 to all the bigram counts and assigns a probability in proportion to their number to all the never seen events. This estimator tends to o... |

706 | Estimation of probabilities from sparse data for the language model component of a speech recognizer
- Katz
- 1987
(Show Context)
Citation Context ...stribution if trigrams are computed - or otherwise (e.g. for 8sunigrams) uniformly. The discounting and the redistribution functions are generally combined according to two main schemes: backing-off (=-=Katz, 1987) and -=-interpolation (Jelinek & Mercer, 1980). In the backing-off scheme the bigram probability is computed by choosing the most significant approximation according to the frequency countings: ⎧ f ⎪⎨ P... |

552 |
An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes
- Baum
- 1969
(Show Context)
Citation Context ...onal frequencies needs simple counting over a text sample Wf , while interpolation parameters are estimated on a second disjoint training sequence Wλ by means of the following ML iterative estimator =-=(Baum, 1972): λ n+1 (y) =-=- 1 |Sy| � yz∈Sy λ n (y)P r(z) (1 − λ n (y))f(z | y) + λ n (y)P r(z) ∀y ∈ V (4) where Sy is the set of all occurrences of bigrams of type y in Wλ. Moreover, to avoid overfitting parameter... |

415 |
The population frequency of species and the estimation of population parameters
- Good
- 1953
(Show Context)
Citation Context ...probability in proportion to their number to all the never seen events. This estimator tends to over-estimate the zero-frequency probability in presence of very sparse data. Good-Turing (GT) formula (=-=Good, 1953). T-=-he GT formula can be derived (see Appendix B) by assuming the “symmetry” requirement -i.e. same frequencies correspond to equal probability estimates. It must be noted that the GT formula is indee... |

389 |
Techniques for automatically correcting words in text
- Kukich
- 1992
(Show Context)
Citation Context ... 2. 2 Stochastic Language Models Stochastic LMs are used extensively in many fields: automatic speech recognition, machine translation, spelling correction, text compression, etc (Brown et al., 1994; =-=Kukich, 1992-=-; Witten & Bell, 1991). The framework of stochastic LMs can be well represented by an information theoretical model. A sequence of words W, generated by a source with probability Pr(W), is transmitted... |

359 |
Interpolated estimation of Markov source parameters from sparse data
- Jelinek, Mercer
- 1980
(Show Context)
Citation Context ...uted - or otherwise (e.g. for 8sunigrams) uniformly. The discounting and the redistribution functions are generally combined according to two main schemes: backing-off (Katz, 1987) and interpolation (=-=Jelinek & Mercer, 1980). In the bac-=-king-off scheme the bigram probability is computed by choosing the most significant approximation according to the frequency countings: ⎧ f ⎪⎨ P r(z | y) = ∗ (z | y) if c(yz) > 0 ⎪⎩ Kyλ(y... |

349 |
Stacked regressions
- Breiman
- 1996
(Show Context)
Citation Context ...rsens. An interesting way to reduce the disadvantage of deleting a cross-validation set from the training data is provided by the stacked version of the LG model (Federico, 1993). The stacked method (=-=Breiman, 1992-=-) basically combines different predictors estimated on the training data to improve prediction accuracy. Translated into LMs, given m bigram estimates: P r 1 (z | y), P r 2 (z | y), . . . , P r m (z |... |

249 |
The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression
- Witten, Bell
- 1991
(Show Context)
Citation Context ...ic Language Models Stochastic LMs are used extensively in many fields: automatic speech recognition, machine translation, spelling correction, text compression, etc (Brown et al., 1994; Kukich, 1992; =-=Witten & Bell, 1991-=-). The framework of stochastic LMs can be well represented by an information theoretical model. A sequence of words W, generated by a source with probability Pr(W), is transmitted through a channel an... |

185 |
On structuring probabilistic dependences in stochastic language modelling
- Ney, Essen, et al.
- 1994
(Show Context)
Citation Context ...a simple discounting constant that can be estimated by assuming either a Poisson process for new words occurring after a given context (Witten & Bell, 1991), or by applying the LOO estimation method (=-=Ney et al., 1994; -=-Nadas, 1985). In both cases a good approximation of the so computed estimators yields the GT estimator for novel bigrams: c ∗ (yz) d0 c = d1 d0 d0 c = d1 c . Another discounting method, here called ... |

88 |
An Introduction to Formal Languages and Automata, third edition
- Linz
- 2001
(Show Context)
Citation Context ...work by using one of the available algorithms for minimizing the number of states in a deterministic finite state automaton. These algorithms are based on the indistinguishability property of states (=-=Linz, 1990-=-). Let T1 and T2 be the automata derived from the initial one by considering as initial states respectively S1 and S2. S1 and S2 are said indistinguishable if the languages accepted by T1 and T2 are e... |

65 | Principles of lexical language modeling for speech recognition - Jelinek, Mercer - 1995 |

58 |
Improvements in beam search for 10000-word continuous speech recognition
- Ney, Haeb-Umbach, et al.
- 1992
(Show Context)
Citation Context ...ements. (FIGURE 2 ABOUT HERE) 4.3 Static Tree-based Representation To organize the search space in tree mode, it has been proposed either to dynamically build the portion of currently explored space (=-=Ney et al., 1992-=-; Odell, Valtchev, Woodland & Young, 1994), or to adopt a static linear-tree mixture approach (Murveit, Monaco, Digalakis & Butzberger, 1994). Nevertheless, a static representation of the whole search... |

46 |
Natural language modeling for phoneme-totext transcription
- Derouault, Merialdo
- 1986
(Show Context)
Citation Context ... order frequencies. The basic bigram probability is expressed as follows: P r(z | y) = (1 − λ(y))f(z | y) + λ(y)P r(z) (3) where 0 < λ(y) ≤ 1 ∀y and λ(y) = 1 if c(y) = 0. According to the li=-=terature (Derouault & Merialdo, 1986-=-; Jelinek, Mercer & Roukos, 1992), estimation of conditional frequencies needs simple counting over a text sample Wf , while interpolation parameters are estimated on a second disjoint training sequen... |

43 | The estimation of powerful language models from small and large corpora - Placeway, Schwartz, et al. - 1993 |

40 | A baseline of a speaker independent continuous speech recognizer of italian
- Angelini, Brugnara, et al.
- 1993
(Show Context)
Citation Context ...roviding different data sparseness conditions, while speech recognition accuracy tests are presented for a 10,000-word speech recognition task relative to the A.Re.S. (Automatic REporting by Speech) (=-=Angelini et al., 1994a) 1-=- applicative domain. Recognition tests confirm a relevant difference between “naive” bigram estimation methods and more refined ones, 1 A.Re.S. is a real-time CSR system for radiological reporting... |

31 |
Data driven search organization for continuous speech recognition in the spicos system
- Ney, Mergel, et al.
- 1992
(Show Context)
Citation Context ...ements. (FIGURE 2 ABOUT HERE) 4.3 Static Tree-based Representation To organize the search space in tree mode, it has been proposed either to dynamically build the portion of currently explored space (=-=Ney et al., 1992-=-; Odell, Valtchev, Woodland & Young, 1994), or to adopt a static linear-tree mixture approach (Murveit, Monaco, Digalakis & Butzberger, 1994). Nevertheless, a static representation of the whole search... |

26 | An inequality with applications to statistical prediction for functions of Markov processes and to a model of ecology - Baum, Eagon - 1967 |

20 | Techniques to achieve an accurate real-time large-vocabulary speech recognition system - Murveit, Monaco, et al. - 1994 |

16 |
Search Strategies for Large-Vocabulary Continuous-Speech Recognition
- Ney
- 1993
(Show Context)
Citation Context ...onds to a word. Researchers at Philips laboratories reported advantages obtained by integrating tree organization of the lexicon with the beam-search algorithm (Ney, Haeb-Umbach, Tran & Oerder, 1992; =-=Ney, 1993-=-). They showed that 95% of the state hypotheses were in the first two phonemes of words when the linear representation is used. This fact makes tree organization attractive since it may prevent repeti... |

8 | Radiological reporting by speech recognition: the A.Re.S system
- Angelini, Antoniol, et al.
- 1994
(Show Context)
Citation Context ...roviding different data sparseness conditions, while speech recognition accuracy tests are presented for a 10,000-word speech recognition task relative to the A.Re.S. (Automatic REporting by Speech) (=-=Angelini et al., 1994a) 1-=- applicative domain. Recognition tests confirm a relevant difference between “naive” bigram estimation methods and more refined ones, 1 A.Re.S. is a real-time CSR system for radiological reporting... |

4 |
Automatic speech recognition in machine-aided translation. Computer Speech and Language
- Brown, Chen, et al.
- 1994
(Show Context)
Citation Context ...discussed in Section 2. 2 Stochastic Language Models Stochastic LMs are used extensively in many fields: automatic speech recognition, machine translation, spelling correction, text compression, etc (=-=Brown et al., 1994-=-; Kukich, 1992; Witten & Bell, 1991). The framework of stochastic LMs can be well represented by an information theoretical model. A sequence of words W, generated by a source with probability Pr(W), ... |

4 | Design and acquisition of a task-oriented spontaneous-speech data base - Corazza, Federico, et al. - 1993 |

4 |
On Turing’s formula formula for word probabilities
- Nadas
- 1985
(Show Context)
Citation Context ...ng constant that can be estimated by assuming either a Poisson process for new words occurring after a given context (Witten & Bell, 1991), or by applying the LOO estimation method (Ney et al., 1994; =-=Nadas, 1985).-=- In both cases a good approximation of the so computed estimators yields the GT estimator for novel bigrams: c ∗ (yz) d0 c = d1 d0 d0 c = d1 c . Another discounting method, here called Linear Empiri... |

3 | Tools for development, test and analysis of ASRs - Antoniol, Carli, et al. - 1992 |

2 |
Stacked estimation of interpolated ngram language models
- Federico
- 1993
(Show Context)
Citation Context ...tribution to the likelihood of Wcv worsens. An interesting way to reduce the disadvantage of deleting a cross-validation set from the training data is provided by the stacked version of the LG model (=-=Federico, 1993-=-). The stacked method (Breiman, 1992) basically combines different predictors estimated on the training data to improve prediction accuracy. Translated into LMs, given m bigram estimates: P r 1 (z | y... |

1 | A Bayesian Estimation As both the Bayesian and the Good-Turing estimators are general techniques applicable to any discrete distribution, a more general estimation problem is considered here: given a population V - Theory |