## LANGUAGE MODEL ADAPTATION FOR AUTOMATIC SPEECH RECOGNITION AND STATISTICAL MACHINE TRANSLATION (2004)

Citations: | 1 - 0 self |

### BibTeX

@MISC{Kim04languagemodel,

author = {Woosung Kim},

title = {LANGUAGE MODEL ADAPTATION FOR AUTOMATIC SPEECH RECOGNITION AND STATISTICAL MACHINE TRANSLATION},

year = {2004}

}

### OpenURL

### Abstract

Language modeling is critical and indispensable for many natural language ap-plications such as automatic speech recognition and machine translation. Due to the complexity of natural language grammars, it is almost impossible to construct language models by a set of linguistic rules; therefore statistical techniques have been dominant for language modeling over the last few decades. All statistical modeling techniques, in principle, work under some conditions: 1) a reasonable amount of training data is available and 2) the training data comes from the same population as the test data to which we want to apply our model. Based on observations from the training data, we build statistical models and therefore, the success of a statistical model is crucially dependent on the training data. In other words, if we don’t have enough data for training, or the training data is not matched with the test data, we are not able to build accurate statistical models. This thesis presents novel methods to cope with those problems in language modeling—language model adaptation.

### Citations

8563 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ... Ki are the empirical counts of the constraint fi(h,w). Our goal is then to find the probability distribution p ∗ (h,w) = arg max p(h,w)∈P −� h,w p(h,w) log p(h,w), (3.5) which maximizes the entropy (=-=Cover and Thomas, 1991-=-) where P is a linear family of probability distributions. The basic goal of the maximum entropy model is to find the solution for equation (3.5). It can be interpreted as finding the distribution whi... |

8089 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...ation techniques, such as the publicly available GIZA++ tools (Och and Ney, 2000) which are based on the IBM models (Brown et al., 1990, 1993). These tools use several iterations of the EM algorithm (=-=Dempster et al., 1977-=-) on increasingly complex word-alignment models to infer, among other translation model parameters, the conditional probabilities PT(c|e) and PT(e|c) of words c and e being mutual translations. Unlike... |

3921 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...1997). The remaining question is how to classify training documents to build topic clusters. This can also be done by IR techniques and some automatic clustering technique such as K-means clustering (=-=Duda and Hart, 1974-=-). In short, an initial topic (class) out of previously determined K topics is assigned to each training document. Then, based on that initial topic assignment, we can build the topic centroid for eac... |

2721 | Indexing by Latent Semantic Analysis
- Deerwester, Dumais, et al.
- 1990
(Show Context)
Citation Context ...rd will be represented as a weighted average of all of its possible meanings. It is therefore possible that the weighted average corresponds to a completely different position to any of its meanings (=-=Deerwester et al., 1990-=-). Prior to describing further details of LSA, we give a brief introduction to the underlying technologies of LSA—QR factorization and SVD. 8.2 QR Factorization and Singular Vector Decomposition For t... |

2362 | Modern Information Retrieval
- Baeza-Yates, Ribeiro-Neto
- 1999
(Show Context)
Citation Context ...o compute PCL-unigram(e|d C i ) = � c∈C PT(e|c) ˆ P(c|d C i ), ∀e ∈ E, (4.1) an English bag-of-words representation of the Mandarin story d C i as used in standard vector-based information retrieval (=-=Baeza-Yates et al., 1999-=-; Salton and McGill, 1986). Once we have obtained this English bag-of-words representation from a Chinese query document, our next step is to measure the similarity between this query 37sand documents... |

1958 | Matrix computations - Golub, Loan - 1996 |

1472 | BLEU: a Method for Automatic Evaluation of Machine Translation
- Papineni, Ward
- 2002
(Show Context)
Citation Context ...Therefore, the automatic evaluation of SMT outputs has been an important issue, and still there is no single standard measure (Akiba et al., 2001; Lin and Och, 2004; Melamed et al., 2003; NIST, 2002; =-=Papineni et al., 2002-=-). The main difficulty in the automatic evaluation of SMT lies in the fact that there is no single ground truth. In other words, there may be many correct translations for a source language input segm... |

1173 | The mathematics of statistical machine translation: Parameter estimation
- Brown, Pietra, et al.
- 1994
(Show Context)
Citation Context ...npur and Kim, 2004). In spite of significant improvements using our approach, there are some shortcomings in this method. Specifically, stochastic translation lexicons estimated using the IBM method (=-=Brown et al., 1993-=-) from a fairly large sentence-aligned 60sChinese-English parallel corpus are used—a considerable demand especially for a resource-deficient language. As suggested above, an easier-to-obtain documenta... |

851 | An Empirical Study of Smoothing Techniques for Language Modeling - Chen, Goodman - 1998 |

739 |
Statistical Methods for Speech Recognition
- Jelinek
- 1998
(Show Context)
Citation Context ...o given input word strings, we begin with the most popular LM, N-gram LM. Although the N-gram LM can be applied many applications, here we take an example of the N-gram LM applied to the ASR problem (=-=Jelinek, 1997-=-). The ASR problem is to find the most likely word string W from the given acoustic evidence (input data) A as ˆW = arg max W P(W |A) . (2.1) By applying Bayes’ formula of probability theory, equation... |

553 | Inducing features of random fields
- Pietra, Pietra, et al.
- 1997
(Show Context)
Citation Context ...h starts with some arbitrary distribution and converges to the solution such as generalized iterative 29sscaling (GIS) (Csiszár, 1989; Darroch and Ratcliff, 1972) or improved iterative scaling (IIS) (=-=Pietra et al., 1997-=-). Here we show the example of combining topic-based LMs using the maximum entropy model (Wu, 2002). They use the topic-dependent trigram as the sufficient statistic of history. p(wi|w1, · · · ,wi−1) ... |

533 | Using linear algebra for intelligent information retrieval - Berry, Dumais, et al. - 1995 |

525 |
Switchboard: Telephone speech corpus for research and development
- Godfrey, Holliman, et al.
- 1992
(Show Context)
Citation Context ...been seen in the training set. There are some cases, however, where N-gram LMs are not good enough to predict the next word. Suppose we have the following sentence (taken from the Switchboard corpus (=-=Godfrey et al., 1992-=-)). 21sYou know I want to throw some charcoal on the grill and and throw a steak on there and some baked potatoes and stuff like that. Suppose we are predicting the word baked. We note that the N-gram... |

450 | Improved statistical alignment models
- Och, Ney
- 2000
(Show Context)
Citation Context ...way to obtain the translation dictionaries automatically from the sentence-level aligned parallel corpus 5 and statistical machine translation techniques, such as the publicly available GIZA++ tools (=-=Och and Ney, 2000-=-) which are based on the IBM models (Brown et al., 1990, 1993). These tools use several iterations of the EM algorithm (Dempster et al., 1977) on increasingly complex word-alignment models to infer, a... |

444 | Unsupervised learning by probabilistic latent semantic analysis - Hofmann |

353 |
The population frequencies of the species and the estimation of population parameters
- Good
- 1953
(Show Context)
Citation Context ...t has been shown that unreliable constraints such as singleton trigrams should be completely ignored from the model’s constraints. Furthermore, some discounting methods such as Good-Turing discounts (=-=Good, 1953-=-) may be applied to the relative frequency counts of the marginal probabilities on the right-hand sides of equations (3.8)-(3.11). Finally, the ME solution has an exponential form p(wi|wi−1,wi−2,ti) =... |

303 | Finite-State Transducers in Language and Speech Processing
- Mohri
- 1997
(Show Context)
Citation Context ...nerative source-channel formulation for phrase-based translation constructed so that each step in the generative process can be implemented within a weighted finite state transducer (WFST) framework (=-=Mohri, 1997-=-; Mohri et al., 2000). We summarize this technique here for completeness, and interested readers are referred to Kumar et al. (2004a) for details. The TTM defines a joint probability distribution over... |

271 | Improved backing-off for m-gram language modeling - Kneser, Ney - 1995 |

271 | A Comparison of Alignment Models for Statistical Machine Translation - Och - 2000 |

261 |
Mathematical Statistics and Data Analysis
- Rice
- 1995
(Show Context)
Citation Context ... new LM when compared to a baseline LM. In particular, we would like to ensure that the performance improvement is not caused by chance. For this purpose, we will use a statistical significance test (=-=Rice, 1995-=-). A statistical significance test provides a mechanism for making quantitative decisions about a process or processes. The intent is to determine whether there is enough evidence to reject a conjectu... |

242 | A maximum entropy approach to adaptive statistical learning modeling
- Rosenfeld
- 1996
(Show Context)
Citation Context ...d Renals, 1999; Iyer and Ostendorf, 1999). In addition, minimum discrimination information based models (Federico and Bertoldi, 2001) or maximum entropy models have been used (Khudanpur and Wu, 1999; =-=Rosenfeld, 1996-=-). 3.2 Topic Based Language Model The topic based language model is a representative example of LM adaptation: the adaptation by topic (Chen et al., 1998; Florian and Yarowsky, 1999; Gotoh and Renals,... |

229 | A Gaussian prior for smoothing maximum entropy models
- Chen, Rosenfeld
- 1999
(Show Context)
Citation Context ...i|wi−1,wi−2,ti)p(wi−1,wi−2,ti) = #[ti,wi] . (3.11) #[training data] We have seen that smoothing is an important issue in N-gram language modeling; it is important in maximum entropy modeling as well (=-=Chen and Rosenfeld, 1999-=-; Martin et al., 2000). Constraining unreliable marginal probabilities—those observed only once (singletons) or twice (doubletons)—increases the computational overhead, and therefore, those constraint... |

177 | Adaptive statistical language modeling: A Maximum Entropy Approach
- Rosenfeld
- 1994
(Show Context)
Citation Context ...the average mutual information between lexical pairs co-occurring anywhere within a long “window” of each-other has been used to capture statistical dependencies not covered by N-gram LMs (Lau, 1994; =-=Rosenfeld, 1994-=-, 1996; Tillmann and Ney, 1997). We note that even though no distinction is made between content-bearing and function words in the process of selecting trigger pairs, a vast majority of trigger-pairs ... |

165 |
A Cache-Based Natural Language Model for Speech Recognition
- Kuhn, Mori
- 1990
(Show Context)
Citation Context ...inally adapt it. LM adaptation differs in two ways: how to build or derive adaptive LMs and how to combine adaptive LMs with the static LM. For building dynamic LMs, there are cache-based approaches (=-=Kuhn and Mori, 1990-=-), trigger-based approaches which we will discuss in Chapter 7, and topicbased approaches which are going to be described in Section 3.2. For combining dynamic LMs with the static LM, one obvious way ... |

110 | Matrix, vector space, and information retrieval
- Berry, Drmac, et al.
- 1999
(Show Context)
Citation Context ... the way for the rank-k approximation; on the other hand, the distance between the original matrix A and the rank-k approximated matrix, in terms of the Frobenius norm, is minimized by using the SVD (=-=Berry et al., 1999-=-; Hofmann, 2001). In other words, A ′′ = arg min Ā:rank( Ā)=k ||A − Ā||F where the Frobenius norm (|| · ||F) is defined as follows. � � � ||A||F = � M � where aij refers to the i th row, j th column e... |

110 | The RWTH Phrase-based Statistical Machine Translation System - Zens, Bender, et al. - 2005 |

97 | The Design Principles of a Weighted Finite-State Transducer - Mohri, Pereira, et al. - 2000 |

82 | Language model adaptation using mixtures and an exponentially decaying cache
- Clarkson, Robinson
- 1997
(Show Context)
Citation Context ...ic or background N-gram language model. 3.3 Adaptation Method: Maximum Entropy Model There has been considerable research for combining topic related information with N-gram models (Bellegarda, 1998; =-=Clarkson and Robinson, 1997-=-; Iyer and Ostendorf, 1999; Kneser et al., 1997). The basic idea of these approaches is to exploit the differences of word N-gram distributions across topics. That is, first the whole training data is... |

80 | Fully automatic crosslanguage document retrieval using latent semantic indexing - Landauer, Littman - 1990 |

75 | Exploiting latent semantic information in statistical language modeling - Bellegarda - 2000 |

74 | Automatic crosslanguage retrieval using latent semantic indexing - Dumais, Littman, et al. - 1997 |

58 | M.L.: Automatic cross-linguistic information retrieval using latent semantic indexing - Dumais, Landauer, et al. - 1996 |

54 |
Language model adaptation using dynamic marginals
- Kneser
- 1997
(Show Context)
Citation Context ...Method: Maximum Entropy Model There has been considerable research for combining topic related information with N-gram models (Bellegarda, 1998; Clarkson and Robinson, 1997; Iyer and Ostendorf, 1999; =-=Kneser et al., 1997-=-). The basic idea of these approaches is to exploit the differences of word N-gram distributions across topics. That is, first the whole training data is separated into several topic-specific clusters... |

47 | Bootstrap estimates for confidence intervals in ASR performance evaluation
- Bisani, Ney
(Show Context)
Citation Context ...ea to use bootstrap resampling for measuring confidence intervals for MT scores was originally proposed by Franz Och. This was in fact adapted from the method to measure confidence intervals for ASR (=-=Bisani and Ney, 2004-=-). 4 Notice that multiple references are typically available for one segment. 111sRemark: This bootstrap estimate has been implemented by Och et al. (2003) and we use this implementation in our analys... |

47 | Towards better integration of semantic predictors in statistical language modeling - Coccaro, Jurafsky - 1998 |

47 | Alternative approaches for cross-language text retrieval
- Oard
- 1997
(Show Context)
Citation Context ...t documents for a given query where the documents and the query are written in different languages and it has been long and widely studied (Davis and Ogden, 1997; Grefenstette and Grefenstette, 1998; =-=Oard, 1997-=-). In our approach, we don’t necessarily use the state-of-the-art IR method; rather, we use a simple and crude IR method, vector space model, and we try to show that our cross-lingual language model a... |

45 | Inducing multilingual text analysis tools via robust projection across aligned corpora - YAROWSKY, NGAI, et al. |

44 | Turian,“Precision and Recall of Machine Translation
- Melamed, Green, et al.
- 2003
(Show Context)
Citation Context ...d they are difficult to quantify. Therefore, the automatic evaluation of SMT outputs has been an important issue, and still there is no single standard measure (Akiba et al., 2001; Lin and Och, 2004; =-=Melamed et al., 2003-=-; NIST, 2002; Papineni et al., 2002). The main difficulty in the automatic evaluation of SMT lies in the fact that there is no single ground truth. In other words, there may be many correct translatio... |

41 |
QUILT: Implementing a large-scale cross-language text retrieval system
- Davis, Ogden
- 1997
(Show Context)
Citation Context ...d a classical task of CLIR, which selects similar or relevant documents for a given query where the documents and the query are written in different languages and it has been long and widely studied (=-=Davis and Ogden, 1997-=-; Grefenstette and Grefenstette, 1998; Oard, 1997). In our approach, we don’t necessarily use the state-of-the-art IR method; rather, we use a simple and crude IR method, vector space model, and we tr... |

41 | Language independent and language adaptive large vocabulary speech recognition - Schultz, Waibel - 1998 |

40 | The TDT-2 text and speech corpus
- Cieri, Graff, et al.
- 1999
(Show Context)
Citation Context ...NN Headline News CCTV ABC World News Tonight CTS NBC Nightly News CBS-Taiwan MSNBC News with Brian Williams 1.2 Related Work: Topic Detection and Tracking The topic detection and tracking (TDT) task (=-=Christopher et al., 2000-=-) is a concrete example of a large publicly funded technology demonstration program which motivates the research described in this dissertation. The original TDT corpus contains news broadcasts from 4... |

38 |
Tools for the analysis of benchmark speech recognition tests
- Pallett, Fisher, et al.
- 1990
(Show Context)
Citation Context ... an example of the significance test: a matched pairs sentencesegment word error (MAPSSWE) test, which can be performed by a NIST (National Institute of Standards and Technology) ASR evaluation tool (=-=Pallett et al., 1990-=-). Since we are interested in whether one system performs significantly better than the other, our hypotheses would be given by: H0 : the mean of error differences between two systems is zero, Ha : th... |

33 | A Weighted Finite State Transducer Translation Template Model for Statistical Machine Translation - Kumar, Deng, et al. - 2005 |

32 |
Using Multiple Edit Distances to Automatically Rank Machine Translation Output
- Akiba, Imamura, et al.
- 2001
(Show Context)
Citation Context ...s, but human judgments are expensive and they are difficult to quantify. Therefore, the automatic evaluation of SMT outputs has been an important issue, and still there is no single standard measure (=-=Akiba et al., 2001-=-; Lin and Och, 2004; Melamed et al., 2003; NIST, 2002; Papineni et al., 2002). The main difficulty in the automatic evaluation of SMT lies in the fact that there is no single ground truth. In other wo... |

31 | A multispan language modeling framework for large vocabulary speech recognition
- Bellegarda
- 1997
(Show Context)
Citation Context ...odel with the static or background N-gram language model. 3.3 Adaptation Method: Maximum Entropy Model There has been considerable research for combining topic related information with N-gram models (=-=Bellegarda, 1998-=-; Clarkson and Robinson, 1997; Iyer and Ostendorf, 1999; Kneser et al., 1997). The basic idea of these approaches is to exploit the differences of word N-gram distributions across topics. That is, fir... |

31 | Large Vocabulary Speech Recognition with Multispan Statistical Language Models - Bellegarda - 2000 |

29 | Exploiting Syntactic Structure for Natural Language Modeling. 668 et al. A Scalable Distributed Syntactic
- Chelba
(Show Context)
Citation Context ...gnized words (errors) in the ASR output (hypothesis) to the total number of words in the reference (correct answer). Here is an example showing a reference, a hypothesis, and each word’s error types (=-=Chelba, 2000-=-). REF: UP UPSTATE NEW YORK SOMEWHERE UH OVER OVER HUGE AREAS HYP: UPSTATE NEW YORK SOMEWHERE UH ALL ALL THE HUGE AREAS ERR: D I S S where D, I, and S stand for errors due to deletions, insertions and... |

29 | On the use of singular value decomposition for text retrieval
- Husbands, Simon, et al.
- 2001
(Show Context)
Citation Context ...lar value decomposition (SVD) which will be explained in Section 8.2, LSA finds the optimal projection of the original word document frequency matrix into a low-dimensional space (Berry et al., 1995; =-=Husbands et al., 2000-=-). As a consequence, all terms or documents semantically related will remain salient in the projected LSA space, which leads us to efficiently find similar words, similar documents, or similar documen... |

27 | Just-in-time language modeling
- Berger, Miller
- 1998
(Show Context)
Citation Context ...language text from an unrelated domain (e.g., Arabic web pages) may sometimes be available, and its use to improve performance in the target language and domain has been investigated elsewhere (e.g., =-=Berger and Miller, 1998-=-; Scheytt et al., 1998). Abundant domain-specific text in other languages (e.g., English news broadcasts) is also often available. Furthermore, for several languages with a sub-par electronic presence... |

27 | Towards language independent acoustic modeling - Byrne, Beyerlein, et al. |