## Topic-Based Language Models Using EM (1999)

### Cached

### Download Links

Venue: | IN PROCEEDINGS OF EUROSPEECH |

Citations: | 54 - 1 self |

### BibTeX

@INPROCEEDINGS{Gildea99topic-basedlanguage,

author = {Daniel Gildea and Thomas Hofmann},

title = {Topic-Based Language Models Using EM},

booktitle = {IN PROCEEDINGS OF EUROSPEECH},

year = {1999},

pages = {2167--2170},

publisher = {}

}

### Years of Citing Articles

### OpenURL

### Abstract

In this paper, we propose a novel statistical language model to capture topic-related long-range dependencies. Topics are modeled in a latent variable framework in which we also derive an EM algorithm to perform a topic factor decomposition based on a segmented training corpus. The topic model is combined with a standard language model to be used for on-line word prediction. Perplexity results indicate an improvement over previously proposed topic models, which unfortunately has not translated into lower word error.

### Citations

8212 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...re, the number of topics, i.e., the number of values the latent variable t can take, is predetermined, and the parameters P (wjt) and P (tjd) are fitted by the Expectation-Maximization (EM) algorithm =-=[8]-=-. Starting from randomly initialized values for the parameters this involves the standard procedure of alternating two computational steps: the E-step to calculate the posterior probability of the lat... |

2761 | Indexing by latent semantic analysis
- Deerwester, Dumais, et al.
- 1990
(Show Context)
Citation Context ...oximation in terms of a linear combination of a small number of topic factors. A similar approach to language modeling based on a dimension reduction technique known as Latent Semantic Analysis (LSA) =-=[7]-=- has been proposed in [1] (a detailed implementation is provided in [4]). Yet, compared to the LSA approach that makes use of Singular Value Decomposition techniques, our method has the crucial advant... |

807 | Probabilistic latent semantic indexing
- Hofmann
- 1999
(Show Context)
Citation Context ...xity is significantly higher than the 12% improvement reported by [4] on the same data. This stresses the advantage of our probabilistic factor model that has also been verified in other applications =-=[9, 10]-=-. 4.3. Analysis: When Does the Model Help? It is interesting to consider which words the topic model helps in predicting. One might expect that because extremely common function words such as “and”, “... |

778 | A view of the EM algorithm that justifies incremental, sparse and other variants
- Neal, Hinton
- 1998
(Show Context)
Citation Context ...= 1 (wijt)P (tjhi,1) P i+1 t0P (wijt0)P (t0jhi,1) + i P (tjhi,1);(6) i+1 P w;d n(w; d) P (tjd) P (tjh1)=P (t) = P : (7) w;d n(w; d) This is essentially an online EM algorithm of the type discussed in =-=[14]-=-, but here only a single iteration is performed, reducing the computational complexity in the test stage to a minimum. Experiments using full EM iterations showed negligible improvements with higher c... |

539 | Probabilistic latent semantic analysis
- Hofmann
- 1999
(Show Context)
Citation Context ...ned on documents of various topics and are then combined at runtime. Our approach is closely related to the latter class of topic mixtures in that the proposed model is based on a topic decomposition =-=[9]-=-, P (wjh) = X t P (wjt)P (tjh) : (1) Here t is a latent class variable that is supposed to refer to different topics, P (wjt) are topic-specific word probabilities or topic factors and P (tjh) are mix... |

429 |
Generalized iterative scaling for log-linear models
- Darroch, Ratchli
- 1972
(Show Context)
Citation Context ...information. For simplicity, we focus on combining it with a n-gram model. The combination scheme we favor is based on an intuition from maximum entropy model fitting by Iterated Proportional Scaling =-=[6]-=-. We interpret the topic model probabilities as marginal word distributions that should be preserved in the combined model, while leaving the higher-order structure unaffected. Under the assumption th... |

245 | A Maximum Entropy Approach to Adaptive Statistical Language Modeling. Computer Speech and Language
- Rosenfeld
- 1996
(Show Context)
Citation Context ...dependent on the available training data. Cache models [13, 3] increase the probability for words observed in the history, e.g. by some factor which decays exponentially with distance. Trigger models =-=[16]-=- are more general in that they allow to incorporate arbitrary word trigger pairs which are combined in an exponential model. Grammar-based techniques [12, 2] exploit syntactical regularities to model ... |

123 | Exploiting Syntactic Structure for Language Modeling
- Chelba, Jelinek
(Show Context)
Citation Context ...s exponentially with distance. Trigger models [16] are more general in that they allow to incorporate arbitrary word trigger pairs which are combined in an exponential model. Grammar-based techniques =-=[12, 2]-=- exploit syntactical regularities to model long-range dependencies. Finally, in topic mixture models [11] a number of language models (e.g., n-grams) are trained on documents of various topics and are... |

95 | Modeling long distance dependence in language: Topic mixtures vs. dynamic cache models
- Iyer, Ostendorf
- 1999
(Show Context)
Citation Context ...ary word trigger pairs which are combined in an exponential model. Grammar-based techniques [12, 2] exploit syntactical regularities to model long-range dependencies. Finally, in topic mixture models =-=[11]-=- a number of language models (e.g., n-grams) are trained on documents of various topics and are then combined at runtime. Our approach is closely related to the latter class of topic mixtures in that ... |

82 | Language Model Adaptation Using Mixtures and an Exponentially Decaying Cache
- Clarkson, Robinson
(Show Context)
Citation Context ...al more recent approaches attempt to overcome this limitation: Variable order models [15] adjust the length of the utilized contexts dynamically dependent on the available training data. Cache models =-=[13, 3]-=- increase the probability for words observed in the history, e.g. by some factor which decays exponentially with distance. Trigger models [16] are more general in that they allow to incorporate arbitr... |

47 | Towards better integration of semantic predictors in statistical language modeling
- Coccaro, Jurafsky
- 1998
(Show Context)
Citation Context ...actors. A similar approach to language modeling based on a dimension reduction technique known as Latent Semantic Analysis (LSA) [7] has been proposed in [1] (a detailed implementation is provided in =-=[4]-=-). Yet, compared to the LSA approach that makes use of Singular Value Decomposition techniques, our method has the crucial advantage of a strict probabilistic interpretation (cf. [9]), a fact that wil... |

33 | Using a Stochastic Context-Free Grammar as a Language Model for Speech Recognition
- Jurafsky, Wooters, et al.
- 1995
(Show Context)
Citation Context ...s exponentially with distance. Trigger models [16] are more general in that they allow to incorporate arbitrary word trigger pairs which are combined in an exponential model. Grammar-based techniques =-=[12, 2]-=- exploit syntactical regularities to model long-range dependencies. Finally, in topic mixture models [11] a number of language models (e.g., n-grams) are trained on documents of various topics and are... |

17 | An overview of the SPRACH system for the transcription of broadcast news
- Cook, Christie, et al.
- 1999
(Show Context)
Citation Context ...uage model is in a real-world application, we put it to use in a large vocabulary continuous speech recognition system. We used the SPRACH recognition system for broadcast news described in detail in =-=[5]-=-. For this experiment, we combined the trigram language model with the topic-based language model. The topic model used for the Broadcast News experiments was trained on 1996 CSR Hub-4 Language Model ... |

17 | Beyond word n-grams
- Pereira, Singer, et al.
- 1995
(Show Context)
Citation Context ...proven hard to improve upon, they are unable to take advantage of longrange dependencies in natural language. Several more recent approaches attempt to overcome this limitation: Variable order models =-=[15]-=- adjust the length of the utilized contexts dynamically dependent on the available training data. Cache models [13, 3] increase the probability for words observed in the history, e.g. by some factor w... |

16 |
A cache based natural language model for speech recognition
- Kuhn, Mori
- 1992
(Show Context)
Citation Context ...al more recent approaches attempt to overcome this limitation: Variable order models [15] adjust the length of the utilized contexts dynamically dependent on the available training data. Cache models =-=[13, 3]-=- increase the probability for words observed in the history, e.g. by some factor which decays exponentially with distance. Trigger models [16] are more general in that they allow to incorporate arbitr... |

15 |
A latent semantic analysis framework for large-span language modeling
- Bellegarda
- 1997
(Show Context)
Citation Context ...inear combination of a small number of topic factors. A similar approach to language modeling based on a dimension reduction technique known as Latent Semantic Analysis (LSA) [7] has been proposed in =-=[1]-=- (a detailed implementation is provided in [4]). Yet, compared to the LSA approach that makes use of Singular Value Decomposition techniques, our method has the crucial advantage of a strict probabili... |