## Cluster Expansions And Iterative Scaling For Maximum Entropy Language Models (1995)

Venue: | Maximum Entropy and Bayesian Methods |

Citations: | 20 - 1 self |

### BibTeX

@INPROCEEDINGS{Lafferty95clusterexpansions,

author = {John Lafferty and Bernhard Suhm},

title = {Cluster Expansions And Iterative Scaling For Maximum Entropy Language Models},

booktitle = {Maximum Entropy and Bayesian Methods},

year = {1995},

publisher = {Kluwer Academic Publishers}

}

### Years of Citing Articles

### OpenURL

### Abstract

. The maximum entropy method has recently been successfully introduced to a variety of natural language applications. In each of these applications, however, the power of the maximum entropy method is achieved at the cost of a considerable increase in computational requirements. In this paper we present a technique, closely related to the classical cluster expansion from statistical mechanics, for reducing the computational demands necessary to calculate conditional maximum entropy language models. 1. Introduction In this paper we present a computational technique that can enable faster calculation of maximum entropy models. The starting point for our method is an algorithm [1] for constructing maximum entropy distributions that is an extension of the generalized iterative scaling algorithm of Darroch and Ratcliff [2,3]. The extended algorithm relaxes the assumption of [2,3] that the constraint functions sum to a constant, and results in a set of decoupled polynomial equations, one fo...

### Citations

1083 | A maximum entropy approach to natural language processing
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ...artition function Zsis obtained by summing over all possible word strings W . In contrast to this use of the joint distribution, recent applications of the maximum entropy method in language modeling =-=[7,8]-=- have employed conditional models. Such models employ features to represent various frequencies in the training text, such as the bigram features just mentioned, but they use this information to const... |

584 | A Statistical Approach to Machine Translation
- Brown, Cocke, et al.
- 1990
(Show Context)
Citation Context ...s to identify regularities in natural language and capture them in a statistical model. Language models are crucial ingredients in automatic speech recognition [4] and statistical machine translation =-=[5]-=- systems, where their use is naturally viewed in terms of the noisy channel model from information theory. In this framework an information source emits messages X from a distributionsP (X) which then... |

553 | Inducing features of random fields
- Pietra, Pietra, et al.
- 1997
(Show Context)
Citation Context ...opy language models. 1. Introduction In this paper we present a computational technique that can enable faster calculation of maximum entropy models. The starting point for our method is an algorithm =-=[1]-=- for constructing maximum entropy distributions that is an extension of the generalized iterative scaling algorithm of Darroch and Ratcliff [2,3]. The extended algorithm relaxes the assumption of [2,3... |

525 |
Switchboard: Telephone speech corpus for research and development
- Godfrey, Holliman, et al.
- 1992
(Show Context)
Citation Context ...n system. 5. Example: A Topic-Dependent Language Model In this section we describe the application of the cluster expansion to the training of a topic-dependent bigram model of the Switchboard corpus =-=[12]-=- for use in a speech recognition system. This corpus comprises approximately three million words of text, transcribed from more than 150 hours of speech collected from telephone conversations. An impo... |

431 |
Generalized iterative scaling for log-linear models
- Darroch, Ratcliff
- 1972
(Show Context)
Citation Context ...odels. The starting point for our method is an algorithm [1] for constructing maximum entropy distributions that is an extension of the generalized iterative scaling algorithm of Darroch and Ratcliff =-=[2,3]-=-. The extended algorithm relaxes the assumption of [2,3] that the constraint functions sum to a constant, and results in a set of decoupled polynomial equations, one for each feature, that must be sol... |

391 |
A Maximum Likelihood Approach to Continuous Speech Recognition
- Bahl, Jelinek, et al.
- 1983
(Show Context)
Citation Context ...ESIAN DECODING Language modeling attempts to identify regularities in natural language and capture them in a statistical model. Language models are crucial ingredients in automatic speech recognition =-=[4]-=- and statistical machine translation [5] systems, where their use is naturally viewed in terms of the noisy channel model from information theory. In this framework an information source emits message... |

79 |
Statistical Mechanics. In A Set of Lectures
- Feynman
- 1972
(Show Context)
Citation Context ...tition function, our use of the method will simply make use of the discrete analogues of the integrals b l for conditional models. For more details on the statistical physics calculations we refer to =-=[9]-=-. 4.2. CLUSTER EXPANSIONS FOR CONDITIONAL MAXENT MODELS The computation necessary to carry out the iterative scaling algorithm described in Section 3 is naturally divided into two parts. First, for a ... |

38 |
A method of computing generalized Bayesian probability values for expert systems
- Cheeseman
- 1983
(Show Context)
Citation Context ...vidual clusters can be significantly more efficient than computingsZ (h) directly. Furthermore, the computation of the clusters can be shared across different histories. The use of Cheeseman's method =-=[10,11]-=- of reordering summations within a cluster can provide further savings. The second computation that is necessary is the calculation of the coefficients of \Deltafi ff in the expectation ~ p [f ff \Del... |

37 | Adaptive language modeling using the maximum entropy principle
- Lau, Rosenfeld, et al.
- 1993
(Show Context)
Citation Context ...artition function Zsis obtained by summing over all possible word strings W . In contrast to this use of the joint distribution, recent applications of the maximum entropy method in language modeling =-=[7,8]-=- have employed conditional models. Such models employ features to represent various frequencies in the training text, such as the bigram features just mentioned, but they use this information to const... |

21 |
A geometric interpretation of Darroch and Ratcliffâ€™s generalized iterative scaling
- CsiszĂˇr
- 1989
(Show Context)
Citation Context ...odels. The starting point for our method is an algorithm [1] for constructing maximum entropy distributions that is an extension of the generalized iterative scaling algorithm of Darroch and Ratcliff =-=[2,3]-=-. The extended algorithm relaxes the assumption of [2,3] that the constraint functions sum to a constant, and results in a set of decoupled polynomial equations, one for each feature, that must be sol... |

6 | Efficient methods for calculating maximum entropy distributions
- Goldman
- 1987
(Show Context)
Citation Context ...vidual clusters can be significantly more efficient than computingsZ (h) directly. Furthermore, the computation of the clusters can be shared across different histories. The use of Cheeseman's method =-=[10,11]-=- of reordering summations within a cluster can provide further savings. The second computation that is necessary is the calculation of the coefficients of \Deltafi ff in the expectation ~ p [f ff \Del... |