## Constrained Stochastic Language Models (1994)

Venue: | Image Models (and Their Speech Model Cousins |

Citations: | 6 - 0 self |

### BibTeX

@INPROCEEDINGS{Mark94constrainedstochastic,

author = {Kevin Mark and Michael I. Miller and Ulf Grenander},

title = {Constrained Stochastic Language Models},

booktitle = {Image Models (and Their Speech Model Cousins},

year = {1994},

pages = {131--140},

publisher = {Springer}

}

### OpenURL

### Abstract

. Stochastic language models incorporating both n-grams and contextfree grammars are proposed. A constrained context-free model specified by a stochastic context-free prior distribution with superimposed n-gram frequency constraints is derived and the resulting maximum-entropy distribution is shown to induce a Markov random field with neighborhood structure at the leaves determined by the relative n-gram frequencies. A computationally efficient version, the mixed tree/chain graph model, is derived with identical neighborhood structure. In this model, a word-tree derivation is given by a stochastic context-free prior on trees down to the preterminal (part-of-speech) level and word attachment is made by a nonstationary Markov chain. Using the Penn TreeBank, a comparison of the mixed tree/chain graph model to both the n-gram and context-free models is performed using entropy measures. The model entropy of the mixed tree/chain graph model is shown to reduce the entropy of both the bigram a...

### Citations

9359 |
Elements of information theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...e random branching process and can be computed analytically [3,9]. 2. H(W jT ) = \Gamma P t2��P 1 K ��(t)H(W jT = t) where H(W jT = t) is the entropy of a Markov chain and can be computed anal=-=ytically[10,2]-=-. Proof. We need only to prove the expression for H(T ) above. The marginal probability p(t) is derived by summing over all word sequences w 2 W: p(t) = X w2W p(w; t) (3.1) 8 Mark, Miller, and Grenand... |

7283 |
A mathematical theory of communications
- Shannon
- 1948
(Show Context)
Citation Context ...e random branching process and can be computed analytically [3,9]. 2. H(W jT ) = \Gamma P t2��P 1 K ��(t)H(W jT = t) where H(W jT = t) is the entropy of a Markov chain and can be computed anal=-=ytically[10,2]-=-. Proof. We need only to prove the expression for H(T ) above. The marginal probability p(t) is derived by summing over all word sequences w 2 W: p(t) = X w2W p(w; t) (3.1) 8 Mark, Miller, and Grenand... |

2273 | Building a large annotated corpus of English: The Penn Treebank
- Marcus, Marcinkiewicz, et al.
- 1993
(Show Context)
Citation Context ...models: bigrams, trigrams, context-free, and the mixed tree/chain graph model. The parameters for these models were estimated from a subset of Dow Jones newswire articles from the Penn TreeBank corpus=-=[7]-=-. This data has 1,013,789 words in 42,254 sentences which have been machine-parsed and hand-corrected. A stochastic context-free grammar was estimated using the parse trees associated with each senten... |

429 |
E.: The theory of branching processes
- Harris
- 1963
(Show Context)
Citation Context ...he preterminal string fl 1 ; fl 2 ; : : : ; fl N where fl i 2 VP and the word string WN = w 1 ; w 2 ; : : : ; wN are the terminal symbols. An important measure is the probability of a derivation tree =-=[4]. For a given tree T = (-=-t; WN ) ��(t; WN ) = ��(t) N Y i=1 ��(w i jfl i ) (1.2) where ��(t) = Q NT i=1 P r i . For the tree in figure 1.2, ��(t; WN ) = P S ! NP VP P NP ! Art N P VP ! V ��( 0 The 0 jA... |

423 | A maximum likelihood approach to continuous speech recognition
- Bahl, Jelinek, et al.
- 1983
(Show Context)
Citation Context ...ed to help distinguish ambiguous phonemes by placing higher probability on more likely possibilities. These models are typically based on some Markov process in the word strings such as a Markov chain=-=[1,5]-=-. Alternatively, more sophisticated language models have been developed which provide syntactic information to perform higher level tasks such as machine translation and message understanding. The und... |

359 |
Interpolated estimation of Markov source parameters from sparse data
- Jelinek, Mercer
- 1980
(Show Context)
Citation Context ...ed to help distinguish ambiguous phonemes by placing higher probability on more likely possibilities. These models are typically based on some Markov process in the word strings such as a Markov chain=-=[1,5]-=-. Alternatively, more sophisticated language models have been developed which provide syntactic information to perform higher level tasks such as machine translation and message understanding. The und... |

19 |
A trellis-based algorithm for estimating the parameters of a hidden stochastic contextfree grammar
- Kupiec
- 1991
(Show Context)
Citation Context ... according to the bigram constraints implicit in the transition probabilities of the Markov chain. For the mixed tree/chain graph model the SCF prior is estimated using the modified trellis algorithm =-=[6,8]-=- and the Markov chain parameters are estimated using relative frequencies. Note that the graph structure for this model is identical to that of the maximum entropy model. However, the distributions ar... |

12 | Parameter estimation for constrained context-free language models
- Mark, Miller, et al.
- 1992
(Show Context)
Citation Context ...tablish an alternative structure which combines both a tree structure and a chain structure. The resulting mixed tree/chain graph model has the same neighborhood structure as the maximumentropy model =-=[8]-=- and has the advantage of computational efficiency. Closed form expressions for the entropy of this model are derived. Results using the Penn TreeBank are shown which demonstrate the power of these al... |

7 | Entropies and combinatorics of random branching processes and context-free languages
- MILLER, O’SULLIVAN
- 1992
(Show Context)
Citation Context ...en by H(W;T ) = H(T ) + H(W jT ) where 1. H(T ) = 1 K H �� (T ) + 1 K P t2�� c P ��(t) log ��(t) + logK where H �� (T ) is the entropy of the random branching process and can be co=-=mputed analytically [3,9]. 2. H(W-=- jT ) = \Gamma P t2��P 1 K ��(t)H(W jT = t) where H(W jT = t) is the entropy of a Markov chain and can be computed analytically[10,2]. Proof. We need only to prove the expression for H(T ) abo... |

2 |
Probability measures for context-free languages. Res. rep
- Grenander
- 1967
(Show Context)
Citation Context ...en by H(W;T ) = H(T ) + H(W jT ) where 1. H(T ) = 1 K H �� (T ) + 1 K P t2�� c P ��(t) log ��(t) + logK where H �� (T ) is the entropy of the random branching process and can be co=-=mputed analytically [3,9]. 2. H(W-=- jT ) = \Gamma P t2��P 1 K ��(t)H(W jT = t) where H(W jT = t) is the entropy of a Markov chain and can be computed analytically[10,2]. Proof. We need only to prove the expression for H(T ) abo... |