## Exploiting Syntactic Structure for Natural Language Modeling (2000)

Citations: | 29 - 0 self |

### BibTeX

@TECHREPORT{Chelba00exploitingsyntactic,

author = {Ciprian Chelba},

title = {Exploiting Syntactic Structure for Natural Language Modeling},

institution = {},

year = {2000}

}

### Years of Citing Articles

### OpenURL

### Abstract

The thesis presents an attempt at using the syntactic structure in natural language for improved language models for speech recognition. The structured language model merges techniques in automatic parsing and language modeling using an original probabilistic parameterization of a shift-reduce parser. A maximum likelihood reestimation procedure belonging to the class of expectation-maximization algorithms is employed for training the model. Experiments on the Wall Street Journal, Switchboard and Broadcast News corpora show improvement in both perplexity and word error rate -- word lattice rescoring -- over the standard 3-gram language model. The significance of the thesis lies in presenting an original approach to language modeling that uses the hierarchical -- syntactic -- structure in natural language to improve on current 3-gram modeling techniques for large vocabulary speech recognition.

### Citations

9216 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...tance used. Csiszar and Tusnady have derived sufficient conditions under which the above alternating minimization procedure converges to the minimum distance between the two sets [13]. As outlined in =-=[12]-=-, this algorithm is applicable to problems in information theory — channel capacity and rate distortion calculation — as well as in statistics — the EM algorithm. EM as alternating minimization Let Q(... |

9033 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...training data;9 2. estimate interpolation coefficients to minimize the perplexity of cross-validation data — the remaining 10% of the training data — using the expectationmaximization (EM) algorithm =-=[14]-=-. Other approaches use different smoothing techniques — maximum entropy [5], back-off [20] — but they all share the same Markov assumption on the underlying source. An attempt to overcome this limitat... |

2234 | Building a Large Annotated Corpus of English: the Penn Treebank
- Marcus, Santorini, et al.
- 1993
(Show Context)
Citation Context ...model (SLM) followed by Chapters 3.1 and 3 explaining the model parameters reestimation algorithm we used. Chapter 4 presents a series of experiments we have carried out on the UPenn Treebank corpus (=-=[21]-=-). Chapters 5 and 6 describe the setup and speech recognition experiments using the structured language model on different corpora: Wall Street Journal (WSJ, [24]), Switchboard (SWB, [15]) and Broadca... |

1153 | A maximum entropy approach to natural language processing
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ...lexity of cross-validation data — the remaining 10% of the training data — using the expectationmaximization (EM) algorithm [14]. Other approaches use different smoothing techniques — maximum entropy =-=[5]-=-, back-off [20] — but they all share the same Markov assumption on the underlying source. An attempt to overcome this limitation is developed in [27]. Words in the context outside the range of the 3-g... |

738 | Classbased n-gram models of natural language
- Brown, Pietra, et al.
- 1992
(Show Context)
Citation Context ...e environment in the WS96 Dependency Modeling Group and the authors’ desire to write a PhD thesis on structured language modeling95 The SLM shares many features with both class based language models =-=[23]-=- and skip n-gram language models [27]; an interesting approach combining class based language models and different order skip-bigram models is presented in [28]. It seems worthwhile to make two commen... |

701 | Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer
- Katz
- 1987
(Show Context)
Citation Context ...s-validation data — the remaining 10% of the training data — using the expectationmaximization (EM) algorithm [14]. Other approaches use different smoothing techniques — maximum entropy [5], back-off =-=[20]-=- — but they all share the same Markov assumption on the underlying source. An attempt to overcome this limitation is developed in [27]. Words in the context outside the range of the 3-gram model are i... |

567 |
Switchboard: telephone speech corpus for research and development
- Godfrey, Holliman, et al.
- 1992
(Show Context)
Citation Context ...bank corpus ([21]). Chapters 5 and 6 describe the setup and speech recognition experiments using the structured language model on different corpora: Wall Street Journal (WSJ, [24]), Switchboard (SWB, =-=[15]-=-) and Broadcast News (BN). We conclude with Chapter 7, outlining the relationship between our approach to language modeling — and parsing — and others in the literature and pointing out what we believ... |

550 |
An inequality and associated maximization technique in statistical estimation for probabilistic functions of markov processes
- Baum
- 1972
(Show Context)
Citation Context ... model space Q(Θ) is usually structured such that dynamic programming techniques can be used for carrying out the Estep — see for example the hidden Markov model(HMM) parameter reestimation procedure =-=[3]-=-. However this advantage does not come for free: in order to be able to structure the model space we need to make independence assumptions that weaken the modeling power of our parameterization. Fortu... |

452 | A new statistical parser based on bigram lexical dependencies
- Collins
- 1996
(Show Context)
Citation Context ...n binarize the parses by again using a rule-based approach. Headword Percolation Inherently a heuristic process, we were satisfied with the output of an enhanced version of the procedure described in =-=[11]-=- — also known under the name “Magerman & Black Headword Percolation Rules”. The procedure first decomposes a parse tree from the treebank into its contextfree constituents, identified solely by the no... |

421 | A maximum likelihood approach to continuous speech recognition. 1983
- Bahl, Jelinek
- 1994
(Show Context)
Citation Context ...E AREAS 1 0 0 0 0 0 1 1 1 0 0 :4 errors per 10 words in transcription; WER = 40% The most successful approach to speech recognition so far is a statistical one pioneered by Jelinek and his colleagues =-=[2]-=-; speech recognition is viewed as a Bayes decision problem: given the observed string of acoustic features A, find the most likely word string ˆ W among those that could have generated A: ˆW = argmaxW... |

354 |
Interpolated Estimation of Markov Source Parameters from Sparse Data
- Jelinek, Mercer
- 1980
(Show Context)
Citation Context ...aining data have been seen once, thus making a relative frequency estimate unusable because of its unreliability. One standard approach that also ensures smoothing is the deleted interpolation method =-=[18]-=-. It interpolates linearly among contexts of different order hn: where: Pθ(wi|wi−n+1 . . .wi−1) = k=n ∑ k=0 λk · f(wi|hk) (1.6) • hk = wi−k+1 . . .wi−1 is the context of order k when predicting wi; • ... |

188 | Adaptive Statistical Language Modeling: A Maximum Entropy Approach
- Rosenfeld
- 1994
(Show Context)
Citation Context ... and, essentially, all attempts to improve on it in the last 20 years have failed. The one interesting enhancement, facilitated by maximum entropy estimation methodology, has been the use of triggers =-=[27]-=- or of singular value decomposition [4] (either of which dynamically identify the topic of discourse) in combination with n−gram models . Measures of Language Model Quality Word Error Rate One possibi... |

178 | The design for the Wall Street Journal-based CSR corpus
- Paul, Baker
- 1992
(Show Context)
Citation Context ...ied out on the UPenn Treebank corpus ([21]). Chapters 5 and 6 describe the setup and speech recognition experiments using the structured language model on different corpora: Wall Street Journal (WSJ, =-=[24]-=-), Switchboard (SWB, [15]) and Broadcast News (BN). We conclude with Chapter 7, outlining the relationship between our approach to language modeling — and parsing — and others in the literature and po... |

176 |
Introduction to Government and Binding Theory
- HAEGEMAN
- 1991
(Show Context)
Citation Context ... the usage of binary branching — in which one word modifies exactly one other word in the same sentence — versus trees with unconstrained branching. Learnability issues favor the former, as argued in =-=[16]-=-. It is not surprising that the binary structure also lends itself to a simpler algorithmic description and is the choice for our modeling approach. As an example, the output of the headword percolati... |

165 | A linear observed time statistical parser based on maximum entropy models
- Ratnaparkhi
- 1997
(Show Context)
Citation Context ...rt of Switchboard which was manually parsed at UPenn —- approx. 20,000 words. This allows the training of an automatic parser — we have used the Collins parser [11] for SWB and the Ratnaparkhi parser =-=[26]-=- for WSJ and BN — which is going to be used to generate an automatic treebank, possibly with a slightly different word-tokenization than that of the two manual treebanks. We evaluated the sensitivity ... |

142 |
Information geometry and alternating minimization procedures
- Csiszár, Tusnady
- 1984
(Show Context)
Citation Context ...s A and B and the distance used. Csiszar and Tusnady have derived sufficient conditions under which the above alternating minimization procedure converges to the minimum distance between the two sets =-=[13]-=-. As outlined in [12], this algorithm is applicable to problems in information theory — channel capacity and rate distortion calculation — as well as in statistics — the EM algorithm. EM as alternatin... |

128 | Exploiting Syntactic Structure for Language Modeling
- Chelba, Jelinek
- 1998
(Show Context)
Citation Context ...ding to the ln(P(Wk, Tk)) score, highest on top. The amount of search is controlled by two parameters: 4Thanks to Bob Carpenter, Lucent Technologies Bell Labs, for pointing out this inaccuracy in our =-=[9]-=- paper28 ## Stats_Del_Int descriptor file ## $Id: del_int_descriptor.tex,v 1.3 1999/03/16 17:54:16 chelba Exp $ Stats_Del_Int::_main_counts_file = counts.devel.HH_w.E0.gz ; Stats_Del_Int::_held_count... |

81 | Structured language modeling
- Chelba, Jelinek
- 2000
(Show Context)
Citation Context ...ntext does not change if we remove the (of (7 cents)) constituent — the resulting sentence is still a valid one — whereas the 3-gram context becomes (a, loss). The preliminary experiments reported in =-=[8]-=- — although the perplexity results are conditioned on parse structure developed by human annotators by having the entire sentence at their disposal — showed the usefulness of headwords accompanied by ... |

81 | Aggregate and mixed-order Markov models for statistical language processing
- Saul, Pereira
- 1997
(Show Context)
Citation Context ...s with both class based language models [23] and skip n-gram language models [27]; an interesting approach combining class based language models and different order skip-bigram models is presented in =-=[28]-=-. It seems worthwhile to make two comments relating the SLM to these approaches: • the smoothing involving NT/POS tags in the WORD-PREDICTOR is similar to a class based language model using NT/POS lab... |

73 | Problem Solving Methods - Nilsson - 1971 |

35 | Relating probabilistic grammars and automata
- Abney, McAllester, et al.
- 1999
(Show Context)
Citation Context ... . .,zk) = ∑ . . . ∑ C(u, z1, . . .,zk, zk+1 . . .zn), zk+1∈Zk+1 zn∈Zn C(z1, . . .,zk) = ∑ C(u, z1, . . .,zk), u∈U26 • λ(z1, . . .,zk) are the interpolation coefficients satisfying λ(z1, . . .,zk) ∈ =-=[0, 1]-=-, k = 0 . . .n. P n (u|z 1 ... z n ) P (u|z ... z ) n-1 1 n-1 f (u|z ... z ) n 1 n f (u|z ... z ) n-1 1 n-1 f (u) 0 0P (u) P (u)= 1/ |U| -1 Figure 2.11: Recursive Linear Interpolation The λ(z1, . . .,... |

33 | Structure and performance of a dependency language model
- Chelba, Engle, et al.
- 1997
(Show Context)
Citation Context ...el in perplexity or word error rate, none of them exploiting syntactic structure for better modeling of the natural language source. The model we present is closely related to the one investigated in =-=[7]-=-, however different in a few important aspects: • our model operates in a left-to-right manner, thus allowing its use directly in the hypothesis search for ˆ W in (1.1); • our model is a factored vers... |

31 |
Inference and estimation of a long-range trigram model
- Pietra, Pietra, et al.
- 1994
(Show Context)
Citation Context ... SB DT NN VBD IN DT NN SE <s> the contract ended with a loss </s> Figure 7.2: Tag reduced WORD-PREDICTOR dependencies 7.2.2 Language Model A structured approach to language modeling has been taken in =-=[25]-=-: the underlying probability model P(W, T) is a simple lexical link grammar, which is automatically induced and reestimated using EM from a training corpus containing word sequences (sentences). The m... |

17 |
A latent semantic analysis framework for large-span language modeling
- Bellegarda
- 1997
(Show Context)
Citation Context ...ve on it in the last 20 years have failed. The one interesting enhancement, facilitated by maximum entropy estimation methodology, has been the use of triggers [27] or of singular value decomposition =-=[4]-=- (either of which dynamically identify the topic of discourse) in combination with n−gram models . Measures of Language Model Quality Word Error Rate One possibility to measure the quality of a langua... |

17 | Combining Non-local, Syntactic and N-gram dependencies in Language Modeling
- Wu, Khudanpur
- 1999
(Show Context)
Citation Context ...s use of syntactic structure. The experiments we have carried out show improvement in both perplexity and word error rate over current state-of-the-art techniques. Preliminary experiments reported in =-=[30]-=- show complementarity between the SLM and a topic language model yielding almost additive results — word error rate improvement — on the Switchboard task. Among the directions which we consider worth ... |

14 | Problem-Solving Methods in Arti cial Intelligence - Nilsson - 1971 |

11 |
Error bounds for convolutional codes and an asymmetrically optimum decoding algorithm
- Viterbi
- 1967
(Show Context)
Citation Context ...ical considerations as explained in the next section. The calculation of gL(x) (5.1) is made very efficient after realizing that one can use the dynamic programming technique in the Viterbi algorithm =-=[29]-=-. Indeed, for a given lattice L, the value of hL(x) is completely determined by the identity of the ending node of x; a Viterbi backward pass over the lattice can store at each node the corresponding ... |

6 | Constrained stochastic language models
- Mark, Miller, et al.
- 1996
(Show Context)
Citation Context ...Tk]) = P(p k i |h0.tag, h−1.tag) (7.3) It can be seen that the probabilistic dependency structure is more complex than that in a CFG even in this simplified SLM. Along the same lines, the approach in =-=[19]-=- regards the word sequence W with the parse structure T as a Markov graph (W, T) modeled using the CFG dependencies superimposed on the regular word-level 2-gram dependencies, showing improvement in p... |

4 |
Information geometry and EM variants
- Byrne, Gunawardana, et al.
- 1998
(Show Context)
Citation Context ...n information geometry. Having gained this insight we can then easily justify the N-best training procedure. This is an interesting area of research to which we were introduced by the presentation in =-=[6]-=-. Information Geometry and EM The problem of maximum likelihood estimation from incomplete data can be viewed in an interesting geometric framework. Before proceeding, let us introduce some concepts a... |

4 |
Information Extraction From Speech And Text
- Jelinek
- 1997
(Show Context)
Citation Context ... . . . . . . . . . . . . . . 94 B.1 Alternating minimization between PT and Q(Θ) . . . . . . . . . . . . 100 x1 Introduction In the accepted statistical formulation of the speech recognition problem =-=[17]-=- the recognizer seeks to find the word string W . = arg max P(A|W) P(W) W where A denotes the observable speech signal, P(A|W) is the probability that when the word string W is spoken, the signal A r... |