## Building Probabilistic Models for Natural Language (1996)

### Cached

### Download Links

- [arxiv.org]
- [www.cs.cmu.edu]
- [ftp.das.harvard.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 67 - 1 self |

### BibTeX

@MISC{Chen96buildingprobabilistic,

author = {Stanley F. Chen},

title = {Building Probabilistic Models for Natural Language},

year = {1996}

}

### Years of Citing Articles

### OpenURL

### Abstract

Building models of language is a central task in natural language processing. Traditionally, language has been modeled with manually-constructed grammars that describe which strings are grammatical and which are not; however, with the recent availability of massive amounts of on-line text, statistically-trained models are an attractive alternative. These models are generally probabilistic, yielding a score reflecting sentence frequency instead of a binary grammaticality judgement. Probabilistic models of language are a fundamental tool in speech recognition for resolving acoustically ambiguous utterances. For example, we prefer the transcription forbear to four bear as the former string is far more frequent in English text. Probabilistic models also have application in optical character recognition, handwriting recognition, spelling correction, part-of-speech tagging, and machine translation. In this thesis, we investigate three problems involving the probabilistic modeling of languag...

### Citations

9034 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...g results. 3.4.2 Other Approaches The most widely-used tool in probabilistic grammar induction is the Inside-Outside algorithm (Baker, 1979), a special case of the Expectation-Maximization algorithm (=-=Dempster et al., 1977-=-). The Inside-Outside algorithm takes a probabilistic context-free grammar and adjusts its probabilities iteratively to attempt to maximize the probability the grammar assigns to some training data. I... |

2893 |
Dynamic Programming
- Bellman
- 1957
(Show Context)
Citation Context ...entence is accepted under the grammar, then the symbol S will occur in the cell corresponding to w 1 \Delta \Delta \Delta wm . The cells can be filled in an efficient manner with dynamic programming (=-=Bellman, 1957-=-). Performing probabilistic chart parsing just requires some extra bookkeeping; the algorithm is essentially the same. 20 20 This is only true when trying to calculate the most probable parse of a sen... |

1287 | The mathematics of statistical machine translation: Parameter estimation
- Brown, Pietra, et al.
- 1993
(Show Context)
Citation Context ...e a higher probability of aligning with the sentence Jean a mang'e Fido than with the sentence Fido a mang'e Jean. However, modeling how word order mutates under translation is notoriously difficult (=-=Brown et al., 1993-=-), and it is unclear how much improvement in accuracy an accurate model of word order would provide. Hence, we ignore this issue and take all word orderings to be equiprobable. Let O(E) denote the num... |

1053 |
A method for construction of minimum-redundancy codes
- Huffman
- 1952
(Show Context)
Citation Context ...ties that are negative powers of two that are near to the actual probabilities of each outcome. In fact, there is an algorithm for performing this assignment in an optimal way, namely Huffman coding (=-=Huffman, 1952-=-). In this case, Huffman coding yields the codewords 0, 10, and 11. Clearly, the codeword lengths do not follow the relation that an outcome with probability p has 8 We see this principle followed in ... |

938 | An empirical study of smoothing techniques for language modeling
- Chen, Goodman
- 1996
(Show Context)
Citation Context ... manual annotation is expensive and thus only a limited amount of such data is available. 9 Chapter 2 Smoothing n-Gram Models In this chapter, we describe work on the task of smoothing n-gram models (=-=Chen and Goodman, 1996-=-). 1 Of the three structural levels at which we model language in this thesis, this represents work at the word level. We introduce two novel smoothing techniques that significantly outperform all exi... |

738 | Classbased n-gram models of natural language
- Brown, Pietra, et al.
- 1992
(Show Context)
Citation Context ... likely to occur in language and match the acoustic signal well. The source-channel model can be extended to many other applications besides speech recognition by just varying the channel model used (=-=Brown et al., 1992-=-b). In optical character recognition and handwriting recognition (Hull, 1992; Srihari and Baltus, 1992), the channel can be interpreted as converting from text to image data instead of from text to sp... |

715 | A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text
- Church
- 1988
(Show Context)
Citation Context ...l outputs image data, text with spelling errors, or text in a foreign language. By varying the source model, we can extend the source-channel model to further applications. In part-of-speech tagging (=-=Church, 1988-=-), one attempts to label words in sentences with their part-of-speech. We can apply the source-channel model by taking the source to generate part-of-speech sequences T pos corresponding to sentences,... |

701 | Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer
- Katz
- 1987
(Show Context)
Citation Context ...he maximum likelihood model. While smoothing is a central issue in language modeling, the literature lacks a definitive comparison between the many existing techniques. Previous studies (Nadas, 1984; =-=Katz, 1987-=-; Church and Gale, 1991; MacKay and Peto, 1995) only compare a small number of methods (typically two) on a single corpus and using a single training data size. As a result, it is currently difficult ... |

644 |
Text Compression
- Bell, Cleary, et al.
- 1990
(Show Context)
Citation Context ...odeling. Some smoothing algorithms that we did not consider that would be interesting to compare against are those from the field of data compression, which includes the subfield of text compression (=-=Bell et al., 1990-=-). However, smoothing algorithms for data compression have different requirements from those used for language modeling. In data compression, it is essential that smoothed models can be built extremel... |

640 | A statistical approach to machine translation
- Brown, Cocke, et al.
- 1990
(Show Context)
Citation Context ...l., 1990), the channel can be interpreted as an imperfect typist that converts perfect text T to noisy text T n with spelling mistakes, yielding T = arg max T p(T )p(T n jT ): In machine translation (=-=Brown et al., 1990-=-), the channel can be interpreted as a translator that converts text T in one language into text T f in a foreign language, yielding T = arg max T p(T )p(T f jT ): (1.3) In each of these cases, we try... |

570 |
Theory of Probability
- Jeffreys
- 1961
(Show Context)
Citation Context ...ward, and high probabilities are adjusted downward. To give an example, one simple smoothing technique is to pretend each bigram occurs once more than it actually does (Lidstone, 1920; Johnson, 1932; =-=Jeffreys, 1948-=-), yielding p +1 (w i jw i\Gamma1 ) = c(w i\Gamma1 w i ) + 1 P w [c(w i\Gamma1 w i ) + 1] = c(w i\Gamma1 w i ) + 1 P w c(w i\Gamma1 w i ) + jV j (2.4) where V is the vocabulary, the set of all words b... |

561 |
Three approaches to the quantitive definition of information
- Kolmogorov
- 1965
(Show Context)
Citation Context ...bitrarily long. However, it is not possible to make descriptions arbitrarily compact. There is a lower bound to the length of the description of any piece of data (Solomonoff, 1960; Solomonoff, 1964; =-=Kolmogorov, 1965-=-), and we can use this lower bound to define a meaningful description length for a piece of data. This is why we choose an optimal coding for calculating l(OjG). This dictum of optimal coding extends ... |

550 |
An inequality and associated maximization technique in statistical estimation for probabilistic functions of markov processes
- Baum
- 1972
(Show Context)
Citation Context ...distribution p unif (w i ) = 1 jV j 17 Given fixed p ML , it is possible to search efficiently for thesw i\Gamma1 i\Gamman+1 that maximize the probability of some data using the Baum-Welch algorithm (=-=Baum, 1972-=-). To yield meaningful results, the data used to estimate thesw i\Gamma1 i\Gamman+1 need to be disjoint from the data used to calculate the p ML . 7 In held-out interpolation, one reserves a section o... |

513 |
Syntactic Structures
- Chomsky
- 1957
(Show Context)
Citation Context ...ese 1 phrases in turn can be combined to create sentences, which in turn can be used to build paragraphs, and so on. Grammars can be used to describe such hierarchical structure in a succinct manner (=-=Chomsky, 1964-=-). A grammar consists of rules that describe allowable ways of combining structures at one level to form structures at the next higher level. For example, we may have a grammar rule of the form: Noun-... |

454 | A Program for Aligning Sentences in Bilingual Corpora," unpublished ms., submitted to 29 th Annual Meeting of the Association for Computational Linguistics - Gale, Church - 1990 |

421 | A maximum likelihood approach to continuous speech recognition. 1983
- Bahl, Jelinek
- 1994
(Show Context)
Citation Context ...n, but they are also useful in applications as diverse as spelling correction, machine translation, and part-ofspeech tagging. These and other applications can be placed in a single common framework (=-=Bahl et al., 1983-=-), the source-channel model used in information theory (Shannon, 1948). In this section, we explain how speech recognition can be placed in this framework, and then explain how other applications are ... |

409 |
The population frequencies of species and the estimation of population parameters
- Good
- 1953
(Show Context)
Citation Context ... ffijV j (2.5) Lidstone and Jeffreys advocate taking ffi = 1. Gale and Church (1990; 1994) have argued that this method generally performs poorly. 2.2.2 Good-Turing Estimate The Good-Turing estimate (=-=Good, 1953-=-) is central to many smoothing techniques. The Good-Turing estimate states that for any n-gram that occurs r times, we should pretend that it occurs r times where r = (r + 1) n r+1 n r (2.6) and where... |

382 |
The estimation of stochastic context-free grammars using the Inside-Outside algorithm, Computer Speech and Language
- Lari, Young
- 1990
(Show Context)
Citation Context ...ocessing, and intuitively such models capture properties of language that n-gram models cannot. For example, it has been shown that grammatical language models can express long-distance dependencies (=-=Lari and Young, 1990-=-; Resnik, 1992; Schabes, 1992). Furthermore, grammatical models have the potential to be more compact while achieving equivalent performance as n-gram models (Brown et al., 1992b). To demonstrate thes... |

360 | Algorithms for Minimization without Derivatives
- Brent
- 1973
(Show Context)
Citation Context ...ion, we include a general multidimensional search engine for automatically searching for optimal parameter values for each smoothing technique. We use the implementation of Powell's search algorithm (=-=Brent, 1973-=-) given in Numerical Recipes in C (Press et al., 1988, pp. 309--317). Powell's algorithm does not require the calculation of the gradient. It involves successive searches along vectors in the multidim... |

354 |
Interpolated Estimation of Markov Source Parameters from Sparse Data
- Jelinek, Mercer
- 1980
(Show Context)
Citation Context ...at of n-gram models and the Lari and Young algorithm. For n-gram models, we tried n = 1; : : : ; 10 for each domain. To smooth the n-gram models, we use a popular version of Jelinek-Mercer smoothing (=-=Jelinek and Mercer, 1980-=-; Bahl et al., 1983), namely the version that we refer to as interp-held-out described in 97 Section 2.4.1. In the Lari and Young algorithm, the initial grammar is taken to be a probabilistic context-... |

318 |
Inductive Inference: Theory and Methods
- Angluin, Smith
- 1983
(Show Context)
Citation Context ...data to be manually annotated in any way. 6 3.2 Grammar Induction as Search Grammar induction can be framed as a search problem, and has been framed as such almost without exception in past research (=-=Angluin and Smith, 1983-=-). The search space is taken to be some class of grammars; for example, in our work we search within the space of probabilistic context-free grammars. We search for a grammar that optimizes some quant... |

278 |
Trainable grammars for speech recognition
- Baker
- 1979
(Show Context)
Citation Context ... we used, and like Cook et al., they do not present any language modeling results. 3.4.2 Other Approaches The most widely-used tool in probabilistic grammar induction is the Inside-Outside algorithm (=-=Baker, 1979-=-), a special case of the Expectation-Maximization algorithm (Dempster et al., 1977). The Inside-Outside algorithm takes a probabilistic context-free grammar and adjusts its probabilities iteratively t... |

200 | Aligning Sentences in Parallel Corpora
- Brown, Lai, et al.
- 1991
(Show Context)
Citation Context ...Canadian parliament proceedings in both English and French. Bilingual corpora have proven useful in many tasks, including machine translation (Brown et al., 1990; Sadler, 1989), sense disambiguation (=-=Brown et al., 1991-=-a; Dagan et al., 1991; Gale et al., 1992), and bilingual lexicography (Klavans and Tzoukermann, 1990; Warwick and Russell, 1990). For example, a bilingual corpus can be used to automatically construct... |

198 | Word-sense disambiguation using statistical methods
- Brown, Pietra, et al.
- 1991
(Show Context)
Citation Context ...Canadian parliament proceedings in both English and French. Bilingual corpora have proven useful in many tasks, including machine translation (Brown et al., 1990; Sadler, 1989), sense disambiguation (=-=Brown et al., 1991-=-a; Dagan et al., 1991; Gale et al., 1992), and bilingual lexicography (Klavans and Tzoukermann, 1990; Warwick and Russell, 1990). For example, a bilingual corpus can be used to automatically construct... |

141 | Prepositional phrase attachment through a backed-off model
- Collins, Brooks
- 1995
(Show Context)
Citation Context ... 2.6 Discussion Smoothing is a fundamental technique for statistical modeling, important not only for language modeling but for many other applications as well, e.g., prepositional phrase attachment (=-=Collins and Brooks, 1995-=-), part-of-speech tagging (Church, 1988), and stochastic parsing (Magerman, 1994). Whenever data sparsity is an issue (and it always is), smoothing has the potential to improve performance with modera... |

141 | Texttranslation alignment - Kay, Röscheisen - 1993 |

130 |
A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer speech and language
- Church, Gale
- 1991
(Show Context)
Citation Context ...ikelihood model. While smoothing is a central issue in language modeling, the literature lacks a definitive comparison between the many existing techniques. Previous studies (Nadas, 1984; Katz, 1987; =-=Church and Gale, 1991-=-; MacKay and Peto, 1995) only compare a small number of methods (typically two) on a single corpus and using a single training data size. As a result, it is currently difficult for a researcher to int... |

125 | Aligning Sentences in Bilingual Corpora Using Lexical Information
- Chen
- 1993
(Show Context)
Citation Context ...e just need to enhance the move set. 106 Chapter 4 Aligning Sentences in Bilingual Text In this chapter, we describe an algorithm for aligning sentences with their translations in a bilingual corpus (=-=Chen, 1993-=-). In experiments with the Hansard Canadian parliament proceedings, our algorithm yields significantly better accuracy than previous algorithms. In addition, it is efficient, robust, language-independ... |

119 | Two languages are more informative than one
- Dagan, Itai, et al.
- 1991
(Show Context)
Citation Context ...roceedings in both English and French. Bilingual corpora have proven useful in many tasks, including machine translation (Brown et al., 1990; Sadler, 1989), sense disambiguation (Brown et al., 1991a; =-=Dagan et al., 1991-=-; Gale et al., 1992), and bilingual lexicography (Klavans and Tzoukermann, 1990; Warwick and Russell, 1990). For example, a bilingual corpus can be used to automatically construct a bilingual dictiona... |

116 |
Basic methods of probabilistic context-free grammars
- Jelinek, Lafferty, et al.
- 1992
(Show Context)
Citation Context ...f repetitions using the universal MDL prior. 3.5.4 Parsing To calculate the most probable parse of a sentence given the current hypothesis grammar, we use a probabilistic chart parser (Younger, 1967; =-=Jelinek et al., 1992-=-). In chart parsing, one fills in a chart composed of cells, where each cell represents a span in the sentence to be parsed. If the sentence is composed of the words w 1 \Delta \Delta \Delta wm , then... |

86 |
A Spelling Correction Program Based on a Noisy Channel Model
- Kernighan, W, et al.
- 1990
(Show Context)
Citation Context ...i and Baltus, 1992), the channel can be interpreted as converting from text to image data instead of from text to speech, yielding the equation T = arg max T p(T )p(imagejT ): In spelling correction (=-=Kernighan et al., 1990-=-), the channel can be interpreted as an imperfect typist that converts perfect text T to noisy text T n with spelling mistakes, yielding T = arg max T p(T )p(T n jT ): In machine translation (Brown et... |

79 | Using bilingual materials to develop word sense disambiguation methods
- Gale, Church, et al.
- 1992
(Show Context)
Citation Context ...nglish and French. Bilingual corpora have proven useful in many tasks, including machine translation (Brown et al., 1990; Sadler, 1989), sense disambiguation (Brown et al., 1991a; Dagan et al., 1991; =-=Gale et al., 1992-=-), and bilingual lexicography (Klavans and Tzoukermann, 1990; Warwick and Russell, 1990). For example, a bilingual corpus can be used to automatically construct a bilingual dictionary. A bilingual dic... |

68 |
An estimate of an upper bound for the entropy of english
- Brown, Pietra
- 1992
(Show Context)
Citation Context ... likely to occur in language and match the acoustic signal well. The source-channel model can be extended to many other applications besides speech recognition by just varying the channel model used (=-=Brown et al., 1992-=-b). In optical character recognition and handwriting recognition (Hull, 1992; Srihari and Baltus, 1992), the channel can be interpreted as converting from text to image data instead of from text to sp... |

66 | Good-turing frequency estimation without tears - Gale, Sampson - 1995 |

56 | A convergent gambling estimate of the entropy of English
- Cover, King
- 1978
(Show Context)
Citation Context ...l, if a particular outcome has probability p, in the limit of coding a large number of trials, each of those outcomes will take on average log 2 1 p bits to code in the optimal coding (Shannon, 1948; =-=Cover and King, 1978-=-). In fact, this limit can be realized in practice with an efficient algorithm called arithmetic coding (Pasco, 1976; Rissanen, 1976). 3.2.2 Description Lengths Now, let us return to the minimum descr... |

54 | Bayesian grammar induction for language modeling
- Chen
- 1995
(Show Context)
Citation Context ...ther way to improve performance. 50 Chapter 3 Bayesian Grammar Induction for Language Modeling In this chapter, we describe a corpus-based induction algorithm for probabilistic context-free grammars (=-=Chen, 1995-=-) that significantly outperforms the grammar induction algorithm introduced by Lari and Young (1990), the most widely-used algorithm for probabilistic grammar induction. In addition, it outperforms n-... |

50 | Deriving Translation Data from Bilingual Texts - Catizone, Russell, et al. - 1989 |

43 |
Applications of stochastic context-free grammars using the inside-outside algorithm
- Lad, Young
- 1991
(Show Context)
Citation Context ... though it assigns a high probability to the training data; this phenomenon is called overfitting the training data. In n-gram models and work with the Inside-Outside algorithm (Lari and Young, 1990; =-=Lari and Young, 1991; Pereira -=-and Schabes, 1992), this issue is evaded because all of the models considered are of a fixed size, so that the "optimal" grammar cannot be expressed. 7 However, in our work we do not wish to... |

32 |
Probability: The Deductive and Inductive Problems
- Johnson
- 1932
(Show Context)
Citation Context ...are adjusted upward, and high probabilities are adjusted downward. To give an example, one simple smoothing technique is to pretend each bigram occurs once more than it actually does (Lidstone, 1920; =-=Johnson, 1932-=-; Jeffreys, 1948), yielding p +1 (w i jw i\Gamma1 ) = c(w i\Gamma1 w i ) + 1 P w [c(w i\Gamma1 w i ) + 1] = c(w i\Gamma1 w i ) + 1 P w c(w i\Gamma1 w i ) + jV j (2.4) where V is the vocabulary, the se... |

26 | What's wrong with adding one - Gale, Church - 1990 |

26 | Speech Recognition and the Frequency of Recently Used Words: A Modified Markov Model for Natural Language
- Kuhn
- 1988
(Show Context)
Citation Context ...tion p(w i jw i\Gamma1 i\Gamman+1 ) for a fixed w i\Gamma1 i\Gamman+1 , as opposed to counts in the global n-gram distribution. 16 This observation is taken advantage of in dynamic language modeling (=-=Kuhn, 1988-=-; Rosenfeld and Huang, 1992; Rosenfeld, 1994a). 44 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 10 100 1000 1e-13 1e-12 1e-11 1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 bucketing value bigram model shape predicted by... |

25 | Language modeling with sentence-level mixtures - Iyer, Ostendorf, et al. - 1994 |

23 | Grammatical inference by hill climbing - Cook, Rosenfeld, et al. - 1976 |

17 | Lexical heads, phrase structure and the induction of grammar.” SIGDAT - Marcken - 1995 |

9 |
Combining syntactic knowledge and visual text recognition: A hidden Markov model for part of speech tagging in a word recognition algorithm
- Hull
- 1992
(Show Context)
Citation Context ...model can be extended to many other applications besides speech recognition by just varying the channel model used (Brown et al., 1992b). In optical character recognition and handwriting recognition (=-=Hull, 1992-=-; Srihari and Baltus, 1992), the channel can be interpreted as converting from text to image data instead of from text to speech, yielding the equation T = arg max T p(T )p(imagejT ): In spelling corr... |

9 |
The BICORD system
- Klavans, Tzoukermann
- 1990
(Show Context)
Citation Context ... useful in many tasks, including machine translation (Brown et al., 1990; Sadler, 1989), sense disambiguation (Brown et al., 1991a; Dagan et al., 1991; Gale et al., 1992), and bilingual lexicography (=-=Klavans and Tzoukermann, 1990-=-; Warwick and Russell, 1990). For example, a bilingual corpus can be used to automatically construct a bilingual dictionary. A bilingual dictionary can be expressed as a probabilistic model p(f je) of... |

7 | Learning Probabilistic Grammars for Language Modeling - Carroll - 1995 |

7 | K.W.: Estimation procedures for language context: Poor estimates are worse than none - Gale, Church - 1990 |

3 | Speech recognition using a stochastic language model integrating local and global constraints - Isotani, Matsunaga - 1994 |

1 |
Experiments in stochastic grammar inference with simulated annealing and the inside-outside algorithm. Unpublished report
- Chen, Kehler, et al.
- 1993
(Show Context)
Citation Context ...onstrate that by using a richer move set this constraint is much less serious. There have been several results demonstrating the severity of the local minima problem for the Inside-Outside algorithm (=-=Chen et al., 1993-=-; de Marcken, 1995). In terms of efficiency, the algorithms differ significantly because of the different ways rules are selected. In the Lari and Young algorithm, one starts with a large grammar and ... |