## Building Probabilistic Models for Natural Language (1996)

### Cached

### Download Links

- [arxiv.org]
- [www.cs.cmu.edu]
- [ftp.das.harvard.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 67 - 1 self |

### BibTeX

@TECHREPORT{Chen96buildingprobabilistic,

author = {Stanley F. Chen},

title = {Building Probabilistic Models for Natural Language},

institution = {},

year = {1996}

}

### Years of Citing Articles

### OpenURL

### Abstract

Building models of language is a central task in natural language processing. Traditionally, language has been modeled with manually-constructed grammars that describe which strings are grammatical and which are not; however, with the recent availability of massive amounts of on-line text, statistically-trained models are an attractive alternative. These models are generally probabilistic, yielding a score reflecting sentence frequency instead of a binary grammaticality judgement. Probabilistic models of language are a fundamental tool in speech recognition for resolving acoustically ambiguous utterances. For example, we prefer the transcription forbear to four bear as the former string is far more frequent in English text. Probabilistic models also have application in optical character recognition, handwriting recognition, spelling correction, part-of-speech tagging, and machine translation. In this thesis, we investigate three problems involving the probabilistic modeling of languag...

### Citations

9054 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...g results. 3.4.2 Other Approaches The most widely-used tool in probabilistic grammar induction is the Inside-Outside algorithm (Baker, 1979), a special case of the Expectation-Maximization algorithm (=-=Dempster et al., 1977-=-). The Inside-Outside algorithm takes a probabilistic context-free grammar and adjusts its probabilities iteratively to attempt to maximize the probability the grammar assigns to some training data. I... |

7146 |
A mathematical theory of communication
- Shannon
- 1948
(Show Context)
Citation Context ...tion, machine translation, and part-ofspeech tagging. These and other applications can be placed in a single common framework (Bahl et al., 1983), the source-channel model used in information theory (=-=Shannon, 1948-=-). In this section, we explain how speech recognition can be placed in this framework, and then explain how other applications are just variations on this theme. The task of speech recognition can be ... |

2901 |
Dynamic programming
- Bellman
- 1957
(Show Context)
Citation Context ...entence is accepted under the grammar, then the symbol S will occur in the cell corresponding to w 1 \Delta \Delta \Delta wm . The cells can be filled in an efficient manner with dynamic programming (=-=Bellman, 1957-=-). Performing probabilistic chart parsing just requires some extra bookkeeping; the algorithm is essentially the same. 20 20 This is only true when trying to calculate the most probable parse of a sen... |

1782 | An Introduction to Kolmogorov Complexity and Its Applications, 2nd edn
- Li, Vitányi
- 1997
(Show Context)
Citation Context ...temming from an encoding perspective. Secondly, it has been observed that "MDL-style" priors of the form p(G) = 2 \Gammal(G) can be 64 good models of the real world (Solomonoff, 1964; Rissan=-=en, 1978; Li and Vit'anyi, 1993-=-). 12 To demonstrate this, let us consider some examples of real-world data. Let us say you see one hundred flips of a coin, and each time it turns up heads. Clearly, you expect a head with very high ... |

1287 | The mathematics of statistical machine translation: Parameter estimation
- Brown, Pietra, et al.
- 1993
(Show Context)
Citation Context ...e a higher probability of aligning with the sentence Jean a mang'e Fido than with the sentence Fido a mang'e Jean. However, modeling how word order mutates under translation is notoriously difficult (=-=Brown et al., 1993-=-), and it is unclear how much improvement in accuracy an accurate model of word order would provide. Hence, we ignore this issue and take all word orderings to be equiprobable. Let O(E) denote the num... |

1246 |
Modeling by shortest data description
- Rissanen
- 1978
(Show Context)
Citation Context ...els, grammatical models offer the best hope for significantly improving language modeling accuracy. We introduce a novel grammar induction algorithm based on the minimum description length principle (=-=Rissanen, 1978-=-) that surpasses the performance of existing algorithms. The third problem deals with the task of bilingual sentence alignment. There exist many corpora that contain equivalent text in multiple langua... |

1053 |
A method for the construction of minimum redundancy codes
- Huffman
- 1952
(Show Context)
Citation Context ...ties that are negative powers of two that are near to the actual probabilities of each outcome. In fact, there is an algorithm for performing this assignment in an optimal way, namely Huffman coding (=-=Huffman, 1952-=-). In this case, Huffman coding yields the codewords 0, 10, and 11. Clearly, the codeword lengths do not follow the relation that an outcome with probability p has 8 We see this principle followed in ... |

940 | An empirical study of smoothing techniques for language modeling. Computer Speech and Language
- SF, Goodman
- 1999
(Show Context)
Citation Context ... manual annotation is expensive and thus only a limited amount of such data is available. 9 Chapter 2 Smoothing n-Gram Models In this chapter, we describe work on the task of smoothing n-gram models (=-=Chen and Goodman, 1996-=-). 1 Of the three structural levels at which we model language in this thesis, this represents work at the word level. We introduce two novel smoothing techniques that significantly outperform all exi... |

738 | Class-Based N-Gram Models of Natural Language
- Brown, Pietra
- 1992
(Show Context)
Citation Context ... likely to occur in language and match the acoustic signal well. The source-channel model can be extended to many other applications besides speech recognition by just varying the channel model used (=-=Brown et al., 1992-=-b). In optical character recognition and handwriting recognition (Hull, 1992; Srihari and Baltus, 1992), the channel can be interpreted as converting from text to image data instead of from text to sp... |

716 | A stochastic parts program and noun phrase parser for unrestricted text
- Church
- 1988
(Show Context)
Citation Context ...l outputs image data, text with spelling errors, or text in a foreign language. By varying the source model, we can extend the source-channel model to further applications. In part-of-speech tagging (=-=Church, 1988-=-), one attempts to label words in sentences with their part-of-speech. We can apply the source-channel model by taking the source to generate part-of-speech sequences T pos corresponding to sentences,... |

701 | Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer
- Katz
- 1987
(Show Context)
Citation Context ...he maximum likelihood model. While smoothing is a central issue in language modeling, the literature lacks a definitive comparison between the many existing techniques. Previous studies (Nadas, 1984; =-=Katz, 1987-=-; Church and Gale, 1991; MacKay and Peto, 1995) only compare a small number of methods (typically two) on a single corpus and using a single training data size. As a result, it is currently difficult ... |

644 |
Modeling for text compression
- Bell, Witten, et al.
- 1989
(Show Context)
Citation Context ...odeling. Some smoothing algorithms that we did not consider that would be interesting to compare against are those from the field of data compression, which includes the subfield of text compression (=-=Bell et al., 1990-=-). However, smoothing algorithms for data compression have different requirements from those used for language modeling. In data compression, it is essential that smoothed models can be built extremel... |

641 | A statistical approach to machine translation
- Brown, Cocke, et al.
- 1990
(Show Context)
Citation Context ...l., 1990), the channel can be interpreted as an imperfect typist that converts perfect text T to noisy text T n with spelling mistakes, yielding T = arg max T p(T )p(T n jT ): In machine translation (=-=Brown et al., 1990-=-), the channel can be interpreted as a translator that converts text T in one language into text T f in a foreign language, yielding T = arg max T p(T )p(T f jT ): (1.3) In each of these cases, we try... |

570 |
Theory of Probability
- Jeffreys
- 1961
(Show Context)
Citation Context ...ward, and high probabilities are adjusted downward. To give an example, one simple smoothing technique is to pretend each bigram occurs once more than it actually does (Lidstone, 1920; Johnson, 1932; =-=Jeffreys, 1948-=-), yielding p +1 (w i jw i\Gamma1 ) = c(w i\Gamma1 w i ) + 1 P w [c(w i\Gamma1 w i ) + 1] = c(w i\Gamma1 w i ) + 1 P w c(w i\Gamma1 w i ) + jV j (2.4) where V is the vocabulary, the set of all words b... |

561 |
Three approaches to the quantitative definition of information,” Probl
- Kolmogorov
- 1965
(Show Context)
Citation Context ...bitrarily long. However, it is not possible to make descriptions arbitrarily compact. There is a lower bound to the length of the description of any piece of data (Solomonoff, 1960; Solomonoff, 1964; =-=Kolmogorov, 1965-=-), and we can use this lower bound to define a meaningful description length for a piece of data. This is why we choose an optimal coding for calculating l(OjG). This dictum of optimal coding extends ... |

550 |
An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities
- Baum
- 1972
(Show Context)
Citation Context ...distribution p unif (w i ) = 1 jV j 17 Given fixed p ML , it is possible to search efficiently for thesw i\Gamma1 i\Gamman+1 that maximize the probability of some data using the Baum-Welch algorithm (=-=Baum, 1972-=-). To yield meaningful results, the data used to estimate thesw i\Gamma1 i\Gamman+1 need to be disjoint from the data used to calculate the p ML . 7 In held-out interpolation, one reserves a section o... |

514 |
Syntactic Structures
- Chomsky
- 1957
(Show Context)
Citation Context ...ese 1 phrases in turn can be combined to create sentences, which in turn can be used to build paragraphs, and so on. Grammars can be used to describe such hierarchical structure in a succinct manner (=-=Chomsky, 1964-=-). A grammar consists of rules that describe allowable ways of combining structures at one level to form structures at the next higher level. For example, we may have a grammar rule of the form: Noun-... |

454 | A Program for Aligning Sentences in Bilingual Corpora," unpublished ms., submitted to 29 th Annual Meeting of the Association for Computational Linguistics - Gale, Church - 1990 |

427 |
A formal theory of inductive inference
- Solomonoff
- 1964
(Show Context)
Citation Context ...ke descriptions arbitrarily long. However, it is not possible to make descriptions arbitrarily compact. There is a lower bound to the length of the description of any piece of data (Solomonoff, 1960; =-=Solomonoff, 1964-=-; Kolmogorov, 1965), and we can use this lower bound to define a meaningful description length for a piece of data. This is why we choose an optimal coding for calculating l(OjG). This dictum of optim... |

422 | R . "A maximum likelihood approach t o continuous speech recognition
- Bahl, Jelinek, et al.
- 1983
(Show Context)
Citation Context ...n, but they are also useful in applications as diverse as spelling correction, machine translation, and part-ofspeech tagging. These and other applications can be placed in a single common framework (=-=Bahl et al., 1983-=-), the source-channel model used in information theory (Shannon, 1948). In this section, we explain how speech recognition can be placed in this framework, and then explain how other applications are ... |

411 |
The population frequencies of species and the estimation of population parameters
- Good
- 1953
(Show Context)
Citation Context ... ffijV j (2.5) Lidstone and Jeffreys advocate taking ffi = 1. Gale and Church (1990; 1994) have argued that this method generally performs poorly. 2.2.2 Good-Turing Estimate The Good-Turing estimate (=-=Good, 1953-=-) is central to many smoothing techniques. The Good-Turing estimate states that for any n-gram that occurs r times, we should pretend that it occurs r times where r = (r + 1) n r+1 n r (2.6) and where... |

382 |
The estimation of stochastic context-free grammars using the Inside-Outside algorithm, Computer Speech and Language
- Lari, Young
- 1990
(Show Context)
Citation Context ...ocessing, and intuitively such models capture properties of language that n-gram models cannot. For example, it has been shown that grammatical language models can express long-distance dependencies (=-=Lari and Young, 1990-=-; Resnik, 1992; Schabes, 1992). Furthermore, grammatical models have the potential to be more compact while achieving equivalent performance as n-gram models (Brown et al., 1992b). To demonstrate thes... |

375 |
Prediction and entropy of printed english
- Shannon
- 1951
(Show Context)
Citation Context ...lems that we have selected investigate the task of modeling language at three different levels: words, constituents, and sentences. First, we consider the problem of smoothing n-gram language models (=-=Shannon, 1951-=-). Such models are dominant in language modeling, yielding the best current performance. In such models, the probability of a sentence is expressed through the probability of each word in the sentence... |

362 | Algorithms for Minimization without Derivatives
- Brent
- 1973
(Show Context)
Citation Context ...ion, we include a general multidimensional search engine for automatically searching for optimal parameter values for each smoothing technique. We use the implementation of Powell's search algorithm (=-=Brent, 1973-=-) given in Numerical Recipes in C (Press et al., 1988, pp. 309--317). Powell's algorithm does not require the calculation of the gradient. It involves successive searches along vectors in the multidim... |

357 |
Interpolated estimation of markov source parameters from sparse data
- Jelinek, Mercer
- 2000
(Show Context)
Citation Context ...at of n-gram models and the Lari and Young algorithm. For n-gram models, we tried n = 1; : : : ; 10 for each domain. To smooth the n-gram models, we use a popular version of Jelinek-Mercer smoothing (=-=Jelinek and Mercer, 1980-=-; Bahl et al., 1983), namely the version that we refer to as interp-held-out described in 97 Section 2.4.1. In the Lari and Young algorithm, the initial grammar is taken to be a probabilistic context-... |

326 |
Complexity in Statistical Inquiry
- Rissanen
- 1989
(Show Context)
Citation Context ...particular, the probability we assign to the symbol A expanding to exactly n B's is p(A ) B n ) = p MDL (n) = 6 2 1 n[log 2 (n + 1)] 2 where p MDL is the universal MDL prior over the natural numbers (=-=Rissanen, 1989-=-). We 78 choose this parameterization because it prevents us from needing to estimate the probability of the A ! AB expansion versus the A ! B expansion, and because in some sense it is the most conse... |

318 |
Inductive Inference: Theory and Methods
- Angluin, Smith
- 1983
(Show Context)
Citation Context ...data to be manually annotated in any way. 6 3.2 Grammar Induction as Search Grammar induction can be framed as a search problem, and has been framed as such almost without exception in past research (=-=Angluin and Smith, 1983-=-). The search space is taken to be some class of grammars; for example, in our work we search within the space of probabilistic context-free grammars. We search for a grammar that optimizes some quant... |

278 |
Trainable grammars for speech recognition
- Baker
- 1979
(Show Context)
Citation Context ... we used, and like Cook et al., they do not present any language modeling results. 3.4.2 Other Approaches The most widely-used tool in probabilistic grammar induction is the Inside-Outside algorithm (=-=Baker, 1979-=-), a special case of the Expectation-Maximization algorithm (Dempster et al., 1977). The Inside-Outside algorithm takes a probabilistic context-free grammar and adjusts its probabilities iteratively t... |

278 | Inside-outside reestimation from partially bracketed corpora
- Pereira, Schabes
- 1992
(Show Context)
Citation Context ... given earlier in this section where we qualify nonterminal symbols with their head words. 6 Some grammar induction algorithms require that the training data be annotated with parse tree information (=-=Pereira and Schabes, 1992-=-; Magerman, 1994). However, these algorithms tend to be geared toward parsing instead of language modeling. It is expensive to manually annotate data, and it is not practical to annotate the amount of... |

202 | Aligning Sentences in Parallel Corpora
- Brown, Lai, et al.
- 1991
(Show Context)
Citation Context ...Canadian parliament proceedings in both English and French. Bilingual corpora have proven useful in many tasks, including machine translation (Brown et al., 1990; Sadler, 1989), sense disambiguation (=-=Brown et al., 1991-=-a; Dagan et al., 1991; Gale et al., 1992), and bilingual lexicography (Klavans and Tzoukermann, 1990; Warwick and Russell, 1990). For example, a bilingual corpus can be used to automatically construct... |

198 | Word-Sense Disambiguation using Statistical Methods
- Brown, Pietra, et al.
- 1991
(Show Context)
Citation Context ...Canadian parliament proceedings in both English and French. Bilingual corpora have proven useful in many tasks, including machine translation (Brown et al., 1990; Sadler, 1989), sense disambiguation (=-=Brown et al., 1991-=-a; Dagan et al., 1991; Gale et al., 1992), and bilingual lexicography (Klavans and Tzoukermann, 1990; Warwick and Russell, 1990). For example, a bilingual corpus can be used to automatically construct... |

189 | Adaptive statistical language modeling: A maximum entropy approach
- Rosenfeld
- 1994
(Show Context)
Citation Context ... a fixed w i\Gamma1 i\Gamman+1 , as opposed to counts in the global n-gram distribution. 16 This observation is taken advantage of in dynamic language modeling (Kuhn, 1988; Rosenfeld and Huang, 1992; =-=Rosenfeld, 1994-=-a). 44 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 10 100 1000 1e-13 1e-12 1e-11 1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 bucketing value bigram model shape predicted by linear interpolation church-gale 1e-07 1e-0... |

188 | Modeling by the shortest data description. Automatica-J.IFAC - Rissanen - 1978 |

160 |
Natural Language Parsing as Statistical Pattern Recognition
- Magerman
- 1994
(Show Context)
Citation Context ...ssible values of P w i c(w i i\Gamman+1 ) are bucketed. If the last bucket has fewer than c min counts, we merge it with the preceding bucket. Historically, this process is called the wall of bricks (=-=Magerman, 1994-=-). We use separate buckets for each n-gram model being interpolated. In performing this bucketing, we create an array containing how many n-grams occur for each value of P w i c(w i i\Gamman+1 ) up to... |

141 | Prepositional phrase attachment through a backed-off model
- Collins, Brooks
- 1995
(Show Context)
Citation Context ... 2.6 Discussion Smoothing is a fundamental technique for statistical modeling, important not only for language modeling but for many other applications as well, e.g., prepositional phrase attachment (=-=Collins and Brooks, 1995-=-), part-of-speech tagging (Church, 1988), and stochastic parsing (Magerman, 1994). Whenever data sparsity is an issue (and it always is), smoothing has the potential to improve performance with modera... |

141 | Texttranslation alignment - Kay, Röscheisen - 1993 |

131 |
A Comparison of the Enhanced Good-Turing and Deleted Estimation Methods for Estimating
- Church, Gale
- 1991
(Show Context)
Citation Context ...ikelihood model. While smoothing is a central issue in language modeling, the literature lacks a definitive comparison between the many existing techniques. Previous studies (Nadas, 1984; Katz, 1987; =-=Church and Gale, 1991-=-; MacKay and Peto, 1995) only compare a small number of methods (typically two) on a single corpus and using a single training data size. As a result, it is currently difficult for a researcher to int... |

125 | Aligning Sentences in Bilingual Corpora Using Lexical Information
- Chen
- 1993
(Show Context)
Citation Context ...e just need to enhance the move set. 106 Chapter 4 Aligning Sentences in Bilingual Text In this chapter, we describe an algorithm for aligning sentences with their translations in a bilingual corpus (=-=Chen, 1993-=-). In experiments with the Hansard Canadian parliament proceedings, our algorithm yields significantly better accuracy than previous algorithms. In addition, it is efficient, robust, language-independ... |

123 |
Stochastic lexicalized treeadjoining grammars
- Schabes
- 1992
(Show Context)
Citation Context ...s capture properties of language that n-gram models cannot. For example, it has been shown that grammatical language models can express long-distance dependencies (Lari and Young, 1990; Resnik, 1992; =-=Schabes, 1992-=-). Furthermore, grammatical models have the potential to be more compact while achieving equivalent performance as n-gram models (Brown et al., 1992b). To demonstrate these points, we introduce the gr... |

119 | Two Languages are more Informative than One
- Dagan, Itai, et al.
- 1991
(Show Context)
Citation Context ...roceedings in both English and French. Bilingual corpora have proven useful in many tasks, including machine translation (Brown et al., 1990; Sadler, 1989), sense disambiguation (Brown et al., 1991a; =-=Dagan et al., 1991-=-; Gale et al., 1992), and bilingual lexicography (Klavans and Tzoukermann, 1990; Warwick and Russell, 1990). For example, a bilingual corpus can be used to automatically construct a bilingual dictiona... |

116 |
Basic methods of probabilistic context-free grammars
- Jelinek, Lafferty, et al.
- 1992
(Show Context)
Citation Context ...f repetitions using the universal MDL prior. 3.5.4 Parsing To calculate the most probable parse of a sentence given the current hypothesis grammar, we use a probabilistic chart parser (Younger, 1967; =-=Jelinek et al., 1992-=-). In chart parsing, one fills in a chart composed of cells, where each cell represents a span in the sentence to be parsed. If the sentence is composed of the words w 1 \Delta \Delta \Delta wm , then... |

114 | Using cognates to align sentences in bilingual corpora
- Simard, Foster, et al.
- 1992
(Show Context)
Citation Context ...me spelling in two different languages. For example, punctuation, numbers, and proper names generally have the same spellings in English and French. Such words are members of a class called cognates (=-=Simard et al., 1992-=-). Because identically spelled words can be recognized automatically and are frequently translations of each other, it is sensible to use this a priori information in initializing word bead frequencie... |

113 |
Generalized Kraft Inequality and Arithmetic Coding
- Rissanen
- 1976
(Show Context)
Citation Context ... log 2 1 p bits to code in the optimal coding (Shannon, 1948; Cover and King, 1978). In fact, this limit can be realized in practice with an efficient algorithm called arithmetic coding (Pasco, 1976; =-=Rissanen, 1976-=-). 3.2.2 Description Lengths Now, let us return to the minimum description length principle and the meaning of a description. Recall that MDL states that one should minimize the sum of l(G), the lengt... |

97 | Best-first model merging for Hidden Markov Model induction - Stolcke, Omohundro - 1993 |

88 |
Probabilistic Tree-Adjoining Grammar as a Framework for Statistical Natural Language
- Resnik
- 1992
(Show Context)
Citation Context ...ely such models capture properties of language that n-gram models cannot. For example, it has been shown that grammatical language models can express long-distance dependencies (Lari and Young, 1990; =-=Resnik, 1992-=-; Schabes, 1992). Furthermore, grammatical models have the potential to be more compact while achieving equivalent performance as n-gram models (Brown et al., 1992b). To demonstrate these points, we i... |

86 |
A Spelling Correction Program Based on a Noisy Channel Model
- Kernighan, W, et al.
- 1990
(Show Context)
Citation Context ...i and Baltus, 1992), the channel can be interpreted as converting from text to image data instead of from text to speech, yielding the equation T = arg max T p(T )p(imagejT ): In spelling correction (=-=Kernighan et al., 1990-=-), the channel can be interpreted as an imperfect typist that converts perfect text T to noisy text T n with spelling mistakes, yielding T = arg max T p(T )p(T n jT ): In machine translation (Brown et... |

83 | A Hierarchical Dirichlet Language Model
- MacKay, Peto
- 1994
(Show Context)
Citation Context ...smoothing is a central issue in language modeling, the literature lacks a definitive comparison between the many existing techniques. Previous studies (Nadas, 1984; Katz, 1987; Church and Gale, 1991; =-=MacKay and Peto, 1995-=-) only compare a small number of methods (typically two) on a single corpus and using a single training data size. As a result, it is currently difficult for a researcher to intelligently choose betwe... |

82 | Parsing a natural language using mutual information statistics - Magerman, Marcus - 1990 |

79 | Using bilingual materials to develop word sense disambiguation methods
- Gale, Church, et al.
- 1992
(Show Context)
Citation Context ...nglish and French. Bilingual corpora have proven useful in many tasks, including machine translation (Brown et al., 1990; Sadler, 1989), sense disambiguation (Brown et al., 1991a; Dagan et al., 1991; =-=Gale et al., 1992-=-), and bilingual lexicography (Klavans and Tzoukermann, 1990; Warwick and Russell, 1990). For example, a bilingual corpus can be used to automatically construct a bilingual dictionary. A bilingual dic... |

69 |
An estimate of an upper bound for the entropy of English
- Brown, Pietra, et al.
- 1992
(Show Context)
Citation Context ... likely to occur in language and match the acoustic signal well. The source-channel model can be extended to many other applications besides speech recognition by just varying the channel model used (=-=Brown et al., 1992-=-b). In optical character recognition and handwriting recognition (Hull, 1992; Srihari and Baltus, 1992), the channel can be interpreted as converting from text to image data instead of from text to sp... |