## Bayesian grammar induction for language modeling (1995)

### Cached

### Download Links

- [arxiv.org]
- [das-ftp.harvard.edu]
- [www.cs.cmu.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proceedings of ACL |

Citations: | 52 - 1 self |

### BibTeX

@INPROCEEDINGS{Chen95bayesiangrammar,

author = {Stanley F. Chen},

title = {Bayesian grammar induction for language modeling},

booktitle = {In Proceedings of ACL},

year = {1995}

}

### Years of Citing Articles

### OpenURL

### Abstract

We describe a corpus-based induction algorithm for probabilistic context-free grammars. The algorithm employs a greedy heuristic search within a Bayesian framework, and a post-pass using the Inside-Outside algorithm. We compare the performance of our algorithm to n-gram models and the Inside-Outside algorithm in three language modeling tasks. In two of the tasks, the training data is generated by a probabilistic context-free grammar and in both tasks our algorithm outperforms the other techniques. The third task involves naturally-occurring data, and in this task our algorithm does not perform as well as n-gram models but vastly outperforms the Inside-Outside algorithm. 1

### Citations

8090 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ... successful in language modeling include the InsideOutside algorithm (Lari and Young, 1990; Lari and Young, 1991; Pereira and Schabes, 1992), a special case of the Expectation-Maximization algorithm (=-=Dempster et al., 1977-=-), and work by McCandless and Glass (1993). In the latter work, McCandless uses a heuristic search procedure similar to ours, but a very different search criteria. To our knowledge, neither algorithm ... |

1682 | An Introduction to Kolmogorov Complexity and its Applications - Li, Vitányi - 1997 |

1160 |
Modeling by shortest data description
- Rissanen
- 1983
(Show Context)
Citation Context ... grammars. In particular, Solomonoff proposes the use of the universal a priori probability (Solomonoff, 1960), which is closely related to the minimum description length principle later proposed by (=-=Rissanen, 1978-=-). In the case of grammatical language modeling, this corresponds to taking p(G) = 2 \Gammal(G) where l(G) is the length of the description of the grammar in bits. The universal a priori probability h... |

697 | Class-based Ngram models of natural language - Brown - 1992 |

404 | A formal theory of inductive inference - Solomonoff - 1964 |

391 | A Maximum Likelihood Approach to Continuous Speech Recognition - Bahl, Jelinek, et al. - 1983 |

373 |
The estimation of stochastic context-free grammars using the Inside–Outside algorithm. Computer Speech and Language
- Lari, Young
- 1990
(Show Context)
Citation Context ...e is grammar-based language models. Language models expressed as a probabilistic grammar tend to be more compact than n-gram language models, and have the ability to model long-distance dependencies (=-=Lari and Young, 1990-=-; Resnik, 1992; Schabes, 1992). However, to date there has been little success in constructing grammar-based language models competitive with n-gram models in problems of any magnitude. In this paper,... |

339 |
Prediction and Entropy of Printed English
- Shannon
- 1951
(Show Context)
Citation Context ...er, 1975; Kernighan et al., 1990; Srihari and Baltus, 1992). However, static language modeling performance has remained basically unchanged since the advent of n-gram language models forty years ago (=-=Shannon, 1951-=-). Yet, n-gram language models can only capture dependencies within an n-word window, where currently the largest practical n for natural language is three, and many dependencies in natural language o... |

336 |
Interpolated estimation of markov source parameters from sparse data
- Jelinek, Mercer
- 1980
(Show Context)
Citation Context ...models, we tried n = 1; : : : ; 10 for each domain. For smoothing a particular n-gram model, we took a linear combination of all lower order n-gram models. In particular, we follow standard practice (=-=Jelinek and Mercer, 1980-=-; Bahl et al., 1983; Brown et al., 1992) and take the smoothed i-gram probability to be a linear combination of the i-gram frequency in the training data and the smoothed (i \Gamma 1)-gram probability... |

306 |
Inductive inference: theory and methods
- Angluin, Smith
- 1983
(Show Context)
Citation Context ...a post-pass using the Inside-Outside algorithm. Grammar Induction as Search Grammar induction can be framed as a search problem, and has been framed as such almost without exception in past research (=-=Angluin and Smith, 1983-=-). The search space is taken to be some class of grammars; for example, in our work we search within the space of probabilistic contextfree grammars. The objective function is taken to be some measure... |

272 | Inside-Outside Reestimation from Partially Bracketed Corpora
- Pereira, Schabes
- 1992
(Show Context)
Citation Context ...ficient enough to be applied to small data sets. The grammar induction algorithms most successful in language modeling include the InsideOutside algorithm (Lari and Young, 1990; Lari and Young, 1991; =-=Pereira and Schabes, 1992-=-), a special case of the Expectation-Maximization algorithm (Dempster et al., 1977), and work by McCandless and Glass (1993). In the latter work, McCandless uses a heuristic search procedure similar t... |

268 |
Trainable grammars for speech recognition
- Baker
- 1979
(Show Context)
Citation Context ...ls in problems of any magnitude. In this paper, we describe a corpus-based induction algorithm for probabilistic context-free grammars that outperforms n-gram models and the Inside-Outside algorithm (=-=Baker, 1979-=-) in medium-sized domains. This result marks the first time a grammar-based language model has surpassed n-gram modeling in a task of at least moderate size. The algorithm employs a greedy heuristic s... |

214 |
An Inequality with Applications to Statistical Estimation for Probabilistic Functions of a Markov Process and to a Model for Ecology
- Baum, Egon
- 1967
(Show Context)
Citation Context ...jw i\Gamma2 \Delta \Delta \Delta w \Gamma1 ) where c(W ) denotes the count of the word sequencesW in the training data. The smoothing parameterssi;c are trained through the ForwardBackward algorithm (=-=Baum and Eagon, 1967-=-) on held-out data. Parameterssi;c are tied together for similar c to prevent data sparsity. For the Inside-Outside algorithm, we follow the methodology described by Lari and Young. For a given n, we ... |

170 | Modeling by the shortest data description. Automatica - Rissanen - 1978 |

120 |
Stochastic Lexicalized Tree-Adjoining Grammars
- Schabes
- 1992
(Show Context)
Citation Context ...Language models expressed as a probabilistic grammar tend to be more compact than n-gram language models, and have the ability to model long-distance dependencies (Lari and Young, 1990; Resnik, 1992; =-=Schabes, 1992-=-). However, to date there has been little success in constructing grammar-based language models competitive with n-gram models in problems of any magnitude. In this paper, we describe a corpus-based i... |

84 |
Probabilistic tree-adjoining grammar as a framework for statistical natural language processing
- RESNIK
- 1992
(Show Context)
Citation Context ...guage models. Language models expressed as a probabilistic grammar tend to be more compact than n-gram language models, and have the ability to model long-distance dependencies (Lari and Young, 1990; =-=Resnik, 1992-=-; Schabes, 1992). However, to date there has been little success in constructing grammar-based language models competitive with n-gram models in problems of any magnitude. In this paper, we describe a... |

81 | A spelling correction program based on a noisy channel model - Kernighan, Church, et al. - 1990 |

77 |
The Dragon System - an Overview
- Baker
- 1975
(Show Context)
Citation Context ...ze. Introduction In applications such as speech recognition, handwriting recognition, and spelling correction, performance is limited by the quality of the language model utilized (Bahl et al., 1978; =-=Baker, 1975-=-; Kernighan et al., 1990; Srihari and Baltus, 1992). However, static language modeling performance has remained basically unchanged since the advent of n-gram language models forty years ago (Shannon,... |

49 | A preliminary report on a general theory of inductive inference
- Solomonoff
- 1960
(Show Context)
Citation Context ...satisfy the goal of favoring smaller grammars by choosing a prior that assigns higher probabilities to such grammars. In particular, Solomonoff proposes the use of the universal a priori probability (=-=Solomonoff, 1960-=-), which is closely related to the minimum description length principle later proposed by (Rissanen, 1978). In the case of grammatical language modeling, this corresponds to taking p(G) = 2 \Gammal(G)... |

43 |
Applications of Stochastic Context-Free Grammars Using the Inside– Outside Algorithm. Computer Speech and Language
- Lari, Young
- 1991
(Show Context)
Citation Context ...algorithms are only efficient enough to be applied to small data sets. The grammar induction algorithms most successful in language modeling include the InsideOutside algorithm (Lari and Young, 1990; =-=Lari and Young, 1991-=-; Pereira and Schabes, 1992), a special case of the Expectation-Maximization algorithm (Dempster et al., 1977), and work by McCandless and Glass (1993). In the latter work, McCandless uses a heuristic... |

10 |
Recognition of a continuously read natural corpus
- Bahl, Baker, et al.
- 1978
(Show Context)
Citation Context ...t least moderate size. Introduction In applications such as speech recognition, handwriting recognition, and spelling correction, performance is limited by the quality of the language model utilized (=-=Bahl et al., 1978-=-; Baker, 1975; Kernighan et al., 1990; Srihari and Baltus, 1992). However, static language modeling performance has remained basically unchanged since the advent of n-gram language models forty years ... |

8 | A Maximum Likelihood Approach to Continuous Speech Recognition - F, Mercer - 1983 |

7 |
Combining statistical and syntactic methods in recognizing handwritten sentences. AAAI Symposium: Probabilistic Approaches to Natural Language
- Srihari, Baltus
- 1992
(Show Context)
Citation Context ... as speech recognition, handwriting recognition, and spelling correction, performance is limited by the quality of the language model utilized (Bahl et al., 1978; Baker, 1975; Kernighan et al., 1990; =-=Srihari and Baltus, 1992-=-). However, static language modeling performance has remained basically unchanged since the advent of n-gram language models forty years ago (Shannon, 1951). Yet, n-gram language models can only captu... |

6 | Empirical acquisition of word and phrase classes in the ATlS domain - McCandlcss, Glass - 1993 |

1 | Class-based n-gram models of natural language - Cook, Rosenfeld, et al. - 1992 |

1 | Grammatical inference by hill climbing - Aronson - 1976 |

1 | A preliminary report on a general theory of inductive inference - R - 1960 |

1 | Inductive inference: theory and methods. ACM Computing Surveys, 15:237– 269. et al.1978 - Angluin, Smith - 1983 |