## A Goodness Measure for Phrase Learning via Compression with the MDL Principle (1998)

Citations: | 4 - 2 self |

### BibTeX

@MISC{Kit98agoodness,

author = {Chunyu Kit},

title = {A Goodness Measure for Phrase Learning via Compression with the MDL Principle},

year = {1998}

}

### OpenURL

### Abstract

This paper reports our ongoing research on unsupervised language learning via compression within the MDL paradigm. It formulates an empirical information-theoretical measure, description length gain, for evaluating the goodness of guessing a sequence of words (or character) as a phrase (or a word), which can be calculated easily following classic information theory. The paper also presents a best-first learning algorithm based on this measure. Experiments on phrase and lexical learning from POS tag and character sequence, respectively, show promising results.

### Citations

8593 | Elements of Information Theory - Cover, Thomas - 1991 |

6079 | A mathematical theory of communication - Shannon |

2831 |
Adaptation in Natural and Artificial Systems
- Holland
- 1992
(Show Context)
Citation Context ...ic parameters for regular grammars (e.g., hidden Markov models (HMMs)) and SCFGs, respectively. There are also many other sophisticated algorithms to facilitate the searching, e.g., genetic algorithm =-=[Hol75]-=- and stimulated annealing algorithm [KGV83]. However, no matter how sophisticated is the search method in use, the goodness criterion to guide the searching remains a critical issue. It is this 1 Proc... |

2109 | Building a large annotated corpus of English: The Penn Treebank
- Marcus, Santorini, et al.
- 1994
(Show Context)
Citation Context ... perplexity or likelihood. However, more concrete evaluations are expected to be based on the performance of learning from a large-scale naturally-occurring corpus like the Brown [FK82] or PTB corpus =-=[MSM93]-=-. 3 Language Learning via Compression The idea of learning via compression has been practised by researchers for a long time. A notable early discussion on the dual relationship between learning (i.e.... |

1687 | An Introduction to Kolmogorov Complexity and its Applications - Li, Vitanyi - 1997 |

1165 | Modeling By Shortest Data Description - Rissanen - 1978 |

947 | A method for the construction of minimum redundancy codes - Huffman - 1952 |

645 | Suffix arrays: a new method for on-line string searches - Manber, Myers - 1993 |

497 | Stochastic Complexity - Rissanen - 1989 |

374 |
The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech and Language
- Lari, Young
- 1990
(Show Context)
Citation Context ... research. For example, Cook et al. [CRA76] explore a hill-climbing search for a grammar of a smaller weighted sum of grammar complexity and the discrepancy between grammar and corpus; Lari and Young =-=[LY90]-=-, Carroll and Charniak [CC92a, CC92b] induce various types of probabilistic context-free grammar (PCFG) with inside-outside algorithm [Baker79] for probabilistic parameter re-estimation; Brill et al. ... |

357 |
Frequency analysis of English usage: Lexicon and grammar
- Francis, Kucera
- 1982
(Show Context)
Citation Context ...easures like entropy, perplexity or likelihood. However, more concrete evaluations are expected to be based on the performance of learning from a large-scale naturally-occurring corpus like the Brown =-=[FK82]-=- or PTB corpus [MSM93]. 3 Language Learning via Compression The idea of learning via compression has been practised by researchers for a long time. A notable early discussion on the dual relationship ... |

339 | Selfâ€“organized Language Modeling for Speech Recognition
- Jelinek
- 1990
(Show Context)
Citation Context ...(xn )] = \Gamma 1 n n X i=1 logsp(x i ) = \Gamma 1 n X x2V c(x) logsp(x) = 1 n DL(X) (13.4) Perplexity is an indication of the quality of a language model: a lower perplexity indicates a better model =-=[Jel85]-=-. The description length in (13.3) and its average (i.e., the empirical entropy ) in (13.4) play the same role. 5 Learning Algorithm Following the DL calculation above, the description length gain of ... |

324 | Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology - Zipf - 1949 |

292 | Inferring decision trees using the minimum description length principle - Quinlan, Rivest - 1980 |

269 |
Trainable grammars for speech recognition
- Baker
- 1979
(Show Context)
Citation Context ... the discrepancy between grammar and corpus; Lari and Young [LY90], Carroll and Charniak [CC92a, CC92b] induce various types of probabilistic context-free grammar (PCFG) with inside-outside algorithm =-=[Baker79]-=- for probabilistic parameter re-estimation; Brill et al. [BMMS90] derive phrase structures from a tagged corpus with generalised mutual information; and Brill and Marcus [BM92] attempt to induce binar... |

227 | On the length of programs for computing finite binary sequences - Chaitin |

171 | Adaptation in Natural and Arti cial Systems. The - Holland - 1975 |

158 | The logical structure of linguistic theory - Chomsky - 1955 |

125 | Bayesian Learning of Probabilistic Language Models
- Stolcke
- 1994
(Show Context)
Citation Context ...scussion on the kind of issues involved in pure distribution analysis and on the disadvantages of the inside-outside algorithm for grammar induction. Recent work in language modelling such as Stolcke =-=[Sto94]-=- and Chen [Chen95, Chen96] are in the theoretical framework of Bayesian modelling. Basically, Stolcke's work follows Cook et al.'s [CRA76] paradigm of searching by hill-climbing, but guided by maximum... |

120 | Psychoâ€“Biology of Languages - Zipf - 1965 |

98 | Two experiments on learning probabilistic dependency grammars from corpora - Carroll, Charniak - 1992 |

89 | Stochastic Complexity in Statistical Inquiry; World Scientific - Rissanen - 1989 |

64 | Building Probabilistic Models for Natural Language - Chen - 1996 |

61 | Deducing linguistic structure from the statistics of large corpora
- Brill, Magerman, et al.
- 1990
(Show Context)
Citation Context ..., Carroll and Charniak [CC92a, CC92b] induce various types of probabilistic context-free grammar (PCFG) with inside-outside algorithm [Baker79] for probabilistic parameter re-estimation; Brill et al. =-=[BMMS90]-=- derive phrase structures from a tagged corpus with generalised mutual information; and Brill and Marcus [BM92] attempt to induce binary branching phrases with distribution analysis using the informat... |

54 | Language Acquisition, Data Compression and Generalisation - Wolff - 1982 |

53 | Bayesian grammar induction for language modeling - Chen - 1995 |

49 | A new method of n-gram statistics for large number of n and automatic extraction of words and phrases from large text data of japanese - Nagao, Mori - 1994 |

49 |
A formal theory of inductive inference: Part 1 and 2
- Solomonoff
- 1964
(Show Context)
Citation Context ...ractised by researchers for a long time. A notable early discussion on the dual relationship between learning (i.e., detecting regularities in data) and compression is in Solomonoff's pioneering work =-=[Solo64]-=-. Some earlier related thoughts may be traced back to, for instance, Chomsky's considerations of the simplicity of grammar [Chom55, Chom57] and Zipf's Principle of Least Effort [Zipf35, Zipf49]. Occam... |

41 | Automatically acquiring phrase structure using distributional analysis
- Brill, Marcus
- 1992
(Show Context)
Citation Context ...de-outside algorithm [Baker79] for probabilistic parameter re-estimation; Brill et al. [BMMS90] derive phrase structures from a tagged corpus with generalised mutual information; and Brill and Marcus =-=[BM92]-=- attempt to induce binary branching phrases with distribution analysis using the information-theoretical measure divergence, derived from relative entropy; among many others. They have achieved signif... |

39 | A method for the construction of minimum-redundancy codes - man, A - 1979 |

26 | Learning probabilistic dependency grammars from labelled text - Carroll, Charniak - 1992 |

25 | On the length of programs for computing nite binary sequences, Journal of the Association for Computing Machinery 13 - Chaitin - 1966 |

24 | Language acquisition and the discovery of phrase structure - Wolff |

22 |
Grammatical inference by hill climbing
- Cook, Rosenfeld, et al.
- 1976
(Show Context)
Citation Context ...ing phrases from n-grams. However, a more crucial issue in inducing phrases is a sound criterion to guide the guessing. Many criteria have been explored in previous research. For example, Cook et al. =-=[CRA76]-=- explore a hill-climbing search for a grammar of a smaller weighted sum of grammar complexity and the discrepancy between grammar and corpus; Lari and Young [LY90], Carroll and Charniak [CC92a, CC92b]... |

21 | Three approaches for defining the concept of information quantity - Kolmogorov - 1965 |

20 | A New Method for Discovering the Grammars for Phrase Structure Languages - Solomonoff - 1959 |

13 | New techniques for context modeling
- Ristad, Thomas
- 1995
(Show Context)
Citation Context ...pus are reported to be 1.92 bits/characters on training data and 2.04 bits/characters on test data, omitting the grammar (i.e., the lexical items learned) and some other overheads. Ristad and Thomas' =-=[RT95]-=- non-monotonic context modelling with MDL principle achieves a massage entropy of 1.97 bits/characters on Brown corpus, leaving the model apart. We expect the ongoing optimal chunking algorithm to ach... |

12 | Language acquisition in the MDL framework - Rissanen, Ristad - 1994 |

11 | Segmenting speech without a lexicon: Evidence for a bootstrapping model of lexical acquisition - Cartwright, Brent - 1994 |

11 | Segmenting speech without a lexicon: The roles of phonotactics and speech source - Cartwright, Brent - 1994 |

10 | The mechanization of linguistic learning - Solomonoff - 1958 |

10 | Solomono , A formal theory of inductive inference - J - 1964 |

6 | Three approaches for de ning the concept of `information quantity - Kolmogorov - 1965 |

3 |
The unsupervised acquisition of a lexicon from continuous speech
- Marken
- 1995
(Show Context)
Citation Context ...i.e., the lexical items learned). This shows that our algorithm compares favourably with other researchers' work. The best result reported before in this direction is de Marcken's MDL lexicon learner =-=[deMa95b]-=- with a Baum-Welch algorithm to estimate probabilities. The entropy rates it achieves in lexical learning from Brown corpus are reported to be 1.92 bits/characters on training data and 2.04 bits/chara... |

3 | Linguistic structure as composition and perturbation - Marken - 1996 |

3 |
Optimisation by stimulated annealing
- Kirkpatrick, Jr, et al.
- 1983
(Show Context)
Citation Context ...idden Markov models (HMMs)) and SCFGs, respectively. There are also many other sophisticated algorithms to facilitate the searching, e.g., genetic algorithm [Hol75] and stimulated annealing algorithm =-=[KGV83]-=-. However, no matter how sophisticated is the search method in use, the goodness criterion to guide the searching remains a critical issue. It is this 1 Proceedings of the Third ESSLLI Student Session... |

3 |
Speeding up the Virtual Corpus approach to derivng and retrieving n-grams for any n from large-scale corpora
- Kit
- 1995
(Show Context)
Citation Context ...ce is as a phrase. The guessing needs to consider all n-gram items in the corpus. Although n-grams of arbitrary lengths in a large-scale corpus are known to be huge in number, the Virtual Corpus (VC) =-=[Kit95]-=-, based on suffix array data structure [MM90, NM94] and a bucket-radix sort, can be employed as a fairly efficient approach to handling them, including counting and, more importantly, storing and retr... |

3 | Solomono . A new method for discovering the grammars of phrase structure languages - J - 1959 |

3 | Solomono . The mechanization of linguistic learning - J - 1960 |

3 | The Psychobiology of Language. Houghton-Mi in - Zipf - 1935 |