## Unsupervised Lexical Learning as Inductive Inference (2000)

Citations: | 11 - 6 self |

### BibTeX

@MISC{Kit00unsupervisedlexical,

author = {Chun Yu Kit},

title = {Unsupervised Lexical Learning as Inductive Inference},

year = {2000}

}

### Years of Citing Articles

### OpenURL

### Abstract

To learn a language, the learners must first learn its words, the essential building blocks for utterances. The difficulty in learning words lies in the unavailability of explicit word boundaries in speech input. The learners have to infer lexical items with some innately endowed learning mechanism(s) for regularity detection- regularities in the speech normally indicate word patterns. With respect to Zipf's least-effort principle and Chomsky's thoughts on the minimality of grammar for human language, we hypothesise a cognitive mechanism underlying language learning that seeks for the least-effort representation for input data. Accordingly, lexical learning is to infer the minimal-cost representation for the input under the constraint of permissible representation for lexical items. The main theme of this thesis is to examine how far this learning mechanism can go in unsupervised lexical learning from real language data without any pre-defined (e.g., prosodic and phonotactic) cues, but entirely resting on statistical induction of structural patterns for the most economic representation for the data. We first review

### Citations

9231 |
Elements of Information Theory
- Cover, Thomas
- 1990
(Show Context)
Citation Context ...ation with a delimiter inserted in between, and DL(·) is the empirical description length that can be estimated by the Shannon-Fano code or Huffman code as below, following classic information theory =-=[23, 7]-=-. DL(X) = |X| ˆ H(X) = − |X| � ˆp(x)log2 ˆp(x) = − � x∈V x∈V c(x)log 2 where | · | denotes the length of a corpus, c(·) the frequency of a token in the corpus, and ˆp(·) . Straightforwardly, the avera... |

9054 | Maximum likelihood from incomplete data via the EM algorithm - Dempster, Laird, et al. - 1977 |

7146 |
A mathematical theory of communication
- Shannon
- 1948
(Show Context)
Citation Context ...ation with a delimiter inserted in between, and DL(·) is the empirical description length that can be estimated by the Shannon-Fano code or Huffman code as below, following classic information theory =-=[23, 7]-=-. DL(X) = |X| ˆ H(X) = − |X| � ˆp(x)log2 ˆp(x) = − � x∈V x∈V c(x)log 2 where | · | denotes the length of a corpus, c(·) the frequency of a token in the corpus, and ˆp(·) . Straightforwardly, the avera... |

4597 | A tutorial on hidden Markov models and selected applications in speech processing - Rabiner - 1989 |

3933 | Optimization by simulated annealing - Kirkpatrick, Gelatt, et al. - 1983 |

3037 | H.: Adaption in Natural and Artificial Systems - Holland - 1975 |

2235 | Building a Large Annotated Corpus for English: The Penn Treebank - Marcus, Santorini, et al. - 1993 |

1939 |
The minimalist program
- Chomsky
- 1995
(Show Context)
Citation Context ...mally necessary 1s2 Unsupervised Lexical Learning as Inductive Inference via Compression innate structure (and ability) for language learning faculty, scattered in his linguistic theories from [5] to =-=[6]-=-. One may take is as a piece of evidence against his assumption of a powerful universal grammar, if unsupervised learning starting from this initial point could succeed. Our purpose here, however, is ... |

1782 | An Introduction to Kolmogorov Complexity and Its Applications, 2nd edn
- Li, Vitányi
- 1997
(Show Context)
Citation Context ...e one with less initial ability (and knowledge) indicates a more effective learning. The theoretical inspiration for this research comes from algorithmic information (or Kolmogorov complexity) theory =-=[24, 13, 4, 14]-=-, including (1) Solomonoff’s inductive inference theory [24] – the first of the three origins of the algorithmic information theory, (2) the MDL and MML principles by Rissanen [20, 21, 22] and by Wall... |

1730 | Aspects of the theory of syntax - Chomsky - 1965 |

1246 |
Modeling by shortest data description
- Rissanen
- 1978
(Show Context)
Citation Context ...) theory [24, 13, 4, 14], including (1) Solomonoff’s inductive inference theory [24] – the first of the three origins of the algorithmic information theory, (2) the MDL and MML principles by Rissanen =-=[20, 21, 22]-=- and by Wallace and colleagues [29, 30], respectively 1 , (3) Vitányi and Li’s formulation of the ideal MDL in terms of Kolmogorov complexity [27, 28] and, in particular, their idea (or “intuition”, i... |

1221 | A universal algorithm for sequential data compression - Ziv, Lempel - 1977 |

1053 | A method for the construction of minimum redundancy codes - Huffman - 1952 |

1015 | Computational analysis of present-day American English - Kučera, Francis - 1967 |

936 | The Ant System: Optimization by a colony of cooperating agents - Dorigo, Maniezzo, et al. - 1996 |

872 | Accurate methods for the statistics of surprise and coincidence - Dunning - 1993 |

842 | A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The annals of mathematical statistics - Baum, Soules, et al. - 1970 |

738 | Class-Based N-Gram Models of Natural Language - Brown, Pietra - 1992 |

716 | A stochastic parts program and noun phrase parser for unrestricted text - Church - 1988 |

702 | An efficient context-free parsing algorithm - Earley - 1970 |

677 | An Introduction to Tree Adjoining Grammars - Joshi - 1987 |

673 | Suffix arrays: A new method for on-line string searches
- Manber, Myers
- 1993
(Show Context)
Citation Context ...e corpus. Although n-grams of arbitrary lengths in a large-scale corpus are known to be huge in number, we have developed the Virtual Corpus (VC) system [11], based on the suffix array data structure =-=[17]-=-, as a fairly efficient approach to handling them, including counting, storing and retrieval. 3 Algorithm With the aid of the DLG formulated above, the unsupervised lexical learning becomes an optimal... |

644 | Modeling for text compression - Bell, Witten, et al. - 1989 |

550 | Lexicalfunctional grammar: A formal system for grammatical representation - Bresnan, Kaplan - 1982 |

541 |
Stochastic complexity
- Rissanen
- 1987
(Show Context)
Citation Context ...) theory [24, 13, 4, 14], including (1) Solomonoff’s inductive inference theory [24] – the first of the three origins of the algorithmic information theory, (2) the MDL and MML principles by Rissanen =-=[20, 21, 22]-=- and by Wallace and colleagues [29, 30], respectively 1 , (3) Vitányi and Li’s formulation of the ideal MDL in terms of Kolmogorov complexity [27, 28] and, in particular, their idea (or “intuition”, i... |

514 |
Syntactic Structures
- Chomsky
- 1957
(Show Context)
Citation Context ...he minimally necessary 1s2 Unsupervised Lexical Learning as Inductive Inference via Compression innate structure (and ability) for language learning faculty, scattered in his linguistic theories from =-=[5]-=- to [6]. One may take is as a piece of evidence against his assumption of a powerful universal grammar, if unsupervised learning starting from this initial point could succeed. Our purpose here, howev... |

511 | Statistical learning by 8-month-old infants - Saffran, Aslin, et al. - 1996 |

472 | On learning the past tenses of English verbs - Rumelhart, McClelland - 1986 |

467 | Frequency Analysis of English Usage: Lexicon and Grammar - WN, Kucera - 1982 |

458 | A technique for high performance data compression - Welch - 1984 |

453 | Language Learnability and Language Development - Pinker - 1984 |

435 | A universal prior for integers and estimation by minimum description length - Rissanen - 1983 |

411 | The population frequencies of species and the estimation of population parameters - Good - 1953 |

395 | Statistical inference for probabilistic functions of finite state markov chains - Baum, Petrie - 1966 |

391 | The TRACE model of speech perception - McClelland, Elman - 1986 |

382 | The estimation of stochastic context-free grammars using the Inside-Outside algorithm, Computer Speech and Language - Lari, Young - 1990 |

361 | Self-organized language modeling for speech recognition - Jelinek - 1990 |

360 |
Human Behaviour and the Principle of Least Effort
- Zipf
- 1949
(Show Context)
Citation Context ...l procedure for compressing the input data. As in other more cognition oriented studies on language learning, we also adopt the assumption that the lexical learning follows the least-effort principle =-=[31]-=- to learn from the natural language data generated by human language behaviors that are observed to be governed by the least-effort principle in language production. However, instead of interpreting t... |

356 | Data compression using adaptive coding and partial string matching - Cleary, Witten - 1984 |

341 | Text Algorithms - Crochemore, Rytter - 1994 |

331 | The Mental Representation of Grammatical Relations - Bresnan - 1982 |

323 |
An Information Measure for Classification
- Wallace, Boulton
- 1968
(Show Context)
Citation Context ...omonoff’s inductive inference theory [24] – the first of the three origins of the algorithmic information theory, (2) the MDL and MML principles by Rissanen [20, 21, 22] and by Wallace and colleagues =-=[29, 30]-=-, respectively 1 , (3) Vitányi and Li’s formulation of the ideal MDL in terms of Kolmogorov complexity [27, 28] and, in particular, their idea (or “intuition”, in their own terms) on how to conduct in... |

307 | Inferring decision trees using the minimum description length princliple. Information and Computation 80:227{248 - Quinlan, Rivest - 1989 |

288 | The discovery of spoken language - Jusczyk - 1997 |

280 | Natural language and natural selection - Pinker, Bloom - 1990 |

278 | Trainable grammars for speech recognition - Baker - 1979 |

278 | Inside-outside reestimation from partially bracketed corpora - Pereira, Schabes - 1992 |

272 | Cross-language speech perception: Evidence for perceptual reorganization during the first year of life," Infant Behaviour and Development - Werker, Tees - 1984 |

247 | The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression,” in - Witten, Bell - 1991 |

243 | On the length of programs for computing finite binary sequences: statistical considerations
- CHAITIN
- 1969
(Show Context)
Citation Context ...e one with less initial ability (and knowledge) indicates a more effective learning. The theoretical inspiration for this research comes from algorithmic information (or Kolmogorov complexity) theory =-=[24, 13, 4, 14]-=-, including (1) Solomonoff’s inductive inference theory [24] – the first of the three origins of the algorithmic information theory, (2) the MDL and MML principles by Rissanen [20, 21, 22] and by Wall... |