Results 1 -
8 of
8
An efficient, probabilistically sound algorithm for segmentation and word discovery
- MACHINE LEARNING
, 1999
"... This paper presents a model-based, unsupervised algorithm for recovering word boundaries in a natural-language text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstract ..."
Abstract
-
Cited by 103 (2 self)
- Add to MetaCart
This paper presents a model-based, unsupervised algorithm for recovering word boundaries in a natural-language text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstractly so that the detailed component models of phonology, word-order, and word frequency can be replaced in a modular fashion. The model yields a language-independent, prior probability distribution on all possible sequences of all possible words over a given alphabet, based on the assumption that the input was generated by concatenating words from a fixed but unknown lexicon. The model is unusual in that it treats the generation of a complete corpus, regardless of length, as a single event in the probability space. Accordingly, the algorithm does not estimate a probability distribution on words; instead, it attempts to calculate the prior probabilities of various word sequences that could underlie the observed text. Experiments on phonemic transcripts of spontaneous speech by parents to young children suggest that our algorithm is more effective than other proposed algorithms, at least when utterance boundaries are given and the text includes a substantial number of short utterances.
The Unsupervised Acquisition of a Lexicon from Continuous Speech
- MIT Artificial Intelligence Lab
, 1995
"... We present an unsupervised learning algorithm that acquires a natural-language lexicon from raw speech. The algorithm is based on the optimal encoding of symbol sequences in an MDL framework, and uses a hierarchical representation of language that overcomes many of the problems that havestymied p ..."
Abstract
-
Cited by 31 (2 self)
- Add to MetaCart
We present an unsupervised learning algorithm that acquires a natural-language lexicon from raw speech. The algorithm is based on the optimal encoding of symbol sequences in an MDL framework, and uses a hierarchical representation of language that overcomes many of the problems that havestymied previous grammar-induction procedures. The forward mapping from symbol sequences to the speech stream is modeled using features based on articulatory gestures. We present results on the acquisition of lexicons and language models from rawspeech, text, and phonetic transcripts, and demonstrate that our algorithm compares very favorably to other reported results with respect to segmentation performance and statistical efficiency.
Lexical Heads, Phrase Structure and the Induction of Grammar
- In Third Workshop on Very Large Corpora
, 1995
"... This paper examines why some previous approaches have failed to acquire desired grammars without supervision, and proposes that with a different conception of phrase-structure supervision might not be necessary. In particular, we examine some reasons why SCFGs are poor models to use for learning hum ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
This paper examines why some previous approaches have failed to acquire desired grammars without supervision, and proposes that with a different conception of phrase-structure supervision might not be necessary. In particular, we examine some reasons why SCFGs are poor models to use for learning human language, especially when combined with the inside-outside algorithm. We argue that head-driven grammatical formalisms like dependency grammars (MelSuk, 1988) or link grammars (Sleator and Temperley, 1991) are better suited to the task
Unsupervised Lexical Learning as Inductive Inference
, 2000
"... To learn a language, the learners must first learn its words, the essential building blocks for utterances. The difficulty in learning words lies in the unavailability of explicit word boundaries in speech input. The learners have to infer lexical items with some innately endowed learning mechanism( ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
To learn a language, the learners must first learn its words, the essential building blocks for utterances. The difficulty in learning words lies in the unavailability of explicit word boundaries in speech input. The learners have to infer lexical items with some innately endowed learning mechanism(s) for regularity detection- regularities in the speech normally indicate word patterns. With respect to Zipf's least-effort principle and Chomsky's thoughts on the minimality of grammar for human language, we hypothesise a cognitive mechanism underlying language learning that seeks for the least-effort representation for input data. Accordingly, lexical learning is to infer the minimal-cost representation for the input under the constraint of permissible representation for lexical items. The main theme of this thesis is to examine how far this learning mechanism can go in unsupervised lexical learning from real language data without any pre-defined (e.g., prosodic and phonotactic) cues, but entirely resting on statistical induction of structural patterns for the most economic representation for the data. We first review
Linguistic Structure as Composition and Perturbation
- In Meeting of the Association for Computational Linguistics
, 1996
"... This paper discusses the problem of learning language from unprocessed text and speech signals, concentrating on the problem of learning a lexicon. In particular, it argues for a representation of language in which linguistic parameters like words are built by perturbing a composition of exist ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
This paper discusses the problem of learning language from unprocessed text and speech signals, concentrating on the problem of learning a lexicon. In particular, it argues for a representation of language in which linguistic parameters like words are built by perturbing a composition of existing parameters. The power of the representation is demonstrated by several examples in text segmentation and compression, acquisition of a lexicon from raw speech, and the acquisition of mappings between text and artificial representations of meaning.
A Goodness Measure for Phrase Learning via Compression with the MDL Principle
- In The ESSLLI-98 Student Session, Chapter 13
, 1998
"... . This paper reports our ongoing research on unsupervised language learning via compression within the MDL paradigm. It formulates an empirical information-theoretical measure, description length gain, for evaluating the goodness of guessing a sequence of words (or characters) as a phrase (or a word ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
. This paper reports our ongoing research on unsupervised language learning via compression within the MDL paradigm. It formulates an empirical information-theoretical measure, description length gain, for evaluating the goodness of guessing a sequence of words (or characters) as a phrase (or a word), which can be calculated easily following classic information theory. The paper also presents a best-first learning algorithm based on this measure. Experiments on phrase and lexical learning from POS tag and character sequence, respectively, show promising results. 1 Introduction Grammar induction from a naturally-occurring text corpus is in the general domain of inferring a theory (or model) to account for observed data. Practical techniques for grammar induction have many important applications for a wide range of natural language (NL) and speech processing tasks. Researchers have devoted tremendous effort to the research in the past decades. Inferring a probabilistic grammar from NL d...
Corpus-Based Lexical Acquisition For Semantic Parsing
, 1996
"... Building accurate and efficient natural language processing (NLP) systems is an important and difficult problem. There has been increasing interest in automating this process. The lexicon, or the mapping from words to meanings, is one component that is typically difficult to update and that chang ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Building accurate and efficient natural language processing (NLP) systems is an important and difficult problem. There has been increasing interest in automating this process. The lexicon, or the mapping from words to meanings, is one component that is typically difficult to update and that changes from one domain to the next. Therefore, automating the acquisition of the lexicon is an important task in automating the ac- quisition of NLP systems. This proposal describes a system, Wo.Fm (WOrd Learning From Interpreted Examples), that learns a lexicon from input consisting of sentences paired with representations of their meanings. Preliminary experimental results show that this system can learn correct and useful mappings. The correctness is evaluated by comparing a known lexicon to one learned from the training input. The usefulness is evaluated by examining the effect of using the lexicon learned by Woe. mE to assist a parser acquisition system, where previously this lexicon had to be hand-built. Future work in the form of extensions to the algorithm, further evaluation, and possible applications is discussed.
The Acquisition of a Lexicon from Paired Phoneme Sequences and Semantic Representations
- In Lecture Notes in Computer Science
, 1994
"... We present an algorithm that acquires words (pairings of phonological forms and semantic representations) from larger utterances of unsegmented phoneme sequences and semantic representations. The algorithm maintains from utterance to utterance only a single coherent dictionary, and learns in the pre ..."
Abstract
- Add to MetaCart
We present an algorithm that acquires words (pairings of phonological forms and semantic representations) from larger utterances of unsegmented phoneme sequences and semantic representations. The algorithm maintains from utterance to utterance only a single coherent dictionary, and learns in the presence of homonymy, synonymy, and noise. Test results over a corpus of utterances generated from the Childes database of mother-child interactions are presented. 1 Introduction This paper is concerned with the machine-learning of a lexicon from utterances that consist of an unsegmented phoneme sequence paired with a semantic representation of what those phonemes collectively mean. The problem is modeled after the environment that a child learns in, presented with a continuous speech signal 1 and potentially hypothesizing a meaning for that signal based upon visual stimuli. We radically simplify the problem the child encounters for the computer by pre-digesting the speech stream into a sequ...

