## A bayesian framework for word segmentation: Exploring the effects of context (2009)

Venue: | In 46th Annual Meeting of the ACL |

Citations: | 50 - 11 self |

### BibTeX

@INPROCEEDINGS{Goldwater09abayesian,

author = {Sharon Goldwater and Thomas L. Griffiths and Mark Johnson},

title = {A bayesian framework for word segmentation: Exploring the effects of context},

booktitle = {In 46th Annual Meeting of the ACL},

year = {2009},

pages = {398--406}

}

### OpenURL

### Abstract

Since the experiments of Saffran et al. (1996a), there has been a great deal of interest in the question of how statistical regularities in the speech stream might be used by infants to begin to identify individual words. In this work, we use computational modeling to explore the effects of different assumptions the learner might make regarding the nature of words – in particular, how these assumptions affect the kinds of words that are segmented from a corpus of transcribed child-directed speech. We develop several models within a Bayesian ideal observer framework, and use them to examine the consequences of assuming either that words are independent units, or units that help to predict other units. We show through empirical and theoretical results that the assumption of independence causes the learner to undersegment the corpus, with many two- and three-word sequences (e.g. what’s that, do you, in the house) misidentified as individual words. In contrast, when the learner assumes that words are predictive, the resulting segmentation is far more accurate. These results indicate that taking context into account is important for a statistical word segmentation strategy to be successful, and raise the possibility that even young infants may be able to exploit more subtle statistical patterns than have usually been considered. 1

### Citations

3737 |
Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images
- Geman, Geman
- 1984
(Show Context)
Citation Context ...on of the input corpus. We are left with the problem of inference, or actually identifying the highest probability segmentation from among all possibilities. We used a method known as Gibbs sampling (=-=Geman and Geman, 1984-=-), a type of Markov chain Monte Carlo algorithm (Gilks et al., 1996) in which variables are repeatedly sampled from their conditional posterior distribution given the current values of all other varia... |

2392 | Latent Dirichlet allocation - BLEI, NG, et al. - 2002 |

2257 | Equation of state calculations by fast computing machines - Metropolis, Rosenbluth, et al. - 1953 |

1544 | Finding Structure in Time
- Elman
- 1990
(Show Context)
Citation Context ...g algorithms that are believed to incorporate cognitively plausible mechanisms of information processing. Algorithmic-level approaches to word segmentation include a variety of neural network models (=-=Elman, 1990-=-; Allen and Christiansen, 1996; Cairns and Shillcock, 1997; Christiansen et al., 1998) as well as several learning algorithms based on transitional probabilities, mutual information, and similar stati... |

1250 | Bayesian Data Analysis - Gelman, Carlin, et al. - 1995 |

1224 | Monte Carlo sampling methods using Markov chains and their applications, Biometrika 57: 97–109 - Hastings - 1970 |

1042 | Bayesian Theory - Bernardo, Smith - 1994 |

900 | Sequential Monte Carlo Methods in Practice - Doucet, Freitas, et al. - 2000 |

857 | An empirical study of smoothing techniques for language modeling - Chen, Goodman - 1996 |

714 |
A bayesian analysis of some nonparametric problems
- Ferguson
- 1973
(Show Context)
Citation Context ...the kind of distribution that is found in natural language (Zipf, 1932). The model we have just described is an instance of a kind of model known in the statistical literature as a Dirichlet process (=-=Ferguson, 1973-=-). The Dirichlet process is commonly used in Bayesian statistics as a nonparametric prior for clustering models, and is closely related to Anderson’s (1991) rational model of categorization (Sanborn e... |

667 | On sequential monte carlo sampling methods for bayesian filtering
- Doucet, Andrieu, et al.
(Show Context)
Citation Context ...eses. Although Gibbs sampling is a batch learning algorithm, where the entire data set is available to the learner at once, we note that there are other sampling techniques known as particle filters (=-=Doucet et al., 2000-=-; Sanborn et al., 2006b) that can be used to produce approximations of the posterior distribution in an online fashion (examining each utterance in turn exactly once). We return in the General Discuss... |

563 | Probabilistic inference using Markov chain Monte Carlo methods - Neal - 1993 |

542 | Hierarchical Dirichlet processes
- Teh, Jordan, et al.
(Show Context)
Citation Context ...nerated already, which leads to a preference for power-law distributions over the second item in each bigram. The bigram model we have just defined is known as a hierarchical Dirichlet process (HDP) (=-=Teh et al., 2005-=-). The HDP is an extension of the DP, and is typically used to model data in which there are multiple distributions over similar sets of outcomes, and the distributions are believed to be similar. For... |

400 | Bayesian density estimation and inference using mixtures - Escobar, West - 1995 |

397 |
Statistical learning by 8-month-old infants
- Saffran, Aslin, et al.
- 1996
(Show Context)
Citation Context ...ical (stress) patterns (Morgan et al., 1995; Jusczyk et al., 1999b), effects of coarticulation (Johnson and Jusczyk, 2001), and statistical regularities in the sequences of syllables found in speech (=-=Saffran et al., 1996-=-a). This last source of information can be used in a language-independent way, and seems to be used by infants earlier than most other cues, by the age of 7 months (Thiessen and Saffran, 2003). These ... |

375 | Markov Chain Sampling Methods for Dirichlet Process Mixture Models - Neal - 2000 |

297 | Stochastic Complexity in Statistical Inquiry - Rissanen - 1989 |

276 |
Distributional structure
- Harris
- 1954
(Show Context)
Citation Context .... The idea that word and morpheme boundaries may be discovered through the use of statistical information is not new, but originally these methods were seen primarily as analytic tools for linguists (=-=Harris, 1954-=-; Harris, 1955). More recently, evidence that infants are sensitive to statistical dependencies between syllables has lent weight to the idea that this kind of information may actually be used by huma... |

266 | Ferguson distributions via polya urn schemes - Blackwell, MacQueen - 1973 |

265 | Unsupervised learning of the morphology of a natural language - Goldsmith |

249 |
Simulated Annealing and Boltzmann Machines - a Stochastic Approach to Combinatorial Optimization and Neural Computers
- Aarts, Korst
- 1989
(Show Context)
Citation Context ...n, there are several possible ways to evaluate its performance. For most of our simulations, we evaluated a single sample taken after 20,000 iterations. We used a method known as simulated annealing (=-=Aarts and Korst, 1989-=-) to speed convergence of the sampler, and in some cases (noted below) to obtain an approximation of the MAP solution by concentrating samples around the mode of the posterior. This allowed us to exam... |

216 | The Adaptive Nature of Human Categorization - Anderson - 1991 |

190 | Hierarchical topic models and the nested Chinese restaurant process - Blei, Griffiths, et al. |

178 |
Word segmentation: The role of distributional cues
- Saffran, Newport, et al.
- 1996
(Show Context)
Citation Context ...ical (stress) patterns (Morgan et al., 1995; Jusczyk et al., 1999b), effects of coarticulation (Johnson and Jusczyk, 2001), and statistical regularities in the sequences of syllables found in speech (=-=Saffran et al., 1996-=-a). This last source of information can be used in a language-independent way, and seems to be used by infants earlier than most other cues, by the age of 7 months (Thiessen and Saffran, 2003). These ... |

157 | The infinite Gaussian mixture model - Rasmussen - 2000 |

142 | An efficient, probabilistically sound algorithm for segmentation and word discovery
- Brent
- 1999
(Show Context)
Citation Context ...stically independent units tends to undersegment the corpus, identifying commonly co-occurring sequences of words as single words. These results seem to conflict with those of several earlier models (=-=Brent, 1999-=-; Venkataraman, 2001; Batchelder, 2002), where systematic undersegmentation was not found even when words were assumed to be independent. However, we argue here that these previous results are mislead... |

127 | Exchangeability and related topics - Aldous - 1985 |

125 | Exchangeability and related topics. École d’Été de Probabilités de Saint-Flour XIII1983 - Aldous - 1985 |

122 | Distributional regularity and phonotactic constraints are useful for segmentation
- Brent, Cartwright
- 1996
(Show Context)
Citation Context ...or all previous utterances. The online nature of this algorithm is intended to provide a more realistic simulation of human word segmentation than earlier batch learning algorithms (de Marcken, 1995; =-=Brent and Cartwright, 1996-=-), which assume that the entire corpus of data is available to the learner at once (i.e., the learner may iterate over the data many times). In the remainder of this paper, we will describe two new Ba... |

115 |
The Child Language Data Exchange System
- MacWhinney, Snow
- 1985
(Show Context)
Citation Context ...segmentation, we report results on the same corpus used by Brent (1999) and Venkataraman (2001). The data is derived from the Bernstein-Ratner corpus (Bernstein-Ratner, 1987) of the CHILDES database (=-=MacWhinney and Snow, 1985-=-), which contains orthographic transcriptions of utterances directed at 13- to 23month-olds. The data was post-processed by Brent, who removed disfluencies and non-words, discarded parental utterances... |

111 | Computation of conditional probability statistics by 8-month-old infants
- Aslin, Saffran, et al.
- 1998
(Show Context)
Citation Context ...are a crucial first step in bootstrapping word segmentation (Thiessen and Saffran, 2003), and have provoked a great deal of interest in these strategies (Saffran et al., 1996b; Saffran et al., 1996a; =-=Aslin et al., 1998-=-; Toro et al., 2005). In this paper, we use computational modeling techniques to examine some of the assumptions underlying much of the research on statistical word segmentation. Most previous work on... |

109 | Word Learning as Bayesian Inference
- Xu, Tenenbaum
- 2007
(Show Context)
Citation Context ...learners approximate ideal learners. Nevertheless, this suggestion is not completely unfounded, given the accumulating evidence in favor of humans as ideal learners in other domains or at other ages (=-=Xu and Tenenbaum, 2007-=-; Frank et al., 2007; Schulz et al., 2007). In order to further examine whether infants behave as ideal learners, or the ways in which they depart from the ideal, it is important to first understand w... |

107 | Learning at a distance I. Statistical learning of non-adjacent dependencies
- Newport, Aslin
- 2004
(Show Context)
Citation Context ...rs are or are not sensitive to (e.g., transitional probabilities vs. frequencies (Aslin et al., 1998), syllables vs. phonemes (Newport et al., in preparation), adjacent vs. non-adjacent dependencies (=-=Newport and Aslin, 2004-=-), and the ways in which transitional probabilities interact with other kinds of cues (Johnson and Jusczyk, 2001; Thiessen and Saffran, 2003; Thiessen and Saffran, 2004). In addition, many researchers... |

97 | M.: Learning to segment speech using multiple cues: a connectionist model - Christiansen, Allen, et al. - 1998 |

86 |
T he beginnings of word segmentation in English-learning infants
- Jusczyk, Houston, et al.
- 1999
(Show Context)
Citation Context ... to identify word boundaries. In fact, there is evidence that infants use a wide range of weak cues for word segmentation. These cues include phonotactics (Mattys et al., 1999), allophonic variation (=-=Jusczyk et al., 1999-=-a), metrical (stress) patterns (Morgan et al., 1995; Jusczyk et al., 1999b), effects of coarticulation (Johnson and Jusczyk, 2001), and statistical regularities in the sequences of syllables found in ... |

79 | A hierarchical Dirichlet language model - MACKAY, L - 1995 |

76 | Interpolating between types and tokens by estimating power-law generators - Goldwater, Griffiths, et al. |

76 | From phoneme to morpheme - Harris - 1955 |

69 | The role of exposure to isolated words in early vocabulary development - Brent, Siskind - 2001 |

65 | Models of word segmentation in fluent maternal speech to infants - Aslin, Woodward, et al. - 1996 |

65 | The infinite PCFG using hierarchical Dirichlet processes - Liang, Petrov, et al. - 2007 |

64 | Unsupervised discovery of morphemes - Creutz, Lagus - 2002 |

63 |
Phonotactic and prosodic effects on word segmentation in infants
- Mattys, Jusczyk, et al.
- 1999
(Show Context)
Citation Context ...een words, children must be using other cues to identify word boundaries. In fact, there is evidence that infants use a wide range of weak cues for word segmentation. These cues include phonotactics (=-=Mattys et al., 1999-=-), allophonic variation (Jusczyk et al., 1999a), metrical (stress) patterns (Morgan et al., 1995; Jusczyk et al., 1999b), effects of coarticulation (Johnson and Jusczyk, 2001), and statistical regular... |

61 | The Units of Language Acquisition - Peters - 1983 |

56 | Statistical learning of new visual feature combinations by infants - Fiser, Aslin - 2002 |

56 | Contextual dependencies in unsupervised word segmentation - Goldwater, Griffiths, et al. |

55 | Adaptor grammars: A framework for specifying compositional nonparameteric Bayesian models - Johnson, Griffiths, et al. - 2006 |

49 | Modeling word burstiness using the Dirichlet distribution - Madsen, Kauchak, et al. - 2005 |

45 |
Bootstrapping word boundaries: A bottom-up corpus based approach to speech segmentation
- Cairns, Shillcock, et al.
- 1997
(Show Context)
Citation Context ...te cognitively plausible mechanisms of information processing. Algorithmic-level approaches to word segmentation include a variety of neural network models (Elman, 1990; Allen and Christiansen, 1996; =-=Cairns and Shillcock, 1997-=-; Christiansen et al., 1998) as well as several learning algorithms based on transitional probabilities, mutual information, and similar statistics (Swingley, 2005; Ando and Lee, 2000; Feng et al., 20... |

45 |
Infants' sensitivity to allophonic cues for word segmentation
- Jusczyk, Hohne, et al.
- 1999
(Show Context)
Citation Context ... to identify word boundaries. In fact, there is evidence that infants use a wide range of weak cues for word segmentation. These cues include phonotactics (Mattys et al., 1999), allophonic variation (=-=Jusczyk et al., 1999-=-a), metrical (stress) patterns (Morgan et al., 1995; Jusczyk et al., 1999b), effects of coarticulation (Johnson and Jusczyk, 2001), and statistical regularities in the sequences of syllables found in ... |