## A maximum entropy model of phonotactics and phonotactic learning (2006)

### Cached

### Download Links

Citations: | 77 - 13 self |

### BibTeX

@MISC{Hayes06amaximum,

author = {Bruce Hayes and Colin Wilson},

title = {A maximum entropy model of phonotactics and phonotactic learning },

year = {2006}

}

### Years of Citing Articles

### OpenURL

### Abstract

The study of phonotactics (e.g., the ability of English speakers to distinguish possible words like blick from impossible words like *bnick) is a central topic in phonology. We propose a theory of phonotactic grammars and a learning algorithm that constructs such grammars from positive evidence. Our grammars consist of constraints that are assigned numerical weights according to the principle of maximum entropy. Possible words are assessed by these grammars based on the weighted sum of their constraint violations. The learning algorithm yields grammars that can capture both categorical and gradient phonotactic patterns. The algorithm is not provided with any constraints in advance, but uses its own resources to form constraints and weight them. A baseline model, in which Universal Grammar is reduced to a feature set and an SPE-style constraint format, suffices to learn many phonotactic phenomena. In order to learn nonlocal phenomena such as stress and vowel harmony, it is necessary to augment the model with autosegmental tiers and metrical grids. Our results thus offer novel, learning-theoretic support for such representations. We apply the model to English syllable onsets, Shona vowel harmony, quantity-insensitive stress typology, and the full phonotactics of Wargamay, showing that the learned grammars capture the distributional generalizations of these languages and accurately predict the findings of a phonotactic experiment.

### Citations

8548 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...honotactic analysis. The term “maximum entropy” relates to this goal. “Entropy” is an information-theoretic measure of the amount of randomness in the system, given by the formula – Σ P(x) log(P(x)) (=-=Cover and Thomas 1991-=-). According to a theorem proved by Della Pietra et al. (1997), if probability is defined as in §3.2, maximizing entropy is in fact equivalent to maximizing the probability of the observed forms given... |

3824 |
Introduction to Automata Theory Languages and Computation,Second Edition
- HOPCROFT, MOTAWANI, et al.
(Show Context)
Citation Context ... of strings can be computed by representing the set as a finite state machine. We construct our machines by first representing each constraint as a weighted finite-state acceptor. Using intersection (=-=Hopcroft and Ullman 1979-=-), the constraints are then combined into a single machine that embodies the full grammar (Ellison 1994, Riggle 2004). Each path through this machine corresponds to a phonological representation toget... |

1857 |
Numerical Recipes in C: The Art of Scientific Computing
- Press, Flannery, et al.
- 1992
(Show Context)
Citation Context ...e becomes sufficiently close (by an arbitrarily-chosen small value) to zero. There are many algorithms that can iteratively ascend a surface given the gradient. We used the Conjugate Gradient method (=-=Press et al. 1992-=-), which is known to converge quickly for this type of problem (Malouf 2002). The heart of the calculation is the determination of the gradients. Formally, the gradient consists of a vector of partial... |

1453 | Optimality Theory: Constraint Interaction in Generative Grammar
- Prince, Smolensky
- 2004
(Show Context)
Citation Context ...ons of phonological forms, argued above (§2.1) to be crucial to an adequate phonotactic model. All of the constraints in our model are Markedness constraints, in the sense of Optimality Theory (“OT”; =-=Prince and Smolensky 1993-=-/2004). No role is played by inputs or by OT-style Faithfulness constraints. This decision is sensible in light of the task at hand: we seek to assess forms simply for their phonological legality, not... |

1152 |
Information Theory, Inference, and Learning Algorithms, volume 1
- MacKay
- 2003
(Show Context)
Citation Context ... large random sample from the set Ω of all possible phonological representations. When the sample is sufficiently large and is drawn according to wellestablished techniques (Della Pietra et al. 1997, =-=MacKay 2003-=-) the average number of violations in the sample provides a fairly accurate estimate of the expected value for Ω as a whole. For details of sampling, see Appendix A. 4.2.2 Generality Within the strata... |

1082 | A maximum entropy approach to natural language processing
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ...d for application to the learning and analysis of input-output mappings, see Goldwater and Johnson 2003, Jäger 2004.Hayes/Wilson Maximum Entropy Phonotactics p. 4 Maximum entropy grammars ( maxent ; =-=Berger et al. 1996-=-, Della Pietra et al. 1997, Eisner 2001, Klein and Manning 2003, Rosenfeld 1996) have special properties that recommend them as a basis for phonotactic learning. In particular, they have been subject ... |

791 | The sound pattern of English - Chomsky, Halle - 1968 |

739 |
Statistical Methods for Speech Recognition
- Jelinek
- 1998
(Show Context)
Citation Context ... its syllable in a metrical stress hierarchy. Previous accounts of phonotactic learning, however, have relied on just a single classification of environments. For instance, traditional n-gram models (=-=Jelinek 1999-=-, Jurafsky and Martin 2000) are quite efficient and have broad application in industry, but they define only an immediate segmental context and are thus insufficient as a basis for phonotactic analysi... |

732 | Foundations of Statistical Natural Language Processing - Manning, Schutze - 1999 |

552 | Inducing features of random fields
- Pietra, S, et al.
- 1997
(Show Context)
Citation Context ...t) grammars, see Jaynes 1983, Jelinek 1999:ch. 13, Manning and Schutze 1999, and Klein and Manning 2003. We will rely here on particular results developed in Berger et al. 1996, Rosenfeld 1996, Della =-=Pietra et al. 1997-=-, and Eisner 2001. For earlier applications of maxent grammars to phonology, in particular to the learning and analysis of input-output mappings, see Goldwater and Johnson 2003 and Jäger 2004. Maxent ... |

479 |
Faithfulness and Reduplicative Identity
- McCarthy, Prince
- 1995
(Show Context)
Citation Context ...od described here, then take on the many forms of string mapping that must be learned: mapping from paradigmatic base forms to the other paradigm members (Albright 2002b), from bases to reduplicants (=-=McCarthy and Prince 1995-=-), from one free variant to another (Kawahara 2002), and so on. These mappings are learned as a maxent grammar that incorporates Faithfulness constraints. The constraints used for string mappings woul... |

372 | Learnability in optimality theory - Tesar, Smolensky - 1998 |

321 |
Metrical Stress Theory: Principles and Case Studies
- Hayes
- 1995
(Show Context)
Citation Context ...portant restrictions. 7.1 Unbounded stress The locality of stress is seen clearly in so-called unbounded stress patterns. One such pattern, attributed to Eastern Cheremis and various other languages (=-=Hayes 1995-=-:§7.2), works as follows: (15) a. Every heavy syllable bears some degree of stress. b. Every initial syllable bears some degree of stress. c. Of the stressed syllables in a word, the rightmost bears m... |

295 | Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River - Jurafsky, Martin - 2008 |

277 |
Speech and Language Processing: An Introduction to
- Jurafsky, Martin
- 2006
(Show Context)
Citation Context ...llows that any approach to phonotactics in which the content of constraints is hobbled by computational considerations should be rejected. Thus, for instance, traditional n-gram models (Jelinek 1999, =-=Jurafsky and Martin 2000-=-) are quite efficient and have broad application in industry, but are insufficient as a basis for phonotactic analysis (see §5.3). Similarly, the stochastic context-free grammar of Coleman and Pierreh... |

269 | Empirical tests of the Gradual Learning Algorithm
- Boersma, Hayes
- 1999
(Show Context)
Citation Context ...m of accounting for gradient intuitions. A large body of research in generative linguistics deals with this issue; for example Chomsky 1963, Ross 1972, Legendre et al. 1990, Schütze 1996, Hayes 2000, =-=Boersma and Hayes 2001-=-, Boersma 2004, Keller 2000, 2006, Sorace and Keller 2005, and Legendre et al. 2006. In the particular domain of phonotactics, gradient intuitions are pervasive: they have been found in every experime... |

253 |
On certain formal properties of grammars
- Chomsky
- 1959
(Show Context)
Citation Context ...generative grammar that address well-formedness are faced with the problem of accounting for gradient intuitions. A large body of research in generative linguistics deals with this issue; for example =-=Chomsky 1963-=-, Ross 1972, Legendre et al. 1990, Schütze 1996, Hayes 2000, Boersma and Hayes 2001, Boersma 2004, Keller 2000, 2006, Sorace and Keller 2005, and Legendre et al. 2006. In the particular domain of phon... |

242 | A maximum entropy approach to adaptive statistical learning modeling
- Rosenfeld
- 1996
(Show Context)
Citation Context ...ropy (hereafter, maxent) grammars, see Jaynes 1983, Jelinek 1999:ch. 13, Manning and Schutze 1999, and Klein and Manning 2003. We will rely here on particular results developed in Berger et al. 1996, =-=Rosenfeld 1996-=-, Della Pietra et al. 1997, and Eisner 2001. For earlier applications of maxent grammars to phonology, in particular to the learning and analysis of input-output mappings, see Goldwater and Johnson 20... |

241 |
Information processing in dynamic systems: foundations of harmony theory
- Smolensky
- 1986
(Show Context)
Citation Context ...ct ratings for all 62 onsets in the Scholes experiment. 13 T is mnemonic for the computational “temperature,” a term reflecting the origin of maximum entropy theory in statistical mechanics; see e.g. =-=Smolensky 1986-=-:270.sHayes/Wilson Maximum Entropy Phonotactics p. 24 Figure 3 Performance of the model in predicting the data of Scholes 1966 The correlation of 0.946 becomes more meaningful when compared with the c... |

229 | A comparison of algorithms for maximum entropy parameter estimation
- Malouf
- 2002
(Show Context)
Citation Context ...e are many algorithms that can iteratively ascend a surface given the gradient. We used the Conjugate Gradient method (Press et al. 1992), which is known to converge quickly for this type of problem (=-=Malouf 2002-=-). The heart of the calculation is the determination of the gradients. Formally, the gradient consists of a vector of partial derivatives, one for each constraint in the grammar. Each partialsHayes/Wi... |

223 | Functional Phonology: Formalizing the Interactions between Articulatory and Perceptual Drives. Doctoral Dissertation
- BOERSMA
- 1998
(Show Context)
Citation Context ...eir reliance on UG. For instance, language learners could make use of their own phonetic experience, accessing it to discover phonetically natural constraints grounded in articulation and perception (=-=Boersma 1998-=-, Hayes 1999a, Steriade 1999, 2001a, b, Gordon 2004, Hayes, Kirchner and Steriade 2004). Preference for such constraints would constitute a learning bias in favor of phonological systems that are easi... |

218 |
Autosegmental Phonology
- Goldsmith
- 1976
(Show Context)
Citation Context ...erkamp et al. 2006. Our own inductive baseline is a purely linear, feature-bundle approach modeled on Chomsky and Halle (1968; henceforth SPE). To this we will add the concepts of autosegmental tier (=-=Goldsmith 1979-=-) and metrical grid (Liberman 1975, Prince 1983), showing that both make possible modes of phonotactic learning that are unreachable by the linear baseline model. 2.3 Accounting for gradience All area... |

177 |
An Essay on Stress
- HALLE, VERGNAUD
- 1987
(Show Context)
Citation Context ...H� , ˌH, ˈH }. L designates light syllables and H heavy; 23 It has often been argued that the grid should be amplified with constituency information, such as foot structure (Liberman and Prince 1977, =-=Halle and Vergnaud 1987-=-, Hayes 1995). The present discussion makes no use of such constituency, taking an agnostic view on whether it exists. For discussion of “hidden structure” of this kind, see §9.5.sHayes/Wilson Maximum... |

167 | Prior probabilities
- Jaynes
- 1968
(Show Context)
Citation Context ...cs p. 5 3. Maximum entropy grammars A maximum entropy grammar uses weighted constraints to assign probabilities to outputs. For general background on maximum entropy (hereafter, maxent) grammars, see =-=Jaynes 1983-=-, Jelinek 1999:ch. 13, Manning and Schutze 1999, and Klein and Manning 2003. We will rely here on particular results developed in Berger et al. 1996, Rosenfeld 1996, Della Pietra et al. 1997, and Eisn... |

165 | Phonology and Language Use - Bybee - 2001 |

158 |
A Thematic Guide to Optimality Theory
- McCarthy
- 2002
(Show Context)
Citation Context ...t the only effective learning strategy is one with an extremely rich UG—a UG that incorporates the entire constraint set for phonology (Prince and Smolensky 1993/2004; Tesar and Smolensky 1998, 2000; =-=McCarthy 2002-=-). If so, the problem of typology will likely be solved, and the outcome of our efforts will be an inductive-baseline argument for the universal-constraint approach. However, there are other ways to e... |

149 | How we learn variation, optionality, and probability
- Boersma
- 1997
(Show Context)
Citation Context ...l responds flexibly and sensitively to the range of frequencies encountered in the learning data. Other algorithms satisfy the gradience criterion, but fail elsewhere. The Gradual Learning Algorithm (=-=Boersma 1997-=-, Boersma and Hayes 2001) responds flexibly to gradient data, but in asHayes/Wilson Maximum Entropy Phonotactics p. 48 well-defined class of cases it fails to find the target grammar (Pater, in press)... |

135 | Self-organisation in vowel systems - Boer - 1999 |

134 |
Language
- Bloomfield
- 1933
(Show Context)
Citation Context ...dient Well-formedness The inventory of syllable onsets in English is an ideal empirical domain for the testing of phonotactic learning models. The basic generalizations have been extensively studied (=-=Bloomfield 1933-=-, Whorf 1940, O Connor and Trim 1953, Fudge 1969, Selkirk 1982, Clements and Keyser 1983, Hammond 1999), and data are available from experimentation that permit rival models to be evaluated. In this s... |

125 | Automatic rule induction for unknown word guessing - Mikheev - 1997 |

125 |
Quantitative consequences of rhythmic organization
- Prince
- 1990
(Show Context)
Citation Context ...e is no evident connection between nasality and [u] in Yidiɲ phonotactics (Hayes 1999b). English vowel length alternations (SPE) are phonotactically motivated insofar as they optimize foot structure (=-=Prince 1990-=-, Hayes 1995), but the accompanying quality alternations ([iː] ~ [ɛ], [eɪ] ~ [æ], [aɪ] ~ [ɪ], [oʊ] ~ [�]) have no evident phonotactic basis. We suggest that the proper link between alternations and ph... |

124 |
Formal Problems in Semitic Phonology and Morphology." Doctoral dissertation
- McCarthy
- 1979
(Show Context)
Citation Context ...le that would not otherwise be. To create the effects of a vowel tier within the computational limits of our system, we use a slightly different conception, due originally to Vergnaud (1977; see also =-=McCarthy 1979-=-). We assume that every phonological representation automatically generates a vowel projection, which is a substring consisting of all and only its vowels, appearing in the same order as in the main r... |

122 |
Relating to the grid
- Prince
- 1983
(Show Context)
Citation Context ...ing richness of structures and phenomena in phonology, including long-distance dependencies (e.g., McCarthy 1988), phrasal hierarchies (Selkirk 1980a), metrical hierarchies (Liberman and Prince 1977, =-=Prince 1983-=-), elaborate interactions with morphology (Kiparsky 1982), and other areas, each the subject of extensive analysis and research. We anticipate that a successful model of phonotactics and phonotactic l... |

118 | Underspecification and markedness - Steriade - 1995 |

117 | On the internal structure of the constraint component Con of UG. Handout of talk presented at the - Smolensky - 1995 |

113 |
Lexical phonology and morphology
- Kiparsky
- 1982
(Show Context)
Citation Context ...ncluding long-distance dependencies (e.g., McCarthy 1988), phrasal hierarchies (e.g. Selkirk 1980a), metrical hierarchies (e.g. Liberman and Prince 1977), elaborate interactions with morphology (e.g. =-=Kiparsky 1982-=-), and other areas, each the subject of extensive analysis and research. We anticipate that a successful model of phonotactics and phonotactic learning will incorporate theoretical work from all of th... |

106 |
Regular morphology and the lexicon
- Bybee
- 1995
(Show Context)
Citation Context ...tal data discussed below (§5.3). In general, it appears that the use of type frequencies yields better results in modeling any sort of phonological intuitions based on the lexicon; for discussion see =-=Bybee 1995-=-, 2001, Pierrehumbert 2001a, Albright 2002a, Albright and Hayes 2003, Hayes and Londe 2006, and Goldwater 2007.sHayes/Wilson Maximum Entropy Phonotactics p. 20 Table 3 Feature set for English consonan... |

106 | Learning at a distance: I. Statistical learning of nonadjacent dependencies
- Newport, Aslin
- 2004
(Show Context)
Citation Context ... limitations would apply for the human learner, whose computational capacity is unknown. Given that exponential growth soon defeats any finite system, there must be limitations of some sort (see also =-=Newport and Aslin 2004-=-). |C| will in general be small to the extent that the feature system makes use of principles of underspecification, as embodied in works such as Kiparsky 1982; Archangeli 1984; and Steriade 7 We will... |

104 |
On stress and linguistic rhythm. Linguistic Inquiry 8.249–336
- Liberman, Prince
- 1977
(Show Context)
Citation Context ...onstrate a striking richness of structures and phenomena in phonology, including long-distance dependencies (e.g., McCarthy 1988), phrasal hierarchies (e.g. Selkirk 1980a), metrical hierarchies (e.g. =-=Liberman and Prince 1977-=-), elaborate interactions with morphology (e.g. Kiparsky 1982), and other areas, each the subject of extensive analysis and research. We anticipate that a successful model of phonotactics and phonotac... |

98 | Learning OT constraint rankings using a maximum entropy model
- Goldwater, Johnson
- 2003
(Show Context)
Citation Context ...al. 1996, Rosenfeld 1996, Della Pietra et al. 1997, and Eisner 2001. For earlier applications of maxent grammars to phonology, in particular to the learning and analysis of input-output mappings, see =-=Goldwater and Johnson 2003-=- and Jäger 2004. Maxent grammars have special properties that recommend them as a basis for phonotactic learning. They have been subject to thorough mathematical analysis that establishes their conver... |

95 | Phonological acquisition in Optimality Theory: The early stages
- Hayes
- 2004
(Show Context)
Citation Context ...tions and selecting members of that space for inclusion in the grammar. Previous research on phonotactic learning has not addressed the selection problem in a general form. Work in Optimality Theory (=-=Hayes 2004-=-, Prince and Tesar 2004, Jarosz 2006, Pater and Coetzee 2006) generally assumes that the constraint set is provided by UG. No selection problem arises under this approach, as learning consists simply ... |

95 |
CV Phonology: A Generative Theory of the Syllable
- CLEMENTS, KEYSER
- 1983
(Show Context)
Citation Context ...words into onsets and rimes. As Coleman and Pierrehumbert point out, this makes it impossible in principle for the model to capture the many phonotactic restrictions that cross onset-rime boundaries (=-=Clements and Keyser 1983-=-:20- 21) or syllable boundaries (bans on geminates, heterorganic nasal-stop clusters, sibilant clusters). The crucial point is that phonotactics are cross-classifying, so that no one single categoriza... |

94 | Rules vs. analogy in English past tenses: a computational/experimental study
- Albright, Hayes
(Show Context)
Citation Context ... that the use of type frequencies yields better results in modeling any sort of phonological intuitions based on the lexicon; for discussion see Bybee 1995, 2001, Pierrehumbert 2001a, Albright 2002a, =-=Albright and Hayes 2003-=-, Hayes and Londe 2006, and Goldwater 2007.sHayes/Wilson Maximum Entropy Phonotactics p. 20 Table 3 Feature set for English consonants p t tʃ k b d dʒ g f θ s ʃ h v ð z ʒ m n ŋ l r j w cons + + + + + ... |

93 |
The Internal Organization of Speech Sounds
- Clements, Hume
- 1995
(Show Context)
Citation Context ... This can be done, for instance, with an autosegmental tier for vowels (Clements 1976, Goldsmith 1979), perhaps incorporated into some conception of feature geometry (Archangeli and Pulleyblank 1987, =-=Clements and Hume 1995-=-). Without attempting to choose between these theories, we argue that a vocalic representation offers a solution to the problem of learning harmony systems. To create the effects of a vowel tier in ou... |

88 |
A Computational Learning Model of Metrical Phonology. Cognition 34
- Dresher, Kaye
- 1990
(Show Context)
Citation Context ...onstraint set is provided by UG. No selection problem arises under this approach, as learning consists simply of assigning a ranking to the constraint set. The parameter setting approach set forth by =-=Dresher and Kaye 1990-=- likewise confronts no selection problem, since the parameters and their cues are provided a priori. However, our interest in establishing an inductive baseline (§2.2) is incompatible with any rich UG... |

88 | Efficient generation in primitive Optimality Theory
- Eisner
- 1997
(Show Context)
Citation Context ...er than the longest string in the learning data D. This is a finite— albeit exponentially large—subset of Ω, and to sum over it we employ methods borrowed from work in computational OT (Ellison 1994, =-=Eisner 1997-=-, Albro 1998, 2005, Riggle 2004). As this work has shown, the properties of a very large set of strings can be computed by representing the set as a finite state machine. We construct our machines by ... |

87 |
The Harmonic Mind. From Neural Computation to Optimality-Theoretic Grammar, volume 1. Cognitive Architecture
- Smolensky, Legendre
- 2006
(Show Context)
Citation Context ...eason we will illustrate here only the calculation of scores and maxent values. To this end, 2 Our “scores” are closely related to the harmony values explored in Smolensky (1986) and subsequent work (=-=Smolensky and Legendre 2006-=-); hence the abbreviation h(x). The term “score” is also used in Prince (2002). The use of scores, but without their theoretical interpretation under maximum entropy as probability, is the basis of “l... |

79 |
Phonetically-driven phonology: the role of Optimality Theory and inductive grounding
- Hayes
- 1999
(Show Context)
Citation Context ...ctic problems. In Yidiɲ phonology, [u] is chosen (productively) as the epenthetic vowel following a nasal consonant, yet there is no evident connection between nasality and [u] in Yidiɲ phonotactics (=-=Hayes 1999-=-b). English vowel length alternations (SPE) are phonotactically motivated insofar as they optimize foot structure (Prince 1990, Hayes 1995), but the accompanying quality alternations ([iː] ~ [ɛ], [eɪ]... |

78 |
Underspecification in Yawelmani phonology and morphology
- Archangeli
- 1984
(Show Context)
Citation Context ...e sort (see also Newport and Aslin 2004). |C| will in general be small to the extent that the feature system makes use of principles of underspecification, as embodied in works such as Kiparsky 1982; =-=Archangeli 1984-=-; and Steriade 7 We will use the following abbreviations for feature names: ant = anterior; approx = approximant; cons = consonantal; cont = continuant; cor = coronal; dors = dorsal; lab = labial; lat... |

78 | Phonological Derivation in Optimality Theory
- Ellison
- 1994
(Show Context)
Citation Context ...at are no longer than the longest string in the learning data D. This is a finite— albeit exponentially large—subset of Ω, and to sum over it we employ methods borrowed from work in computational OT (=-=Ellison 1994-=-, Eisner 1997, Albro 1998, 2005, Riggle 2004). As this work has shown, the properties of a very large set of strings can be computed by representing the set as a finite state machine. We construct our... |