## Learning bias and phonological rule induction (1996)

### Cached

### Download Links

Venue: | Computational Linguistics |

Citations: | 6 - 0 self |

### BibTeX

@ARTICLE{Gildea96learningbias,

author = {Daniel Gildea and Daniel Jurafsky},

title = {Learning bias and phonological rule induction},

journal = {Computational Linguistics},

year = {1996},

volume = {22},

pages = {497--530}

}

### OpenURL

### Abstract

A fundamental debate in the machine learning of language has been the role of prior knowledge in the learning process. Purely nativist approaches, such as the Principles and Parameters model, build parameterized linguistic generalizations directly into the learning system. Purely empirical approaches use a general, domain-independent learning rule (Error Back-Propagation, Instance-Based Generalization, Minimum Description Length) to learn linguistic generalizations directly from the data. In this paper we suggest that an alternative to the purely nativist or purely empiricist learning paradigms is to represent the prior knowledge of language as a set of abstract learning biases, which guide an empirical inductive learning algorithm. We test our idea by examining the machine learning of simple Sound Pattern of English (SPE)-style phonological rules. We represent phonological rules as finite state transducers which accept underlying forms as input and generate surface forms as output. We show that OSTIA, a general-purpose transducer induction algorithm, was incapable of learning simple phonological rules like flapping. We then augmented OSTIA with three kinds of learning biases which are specific to natural language phonology, and are assumed explicitly or implicitly by every theory of phonology: Faithfulness (underlying segments tend

### Citations

3610 | Induction of Decision Trees - Quinlan - 1986 |

1737 | Optimality Theory: Constraint interaction in generative grammar
- Prince, Smolensky
- 1993
(Show Context)
Citation Context ...ed to empiricist induction models to build a cognitively and computationally plausible learning model for phonological rules. Ellison (1994), for example, has shown how to map optimality constraints (=-=Prince and Smolensky, 1993-=-) to finite-state automata; given this result, models of automaton induction enriched in the way we suggest may contribute to the current debate on optimality learning. Our model is not, however, nece... |

1132 | Instance-based learning algorithms - Aha, Kibler, et al. - 1991 |

955 |
Lectures on Government and Binding
- Chomsky
- 1981
(Show Context)
Citation Context ... bring to bear in learning language is not linguistic at all, but derives from constraints imposed by our general cognitive architecture. Others, such the influential Principles and Parameters model (=-=Chomsky, 1981-=-), asserts that what is innate is linguistic knowledge itself, and that the learning process consists mainly of searching for the values of a relatively small number of parameters. Such nativist model... |

736 | Class-based n-gram models of natural language - Brown, Pietra, et al. - 1990 |

695 | The string-to-string correction problem - Wagner, Fisher - 1974 |

584 | Generalization as search - Mitchell - 1982 |

452 |
Two-level morphology: a general computational model for word-form recognition and production
- Koskenniemi
- 1983
(Show Context)
Citation Context ...hat finite state transducers can be used to represent phonological rules, greatly simplifying the problem of parsing the output of phonological rules in order to obtain the underlying, lexical forms (=-=Koskenniemi, 1983-=-; Karttunen, 1993; Pulman and Hepple, 1993; Bird, 1995; Bird and Ellison, 1994). The fact that the weaker generative capacity of FSTs makes them easier to learn than arbitrary context-sensitive rules ... |

434 | Optimality Theory
- PRINCE, SMOLENSKY
- 1993
(Show Context)
Citation Context ...arning others empirically may obviate the need to build in every phonological constraint, as for example nativist models of OT learning suggest (Prince and Smolensky, 1993; Tesar and Smolensky, 1993; =-=Tesar, 1995-=-). We hope in this way to begin to help assess the role of computational phonology in answering the general question of the necessity and nature of linguistic innateness in learning. 2. Transducer rep... |

346 | Regular models of phonological rule systems - Kaplan, Kay - 1994 |

259 | Autosegmental Phonology - Goldsmith - 1976 |

246 | Transductions and Context-Free languages - Berstel - 1979 |

227 |
Prosodic morphology
- McCarthy, Prince
- 1986
(Show Context)
Citation Context ...nded this assumption by restricting the domain of individual phonological rules to changes in an individual node in a feature-geometric representation. Recent two-level theories of Optimality Theory (=-=McCarthy and Prince, 1995-=-) makes the assumption of faithfulness (one which is similar to Chomsky and Halle’s) more explicit by proposing a constraint FAITHFULNESS which requires that the phonological output string match its i... |

170 | Speech perception in infants - Eimas, Siqueland, et al. - 1971 |

140 | Hidden markov model induction by bayesian model merging
- Stolcke, Omohundro
- 1993
(Show Context)
Citation Context ...g algorithms including those for deterministic finite-state automata(Freund et al., 1993), deterministic transducers (Oncina, García, and Vidal, 1993), as well as non-deterministic (stochastic) FSAs (=-=Stolcke and Omohundro, 1993-=-; Stolcke and Omohundro, 1994; Ron, Singer, and Tishby, 1994). Like the empiricist models we discussed above, these algorithms are all general-purpose; none include any domain knowledge about phonolog... |

107 |
Preliminaries to Speech Analysis
- Jakobson, Fant, et al.
- 1952
(Show Context)
Citation Context ...es which pick out certain equivalence classes of segments. Since the beginning of generative grammar, and based on Jakobson’s early insistence on the importance of binary oppositions (Jakobson, 1968; =-=Jakobson, Fant, and Halle, 1952-=-), phonological features, and not the segment, have generally formed the vocabulary over which linguistic rules are formed. Giving such knowledge to OSTIA would allow it to hypothesize that if every v... |

104 | Learning subsequential transducers for pattern recognition interpretation tasks
- Oncina, Garcı́a, et al.
- 1993
(Show Context)
Citation Context ...bitrary context-sensitive rules has allowed the development of a number of learning algorithms including those for deterministic finite-state automata(Freund et al., 1993), deterministic transducers (=-=Oncina, García, and Vidal, 1993-=-), as well as non-deterministic (stochastic) FSAs (Stolcke and Omohundro, 1993; Stolcke and Omohundro, 1994; Ron, Singer, and Tishby, 1994). Like the empiricist models we discussed above, these algori... |

99 | A computational learning model for metrical phonology - Dresher, Kaye - 1990 |

97 | Best-first model merging for Hidden Markov Model induction
- Stolcke, Omohundro
- 1993
(Show Context)
Citation Context ...for deterministic finite-state automata(Freund et al., 1993), deterministic transducers (Oncina, García, and Vidal, 1993), as well as non-deterministic (stochastic) FSAs (Stolcke and Omohundro, 1993; =-=Stolcke and Omohundro, 1994-=-; Ron, Singer, and Tishby, 1994). Like the empiricist models we discussed above, these algorithms are all general-purpose; none include any domain knowledge about phonology, or indeed natural language... |

79 | Phonological derivation in Optimality Theory
- Ellison
- 1994
(Show Context)
Citation Context ...simplifying the problem of parsing the output of phonological rules in order to obtain the underlying, lexical forms (Koskenniemi, 1983; Karttunen, 1993; Pulman and Hepple, 1993; Bird, 1995; Bird and =-=Ellison, 1994-=-). The fact that the weaker generative capacity of FSTs makes them easier to learn than arbitrary context-sensitive rules has allowed the development of a number of learning algorithms including those... |

79 | The power of amnesia
- Ron, Singer, et al.
- 1994
(Show Context)
Citation Context ...e automata(Freund et al., 1993), deterministic transducers (Oncina, García, and Vidal, 1993), as well as non-deterministic (stochastic) FSAs (Stolcke and Omohundro, 1993; Stolcke and Omohundro, 1994; =-=Ron, Singer, and Tishby, 1994-=-). Like the empiricist models we discussed above, these algorithms are all general-purpose; none include any domain knowledge about phonology, or indeed natural language; at most they include a simple... |

77 |
Child language, aphasia, and phonological universals
- Jakobson
- 1968
(Show Context)
Citation Context ...nological features which pick out certain equivalence classes of segments. Since the beginning of generative grammar, and based on Jakobson’s early insistence on the importance of binary oppositions (=-=Jakobson, 1968-=-; Jakobson, Fant, and Halle, 1952), phonological features, and not the segment, have generally formed the vocabulary over which linguistic rules are formed. Giving such knowledge to OSTIA would allow ... |

73 | The leamability of Optimality Theory: An algorithm and some basic complexity results. Unpublished ms
- Tesar, Smolensky
- 1993
(Show Context)
Citation Context ... previously learned) and learning others empirically may obviate the need to build in every phonological constraint, as for example nativist models of OT learning suggest (Prince and Smolensky, 1993; =-=Tesar and Smolensky, 1993-=-; Tesar, 1995). We hope in this way to begin to help assess the role of computational phonology in answering the general question of the necessity and nature of linguistic innateness in learning. 2. T... |

61 |
Computational phonology: a constraint-based approach, ser. Studies in natural language processing
- Bird
- 1995
(Show Context)
Citation Context ...gical rules, greatly simplifying the problem of parsing the output of phonological rules in order to obtain the underlying, lexical forms (Koskenniemi, 1983; Karttunen, 1993; Pulman and Hepple, 1993; =-=Bird, 1995-=-; Bird and Ellison, 1994). The fact that the weaker generative capacity of FSTs makes them easier to learn than arbitrary context-sensitive rules has allowed the development of a number of learning al... |

61 | The Acquisition of Stress: A Data-oriented Approach - Daelemans, Gillis, et al. - 1994 |

60 | 1972 Formal Aspects of Phonological Description. The Hague: Mouton - Johnson |

60 | Partial class behavior and nasal place assimilation - Padgett - 1995 |

59 | Feature geometry and dependency: a review - McCarthy - 1988 |

53 | Finite-state constraints
- Karttunen
- 1993
(Show Context)
Citation Context ...ansducers can be used to represent phonological rules, greatly simplifying the problem of parsing the output of phonological rules in order to obtain the underlying, lexical forms (Koskenniemi, 1983; =-=Karttunen, 1993-=-; Pulman and Hepple, 1993; Bird, 1995; Bird and Ellison, 1994). The fact that the weaker generative capacity of FSTs makes them easier to learn than arbitrary context-sensitive rules has allowed the d... |

46 | Efficient learning of typical finite automata from random walks
- Freund, Kearns, et al.
- 1997
(Show Context)
Citation Context ...pacity of FSTs makes them easier to learn than arbitrary context-sensitive rules has allowed the development of a number of learning algorithms including those for deterministic finite-state automata(=-=Freund et al., 1993-=-), deterministic transducers (Oncina, García, and Vidal, 1993), as well as non-deterministic (stochastic) FSAs (Stolcke and Omohundro, 1993; Stolcke and Omohundro, 1994; Ron, Singer, and Tishby, 1994)... |

46 | A statistical model for generating pronunciation networks - Riley - 1991 |

41 | One-Level Phonology: Autosegmental Representations and Rules as Finite-State Automata
- Bird, Ellison
- 1992
(Show Context)
Citation Context ... greatly simplifying the problem of parsing the output of phonological rules in order to obtain the underlying, lexical forms (Koskenniemi, 1983; Karttunen, 1993; Pulman and Hepple, 1993; Bird, 1995; =-=Bird and Ellison, 1994-=-). The fact that the weaker generative capacity of FSTs makes them easier to learn than arbitrary context-sensitive rules has allowed the development of a number of learning algorithms including those... |

38 | Correspondence and identity constraints in two-level Optimality Theory - Orgun - 1996 |

37 | Cognitive Phonology - Lakoff - 1993 |

35 | Learning the past tense of English verbs: The symbolic patter associator vs. connectionist models - Ling - 1994 |

34 | Connectionist models and linguistic theory: Investigations of stress systems in language - Gupta, Touretzky - 1994 |

28 | The machine learning of phonological structure - Ellison - 1994 |

20 | A discovery procedure for certain phonological rules - Johnson - 1984 |

18 |
A feature-based formalism for two-level phonology: a description and implementation. Computer Speech and Language
- Pulman
- 1993
(Show Context)
Citation Context ...used to represent phonological rules, greatly simplifying the problem of parsing the output of phonological rules in order to obtain the underlying, lexical forms (Koskenniemi, 1983; Karttunen, 1993; =-=Pulman and Hepple, 1993-=-; Bird, 1995; Bird and Ellison, 1994). The fact that the weaker generative capacity of FSTs makes them easier to learn than arbitrary context-sensitive rules has allowed the development of a number of... |

11 | Class-based -grammodels of natural language - Brown, Pietra, et al. - 1992 |

11 |
Hidden Markov estimation for unrestricted stochastic context-free grammars
- Kupiec
- 1992
(Show Context)
Citation Context ...E-style rules, and to a non-probabilistic theory of purely deterministic transducers, these biases may also prove useful when applied to other, stochastic, linguistic rule induction algorithms (e.g. (=-=Kupiec, 1992-=-; Lucke, 1993; Stolcke and Omohundro, 1993; Stolcke and Omohundro, 1994; Ron, Singer, and Tishby, 1994). We believe the idea of augmenting an empirical learning element with relatively abstract learni... |

10 | Learning words in time: Towards a modular connectionist account of the acquisition of receptive morphology - Gasser - 1993 |

9 |
The CELEX lexical database
- Celex
- 1993
(Show Context)
Citation Context ... applied the modified algorithm with variables in the output strings to the problem of the German rule that devoices word-final stops. Our dataset was constructed from the the CELEX lexical database (=-=Celex, 1993-=-), which contains pronunciations for 359,611 word forms – including various inflected forms of the same lexeme. For our experiments we used the CELEX pronunciations as the surface forms, and generated... |

9 | A computational basis for phonology - Touretzky, Wheeler - 1990 |

7 | A declarative theory of phonology-morphology interleaving. Presented at the conference on the derivational residue in phonology - Orgun - 1995 |

5 | Phonological rule induction: An architectural solution - Touretzky, Elvgren, et al. - 1990 |

4 |
The Carnegie Mellon Pronouncing Dictionary v0.1
- CMU
- 1993
(Show Context)
Citation Context ...lying pronunciation of an individual word of English and a machine generated “surface pronunciation”. The underlying string of each pair was taken from the phoneme-based CMU pronunciation dictionary (=-=CMU, 1993-=-). The surface string was generated from each underlying form by mechanically applying the one or more rules we were attempting to induce in each experiment. In our first experiment, we applied the fl... |

3 | Word-Formation in Generative Grammar. Linguistic Inquiry Monograph no. 1 - Aronoff - 1976 |

3 |
Inference of stochastic context-free grammar rules from example data using the theory of bayesian belief propagation
- Lucke
- 1993
(Show Context)
Citation Context ... and to a non-probabilistic theory of purely deterministic transducers, these biases may also prove useful when applied to other, stochastic, linguistic rule induction algorithms (e.g. (Kupiec, 1992; =-=Lucke, 1993-=-; Stolcke and Omohundro, 1993; Stolcke and Omohundro, 1994; Ron, Singer, and Tishby, 1994). We believe the idea of augmenting an empirical learning element with relatively abstract learning biases to ... |

1 | Jurafsky Learning Bias and Phonological Rule Induction Chomsky - Gildea - 1968 |