## Linguistic Structure as Composition and Perturbation (1996)

### Cached

### Download Links

- [acl.ldc.upenn.edu]
- [www.aclweb.org]
- [ucrel.lancs.ac.uk]
- [aclweb.org]
- [wing.comp.nus.edu.sg]
- [aclweb.org]
- [arxiv.org]
- [www.ai.mit.edu]
- [www.demarcken.org]
- DBLP

### Other Repositories/Bibliography

Venue: | In Meeting of the Association for Computational Linguistics |

Citations: | 8 - 0 self |

### BibTeX

@INPROCEEDINGS{Marcken96linguisticstructure,

author = {Carl De Marcken},

title = {Linguistic Structure as Composition and Perturbation},

booktitle = {In Meeting of the Association for Computational Linguistics},

year = {1996},

pages = {335--341},

publisher = {Morgan Kaufmann Publishers}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper discusses the problem of learning language from unprocessed text and speech signals, concentrating on the problem of learning a lexicon. In particular, it argues for a representation of language in which linguistic parameters like words are built by perturbing a composition of existing parameters. The power of the representation is demonstrated by several examples in text segmentation and compression, acquisition of a lexicon from raw speech, and the acquisition of mappings between text and artificial representations of meaning.

### Citations

1174 |
Modeling by shortest data description
- Rissanen
- 1978
(Show Context)
Citation Context ...exicon. The lexicon that minimizes the combined description length of the lexicon and the input maximally compresses the input. In the sense of Rissanen 's minimum description-length (MDL) principle (=-=Rissanen, 1978-=-; Rissanen, 1989) this lexicon is the theory that best explains the data, and one can hope that the patterns in the lexicon reflect the underlying mechanisms and parameters of the language that genera... |

787 |
A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains
- Baum, Petrie, et al.
- 1970
(Show Context)
Citation Context ...arrive at a (locally) optimal set of frequencies and codelengths for the words in the lexicon. For composition by concatenation, the algorithm reduces to the special case of the Baum-Welch procedure (=-=Baum et al., 1970-=-) discussed in (Deligne and Bimbot, 1995). In general, however, the parsing and reestimation involved in EM can be considerably more complicated. To update the structure of the lexicon, words can be a... |

360 |
Frequency analysis of English usage: Lexicon and grammar
- Francis, Kucera
- 1982
(Show Context)
Citation Context ...grammatical primitive: it is the product of a complex mixture of linguistic and extra-linguistic processes. Such patterns can be indistinguishable from desired ones. For example, in the Brown corpus (=-=Francis and Kucera, 1982-=-) scratching her nose occurs 5 times, a corpus-specific idiosyncrasy. This phrase has the same structure as the idiom kicking the bucket. It is difficult to imagine any induction algorithm learning ki... |

273 | Inside-outside reestimation from partially bracketed corpora
- Pereira, Schabes
- 1992
(Show Context)
Citation Context ...ristics can be used to estimate the benefit of deleting words. 4 3.2 Search Properties A significant source of problems in traditional grammar induction techniques is local minima (de Marcken, 1995a; =-=Pereira and Schabes, 1992; Carroll -=-and Charniak, 1992). The search algorithm described above avoids many of these problems. The reason is that hidden structure is largely a "compile-time" phenomena. During parsing all that is... |

270 |
Trainable grammars for speech recognition
- Baker
- 1979
(Show Context)
Citation Context ... makes use of context, then this framework extends naturally to a variation of stochastic context-free grammars in which composition corresponds to tree substitution and the inside-outside algorithm (=-=Baker, 1979) is used -=-for re-estimation. In particular, if each word is associated with a parent class, and these classes are permissible terminals, then "words" act as production rules. For example, a possible w... |

171 |
Modeling by shortest data description. Automatica 14
- Rissanen
- 1978
(Show Context)
Citation Context ...lexicon. The lexicon that minimizes the combined description length of the lexicon and the input maximally compresses the input. In the sense of Rissanen's minimum description-length (MDL) principle (=-=Rissanen, 1978-=-; Rissanen, 1989) this lexicon is the theory that best explains the data, and one can hope that the patterns in the lexicon reflect the underlying mechanisms and parameters of the language that genera... |

165 | Mathematical Structures of Language - Harris - 1968 |

161 | The Logical Structure of Linguistic Theory - Chomsky - 1955 |

126 | Bayesian Learning of Probabilistic Language Models
- Stolcke
- 1994
(Show Context)
Citation Context ...0) and Harris (1968), and compression has been used as the basis for a wide variety of computer programs that attack unsupervised learning in language; see (Olivier, 1968; Wolff, 1982; Ellison, 1992; =-=Stolcke, 1994-=-; Chen, 1995; Cartwright and Brent, 1994) among others. 1.1 Patterns and Language Unfortunately, while surface patterns often reflect interesting linguistic mechanisms and parameters, they do not alwa... |

89 |
Stochastic Complexity in Statistical Inquiry; World Scientific
- Rissanen
- 1989
(Show Context)
Citation Context ...icon that minimizes the combined description length of the lexicon and the input maximally compresses the input. In the sense of Rissanen's minimum description-length (MDL) principle (Rissanen, 1978� =-=Rissanen, 1989-=-) this lexicon is the theory that best explains the data, and one can hope that the patterns in the lexicon re ect the underlying mechanisms and parameters of the language that generated the input. 2.... |

65 |
An estimate of an upper bound for the entropy of English
- Brown, Pietra, et al.
- 1992
(Show Context)
Citation Context ...r, significantly lower than popular algorithms like gzip (2.95 bits/char). This is the best text compression result on this corpus that we are aware of, and should not be confused with lower figures (=-=Brown et al., 1992-=-) that do not include the cost of parameters. Furthermore, because the compressed text is stored in terms of linguistic units like words, it can be searched, indexed, and parsed without decompression.... |

54 |
Language Acquisition, Data Compression and Generalisation
- Wolff
- 1982
(Show Context)
Citation Context ...msky (1955), Solomonoff (1960) and Harris (1968), and compression has been used as the basis for a wide variety of computer programs that attack unsupervised learning in language; see (Olivier, 1968; =-=Wolff, 1982-=-; Ellison, 1992; Stolcke, 1994; Chen, 1995; Cartwright and Brent, 1994) among others. 1.1 Patterns and Language Unfortunately, while surface patterns often reflect interesting linguistic mechanisms an... |

53 | Bayesian grammar induction for language modeling
- Chen
- 1995
(Show Context)
Citation Context ...1968), and compression has been used as the basis for a wide variety of computer programs that attack unsupervised learning in language; see (Olivier, 1968; Wolff, 1982; Ellison, 1992; Stolcke, 1994; =-=Chen, 1995-=-; Cartwright and Brent, 1994) among others. 1.1 Patterns and Language Unfortunately, while surface patterns often reflect interesting linguistic mechanisms and parameters, they do not always do so. Th... |

45 | Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams
- Deligne, Bimbot
- 1995
(Show Context)
Citation Context ... of frequencies and codelengths for the words in the lexicon. For composition by concatenation, the algorithm reduces to the special case of the Baum-Welch procedure (Baum et al., 1970) discussed in (=-=Deligne and Bimbot, 1995-=-). In general, however, the parsing and reestimation involved in EM can be considerably more complicated. To update the structure of the lexicon, words can be added or deleted from it if this is predi... |

36 | C.: The unsupervised acquisition of a lexicon from continuous speech - Marcken - 1995 |

32 |
Stochastic Grammars and Language Acquisition Mechanisms
- Olivier
- 1968
(Show Context)
Citation Context ...ts included Chomsky (1955), Solomonoff (1960) and Harris (1968), and compression has been used as the basis for a wide variety of computer programs that attack unsupervised learning in language; see (=-=Olivier, 1968-=-; Wolff, 1982; Ellison, 1992; Stolcke, 1994; Chen, 1995; Cartwright and Brent, 1994) among others. 1.1 Patterns and Language Unfortunately, while surface patterns often reflect interesting linguistic ... |

28 | Compression of individual sequences by variable rate coding - Ziv, Lempel - 1978 |

27 |
The Machine Learning of Phonological Structure
- Ellison
- 1992
(Show Context)
Citation Context ...Solomonoff (1960) and Harris (1968), and compression has been used as the basis for a wide variety of computer programs that attack unsupervised learning in language; see (Olivier, 1968; Wolff, 1982; =-=Ellison, 1992-=-; Stolcke, 1994; Chen, 1995; Cartwright and Brent, 1994) among others. 1.1 Patterns and Language Unfortunately, while surface patterns often reflect interesting linguistic mechanisms and parameters, t... |

26 |
Learning probabilistic dependency grammars from labelled text
- Carroll, Charniak
- 1992
(Show Context)
Citation Context ...are being used, c 0 (W ) is not assumed to be exactly c(W ). 3.2 Search Properties Local optima debilitate many traditional grammar induction techniques (de Marcken, 1995a; Pereira and Schabes, 1992; =-=Carroll and Charniak, 1992). The sea-=-rch algorithm described above generally escapes this problem, in large part because of the underlying representation. The reason is that hidden structure is largely a "compile-time" phenomen... |

17 | Lexical heads, phrase structure and the induction of grammar.” SIGDAT - Marcken - 1995 |

16 |
Maximum Liklihood From Incomplete Data via the EM Algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...Algorithm Since the class of possible lexicons is infinite, the minimization of description length is necessarily inexact and heuristic. Given a fixed lexicon, the expectation-maximization algorithm (=-=Dempster et al., 1977-=-) can be used to arrive at a (locally) optimal set of probabilities and codelengths for the words in the lexicon. For composition by concatenation, the algorithm reduces to the special case of the Bau... |

11 |
Segmenting speech without a lexicon: Evidence for a bootstrapping model of lexical acquisition
- Cartwright, Brent
- 1994
(Show Context)
Citation Context ...ompression has been used as the basis for a wide variety of computer programs that attack unsupervised learning in language; see (Olivier, 1968; Wolff, 1982; Ellison, 1992; Stolcke, 1994; Chen, 1995; =-=Cartwright and Brent, 1994-=-) among others. 1.1 Patterns and Language Unfortunately, while surface patterns often reflect interesting linguistic mechanisms and parameters, they do not always do so. Three classes of examples serv... |

10 | The mechanization of linguistic learning - Solomonoff - 1958 |

4 |
Bayesian Learning of Probabalistic Language Models
- Stolcke
- 1994
(Show Context)
Citation Context ...0) and Harris (1968), and compression has been used as the basis for a wide variety of computer programs that attack unsupervised learning in language; see (Olivier, 1968; Wolff, 1982; Ellison, 1992; =-=Stolcke, 1994-=-; Chen, 1995; Cartwright and Brent, 1994) among others. 1.1 Patterns and Language Unfortunately, while surface patterns often reflect interesting linguistic mechanisms and parameters, they do not alwa... |

4 | Language Acquisition, Data Compression and Generalization - Wol - 1982 |

3 | The Mechanization of Linguistic Learning - Solomono - 1960 |

1 |
Learning probaballstic dependency grammars from labelled text
- Carroll, Charniak
- 1992
(Show Context)
Citation Context ...mate the benefit of deleting words. 4 3.2 Search Properties A significant source of problems in traditional grammar induction techniques is local minima (de Marcken, 1995a; Pereira and Schabes, 1992; =-=Carroll and Charniak, 1992). The sea-=-rch algorithm described above avoids many of these problems. The reason is that hidden structure is largely a "compile-time" phenomena. During parsing all that is important about a word is i... |

1 |
Maximum lildihood from incomplete data via the EM algorithm
- Dempster, Liard, et al.
- 1977
(Show Context)
Citation Context .... 3 A Search Algorithm Since the class of possible lexicons is infinite, the minimization of description length is necessarily heuristic. Given a fixed lexicon, the expectationmaximization algorithm (=-=Dempster et al., 1977-=-) can be used to arrive at a (locally) optimal set of frequencies and codelengths for the words in the lexicon. For composition by concatenation, the algorithm reduces to the special case of the Baum-... |

1 | Mathematical Structure ofLanguage - Harris - 1968 |