## Formal grammar and information theory: Together again? (2000)

Venue: | PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY |

Citations: | 28 - 0 self |

### BibTeX

@ARTICLE{Pereira00formalgrammar,

author = {Fernando Pereira},

title = {Formal grammar and information theory: Together again?},

journal = {PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY},

year = {2000},

volume = {358},

pages = {1239--1253}

}

### Years of Citing Articles

### OpenURL

### Abstract

In the last 40 years, research on models of spoken and written language has been split between two seemingly irreconcilable traditions: formal linguistics in the Chomsky tradition, and information theory in the Shannon tradition. Zellig Harris had advocated a close alliance between grammatical and information-theoretic principles in the analysis of natural language, and early formal-language theory provided another strong link between information theory and linguistics. Nevertheless, in most research on language and computation, grammatical and information-theoretic approaches had moved far apart. Today, after many years on the defensive, the information-theoretic approach has gained new strength and achieved practical successes in speech recognition, information retrieval, and, increasingly, in language analysis and machine translation. The exponential increase in the speed and storage capacity of computers is the proximate cause of these engineering successes, allowing the automatic estimation of the parameters of probabilistic models of language by counting occurrences of linguistic events in very large bodies of text and speech. However, I will argue that informationtheoretic and computational ideas are also playing an increasing role in the scientific understanding of language, and will help bring together formal-linguistic and information-theoretic perspectives.

### Citations

9002 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...al theoretical results in this area give probabilistic bounds on the generalization error of a model as a function of model error on training data, sample size, model complexity, and margin of error (=-=Vapnik, 1995). I-=-n qualitative terms, the gap between test and training error — a measure of overfitting — grows with model complexity for a fixed training sample size, and decreases with sample size for a fixed m... |

8595 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...onstraints appear to ignore the content-carrying function of utterances. Fortunately, information theory provides a ready tool for quantifying information about with the notion of mutual information (=-=Cover & Thomas 1991-=-), from which a suitable notion of compression relative to side variables of interest can be de� ned (Tishby et al. 1999). Given the enormous conceptual and technical di¯ culties of building a compreh... |

8134 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ... wn) = p(w1) nY p(wijwi 1): By using this estimate for the probability of a string and an aggregate model with C = 16 trained on newspaper text, and by using the expectation{maximization (EM) method (=-=Dempster et al. 1977-=-), we � nd that i= 2 p(Colourless green ideas sleep furiously) p(Furiously sleep ideas green colourless) 2 10 5 : Thus, a suitably constrained statistical model, even a very simple one, can meet Choms... |

2322 | A decisiontheoretic generalization of on-line learning and its application to boosting
- Freund, Schapire
- 1997
(Show Context)
Citation Context ...iant 1984), or even one assumes an on-line setting, in which the goal is to do the best possible prediction on a � xed sequence incrementally generated by the environment (Littlestone & Warmuth 1994; =-=Freund & Schapire 1997-=-). A crucial idea from the distribution-free setting is that model complexity can be measured, even for an in� nite model class, by combinatorial quantities such as the Vapnik{Chervonenkis (VC) dimens... |

2175 | Support-vector networks
- CORTES, V
- 1995
(Show Context)
Citation Context ...al theoretical results in this area give probabilistic bounds on the generalization error of a model as a function of model error on training data, sample size, model complexity, and margin of error (=-=Vapnik, 1995). I-=-n qualitative terms, the gap between test and training error — a measure of overfitting — grows with model complexity for a fixed training sample size, and decreases with sample size for a fixed m... |

1696 | A theory of the learnable
- Valiant
- 1984
(Show Context)
Citation Context ...nknown parameters (as is usually done in statistics), or one takes a distributionfree view in which the data is distributed according to an unknown (but fixed between training and test) distribution (=-=Valiant, 1984-=-), or even one assumes an on-line setting in which the goal is to do the best possible prediction on a fixed sequence incrementally generated by the environment (Littlestone & Warmuth, 1994; Freund & ... |

1364 |
Aspects of the theory of syntax
- Chomsky
- 1965
(Show Context)
Citation Context ...material may be elided, although is would seem that additional constraints on reduction may be necessary. Furthermore, connections between reduction-based and transformational analyses (Harris, 1965; =-=Chomsky, 1965-=-) suggest the possibility of modeling string distributions as the overt projection of a hidden generative process involving operator-argument structures subject to the likelihood constraint subject to... |

1262 |
The Minimalist Program
- Chomsky
- 1995
(Show Context)
Citation Context ...xicalized models less rather than more palatable to Chomskian linguists, for whom structural relationships are the prime subject of theory. But notice that Chomsky’s more recent “minimalist progra=-=m” (Chomsky, 1995) is-=- much more lexically-based than any of his theories since “Aspects” (Chomsky, 1965), in ways that reminiscent of other lexicalized multistratal theories, in particular lexical-functional grammar (... |

1245 | Combining labeled and unlabeled data with co-training
- BLUM, T
- 1998
(Show Context)
Citation Context ...context, allow a language processor to learn from its linguistic environment with very little or no supervision (Yarowsky 1995), and have suggested new machine-learning settings, such as co-training (=-=Blum & Mitchell 1998-=-). Both lexicalized grammars and bag-of-words models represent statistical associations between words in certain con� gurations. However, the kinds of associations y I include under this description a... |

1206 |
Head-Driven Phrase Structure Grammar
- Pollard, Sag
- 1994
(Show Context)
Citation Context ...han any of his theories since “Aspects” (Chomsky, 1965), in ways that reminiscent of other lexicalized multistratal theories, in particular lexical-functional grammar (Bresnan, 1982), HPSG (Pollar=-=d & Sag, 1994-=-), and certain varieties of categorial grammar (Morrill, 1994; Moortgat, 1995; Cornell, 1997). Article submitted to Royal SocietysFormal grammar and information theory 9 other reason than for the func... |

957 | Head-Driven Statistical Models for Natural Language Parsing
- Collins
- 1999
(Show Context)
Citation Context ...ications perspectives, recent work has shown that lexicalized probabilistic context-free grammars can be learned automatically that perform with remarkable accuracy on novel material (Charniak, 1997; =-=Collins, 1998-=-). Besides lexicalization, these models factor the sentence generation process into a sequence of conditionally independent events that reflect such linguistic distinctions as those of head and depend... |

947 |
On the uniform convergence of relative frequencies of events to their probabilities,” Theory Probah
- Vapnik, Chervonenkis
- 1971
(Show Context)
Citation Context ...crucial idea from the distribution-free setting is that model complexity can be measured, even for an in� nite model class, by combinatorial quantities such as the Vapnik{Chervonenkis (VC) dimension (=-=Vapnik & Chervonenkis 1971-=-), which, roughly speaking, gives the order of a polynomial upper bound on the number of distinctions that can be made between samples by models in the class, as a function of sample size. Returning t... |

669 | The Weighted Majority Algorithm
- Littlestone, Warmuth
- 1004
(Show Context)
Citation Context ... and test) distribution (Valiant 1984), or even one assumes an on-line setting, in which the goal is to do the best possible prediction on a � xed sequence incrementally generated by the environment (=-=Littlestone & Warmuth 1994-=-; Freund & Schapire 1997). A crucial idea from the distribution-free setting is that model complexity can be measured, even for an in� nite model class, by combinatorial quantities such as the Vapnik{... |

666 | Estimation of probabilities from sparse data for the language model component of a speech recognizer
- Katz
- 1987
(Show Context)
Citation Context ... of the earliest such methods, due to Turing and Good (Good, 1953), had been published before Chomsky’s attack on empiricism, and has since been used to good effect in statistical models of language=-= (Katz, 1987-=-). The use of smoothing and other forms of regularization to constrain the form of statistical models and ensure better generalization to unseen data is an instance of a central theme in statistical l... |

585 | A statistical approach to machine translation
- Brown, Cocke, et al.
- 1990
(Show Context)
Citation Context ... of what kinds of learning tasks may involve “understanding” but do not force us to attack frontally the immense challenges of grounded language processing. Automatically-trained machine translati=-=on (Brown et al., 1990-=-; Alshawi & Douglas, 1999) may be such a task, since translation requires that many questions about a text to be answered accurately to produce a correct output. Nevertheless, it is easy to find many ... |

550 | Inducing Features of Random Fields - Pietra, Pietra, et al. - 1997 |

490 | Unsupervised word sense disambiguation rivaling supervised methods
- Yarowsky
- 1995
(Show Context)
Citation Context ...evice. These correlations, like the correlations between utterances and their physical context, allow a language processor to learn from its linguistic environment with very little or no supervision (=-=Yarowsky 1995-=-), and have suggested new machine-learning settings, such as co-training (Blum & Mitchell 1998). Both lexicalized grammars and bag-of-words models represent statistical associations between words in c... |

457 |
The mathematics of Sentence Structure
- Lambek
- 1958
(Show Context)
Citation Context ...e argument class information for a word as suggested by Harris and its type in categorial grammar, or subcategorization frames in other linguistic formalisms. However, traditional categorial grammar (=-=Lambek, 1958-=-) conflates function-argument relationships and linear order, whereas Harris factors out linear order explicitly. It is only more recently that categorial grammar has acquired the technical means to i... |

403 |
A Formal Theory of Inductive Inference
- Solomonoff
- 1964
(Show Context)
Citation Context ...lar, any infinite class of grammars can be given a universal prior based on the number of bits needed to encode members of the class, which favors the least complex grammars compatible with the data (=-=Solomonoff, 1964-=-; Horning, 1969). However, those results did not provide a way of quantifying the relationship between a prior over grammars, training sample size and generalization power, and in any case seems to ha... |

367 | Statistical parsing with a context-free grammar and word statistics
- Charniak
- 1997
(Show Context)
Citation Context ...ional and applications perspectives, recent work has shown that lexicalized probabilistic context-free grammars can be automatically learned that perform, with remarkable accuracy, on novel material (=-=Charniak 1997-=-; Collins 1998). Besides lexicalization, these models factor the sentence-generation process into a sequence of conditionally independent events that re®ect such linguistic distinctions as those of he... |

357 |
Knowledge of language: Its nature, origin, and use
- Chomsky
- 1985
(Show Context)
Citation Context ...sky’s principles-and-parameters theory, according to which learnability requires that the set of possible natural languages be generated by the settings of a finite set of finitely-valued parameters=-= (Chomsky, 1986-=-, p. 149). But this extreme constraint is neither necessary, since infinite model classes of finite VC dimension are learnable from an informationtheoretic point of view, nor sufficient, because even ... |

356 | Syntactic Structure. The Hague: Mouton - Chomsky - 1957 |

354 |
The population frequencies of species and the estimation of population parameters
- Good
- 1953
(Show Context)
Citation Context ...arly, smoothing methods can be used in probability models to assign some probability mass to unseen events (Jelinek & Mercer, 1980). In fact, one of the earliest such methods, due to Turing and Good (=-=Good, 1953),-=- had been published before Chomsky’s attack on empiricism, and has since been used to good effect in statistical models of language (Katz, 1987). The use of smoothing and other forms of regularizati... |

337 |
Interpolated estimation of Markov source parameters from sparse data
- Jelinek, Mercer
- 1980
(Show Context)
Citation Context ... data exactly but that will be less dependent on the vagaries of the sample. Similarly, smoothing methods can be used in probability models to assign some probability mass to unseen events (Jelinek & =-=Mercer, 1980).-=- In fact, one of the earliest such methods, due to Turing and Good (Good, 1953), had been published before Chomsky’s attack on empiricism, and has since been used to good effect in statistical model... |

305 | Cryptographic limitations on learning Boolean formulae and finite automata
- Kearns, Valiant
- 1994
(Show Context)
Citation Context ...at is, the search for a model with good generalization may be computationPhil. Trans. R. Soc. Lond. A (2000)s1244 F. Pereira ally intractabley even though the information is, in principle, available (=-=Kearns & Valiant 1994-=-). 4. Hidden variables Early empiricist theories of linguistic behaviour made themselves easy targets of critiques like that of Chomsky (1959) by denying a signi� cant role for the internal, unobserva... |

281 |
The mental representation of grammatical relations
- Bresnan, ed
- 1982
(Show Context)
Citation Context ...) is much more lexically-based than any of his theories since “Aspects” (Chomsky, 1965), in ways that reminiscent of other lexicalized multistratal theories, in particular lexical-functional gramm=-=ar (Bresnan, 1982-=-), HPSG (Pollard & Sag, 1994), and certain varieties of categorial grammar (Morrill, 1994; Moortgat, 1995; Cornell, 1997). Article submitted to Royal SocietysFormal grammar and information theory 9 ot... |

269 |
Trainable grammars for speech recognition
- Baker
- 1979
(Show Context)
Citation Context ...t precluded signi� cant algorithmic and experimental progress with carefully designed model classes and learning methods, such as EM and variants, especially in speech processing (Baum & Petrie 1966; =-=Baker 1979-=-). In particular, the learning problem is easier in practice if interactions between hidden variables tend to factor via the observed variables. Phil. Trans. R. Soc. Lond. A (2000)s1246 F. Pereira 5. ... |

208 | One sense per discourse
- Gale, Church, et al.
- 1992
(Show Context)
Citation Context ...disambiguation, bag-of-words techniques are successful because of the underlying coherence of purposeful language, at syntactic, semantic, and discourse levels. The one-sense-per-discourse principle (=-=Gale et al. 1992-=-) captures a particular form of this coherence. For example, the co-occurrence of the words `stocks’, `bonds’ and `bank’ in the same passage is potentially indicative of a � nancial subject matter, an... |

178 |
A Computational Study of Cross-Situational Techniques for Learning Word-to-Meaning Mappings
- Siskind
- 1996
(Show Context)
Citation Context ...ion required would be to use instead additional observables correlated with the hidden variables, such as prosodic information or perceptual input associated with the content of the linguistic input (=-=Siskind, 1996-=-; Roy & Pentland, 1999). More generally, we may be able to replace direct supervision by indirect correlations, as I now discuss. 6. The Power of Correlations How poor is the stimulus that the languag... |

159 | A linear observed time statistical parser based on maximum entropy models
- Ratnaparkhi
- 1997
(Show Context)
Citation Context ...from random fields and factor probabilities instead as products of exponentials of indicator functions for significant local or global features (events) (Della Pietra, Della Pietra, & Lafferty, 1997; =-=Ratnaparkhi, 1997; Ab-=-ney, 1997), which can be built incrementally with “greedy” algorithms that select at each step the most informative feature. Article submitted to Royal SocietysFormal grammar and information theor... |

148 |
Methods in structural linguistics
- Harris
- 1951
(Show Context)
Citation Context ...s grammatical, while (2) is not. Before and after the split, Zellig Harris had advocated a close alliance between grammatical and information-theoretic principles in the analysis of natural language (=-=Harris, 1951-=-, 1991). Early formal-language theory provided another strong Article submitted to Royal Society TEX Papers2 F. Pereira link between information theory and linguistics. Nevertheless, in most research ... |

143 | Stochastic attribute-value grammars
- Abney
- 1997
(Show Context)
Citation Context ...and factor probabilities instead as products of exponentials of indicator functions for significant local or global features (events) (Della Pietra, Della Pietra, & Lafferty, 1997; Ratnaparkhi, 1997; =-=Abney, 1997), w-=-hich can be built incrementally with “greedy” algorithms that select at each step the most informative feature. Article submitted to Royal SocietysFormal grammar and information theory 11 8. From ... |

120 |
Type Logical Grammar: Categorial Logic of Signs
- Morrill
- 1994
(Show Context)
Citation Context ...lationships and linear order, whereas Harris factors out linear order explicitly. It is only more recently that categorial grammar has acquired the technical means to investigate such factorizations (=-=Morrill, 1994; Mo-=-ortgat, 1995). It becomes then clear that the Harris’s partial order may be formalized as the partial order among set-theoretic function types. However, unlike modern categorial grammar, Harris’s ... |

99 | Derivational minimalism
- Stabler
- 1997
(Show Context)
Citation Context ...or-argument structures subject to the likelihood constraint and transformations. Recent work linking transformational and categorial approaches to syntax makes this possibility especially intriguing (=-=Stabler 1997-=-; Cornell 1997). Linearization `Since the relation that makes sentences out of words is a partial order, while speech is linear, a linear projection is involved from the start.’ Harris’s theory left t... |

86 | On the computational complexity of approximating distributions by probabilistic automata
- Abe, Warmuth
- 1992
(Show Context)
Citation Context ...ables is, however, not in general su¯ cient, since the problem of setting the corresponding conditional probabilities from observable linguistic material is in most cases computationally intractable (=-=Abe & Warmuth 1992-=-). Nevertheless, those intractability results have not precluded signi� cant algorithmic and experimental progress with carefully designed model classes and learning methods, such as EM and variants, ... |

80 | Statistical Methods and Linguistics
- Abney
- 1996
(Show Context)
Citation Context ...ideas” quote above is an early instance. Chomsky concluded that sentences (1) and (2) are equally unlikely from the observation that neither sentence or ‘part’ thereof would have occurred previo=-=usly (Abney, 1996-=-). From this observation, he argued that any statistical model based on the frequencies of word sequences would have to assign equal, zero, probabilities to both sentences. But this relies on the unst... |

79 | Aggregate and Mixed-Order Markov Models for Statistical Language
- Saul, Pereira
- 1997
(Show Context)
Citation Context ...than the O(N 2 ) parameters of the direct model for the joint distribution, and is thus less prone to over� tting if C V . In particular, when (x; y) = (vi; vi+ 1), we have an aggregate bigram model (=-=Saul & Pereira 1997-=-), which is useful for modelling word sequences that include unseen bigrams. With such a model, we can approximate the probability of a string p(w1 wn) by p(w1 wn) = p(w1) nY p(wijwi 1): By using this... |

79 | The context tree weighting method: Basic properties
- Willems, Shtarkov, et al.
- 1995
(Show Context)
Citation Context ...ticular model in a class of models is a best compromise between � tting the experience so far and generalizing to new experience. When the best choice of model is uncertain, Bayesian model averaging (=-=Willems et al. 1995-=-) can be used to combine the predictions of di¬erent candidate models according to the language user’s degree of belief in them, as measured by their past success. Model averaging is, thus, a way for ... |

78 | Pac-bayesian model averaging
- McAllester
- 1999
(Show Context)
Citation Context ...than polynomial time on a deterministic sequential computer, for instance the NP-hard problems. Article submitted to Royal SocietysFormal grammar and information theory 7 statistical learning theory (=-=McAllester, 1999-=-) may provide new theoretical impetus to that research direction, since they show that a prior over models can play a similar regularizing role to a combinatorial complexity measure. The other role fo... |

70 |
A Study of Grammatical Inference
- Horning
- 1969
(Show Context)
Citation Context ... class of grammars can be given a universal prior based on the number of bits needed to encode members of the class, which favours the least complex grammars compatible with the data (Solomono¬ 1964; =-=Horning 1969-=-). However, those results did not provide a way of quantifying the relationship between a prior over grammars, training sample size and generalization power, and in any case seems to have been ignored... |

40 | Multimodal linguistic inference
- MOORTGAT
- 1995
(Show Context)
Citation Context ... linear order, whereas Harris factors out linear order explicitly. It is only more recently that categorial grammar has acquired the technical means to investigate such factorizations (Morrill, 1994; =-=Moortgat, 1995). I-=-t becomes then clear that the Harris’s partial order may be formalized as the partial order among set-theoretic function types. However, unlike modern categorial grammar, Harris’s partial order co... |

35 |
Language and information
- Harris
- 1988
(Show Context)
Citation Context ...In that tradition, Zellig Harris developed what is probably the best-articulated proposal for a marriage of linguistics and information theory. This proposal involves four main so-called constraints (=-=Harris 1988-=-) as follows. Partial order `: : : for each word: : : there are zero or more classes of words, called its arguments, such that the given word will not occur in a sentence unless one word: : : of each ... |

33 | A Theory of Language and Information: A Mathematical Approach - Harris - 1991 |

31 | Review of Skinner's "Verbal behavior - Chomsky - 1966 |

24 |
Automatic Text Processing|the Transformation, Analysis, and Retrieval of Information by Computer
- Salton
- 1989
(Show Context)
Citation Context ... information than it might seem to at � rst sight. For instance, all of the most successful information-retrieval systems ignore the order of words and just use the frequencies of words in documents (=-=Salton 1989-=-) in the so-called bag-of-words approach. Since similar situations are described in similar ways, simple statistical similarity measures between the word distributions in documents and queries are e¬e... |

23 |
Learnability, hyperlearning, and the poverty of the stimulus. Paper presented at
- Pullum
- 1996
(Show Context)
Citation Context ...ver the APS can become empirically grounded without taking into account such calculations, since the stimuli that APS supporters claimed to be missing are actually present with significant frequency (=-=Pullum, 1996).-=- The APS reached an extreme form with Chomsky’s principles-and-parameters theory, according to which learnability requires that the set of possible natural languages be generated by the settings of ... |

15 | Learning dependency transduction models from unannotated examples - Alshawi, Douglas - 2000 |

15 |
erty. Inducing features of random elds
- Pietra, Pietra, et al.
- 1997
(Show Context)
Citation Context .... The second approach is to adopt ideas from random � elds and factor probabilities instead as products of exponentials of indicator functions for signi� cant local or global features (events) (Della =-=Pietra et al. 1997-=-; Ratnaparkhi 1997; Abney 1997), which can be built incrementally with `greedy’ algorithms that select the most informative feature at each step. 8. From deciding to understanding Models based on info... |

12 | Learning words from natural audio-visual input
- Roy, Pentland
- 1998
(Show Context)
Citation Context ...ould be to use additional observables correlated with the hidden variables instead, such as prosodic information or perceptual input associated with the content of the linguistic input (Siskind 1996; =-=Roy & Pentland 1999-=-). More generally, we may be able to replace direct supervision with indirect correlations, as I now discuss. 6. The power of correlations How poor is the stimulus that the language learner exploits t... |

11 |
A type-logical perspective on minimalist derivations
- Cornell
- 1997
(Show Context)
Citation Context ... subject to the likelihood constraint subject to transformations. Recent work linking transformational and categorial approaches to syntax makes this possibility especially intriguing (Stabler, 1997; =-=Cornell, 1997). Lin-=-earization “Since the relation that makes sentences out of words is a partial order, while speech is linear, a linear projection is involved from the start.” Harris’s theory left this step rathe... |