Results 1 -
7 of
7
Zipf and Type-Token rules for the English and Irish languages
, 2004
"... The Zipf curve of log of frequency against log of rank for a large English corpus of 500 million word tokens and 689,000 word types is shown to have the usual slope close to –1 for rank less than 5,000, but then for a higher rank it turns to give a slope close to –2. This is apparently mainly due to ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
The Zipf curve of log of frequency against log of rank for a large English corpus of 500 million word tokens and 689,000 word types is shown to have the usual slope close to –1 for rank less than 5,000, but then for a higher rank it turns to give a slope close to –2. This is apparently mainly due to foreign words and place names. The Zipf curve for a highly-inflected language (the Indo-European Celtic language, Irish) is also given. Because of the larger number of word types per lemma, it remains flatter than the English curve maintaining a slope of –1 until a turning point of about rank 30,000. A formula which calculates the number of tokens given the number of types is derived in terms of the rank at the turning point, 5,000 for English and 30,000 for Irish.
zipfR: Word Frequency Distributions in R
"... We introduce the zipfR package, a powerful and user-friendly open-source tool for LNRE modeling of word frequency distributions in the R statistical environment. We give some background on LNRE models, discuss related software and the motivation for the toolkit, describe the implementation, and conc ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We introduce the zipfR package, a powerful and user-friendly open-source tool for LNRE modeling of word frequency distributions in the R statistical environment. We give some background on LNRE models, discuss related software and the motivation for the toolkit, describe the implementation, and conclude with a complete sample session showing a typical LNRE analysis. 1
Words and Echoes: Assessing and Mitigating the Non-Randomness Problem in Word Frequency Distribution Modeling
"... Frequency distribution models tuned to words and other linguistic events can predict the number of distinct types and their frequency distribution in samples of arbitrary sizes. We conduct, for the first time, a rigorous evaluation of these models based on cross-validation and separation of training ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Frequency distribution models tuned to words and other linguistic events can predict the number of distinct types and their frequency distribution in samples of arbitrary sizes. We conduct, for the first time, a rigorous evaluation of these models based on cross-validation and separation of training and test data. Our experiments reveal that the prediction accuracy of the models is marred by serious overfitting problems, due to violations of the random sampling assumption in corpus data. We then propose a simple pre-processing method to alleviate such non-randomness problems. Further evaluation confirms the effectiveness of the method, which compares favourably to more complex correction techniques. 1
Quantifying Constructional Productivity with Unseen Slot Members
"... This paper is concerned with the possibility of quantifying and comparing the productivity of similar yet distinct syntactic constructions, predicting the likelihood of encountering unseen lexemes in their unfilled slots. Two examples are explored: variants of comparative correlative constructions ( ..."
Abstract
- Add to MetaCart
This paper is concerned with the possibility of quantifying and comparing the productivity of similar yet distinct syntactic constructions, predicting the likelihood of encountering unseen lexemes in their unfilled slots. Two examples are explored: variants of comparative correlative constructions (CCs, e.g. the faster the better), which are potentially very productive but in practice lexically restricted; and ambiguously attached prepositional phrases with the preposition with, which can host both large and restricted inventories of arguments under different conditions. It will be shown that different slots in different constructions are not equally likely to be occupied productively by unseen lexemes, and suggested that in some cases this can help disambiguate the underlying syntactic and semantic structure. 1
unknown title
"... The field of linguistics has recently undergone a methodological revolution. Whereas earlier on most linguists had relied solely on introspection, recent years have seen the rise to prominence of corpora, i.e. large samples of texts, as the main source of linguistic data [5]. Because of this shift, ..."
Abstract
- Add to MetaCart
The field of linguistics has recently undergone a methodological revolution. Whereas earlier on most linguists had relied solely on introspection, recent years have seen the rise to prominence of corpora, i.e. large samples of texts, as the main source of linguistic data [5]. Because of this shift, statistical analysis plays an increasingly central role in the field. However, as has been known since the seminal work of George Kingsley Zipf (e.g. [6]), standard statistical models (in particular all those based on normality assumptions) are not suitable for analyzing the frequency distributions of words and other linguistic units. Even in the largest corpora currently available (containing above one billion running words of text), word frequency distributions are characterized by a high proportion of word types that occur only once or twice. When the sample size is increased further, a nonnegligible number of new types will be encountered about which the original sample did not contain any information at all. Because of these properties, often referred to as the “Zipfianness ” of language data, estimation of occurrence probabilities is unreliable (even when confidence interval estimates are used, cf. [3, Ch. 4]), the central limit theorem no longer guarantees the normality of sample averages for large samples, and the number of
m N
, 2000
"... Word frequency distributions & LNRE models • type-token statistics for any type-rich population with Zipf-like probability distribution (LNRE = Large Number of Rare Events, Baayen 2001) • extrapolation of vocabulary growth & frequency spectrum to larger samples ( ➟ morphological productivity, vocabu ..."
Abstract
- Add to MetaCart
Word frequency distributions & LNRE models • type-token statistics for any type-rich population with Zipf-like probability distribution (LNRE = Large Number of Rare Events, Baayen 2001) • extrapolation of vocabulary growth & frequency spectrum to larger samples ( ➟ morphological productivity, vocabulary richness, stylometry, data sparseness, etc.) • estimation of vocabulary size from small samples (e.g. sentence patterns or word senses) • prior distribution in Bayesian inference & population model for Good-Turing smoothing V m E[V m]

