Results 1 
8 of
8
Zipf and TypeToken rules for the English and Irish languages
, 2004
"... The Zipf curve of log of frequency against log of rank for a large English corpus of 500 million word tokens and 689,000 word types is shown to have the usual slope close to –1 for rank less than 5,000, but then for a higher rank it turns to give a slope close to –2. This is apparently mainly due to ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
The Zipf curve of log of frequency against log of rank for a large English corpus of 500 million word tokens and 689,000 word types is shown to have the usual slope close to –1 for rank less than 5,000, but then for a higher rank it turns to give a slope close to –2. This is apparently mainly due to foreign words and place names. The Zipf curve for a highlyinflected language (the IndoEuropean Celtic language, Irish) is also given. Because of the larger number of word types per lemma, it remains flatter than the English curve maintaining a slope of –1 until a turning point of about rank 30,000. A formula which calculates the number of tokens given the number of types is derived in terms of the rank at the turning point, 5,000 for English and 30,000 for Irish.
zipfR: Word Frequency Distributions in R
"... We introduce the zipfR package, a powerful and userfriendly opensource tool for LNRE modeling of word frequency distributions in the R statistical environment. We give some background on LNRE models, discuss related software and the motivation for the toolkit, describe the implementation, and conc ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
We introduce the zipfR package, a powerful and userfriendly opensource tool for LNRE modeling of word frequency distributions in the R statistical environment. We give some background on LNRE models, discuss related software and the motivation for the toolkit, describe the implementation, and conclude with a complete sample session showing a typical LNRE analysis. 1
Words and Echoes: Assessing and Mitigating the NonRandomness Problem in Word Frequency Distribution Modeling
"... Frequency distribution models tuned to words and other linguistic events can predict the number of distinct types and their frequency distribution in samples of arbitrary sizes. We conduct, for the first time, a rigorous evaluation of these models based on crossvalidation and separation of training ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Frequency distribution models tuned to words and other linguistic events can predict the number of distinct types and their frequency distribution in samples of arbitrary sizes. We conduct, for the first time, a rigorous evaluation of these models based on crossvalidation and separation of training and test data. Our experiments reveal that the prediction accuracy of the models is marred by serious overfitting problems, due to violations of the random sampling assumption in corpus data. We then propose a simple preprocessing method to alleviate such nonrandomness problems. Further evaluation confirms the effectiveness of the method, which compares favourably to more complex correction techniques. 1
Quantifying Constructional Productivity with Unseen Slot Members
"... This paper is concerned with the possibility of quantifying and comparing the productivity of similar yet distinct syntactic constructions, predicting the likelihood of encountering unseen lexemes in their unfilled slots. Two examples are explored: variants of comparative correlative constructions ( ..."
Abstract
 Add to MetaCart
This paper is concerned with the possibility of quantifying and comparing the productivity of similar yet distinct syntactic constructions, predicting the likelihood of encountering unseen lexemes in their unfilled slots. Two examples are explored: variants of comparative correlative constructions (CCs, e.g. the faster the better), which are potentially very productive but in practice lexically restricted; and ambiguously attached prepositional phrases with the preposition with, which can host both large and restricted inventories of arguments under different conditions. It will be shown that different slots in different constructions are not equally likely to be occupied productively by unseen lexemes, and suggested that in some cases this can help disambiguate the underlying syntactic and semantic structure. 1
unknown title
"... The field of linguistics has recently undergone a methodological revolution. Whereas earlier on most linguists had relied solely on introspection, recent years have seen the rise to prominence of corpora, i.e. large samples of texts, as the main source of linguistic data [5]. Because of this shift, ..."
Abstract
 Add to MetaCart
The field of linguistics has recently undergone a methodological revolution. Whereas earlier on most linguists had relied solely on introspection, recent years have seen the rise to prominence of corpora, i.e. large samples of texts, as the main source of linguistic data [5]. Because of this shift, statistical analysis plays an increasingly central role in the field. However, as has been known since the seminal work of George Kingsley Zipf (e.g. [6]), standard statistical models (in particular all those based on normality assumptions) are not suitable for analyzing the frequency distributions of words and other linguistic units. Even in the largest corpora currently available (containing above one billion running words of text), word frequency distributions are characterized by a high proportion of word types that occur only once or twice. When the sample size is increased further, a nonnegligible number of new types will be encountered about which the original sample did not contain any information at all. Because of these properties, often referred to as the “Zipfianness ” of language data, estimation of occurrence probabilities is unreliable (even when confidence interval estimates are used, cf. [3, Ch. 4]), the central limit theorem no longer guarantees the normality of sample averages for large samples, and the number of
m N
, 2000
"... Word frequency distributions & LNRE models • typetoken statistics for any typerich population with Zipflike probability distribution (LNRE = Large Number of Rare Events, Baayen 2001) • extrapolation of vocabulary growth & frequency spectrum to larger samples ( ➟ morphological productivity, vocabu ..."
Abstract
 Add to MetaCart
Word frequency distributions & LNRE models • typetoken statistics for any typerich population with Zipflike probability distribution (LNRE = Large Number of Rare Events, Baayen 2001) • extrapolation of vocabulary growth & frequency spectrum to larger samples ( ➟ morphological productivity, vocabulary richness, stylometry, data sparseness, etc.) • estimation of vocabulary size from small samples (e.g. sentence patterns or word senses) • prior distribution in Bayesian inference & population model for GoodTuring smoothing V m E[V m]
© The Association for Computational Linguistics and Chinese Language Processing Reduced NGrams for Chinese Evaluation
"... Theoretically, an improvement in a language model occurs as the size of the ngrams increases from 3 to 5 or higher. As the ngram size increases, the number of parameters and calculations, and the storage requirement increase very rapidly if we attempt to store all possible combinations of ngrams. ..."
Abstract
 Add to MetaCart
Theoretically, an improvement in a language model occurs as the size of the ngrams increases from 3 to 5 or higher. As the ngram size increases, the number of parameters and calculations, and the storage requirement increase very rapidly if we attempt to store all possible combinations of ngrams. To avoid these problems, the reduced ngrams ’ approach previously developed by O ’ Boyle and Smith [1993] can be applied. A reduced ngram language model, called a reduced model, can efficiently store an entire corpus’s phrasehistory length within feasible storage limits. Another advantage of reduced ngrams is that they usually are semantically complete. In our experiments, the reduced ngram creation method or the O’ BoyleSmith reduced ngram algorithm was applied to a large Chinese corpus. The Chinese reduced ngram Zipf curves are presented here and compared with previously obtained conventional Chinese ngrams. The Chinese reduced model reduced perplexity by 8.74 % and the language model size by a factor of 11.49. This paper is the first attempt to model Chinese reduced ngrams, and may provide important insights for Chinese linguistic research.