Results 1  10
of
27
A Brief History of Generative Models for Power Law and Lognormal Distributions
 INTERNET MATHEMATICS
"... Recently, I became interested in a current debate over whether file size distributions are best modelled by a power law distribution or a a lognormal distribution. In trying ..."
Abstract

Cited by 253 (7 self)
 Add to MetaCart
Recently, I became interested in a current debate over whether file size distributions are best modelled by a power law distribution or a a lognormal distribution. In trying
Power laws, Pareto distributions and Zipf’s law
 Contemporary Physics
, 2005
"... When the probability of measuring a particular value of some quantity varies inversely as a power of that value, the quantity is said to follow a power law, also known variously as Zipf’s law or the Pareto distribution. Power laws appear widely in physics, biology, earth and planetary sciences, econ ..."
Abstract

Cited by 176 (0 self)
 Add to MetaCart
When the probability of measuring a particular value of some quantity varies inversely as a power of that value, the quantity is said to follow a power law, also known variously as Zipf’s law or the Pareto distribution. Power laws appear widely in physics, biology, earth and planetary sciences, economics and finance, computer science, demography and the social sciences. For instance, the distributions of the sizes of cities, earthquakes, solar flares, moon craters, wars and people’s personal fortunes all appear to follow power laws. The origin of powerlaw behaviour has been a topic of debate in the scientific community for more than a century. Here we review some of the empirical evidence for the existence of powerlaw forms and the theories proposed to explain them. I.
An efficient, probabilistically sound algorithm for segmentation and word discovery
 MACHINE LEARNING
, 1999
"... This paper presents a modelbased, unsupervised algorithm for recovering word boundaries in a naturallanguage text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstract ..."
Abstract

Cited by 142 (2 self)
 Add to MetaCart
This paper presents a modelbased, unsupervised algorithm for recovering word boundaries in a naturallanguage text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstractly so that the detailed component models of phonology, wordorder, and word frequency can be replaced in a modular fashion. The model yields a languageindependent, prior probability distribution on all possible sequences of all possible words over a given alphabet, based on the assumption that the input was generated by concatenating words from a fixed but unknown lexicon. The model is unusual in that it treats the generation of a complete corpus, regardless of length, as a single event in the probability space. Accordingly, the algorithm does not estimate a probability distribution on words; instead, it attempts to calculate the prior probabilities of various word sequences that could underlie the observed text. Experiments on phonemic transcripts of spontaneous speech by parents to young children suggest that our algorithm is more effective than other proposed algorithms, at least when utterance boundaries are given and the text includes a substantial number of short utterances.
A Stochastic Process for Word Frequency Distributions
, 1991
"... A stochastic model based on insights of Mandelbrot (1953) and Simon (1955) is discussed against the background of new criteria of adequacy that have become available recently as a result of studies of the similarity relations between words as found in large computerized text corpora. ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
A stochastic model based on insights of Mandelbrot (1953) and Simon (1955) is discussed against the background of new criteria of adequacy that have become available recently as a result of studies of the similarity relations between words as found in large computerized text corpora.
A simple LNRE model for random character sequences
 In Proceedings of the 7èmes Journées Internationales d'Analyse Statistique des Données Textuelles (LouvainlaNeuve
, 2004
"... This paper describes a population model for word frequency distributions based on the ZipfMandelbrot law, corresponding to the word frequency distribution induced by a random character sequence. The model, which has convenient analytical and numerical properties, is shown to be adequate for the des ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
This paper describes a population model for word frequency distributions based on the ZipfMandelbrot law, corresponding to the word frequency distribution induced by a random character sequence. The model, which has convenient analytical and numerical properties, is shown to be adequate for the description of language data extracted by automatic means from large text corpora. It can thus be used to study the problems faced by the statistical analysis of such data in the field of naturallanguage processing. Keywords: lexical statistics, LNRE models, ZipfMandelbrot law, random text, cooccurrence statistics 1 Introduction to lexical statistics and LNRE models Most work in the area of lexical statistics is based on random sampling with replacement. 1 This model assumes a population of types w1, · · · , wS with occurrence probabilities π1, · · · , πS. S is called the population size and may be infinite (S = ∞) in the case of a countably infinite population. The probabilities πi are the parameters of this model and must satisfiy
Extension of Zipf’s Law to Word and Character NGrams for English and Chinese
 Journal of Computational Linguistics and Chinese Language Processing
, 2003
"... It is shown that for a large corpus, Zipf 's law for both words in English and characters in Chinese does not hold for all ranks. The frequency falls below the frequency predicted by Zipf's law for English words for rank greater than about 5,000 and for Chinese characters for rank greater than about ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
It is shown that for a large corpus, Zipf 's law for both words in English and characters in Chinese does not hold for all ranks. The frequency falls below the frequency predicted by Zipf's law for English words for rank greater than about 5,000 and for Chinese characters for rank greater than about 1,000. However, when single words or characters are combined together with ngram words or characters in one list and put in order of frequency, the frequency of tokens in the combined list follows Zipf’s law approximately with the slope close to1 on a loglog plot for all ngrams, down to the lowest frequencies in both languages. This behaviour is also found for English 2byte and 3byte word fragments. It only happens when all ngrams are used, including semantically incomplete ngrams. Previous theories do not predict this behaviour, possibly because conditional probabilities of tokens have not been properly represented.
RTG: A Recursive Realistic Graph Generator using Random Typing
 DATA MINING AND KNOWLEDGE DISCOVERY
, 2009
"... We propose a new, recursive model to generate realistic graphs, evolving over time. Our model has the following properties: it is (a) flexible, capable of generating the cross product of weighted/unweighted, directed/undirected, uni/bipartite graphs; (b) realistic, giving graphs that obey eleven sta ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
We propose a new, recursive model to generate realistic graphs, evolving over time. Our model has the following properties: it is (a) flexible, capable of generating the cross product of weighted/unweighted, directed/undirected, uni/bipartite graphs; (b) realistic, giving graphs that obey eleven static and dynamic laws that real graphs follow (we formally prove that for several of the (power) laws and we estimate their exponents as a function of the model parameters); (c) parsimonious, requiring only four parameters. (d) fast, being linear on the number of edges; (e) simple, intuitively leading to the generation of macroscopic patterns. We empirically show that our model mimics two realworld graphs very well: Blognet (unipartite, undirected, unweighted) with 27K nodes and 125K edges; and CommitteetoCandidate campaign donations (bipartite, directed, weighted) with 23K nodes and 880K edges. We also show how to handle time so that edge/weight additions are bursty and selfsimilar. 1
Zipf and TypeToken rules for the English and Irish languages
, 2004
"... The Zipf curve of log of frequency against log of rank for a large English corpus of 500 million word tokens and 689,000 word types is shown to have the usual slope close to –1 for rank less than 5,000, but then for a higher rank it turns to give a slope close to –2. This is apparently mainly due to ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
The Zipf curve of log of frequency against log of rank for a large English corpus of 500 million word tokens and 689,000 word types is shown to have the usual slope close to –1 for rank less than 5,000, but then for a higher rank it turns to give a slope close to –2. This is apparently mainly due to foreign words and place names. The Zipf curve for a highlyinflected language (the IndoEuropean Celtic language, Irish) is also given. Because of the larger number of word types per lemma, it remains flatter than the English curve maintaining a slope of –1 until a turning point of about rank 30,000. A formula which calculates the number of tokens given the number of types is derived in terms of the rank at the turning point, 5,000 for English and 30,000 for Irish.
On the Vocabulary of GrammarBased Codes and the Logical Consistency of Texts
, 2008
"... The article presents a new interpretation for Zipf’s law in natural language which relies on two areas of information theory. We reformulate the problem of grammarbased compression and investigate properties of strongly nonergodic stationary processes. The motivation for the joint discussion is to ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
The article presents a new interpretation for Zipf’s law in natural language which relies on two areas of information theory. We reformulate the problem of grammarbased compression and investigate properties of strongly nonergodic stationary processes. The motivation for the joint discussion is to prove a proposition with a simple informal statement: If an nletter long text describes n β independent facts in a random but consistent way then the text contains at least n β /log n different words. In the formal statement, two specific postulates are adopted. Firstly, the words are understood as the nonterminal symbols of the shortest grammarbased encoding of the text. Secondly, the texts are assumed to be emitted by a nonergodic source, with the described facts being binary IID variables that are asymptotically predictable in a shiftinvariant way. The proof of the formal proposition applies several new tools. These
Power Laws for Monkeys Typing Randomly: The Case of Unequal Probabilities
, 2004
"... An early result in the history of power laws, due to Miller, concerned the following experiment. A monkey types randomly on a keyboard with letters @ IA and a space bar, where a space separates words. A space is hit with probability; all other letters are hit with equal probability @I A. Miller prov ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
An early result in the history of power laws, due to Miller, concerned the following experiment. A monkey types randomly on a keyboard with letters @ IA and a space bar, where a space separates words. A space is hit with probability; all other letters are hit with equal probability @I A. Miller proved that in this experiment, the rankfrequency distribution of words follows a power law. The case where letters are hit with unequal probability has been the subject of recent confusion, with some suggesting that in this case the rankfrequency distribution follows a lognormal distribution. We prove that the rankfrequency distribution follows a power law for assignments of probabilities that have rational logratios for any pair of keys, and we present an argument of Montgomery that settles the remaining cases, also yielding a power law. The key to both arguments is the use of complex analysis. The method of proof produces simple explicit formulas for the coefficient in the power law in cases with rational logratios for the assigned probabilities of keys. Our formula in these cases suggests an exact asymptotic formula in the cases with an irrational logratio, and this formula is exactly what was proved by Montgomery.