Results 1  10
of
53
A Brief History of Generative Models for Power Law and Lognormal Distributions
 INTERNET MATHEMATICS
"... Recently, I became interested in a current debate over whether file size distributions are best modelled by a power law distribution or a a lognormal distribution. In trying ..."
Abstract

Cited by 417 (8 self)
 Add to MetaCart
Recently, I became interested in a current debate over whether file size distributions are best modelled by a power law distribution or a a lognormal distribution. In trying
Power laws, Pareto distributions and Zipf’s law
"... Many of the things that scientists measure have a typical size or “scale”—a typical value around which individual measurements are centred. A simple example would be the heights of human beings. Most adult human beings are about 180cm tall. There is some variation around this figure, notably dependi ..."
Abstract

Cited by 391 (0 self)
 Add to MetaCart
(Show Context)
Many of the things that scientists measure have a typical size or “scale”—a typical value around which individual measurements are centred. A simple example would be the heights of human beings. Most adult human beings are about 180cm tall. There is some variation around this figure, notably depending on sex, but we never see people who are 10cm tall, or 500cm. To make this observation more quantitative, one can plot a histogram of people’s heights, as I have done in Fig. 1a. The figure shows the heights in centimetres of adult men in the United States measured between 1959 and 1962, and indeed the distribution is relatively narrow and peaked around 180cm. Another telling observation is the ratio of the heights of the tallest and shortest people.
An efficient, probabilistically sound algorithm for segmentation and word discovery
 MACHINE LEARNING
, 1999
"... This paper presents a modelbased, unsupervised algorithm for recovering word boundaries in a naturallanguage text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstract ..."
Abstract

Cited by 197 (2 self)
 Add to MetaCart
This paper presents a modelbased, unsupervised algorithm for recovering word boundaries in a naturallanguage text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstractly so that the detailed component models of phonology, wordorder, and word frequency can be replaced in a modular fashion. The model yields a languageindependent, prior probability distribution on all possible sequences of all possible words over a given alphabet, based on the assumption that the input was generated by concatenating words from a fixed but unknown lexicon. The model is unusual in that it treats the generation of a complete corpus, regardless of length, as a single event in the probability space. Accordingly, the algorithm does not estimate a probability distribution on words; instead, it attempts to calculate the prior probabilities of various word sequences that could underlie the observed text. Experiments on phonemic transcripts of spontaneous speech by parents to young children suggest that our algorithm is more effective than other proposed algorithms, at least when utterance boundaries are given and the text includes a substantial number of short utterances.
A Stochastic Process for Word Frequency Distributions
, 1991
"... A stochastic model based on insights of Mandelbrot (1953) and Simon (1955) is discussed against the background of new criteria of adequacy that have become available recently as a result of studies of the similarity relations between words as found in large computerized text corpora. ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
A stochastic model based on insights of Mandelbrot (1953) and Simon (1955) is discussed against the background of new criteria of adequacy that have become available recently as a result of studies of the similarity relations between words as found in large computerized text corpora.
Extension of Zipf’s Law to Word and Character NGrams for English and Chinese
 Journal of Computational Linguistics and Chinese Language Processing
, 2003
"... It is shown that for a large corpus, Zipf 's law for both words in English and characters in Chinese does not hold for all ranks. The frequency falls below the frequency predicted by Zipf's law for English words for rank greater than about 5,000 and for Chinese characters for rank greater ..."
Abstract

Cited by 14 (4 self)
 Add to MetaCart
It is shown that for a large corpus, Zipf 's law for both words in English and characters in Chinese does not hold for all ranks. The frequency falls below the frequency predicted by Zipf's law for English words for rank greater than about 5,000 and for Chinese characters for rank greater than about 1,000. However, when single words or characters are combined together with ngram words or characters in one list and put in order of frequency, the frequency of tokens in the combined list follows Zipf’s law approximately with the slope close to1 on a loglog plot for all ngrams, down to the lowest frequencies in both languages. This behaviour is also found for English 2byte and 3byte word fragments. It only happens when all ngrams are used, including semantically incomplete ngrams. Previous theories do not predict this behaviour, possibly because conditional probabilities of tokens have not been properly represented.
RTG: A Recursive Realistic Graph Generator using Random Typing
 DATA MINING AND KNOWLEDGE DISCOVERY
, 2009
"... We propose a new, recursive model to generate realistic graphs, evolving over time. Our model has the following properties: it is (a) flexible, capable of generating the cross product of weighted/unweighted, directed/undirected, uni/bipartite graphs; (b) realistic, giving graphs that obey eleven sta ..."
Abstract

Cited by 14 (4 self)
 Add to MetaCart
(Show Context)
We propose a new, recursive model to generate realistic graphs, evolving over time. Our model has the following properties: it is (a) flexible, capable of generating the cross product of weighted/unweighted, directed/undirected, uni/bipartite graphs; (b) realistic, giving graphs that obey eleven static and dynamic laws that real graphs follow (we formally prove that for several of the (power) laws and we estimate their exponents as a function of the model parameters); (c) parsimonious, requiring only four parameters. (d) fast, being linear on the number of edges; (e) simple, intuitively leading to the generation of macroscopic patterns. We empirically show that our model mimics two realworld graphs very well: Blognet (unipartite, undirected, unweighted) with 27K nodes and 125K edges; and CommitteetoCandidate campaign donations (bipartite, directed, weighted) with 23K nodes and 880K edges. We also show how to handle time so that edge/weight additions are bursty and selfsimilar. 1
On the Vocabulary of GrammarBased Codes and the Logical Consistency of Texts
, 2008
"... The article presents a new interpretation for Zipf’s law in natural language which relies on two areas of information theory. We reformulate the problem of grammarbased compression and investigate properties of strongly nonergodic stationary processes. The motivation for the joint discussion is to ..."
Abstract

Cited by 13 (10 self)
 Add to MetaCart
The article presents a new interpretation for Zipf’s law in natural language which relies on two areas of information theory. We reformulate the problem of grammarbased compression and investigate properties of strongly nonergodic stationary processes. The motivation for the joint discussion is to prove a proposition with a simple informal statement: If an nletter long text describes n β independent facts in a random but consistent way then the text contains at least n β /log n different words. In the formal statement, two specific postulates are adopted. Firstly, the words are understood as the nonterminal symbols of the shortest grammarbased encoding of the text. Secondly, the texts are assumed to be emitted by a nonergodic source, with the described facts being binary IID variables that are asymptotically predictable in a shiftinvariant way. The proof of the formal proposition applies several new tools. These
Power Laws for Monkeys Typing Randomly: The Case of Unequal Probabilities
, 2004
"... An early result in the history of power laws, due to Miller, concerned the following experiment. A monkey types randomly on a keyboard with letters @ IA and a space bar, where a space separates words. A space is hit with probability; all other letters are hit with equal probability @I A. Miller prov ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
(Show Context)
An early result in the history of power laws, due to Miller, concerned the following experiment. A monkey types randomly on a keyboard with letters @ IA and a space bar, where a space separates words. A space is hit with probability; all other letters are hit with equal probability @I A. Miller proved that in this experiment, the rankfrequency distribution of words follows a power law. The case where letters are hit with unequal probability has been the subject of recent confusion, with some suggesting that in this case the rankfrequency distribution follows a lognormal distribution. We prove that the rankfrequency distribution follows a power law for assignments of probabilities that have rational logratios for any pair of keys, and we present an argument of Montgomery that settles the remaining cases, also yielding a power law. The key to both arguments is the use of complex analysis. The method of proof produces simple explicit formulas for the coefficient in the power law in cases with rational logratios for the assigned probabilities of keys. Our formula in these cases suggests an exact asymptotic formula in the cases with an irrational logratio, and this formula is exactly what was proved by Montgomery.
A simple LNRE model for random character sequences
 In Proceedings of the 7èmes Journées Internationales d'Analyse Statistique des Données Textuelles (LouvainlaNeuve
, 2004
"... This paper describes a population model for word frequency distributions based on the ZipfMandelbrot law, corresponding to the word frequency distribution induced by a random character sequence. The model, which has convenient analytical and numerical properties, is shown to be adequate for the des ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
(Show Context)
This paper describes a population model for word frequency distributions based on the ZipfMandelbrot law, corresponding to the word frequency distribution induced by a random character sequence. The model, which has convenient analytical and numerical properties, is shown to be adequate for the description of language data extracted by automatic means from large text corpora. It can thus be used to study the problems faced by the statistical analysis of such data in the field of naturallanguage processing. Keywords: lexical statistics, LNRE models, ZipfMandelbrot law, random text, cooccurrence statistics 1 Introduction to lexical statistics and LNRE models Most work in the area of lexical statistics is based on random sampling with replacement. 1 This model assumes a population of types w1, · · · , wS with occurrence probabilities π1, · · · , πS. S is called the population size and may be infinite (S = ∞) in the case of a countably infinite population. The probabilities πi are the parameters of this model and must satisfiy
The use of Zipf’s law in animal communication analysis
, 2003
"... nformation theory has been discussed as a technique to analyse communicative processes or sequential behaviour of nonhuman animals, as in MacKay (1972), Slater (1973) and Bradbury & Vehrencamp (1998, chapters 13–15) among others. Recently, McCowan et al. (1999) proposed the use of information th ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
nformation theory has been discussed as a technique to analyse communicative processes or sequential behaviour of nonhuman animals, as in MacKay (1972), Slater (1973) and Bradbury & Vehrencamp (1998, chapters 13–15) among others. Recently, McCowan et al. (1999) proposed the use of information theory for their study of bottlenose dolphin, Tursiops truncatus, whistles. They discussed several aspects of their analysis techniques. Although we agree about the effectiveness of information theory to analyse unknown sources, we would like to further the discussion of one analysis method used in McCowan et al. (1999). Specifically, we wish to illustrate that Zipf’s law is of little use in the analysis of communication signals. The presence or absence in dolphins and other animals of some features of human language remain intriguing and open questions (Tyack 1999). However, we assert that a Zipfbased technique is methodologically inappropriate to address these questions. McCowan et al. (1999, page 410) noted that ‘Few investigators of animal behaviour have examined the use of firstorder entropic analysis known as Zipf’s law or statistic’. In fact, Zipf’s law has been discarded as a linguistic tool, strongly criticized by Miller (1957), Miller & Chomsky (1963) and more thoroughly by Rapoport (1982). McCowan et al. (1999, page 411) also cite the application of Zipf’s law to DNA sequences by Mantegna et al. (1994) ‘with varying interpretations and reliability