Results 1 - 10
of
19
A Brief History of Generative Models for Power Law and Lognormal Distributions
- INTERNET MATHEMATICS
"... Recently, I became interested in a current debate over whether file size distributions are best modelled by a power law distribution or a a lognormal distribution. In trying ..."
Abstract
-
Cited by 192 (7 self)
- Add to MetaCart
Recently, I became interested in a current debate over whether file size distributions are best modelled by a power law distribution or a a lognormal distribution. In trying
An efficient, probabilistically sound algorithm for segmentation and word discovery
- MACHINE LEARNING
, 1999
"... This paper presents a model-based, unsupervised algorithm for recovering word boundaries in a natural-language text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstract ..."
Abstract
-
Cited by 103 (2 self)
- Add to MetaCart
This paper presents a model-based, unsupervised algorithm for recovering word boundaries in a natural-language text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstractly so that the detailed component models of phonology, word-order, and word frequency can be replaced in a modular fashion. The model yields a language-independent, prior probability distribution on all possible sequences of all possible words over a given alphabet, based on the assumption that the input was generated by concatenating words from a fixed but unknown lexicon. The model is unusual in that it treats the generation of a complete corpus, regardless of length, as a single event in the probability space. Accordingly, the algorithm does not estimate a probability distribution on words; instead, it attempts to calculate the prior probabilities of various word sequences that could underlie the observed text. Experiments on phonemic transcripts of spontaneous speech by parents to young children suggest that our algorithm is more effective than other proposed algorithms, at least when utterance boundaries are given and the text includes a substantial number of short utterances.
A Stochastic Process for Word Frequency Distributions
, 1991
"... A stochastic model based on insights of Mandelbrot (1953) and Simon (1955) is discussed against the background of new criteria of adequacy that have become available recently as a result of studies of the similarity relations between words as found in large computerized text corpora. ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
A stochastic model based on insights of Mandelbrot (1953) and Simon (1955) is discussed against the background of new criteria of adequacy that have become available recently as a result of studies of the similarity relations between words as found in large computerized text corpora.
A simple LNRE model for random character sequences
- In Proceedings of the 7èmes Journées Internationales d'Analyse Statistique des Données Textuelles (Louvain-la-Neuve
, 2004
"... This paper describes a population model for word frequency distributions based on the Zipf-Mandelbrot law, corresponding to the word frequency distribution induced by a random character sequence. The model, which has convenient analytical and numerical properties, is shown to be adequate for the des ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
This paper describes a population model for word frequency distributions based on the Zipf-Mandelbrot law, corresponding to the word frequency distribution induced by a random character sequence. The model, which has convenient analytical and numerical properties, is shown to be adequate for the description of language data extracted by automatic means from large text corpora. It can thus be used to study the problems faced by the statistical analysis of such data in the field of natural-language processing. Keywords: lexical statistics, LNRE models, Zipf-Mandelbrot law, random text, cooccurrence statistics 1 Introduction to lexical statistics and LNRE models Most work in the area of lexical statistics is based on random sampling with replacement. 1 This model assumes a population of types w1, · · · , wS with occurrence probabilities π1, · · · , πS. S is called the population size and may be infinite (S = ∞) in the case of a countably infinite population. The probabilities πi are the parameters of this model and must satisfiy
Extension of Zipf’s Law to Word and Character N-Grams for English and Chinese
- Journal of Computational Linguistics and Chinese Language Processing
, 2003
"... It is shown that for a large corpus, Zipf 's law for both words in English and characters in Chinese does not hold for all ranks. The frequency falls below the frequency predicted by Zipf's law for English words for rank greater than about 5,000 and for Chinese characters for rank greater than about ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
It is shown that for a large corpus, Zipf 's law for both words in English and characters in Chinese does not hold for all ranks. The frequency falls below the frequency predicted by Zipf's law for English words for rank greater than about 5,000 and for Chinese characters for rank greater than about 1,000. However, when single words or characters are combined together with n-gram words or characters in one list and put in order of frequency, the frequency of tokens in the combined list follows Zipf’s law approximately with the slope close to-1 on a log-log plot for all n-grams, down to the lowest frequencies in both languages. This behaviour is also found for English 2-byte and 3-byte word fragments. It only happens when all n-grams are used, including semantically incomplete n-grams. Previous theories do not predict this behaviour, possibly because conditional probabilities of tokens have not been properly represented.
Zipf and Type-Token rules for the English and Irish languages
, 2004
"... The Zipf curve of log of frequency against log of rank for a large English corpus of 500 million word tokens and 689,000 word types is shown to have the usual slope close to –1 for rank less than 5,000, but then for a higher rank it turns to give a slope close to –2. This is apparently mainly due to ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
The Zipf curve of log of frequency against log of rank for a large English corpus of 500 million word tokens and 689,000 word types is shown to have the usual slope close to –1 for rank less than 5,000, but then for a higher rank it turns to give a slope close to –2. This is apparently mainly due to foreign words and place names. The Zipf curve for a highly-inflected language (the Indo-European Celtic language, Irish) is also given. Because of the larger number of word types per lemma, it remains flatter than the English curve maintaining a slope of –1 until a turning point of about rank 30,000. A formula which calculates the number of tokens given the number of types is derived in terms of the rank at the turning point, 5,000 for English and 30,000 for Irish.
RTG: A Recursive Realistic Graph Generator using Random Typing
- DATA MINING AND KNOWLEDGE DISCOVERY
, 2009
"... We propose a new, recursive model to generate realistic graphs, evolving over time. Our model has the following properties: it is (a) flexible, capable of generating the cross product of weighted/unweighted, directed/undirected, uni/bipartite graphs; (b) realistic, giving graphs that obey eleven sta ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We propose a new, recursive model to generate realistic graphs, evolving over time. Our model has the following properties: it is (a) flexible, capable of generating the cross product of weighted/unweighted, directed/undirected, uni/bipartite graphs; (b) realistic, giving graphs that obey eleven static and dynamic laws that real graphs follow (we formally prove that for several of the (power) laws and we estimate their exponents as a function of the model parameters); (c) parsimonious, requiring only four parameters. (d) fast, being linear on the number of edges; (e) simple, intuitively leading to the generation of macroscopic patterns. We empirically show that our model mimics two real-world graphs very well: Blognet (unipartite, undirected, unweighted) with 27K nodes and 125K edges; and Committee-to-Candidate campaign donations (bipartite, directed, weighted) with 23K nodes and 880K edges. We also show how to handle time so that edge/weight additions are bursty and self-similar. 1
Power Laws for Monkeys Typing Randomly: The Case of Unequal Probabilities
, 2004
"... An early result in the history of power laws, due to Miller, concerned the following experiment. A monkey types randomly on a keyboard with letters @ IA and a space bar, where a space separates words. A space is hit with probability; all other letters are hit with equal probability @I A. Miller prov ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
An early result in the history of power laws, due to Miller, concerned the following experiment. A monkey types randomly on a keyboard with letters @ IA and a space bar, where a space separates words. A space is hit with probability; all other letters are hit with equal probability @I A. Miller proved that in this experiment, the rank-frequency distribution of words follows a power law. The case where letters are hit with unequal probability has been the subject of recent confusion, with some suggesting that in this case the rank-frequency distribution follows a lognormal distribution. We prove that the rank-frequency distribution follows a power law for assignments of probabilities that have rational logratios for any pair of keys, and we present an argument of Montgomery that settles the remaining cases, also yielding a power law. The key to both arguments is the use of complex analysis. The method of proof produces simple explicit formulas for the coefficient in the power law in cases with rational log-ratios for the assigned probabilities of keys. Our formula in these cases suggests an exact asymptotic formula in the cases with an irrational logratio, and this formula is exactly what was proved by Montgomery.
On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts
, 2008
"... The article presents a new interpretation for Zipf’s law in natural language which relies on two areas of information theory. We reformulate the problem of grammar-based compression and investigate properties of strongly nonergodic stationary processes. The motivation for the joint discussion is to ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
The article presents a new interpretation for Zipf’s law in natural language which relies on two areas of information theory. We reformulate the problem of grammar-based compression and investigate properties of strongly nonergodic stationary processes. The motivation for the joint discussion is to prove a proposition with a simple informal statement: If an n-letter long text describes n β independent facts in a random but consistent way then the text contains at least n β /log n different words. In the formal statement, two specific postulates are adopted. Firstly, the words are understood as the nonterminal symbols of the shortest grammar-based encoding of the text. Secondly, the texts are assumed to be emitted by a nonergodic source, with the described facts being binary IID variables that are asymptotically predictable in a shift-invariant way. The proof of the formal proposition applies several new tools. These
Towards a Model of Language Understanding
- 2 nd RomanianHungarian Joint Symposium on Applied Computational Intelligence SACI 2005
"... Abstract: The paper is an attempt to outline a hierarchical model of language apprehension based on an extension of Language of Thought Hypothesis (LOTH). Several arguments are presented which show that language being incomplete has limitations in representing both the reality and the mental states. ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract: The paper is an attempt to outline a hierarchical model of language apprehension based on an extension of Language of Thought Hypothesis (LOTH). Several arguments are presented which show that language being incomplete has limitations in representing both the reality and the mental states. Therefore, postulating LOTH similar with a conventional language is fallacious. Nonetheless, if language is related with thought, language properties would have to have a causal root in the functioning of the mind. This controversial issue is discussed in relation with the possibility of using Zipf’s law for identifying a deeper causal law at the level of cognition. Zipf’s law may be related with language redundancy necessary for the language understanding process. This process can be modeled based on information compression performed by a self-organizing neuralcomputation structure at two levels. At the first level, a feature extraction is done in a parsing process of a natural conventional language, and the result is a linguistic map which acts as input for the second level of compression. There, a purely semantic map is formed which is independent of any conventional language, accounting in this way the universality of thinking and reasoning process. Keywords: Cognitive modeling, language of thought, Zipf’s law, statistical linguistics. 1

