Results 1  10
of
51
A Brief History of Generative Models for Power Law and Lognormal Distributions
 INTERNET MATHEMATICS
"... Recently, I became interested in a current debate over whether file size distributions are best modelled by a power law distribution or a a lognormal distribution. In trying ..."
Abstract

Cited by 252 (7 self)
 Add to MetaCart
Recently, I became interested in a current debate over whether file size distributions are best modelled by a power law distribution or a a lognormal distribution. In trying
An efficient, probabilistically sound algorithm for segmentation and word discovery
 MACHINE LEARNING
, 1999
"... This paper presents a modelbased, unsupervised algorithm for recovering word boundaries in a naturallanguage text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstract ..."
Abstract

Cited by 142 (2 self)
 Add to MetaCart
This paper presents a modelbased, unsupervised algorithm for recovering word boundaries in a naturallanguage text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstractly so that the detailed component models of phonology, wordorder, and word frequency can be replaced in a modular fashion. The model yields a languageindependent, prior probability distribution on all possible sequences of all possible words over a given alphabet, based on the assumption that the input was generated by concatenating words from a fixed but unknown lexicon. The model is unusual in that it treats the generation of a complete corpus, regardless of length, as a single event in the probability space. Accordingly, the algorithm does not estimate a probability distribution on words; instead, it attempts to calculate the prior probabilities of various word sequences that could underlie the observed text. Experiments on phonemic transcripts of spontaneous speech by parents to young children suggest that our algorithm is more effective than other proposed algorithms, at least when utterance boundaries are given and the text includes a substantial number of short utterances.
Random texts exhibit Zipf'slawlike word frequency distribution
 IEEE Transactions on Information Theory
, 1992
"... are scanned from a copy of the paper (apologize for the poor quality). It is shown that the distribution of word frequencies for randomly generated texts is very similar to Zipf’s law observed in natural languages such as the English. The facts that the frequency of occurrence of a word is almost an ..."
Abstract

Cited by 80 (2 self)
 Add to MetaCart
are scanned from a copy of the paper (apologize for the poor quality). It is shown that the distribution of word frequencies for randomly generated texts is very similar to Zipf’s law observed in natural languages such as the English. The facts that the frequency of occurrence of a word is almost an inverse power law function of its rank and the exponent of this inverse power law is very close to 1 are largely due to the transformation from the word’s length to its rank, which stretches an exponential function to a power law function. key words: statistical linguistics, Zipf’s law, powerlaw distribution, random texts. Zipf observed long time ago [1] that the distribution of word frequencies in English, if the words are aligned according to their ranks, is an inverse power law with the exponent very close to 1. In other words, if the most frequently occurring word appears in the text with the frequency P(1), the next most frequently occurring word has the frequency P(2), and the rankr word has the frequency P(r), the frequency distribution is P(r) = C rα, (1)
Lightweight natural language text compression. Information Retrieval
, 2007
"... Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed text and random access capabilities, in excha ..."
Abstract

Cited by 27 (21 self)
 Add to MetaCart
Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed text and random access capabilities, in exchange for producing around 11 % larger compressed files. This work describes EndTagged Dense Code and (s, c)Dense Code, two new semistatic statistical methods for compressing natural language texts. These techniques permit simpler and faster encoding and obtain better compression ratios than Tagged Huffman Code, while maintaining its fast direct search and random access capabilities. We show that Dense Codes improve Tagged Huffman Code compression ratio by about 10%, reaching only 0.6% overhead over the optimal Huffman compression ratio. Being simpler, Dense Codes are generated 45% to 60 % faster than Huffman codes. This makes Dense Codes a very attractive alternative to Huffman code variants for various reasons: they are simpler to program, faster to build, of almost optimal size, and as fast and easy to search as the best Huffman variants, which are not so close to the optimal size.
Affiliation Networks
"... In the last decade, structural properties of several naturally arising networks (the Internet, social networks, the web graph, etc.) have been studied intensively with a view to understanding their evolution. In recent empirical work, Leskovec, Kleinberg, and Faloutsos identify two new and surprisin ..."
Abstract

Cited by 21 (3 self)
 Add to MetaCart
In the last decade, structural properties of several naturally arising networks (the Internet, social networks, the web graph, etc.) have been studied intensively with a view to understanding their evolution. In recent empirical work, Leskovec, Kleinberg, and Faloutsos identify two new and surprising properties of the evolution of many realworld networks: densification (the ratio of edges to vertices grows over time), and shrinking diameter (the diameter reduces over time to a constant). These properties run counter to conventional wisdom, and are certainly inconsistent with graph models prior to their work. In this paper, we present the first model that provides a simple, realistic, and mathematically tractable generative model that intrinsically explains all the wellknown properties of the social networks, as well as densification and shrinking diameter. Our model is based on ideas studied empirically in the social sciences, primarily on the groundbreaking work of Breiger (1973) on bipartite models of social networks that capture the affiliation of agents to societies. We also present algorithms that harness the structural consequences of our model. Specifically, we show how to overcome the bottleneck of densification in computing shortest paths between vertices by producing sparse subgraphs that preserve or approximate shortest distances to all or a distinguished subset of vertices. This is a rare example of an algorithmic benefit derived from a realistic graph model. Finally, our work also presents a modular approach to connecting random graph paradigms (preferential attachment, edgecopying, etc.) to structural consequences (heavytailed degree distributions, shrinking diameter, etc.).
RNA Structures with Pseudoknots
, 1997
"... i Abstract Secondary structures of nucleic acids are a particularly interesting class of contact structures. Many important RNA molecules,however contain pseudoknots, which are excluded explicitly by the definition of secondary structures. We propose here a generalization of secondary structures th ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
i Abstract Secondary structures of nucleic acids are a particularly interesting class of contact structures. Many important RNA molecules,however contain pseudoknots, which are excluded explicitly by the definition of secondary structures. We propose here a generalization of secondary structures that incorporates "nonnested" pseudoknots. We also introduce a measure for the complexity of more general contact structures in terms of the chromatic number of their intersection graph. We show that RNA structures without nested pseudoknots form a special class of planar graphs, the so called "bisecondary structures". Upper bounds on their number are derived, showing that there are fewer different structures than sequences. An energy function capable of dealing with bisecondary structures was implemented into a generalized kinetic folding algorithm. Sterical hindrances involved in pseudoknot formation are taken into account with the help of two simplifications: stacked regions are viewed ...
Extension of Zipf's Law to Words and Phrases
 Proceedings of the 19th International Conference on Computational Linguistics (COLING
, 2002
"... Zipf's law states that the frequency of word tokens in a large corpus of natural language is inversely proportional to the rank. The law is investigated for two languages English and Mandarin and for ngram word phrases as well as for single words. The law for single words is shown to be valid ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
Zipf's law states that the frequency of word tokens in a large corpus of natural language is inversely proportional to the rank. The law is investigated for two languages English and Mandarin and for ngram word phrases as well as for single words. The law for single words is shown to be valid only for high frequency words.
A Stochastic Process for Word Frequency Distributions
, 1991
"... A stochastic model based on insights of Mandelbrot (1953) and Simon (1955) is discussed against the background of new criteria of adequacy that have become available recently as a result of studies of the similarity relations between words as found in large computerized text corpora. ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
A stochastic model based on insights of Mandelbrot (1953) and Simon (1955) is discussed against the background of new criteria of adequacy that have become available recently as a result of studies of the similarity relations between words as found in large computerized text corpora.
Citations and the zipfmandelbrot’s law
 Complex Systems
, 1997
"... A curious observation was made that the rank statistics of scientific citation numbers follows ZipfMandelbrot’s law. The same powlike behavior is exhibited by some simple random citation models. The observed regularity indicates not so much the peculiar character of the underlying (complex) proces ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
A curious observation was made that the rank statistics of scientific citation numbers follows ZipfMandelbrot’s law. The same powlike behavior is exhibited by some simple random citation models. The observed regularity indicates not so much the peculiar character of the underlying (complex) process, but more likely, than it is usually assumed, its more stochastic nature. 1 1