Results 1  10
of
30
Extension of Zipf's Law to Words and Phrases
 Proceedings of the 19th International Conference on Computational Linguistics (COLING
, 2002
"... Zipf's law states that the frequency of word tokens in a large corpus of natural language is inversely proportional to the rank. The law is investigated for two languages English and Mandarin and for ngram word phrases as well as for single words. The law for single words is shown to be ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
Zipf's law states that the frequency of word tokens in a large corpus of natural language is inversely proportional to the rank. The law is investigated for two languages English and Mandarin and for ngram word phrases as well as for single words. The law for single words is shown to be valid only for high frequency words.
Powerlaw revisited: large scale measurement study of p2p content popularity
 In IPTPS’10: Proceedings of the 9th international conference on Peertopeer systems
, 2010
"... Abstract—The popularity of contents on the Internet is often said to follow a Zipflike distribution. Different measurement studies showed, however, significantly different distributions depending on the measurement methodology they followed. We performed a largescale measurement of the most popula ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
(Show Context)
Abstract—The popularity of contents on the Internet is often said to follow a Zipflike distribution. Different measurement studies showed, however, significantly different distributions depending on the measurement methodology they followed. We performed a largescale measurement of the most popular peertopeer (P2P) content distribution system, BitTorrent, over eleven months. We collected data on a daily to weekly basis from 500 to 800 trackers, with information about 40 to 60 million peers that participated in the distribution of over 10 million torrents. Based on these measurements we show how fundamental characteristics of the observed distribution of content popularity change depending on the measurement methodology and the length of the observation interval. We show that while shortterm or smallscale measurements can conclude that the popularity of contents exhibits a powerlaw tail, the tail is likely exponentially decreasing, especially over long time intervals. I.
Extension of Zipf’s Law to Word and Character NGrams for English and Chinese
 Journal of Computational Linguistics and Chinese Language Processing
, 2003
"... It is shown that for a large corpus, Zipf 's law for both words in English and characters in Chinese does not hold for all ranks. The frequency falls below the frequency predicted by Zipf's law for English words for rank greater than about 5,000 and for Chinese characters for rank greater ..."
Abstract

Cited by 14 (4 self)
 Add to MetaCart
It is shown that for a large corpus, Zipf 's law for both words in English and characters in Chinese does not hold for all ranks. The frequency falls below the frequency predicted by Zipf's law for English words for rank greater than about 5,000 and for Chinese characters for rank greater than about 1,000. However, when single words or characters are combined together with ngram words or characters in one list and put in order of frequency, the frequency of tokens in the combined list follows Zipf’s law approximately with the slope close to1 on a loglog plot for all ngrams, down to the lowest frequencies in both languages. This behaviour is also found for English 2byte and 3byte word fragments. It only happens when all ngrams are used, including semantically incomplete ngrams. Previous theories do not predict this behaviour, possibly because conditional probabilities of tokens have not been properly represented.
Do We Think and Communicate in Quantum Ways? On the Presence of Quantum Structures in Language
, 2004
"... ... this article is to show the presence of genuine quantum structures in human language. More in particular, we will point out the violation of Bell's inequalities in specific situations encountered in language. The first sections of this article explain why the violation of Bell's inequa ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
... this article is to show the presence of genuine quantum structures in human language. More in particular, we will point out the violation of Bell's inequalities in specific situations encountered in language. The first sections of this article explain why the violation of Bell's inequalities is proof of the presence of genuine quantum structures, and how over the past decades this insight has increasingly made itself felt in the foundations of quantum mechanics research. This article also contains an overview of earlier work of ours discussing the detection of quantum structures in other domains than the microworld.
Characterizing Web Syndication Behavior and Content
"... Abstract. We are witnessing a widespread of web syndication technologies such as RSS or Atom for a timely delivery of frequently updated Web content. Almost every personal weblog, news portal, or discussion forum employs nowadays RSS/Atom feeds for enhancing pulloriented searching and browsing of w ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
Abstract. We are witnessing a widespread of web syndication technologies such as RSS or Atom for a timely delivery of frequently updated Web content. Almost every personal weblog, news portal, or discussion forum employs nowadays RSS/Atom feeds for enhancing pulloriented searching and browsing of web pages with pushoriented protocols of web content. Social media applications such as Twitter or Facebook also employ RSS for notifying users about the newly available posts of their preferred friends. Unfortunately, previous works on RSS/Atom statistical characteristics do not provide a precise and updated characterization of feeds ’ behavior and content, characterization which can be used to successfully benchmark effectiveness and efficiency of various RSS processing/analysis techniques. In this paper, we present the first thorough analysis of three complementary features of realscale RSS feeds, namely, publication activity, items structure and length, as well as, vocabulary of its content which we believe are crucial for Web 2.0 applications. Keywords: RSS/Atom Feeds, Publication activity, Items structure and length, textual vocabulary composition and evolution 1
Zipf’s law revisited
, 2007
"... Zipf’s law states that the frequency of occurence of some event as a function of its rank is a powerlaw function. Using empirical examples from different domains, we demonstrate that at least in some cases, increasingly significant divergences from Zipf’s law are registered as the number of events ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Zipf’s law states that the frequency of occurence of some event as a function of its rank is a powerlaw function. Using empirical examples from different domains, we demonstrate that at least in some cases, increasingly significant divergences from Zipf’s law are registered as the number of events observed increases. Importantly, one of these cases is word frequency in a corpus of natural language, which is—undoubtedly—the most prominent example of Zipf’s law. We analyze our findings mathematically and attempt a natural explanation of the regularities underlying them. 1
Discover Hidden Web Properties by Random Walk on Bipartite Graph
"... This paper proposes to use random walk to discover the properties of the deep web data sources that are hidden behind searchable interfaces. The properties, such as the average degree and population size of both documents and terms, are of interests to general public, and find their applications i ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
This paper proposes to use random walk to discover the properties of the deep web data sources that are hidden behind searchable interfaces. The properties, such as the average degree and population size of both documents and terms, are of interests to general public, and find their applications in business intelligence, data integration and deep web crawling. We show that simple random walk (RW) can outperform the uniform random (UR) samples disregarding the high cost of uniform random sampling. We prove that in the idealized case when the degrees follow Zipf’s law, the sample size of UR sampling needs to grow in the order of O(N/ln 2 N) with the corpus size N, while the sample size of RW sampling grows logarithmically. Reuters corpus is used to demonstrate that the term degrees resemble power law distribution, thus RW is better than UR sampling. On the other hand, document degrees have lognormal distribution and exhibit a smaller variance, therefore UR sampling is slightly better.
Variance Reduction In Large Graph Sampling
"... The norm of practice in estimating graph properties is to use uniform random node (RN) samples whenever possible. Many graphs are large and scalefree, inducing large degree variance and estimator variance. This paper shows that random edge (RE) sampling and the corresponding harmonic mean estimator ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
The norm of practice in estimating graph properties is to use uniform random node (RN) samples whenever possible. Many graphs are large and scalefree, inducing large degree variance and estimator variance. This paper shows that random edge (RE) sampling and the corresponding harmonic mean estimator for average degree can reduce the estimation variance significantly. First, we demonstrate that the degree variance, and consequently the variance of the RN estimator, can grow almost linearly with data size for typical scalefree graphs. Then we prove that the RE estimator has a variance bounded from above. Therefore, the variance ratio between RN and RE samplings can be very large for big data. The analytical result is supported by both simulation studies and 18 real networks. We observe that the variance reduction ratio can be more than a hundred for some real networks such as Twitter. Furthermore, we show that random walk (RW) sampling is always worse than RE sampling, and it can reduce the variance of RN method only when its performance is close to that of RE sampling.
Accelerating frequent item counting with fpga
 In Proceedings of the 2014 ACM/SIGDA International Symposium on Fieldprogrammable Gate Arrays, FPGA ’14
, 2014
"... Frequent item counting is one of the most important operations in time series data mining algorithms, and the space saving algorithm is a widely used approach to solving this problem. With the rapid rising of data input speeds, the most challenging problem in frequent item counting is to meet the re ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Frequent item counting is one of the most important operations in time series data mining algorithms, and the space saving algorithm is a widely used approach to solving this problem. With the rapid rising of data input speeds, the most challenging problem in frequent item counting is to meet the requirement of wirespeed processing. In this paper, we propose a streaming oriented PEring framework on FPGA for counting frequent items. Compared with the best existing FPGA implementation, our basic PEring framework saves 50 % lookup table resources cost and achieves the same throughput in a more scalable way. Furthermore, we adopt SIMDlike cascaded filter for further performance improvements, which outperforms the previous work by up to 3.24 times in some data distributions.
Rank–frequency analysis for functional style corpora of Ukrainian
 Journal of Quantitative Linguistics
, 2004
"... We use the rank–frequency analysis for the estimation of Kernel Vocabulary size within specific corpora of Ukrainian. The extrapolation of highrank behaviour is utilized for estimation of the total vocabulary size. Key words: corpus, Ukrainian, rank–frequency dependence, vocabulary size, entropy 1 ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
We use the rank–frequency analysis for the estimation of Kernel Vocabulary size within specific corpora of Ukrainian. The extrapolation of highrank behaviour is utilized for estimation of the total vocabulary size. Key words: corpus, Ukrainian, rank–frequency dependence, vocabulary size, entropy 1