Results 1 
7 of
7
Estimating the Unseen: An n/log(n)sample Estimator for Entropy and Support Size, Shown Optimal via New CLTs
, 2011
"... We introduce a new approach to characterizing the unobserved portion of a distribution, which provides sublinear– sample estimators achieving arbitrarily small additive constant error for a class of properties that includes entropy and distribution support size. Additionally, we show new matching lo ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
We introduce a new approach to characterizing the unobserved portion of a distribution, which provides sublinear– sample estimators achieving arbitrarily small additive constant error for a class of properties that includes entropy and distribution support size. Additionally, we show new matching lower bounds. Together, this settles the longstanding question of the sample complexities of these estimation problems, up to constant factors. Our algorithm estimates these properties up to an arbitrarily small additive constant, using O(n/logn) samples, where n is a bound on the support size, or in the case of estimating the support size, 1/n is a lower bound on the probability of any element of the domain. Previously, no explicit sublinear–sample algorithms for either of these problems were known. Our algorithm is also computationally extremely efficient, running in time linear in the number of samples used. In the second half of the paper, we provide a matching lower bound of Ω(n/log n) samples for estimating entropy or distributionsupportsizetowithinanadditiveconstant. The previous lowerbounds on these sample complexities were n/2 O( √ logn) To show our lower bound, we prove two new and natural multivariate central limit theorems (CLTs); the first uses Stein’s method to relate the sum of independent distributions to the multivariate Gaussian of corresponding mean and covariance, under the earthmover distance metric (also known as the Wasserstein metric). We leverage this central limit theorem to prove a stronger but more specific central limit theorem for “generalized multinomial” distributions—a large class of discrete distributions, parameterized by matri
1 Probability Estimation in the RareEvents Regime
"... We address the problem of estimating the probability of an observed string that is drawn i.i.d. from an unknown distribution. Motivated by models of natural language, we consider the regime in which the length of the observed string and the size of the underlying alphabet are comparably large. In th ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
We address the problem of estimating the probability of an observed string that is drawn i.i.d. from an unknown distribution. Motivated by models of natural language, we consider the regime in which the length of the observed string and the size of the underlying alphabet are comparably large. In this regime, the maximum likelihood distribution tends to overestimate the probability of the observed letters, so the GoodTuring probability estimator is typically used instead. We show that when used to estimate the sequence probability, the GoodTuring estimator is not consistent in this regime. We then introduce a novel sequence probability estimator that is consistent. This estimator also yields consistent estimators for other quantities of interest and a consistent universal classifier. I.
Patterns and exchangeability
 In Proceedings of the IEEE Symposium on Information Theory
, 2010
"... Abstract—In statistics and theoretical computer science, the notion of exchangeability provides a framework for the study of large alphabet scenarios. This idea has been developed in an important line of work starting with Kingman’s study of population genetics, and leading on to the paintbox proce ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract—In statistics and theoretical computer science, the notion of exchangeability provides a framework for the study of large alphabet scenarios. This idea has been developed in an important line of work starting with Kingman’s study of population genetics, and leading on to the paintbox processes of Kingman, the Chinese restaurant processes and their generalizations. In information theory, the notion of the pattern of a sequence provides a framework for the study of large alphabet scenarios, as developed in work of Orlitsky and collaborators. The pattern is a statistic that captures all the information present in the data, and yet is universally compressible regardless of the alphabet size. In this note, connections are made between these two lines of work – specifically, patterns are examined in the context of exchangeability. After observing the relationship between patterns and Kingman’s paintbox processes, and discussing the redundancy of a class of mixture codes for patterns, alternate representations of patterns in terms of graph limits are discussed. I.
Information theory of Exchangeable . . .
, 2012
"... Exchangeable random partition processes are the basis for Bayesian approaches to statistical inference in large alphabet settings. On the other hand, the notion of the pattern of a sequence provides a framework for data compression in large alphabet scenarios. Because data compression and parameter ..."
Abstract
 Add to MetaCart
Exchangeable random partition processes are the basis for Bayesian approaches to statistical inference in large alphabet settings. On the other hand, the notion of the pattern of a sequence provides a framework for data compression in large alphabet scenarios. Because data compression and parameter estimation are intimately related, we study the redundancy of Bayes estimators coming from PoissonDirichlet priors (or “Chinese restaurant processes”) and the PitmanYor prior. This provides an understanding of these estimators in the setting of unknown discrete alphabets from the perspective of universal compression. In particular, we identify relations between alphabet sizes and sample sizes where the redundancy is small – and hence, characterize useful regimes for these estimators.
Sequence Probability Estimation for Large Alphabets
, 704
"... Abstract — We consider the problem of estimating the probability of an observed string drawn i.i.d. from an unknown distribution. The key feature of our study is that the length of the observed string is assumed to be of the same order as the size of the underlying alphabet. In this setting, many le ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract — We consider the problem of estimating the probability of an observed string drawn i.i.d. from an unknown distribution. The key feature of our study is that the length of the observed string is assumed to be of the same order as the size of the underlying alphabet. In this setting, many letters are unseen and the empirical distribution tends to overestimate the probability of the observed letters. To overcome this problem, the traditional approach to probability estimation is to use the classical GoodTuring estimator. We introduce a natural scaling model and use it to show that the GoodTuring sequence probability estimator is not consistent. We then introduce a novel sequence probability estimator that is indeed consistent under the natural scaling model. I.