Results 1 
5 of
5
A better GoodTuring estimator for sequence probabilities
 in Proc. IEEE Int. Symp. Inf. Theory
"... ..."
(Show Context)
1 Probability Estimation in the RareEvents Regime
"... We address the problem of estimating the probability of an observed string that is drawn i.i.d. from an unknown distribution. Motivated by models of natural language, we consider the regime in which the length of the observed string and the size of the underlying alphabet are comparably large. In th ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
We address the problem of estimating the probability of an observed string that is drawn i.i.d. from an unknown distribution. Motivated by models of natural language, we consider the regime in which the length of the observed string and the size of the underlying alphabet are comparably large. In this regime, the maximum likelihood distribution tends to overestimate the probability of the observed letters, so the GoodTuring probability estimator is typically used instead. We show that when used to estimate the sequence probability, the GoodTuring estimator is not consistent in this regime. We then introduce a novel sequence probability estimator that is consistent. This estimator also yields consistent estimators for other quantities of interest and a consistent universal classifier. I.
Sequence Probability Estimation for Large Alphabets
, 704
"... Abstract — We consider the problem of estimating the probability of an observed string drawn i.i.d. from an unknown distribution. The key feature of our study is that the length of the observed string is assumed to be of the same order as the size of the underlying alphabet. In this setting, many le ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract — We consider the problem of estimating the probability of an observed string drawn i.i.d. from an unknown distribution. The key feature of our study is that the length of the observed string is assumed to be of the same order as the size of the underlying alphabet. In this setting, many letters are unseen and the empirical distribution tends to overestimate the probability of the observed letters. To overcome this problem, the traditional approach to probability estimation is to use the classical GoodTuring estimator. We introduce a natural scaling model and use it to show that the GoodTuring sequence probability estimator is not consistent. We then introduce a novel sequence probability estimator that is indeed consistent under the natural scaling model. I.
Estimating the Unseen: An n/log(n)sample Estimator for Entropy and Support Size, Shown Optimal via New CLTs
, 2011
"... We introduce a new approach to characterizing the unobserved portion of a distribution, which provides sublinear– sample estimators achieving arbitrarily small additive constant error for a class of properties that includes entropy and distribution support size. Additionally, we show new matching lo ..."
Abstract
 Add to MetaCart
(Show Context)
We introduce a new approach to characterizing the unobserved portion of a distribution, which provides sublinear– sample estimators achieving arbitrarily small additive constant error for a class of properties that includes entropy and distribution support size. Additionally, we show new matching lower bounds. Together, this settles the longstanding question of the sample complexities of these estimation problems, up to constant factors. Our algorithm estimates these properties up to an arbitrarily small additive constant, using O(n/logn) samples, where n is a bound on the support size, or in the case of estimating the support size, 1/n is a lower bound on the probability of any element of the domain. Previously, no explicit sublinear–sample algorithms for either of these problems were known. Our algorithm is also computationally extremely efficient, running in time linear in the number of samples used. In the second half of the paper, we provide a matching lower bound of Ω(n/log n) samples for estimating entropy or distributionsupportsizetowithinanadditiveconstant. The previous lowerbounds on these sample complexities were n/2 O( √ logn) To show our lower bound, we prove two new and natural multivariate central limit theorems (CLTs); the first uses Stein’s method to relate the sum of independent distributions to the multivariate Gaussian of corresponding mean and covariance, under the earthmover distance metric (also known as the Wasserstein metric). We leverage this central limit theorem to prove a stronger but more specific central limit theorem for “generalized multinomial” distributions—a large class of discrete distributions, parameterized by matri