Results 1  10
of
10
Estimating the Prediction Function and the Number of Unseen Species in Sampling with Replacement
 Journal of the American Statistical Association
, 1998
"... AsampleofN units is taken from a population consisting of an unknown number of species. We are interested in estimating the number of species and the prediction function for future sampling. The prediction function is defined as the expected number of new species that will be found if an additional ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
AsampleofN units is taken from a population consisting of an unknown number of species. We are interested in estimating the number of species and the prediction function for future sampling. The prediction function is defined as the expected number of new species that will be found if an additional sample of size tN is taken, for any positive real number t. In this paper we point out that an estimator suggested by Efron & Thisted (1976) lack some essential properties of the true prediction function, e.g., the property of alternating copositivity. As a result, it cannot be used for large values of t. We propose an alternative estimator which possesses the essential properties, and is easily obtained. We illustrate our estimator with two numerical examples and a simulation study.
Strong consistency of the GoodTuring estimator
 in IEEE Int. Symp. Inf. Theor. Proc
, 2006
"... Abstract — We consider the problem of estimating the total probability of all symbols that appear with a given frequency in a string of i.i.d. random variables with unknown distribution. We focus on the regime in which the block length is large yet no symbol appears frequently in the string. This is ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Abstract — We consider the problem of estimating the total probability of all symbols that appear with a given frequency in a string of i.i.d. random variables with unknown distribution. We focus on the regime in which the block length is large yet no symbol appears frequently in the string. This is accomplished by allowing the distribution to change with the block length. Under a natural convergence assumption on the sequence of underlying distributions, we show that the total probabilities converge to a deterministic limit, which we characterize. We then show that the GoodTuring total probability estimator is strongly consistent. I.
ASYMPTOTIC NORMALITY OF A NONPARAMETRIC ESTIMATOR OF SAMPLE COVERAGE
, 908
"... This paper establishes a necessary and sufficient condition for the asymptotic normality of the nonparametric estimator of sample coverage proposed by Good [Biometrica 40 (1953) 237–264]. This new necessary and sufficient condition extends the validity of the asymptotic normality beyond the previous ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
This paper establishes a necessary and sufficient condition for the asymptotic normality of the nonparametric estimator of sample coverage proposed by Good [Biometrica 40 (1953) 237–264]. This new necessary and sufficient condition extends the validity of the asymptotic normality beyond the previously proven cases. 1. Introduction. Suppose
Coverage Adjusted Entropy Estimation
 TO APPEAR IN A SPECIAL ISSUE OF STATISTICS IN MEDICINE ON NEURONAL DATA ANALYSIS
, 2007
"... Data on “neural coding” have frequently been analyzed using informationtheoretic measures. These formulations involve the fundamental, and generally difficult statistical problem of estimating entropy. We review briefly several methods that have been advanced to estimate entropy, and highlight a m ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Data on “neural coding” have frequently been analyzed using informationtheoretic measures. These formulations involve the fundamental, and generally difficult statistical problem of estimating entropy. We review briefly several methods that have been advanced to estimate entropy, and highlight a method, the coverage adjusted entropy estimator (CAE), due to Chao and Shen that appeared recently in the environmental statistics literature. This method begins with the elementary HorvitzThompson estimator, developed for sampling from a finite population and adjusts for the potential new species that have not yet been observed in the sample—these become the new patterns or “words ” in a spike train that have not yet been observed. The adjustment is due to I.J. Good, and is called the GoodTuring coverage estimate. We provide a new empirical regularization derivation of the coverageadjusted probability estimator, which shrinks the MLE. We prove that the CAE is consistent and firstorder optimal, with rate OP (1 / log n), in the class of distributions with finite entropy variance and that within the class of distributions with finite qth moment of the loglikelihood, the GoodTuring coverage estimate and the total probability of unobserved words converge at rate OP (1/(log n) q). We then provide a simulation study of the estimator with standard distributions and examples from neuronal data, where observations are dependent. The results show that, with a minor modification, the CAE performs much better than the MLE and is better than the Best Upper Bound estimator, due to Paninski, when the number of possible words m is unknown or infinite.
unknown title
, 2004
"... doi:10.1093/bioinformatics/bth239 Estimating and comparing the rates of gene discovery and expressed sequence tag (EST) frequencies in EST surveys ..."
Abstract
 Add to MetaCart
doi:10.1093/bioinformatics/bth239 Estimating and comparing the rates of gene discovery and expressed sequence tag (EST) frequencies in EST surveys
A BerryEsseen bound for the uniform multinomial occupancy model ∗
"... The inductive size bias coupling technique and Stein’s method yield a BerryEsseen theorem for the number of urns having occupancy d ≥ 2 when n balls are uniformly distributed over m urns. In particular, there exists a constant C depending only on d such that sup z∈R P (Wn,m ≤ z) − P (Z ≤ z)  ≤ ..."
Abstract
 Add to MetaCart
The inductive size bias coupling technique and Stein’s method yield a BerryEsseen theorem for the number of urns having occupancy d ≥ 2 when n balls are uniformly distributed over m urns. In particular, there exists a constant C depending only on d such that sup z∈R P (Wn,m ≤ z) − P (Z ≤ z)  ≤ C σn,m 1 + ( n m)3 for all n ≥ d and m ≥ 2, where Wn,m and σ 2 n,m are the standardized count and variance, respectively, of the number of urns with d balls, and Z is a standard normal random variable. Asymptotically, the bound is optimal up to constants if n and m tend to infinity together in a way such that n/m stays bounded.
1 Probability Estimation in the RareEvents Regime
"... We address the problem of estimating the probability of an observed string that is drawn i.i.d. from an unknown distribution. Motivated by models of natural language, we consider the regime in which the length of the observed string and the size of the underlying alphabet are comparably large. In th ..."
Abstract
 Add to MetaCart
We address the problem of estimating the probability of an observed string that is drawn i.i.d. from an unknown distribution. Motivated by models of natural language, we consider the regime in which the length of the observed string and the size of the underlying alphabet are comparably large. In this regime, the maximum likelihood distribution tends to overestimate the probability of the observed letters, so the GoodTuring probability estimator is typically used instead. We show that when used to estimate the sequence probability, the GoodTuring estimator is not consistent in this regime. We then introduce a novel sequence probability estimator that is consistent. This estimator also yields consistent estimators for other quantities of interest and a consistent universal classifier. I.