Results 1  10
of
25
1 Probability Estimation in the RareEvents Regime
"... We address the problem of estimating the probability of an observed string that is drawn i.i.d. from an unknown distribution. Motivated by models of natural language, we consider the regime in which the length of the observed string and the size of the underlying alphabet are comparably large. In th ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
We address the problem of estimating the probability of an observed string that is drawn i.i.d. from an unknown distribution. Motivated by models of natural language, we consider the regime in which the length of the observed string and the size of the underlying alphabet are comparably large. In this regime, the maximum likelihood distribution tends to overestimate the probability of the observed letters, so the GoodTuring probability estimator is typically used instead. We show that when used to estimate the sequence probability, the GoodTuring estimator is not consistent in this regime. We then introduce a novel sequence probability estimator that is consistent. This estimator also yields consistent estimators for other quantities of interest and a consistent universal classifier. I.
Strong consistency of the GoodTuring estimator
 IEEE Int. Symp. Inf. Theor. Proc
, 2006
"... ..."
(Show Context)
Estimating the Prediction Function and the Number of Unseen Species in Sampling with Replacement
 Journal of the American Statistical Association
, 1998
"... AsampleofN units is taken from a population consisting of an unknown number of species. We are interested in estimating the number of species and the prediction function for future sampling. The prediction function is defined as the expected number of new species that will be found if an additional ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
AsampleofN units is taken from a population consisting of an unknown number of species. We are interested in estimating the number of species and the prediction function for future sampling. The prediction function is defined as the expected number of new species that will be found if an additional sample of size tN is taken, for any positive real number t. In this paper we point out that an estimator suggested by Efron & Thisted (1976) lack some essential properties of the true prediction function, e.g., the property of alternating copositivity. As a result, it cannot be used for large values of t. We propose an alternative estimator which possesses the essential properties, and is easily obtained. We illustrate our estimator with two numerical examples and a simulation study.
Fundamental problem of forensic mathematics – The Evidential Value of a Rare Haplotype
, 2009
"... Ychromosomal and mitochondrial haplotyping offer special advantages for criminal (and other) identification. For different reasons, each of them is sometimes detectable in a crime stain for which autosomal typing fails. But they also present special problems, including a fundamental mathematical on ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Ychromosomal and mitochondrial haplotyping offer special advantages for criminal (and other) identification. For different reasons, each of them is sometimes detectable in a crime stain for which autosomal typing fails. But they also present special problems, including a fundamental mathematical one: When a rare haplotype is shared between suspect and crime scene, how strong is the evidence linking the two? Assume a reference population sample is available which contains n1 haplotypes. The most interesting situation as well as the most common one is that the crime scene haplotype was never observed in the population sample. The traditional tools of product rule and sample frequency are not useful when there are no components to multiply and the sample frequency is zero. A useful statistic is the fraction 6 of the population sample that consists of “singletons ” – of onceobserved types. A simple argument shows that the probability for a random innocent suspect to match a previously unobserved crime scene type is (16)/n – distinctly less than 1/n, likely ten times less. The robust validity
ASYMPTOTIC NORMALITY OF A NONPARAMETRIC ESTIMATOR OF SAMPLE COVERAGE
, 908
"... This paper establishes a necessary and sufficient condition for the asymptotic normality of the nonparametric estimator of sample coverage proposed by Good [Biometrica 40 (1953) 237–264]. This new necessary and sufficient condition extends the validity of the asymptotic normality beyond the previous ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
This paper establishes a necessary and sufficient condition for the asymptotic normality of the nonparametric estimator of sample coverage proposed by Good [Biometrica 40 (1953) 237–264]. This new necessary and sufficient condition extends the validity of the asymptotic normality beyond the previously proven cases. 1. Introduction. Suppose
Patterns and exchangeability
 In Proceedings of the IEEE Symposium on Information Theory
, 2010
"... Abstract—In statistics and theoretical computer science, the notion of exchangeability provides a framework for the study of large alphabet scenarios. This idea has been developed in an important line of work starting with Kingman’s study of population genetics, and leading on to the paintbox proce ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract—In statistics and theoretical computer science, the notion of exchangeability provides a framework for the study of large alphabet scenarios. This idea has been developed in an important line of work starting with Kingman’s study of population genetics, and leading on to the paintbox processes of Kingman, the Chinese restaurant processes and their generalizations. In information theory, the notion of the pattern of a sequence provides a framework for the study of large alphabet scenarios, as developed in work of Orlitsky and collaborators. The pattern is a statistic that captures all the information present in the data, and yet is universally compressible regardless of the alphabet size. In this note, connections are made between these two lines of work – specifically, patterns are examined in the context of exchangeability. After observing the relationship between patterns and Kingman’s paintbox processes, and discussing the redundancy of a class of mixture codes for patterns, alternate representations of patterns in terms of graph limits are discussed. I.
A BerryEsseen bound for the uniform multinomial occupancy model
, 2013
"... The inductive size bias coupling technique and Stein’s method yield a BerryEsseen theorem for the number of urns having occupancy d ≥ 2 when n balls are uniformly distributed over m urns. In particular, there exists a constant C depending only on d such that sup z∈R P (Wn,m ≤ z) − P (Z ≤ z)  ≤ ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
The inductive size bias coupling technique and Stein’s method yield a BerryEsseen theorem for the number of urns having occupancy d ≥ 2 when n balls are uniformly distributed over m urns. In particular, there exists a constant C depending only on d such that sup z∈R P (Wn,m ≤ z) − P (Z ≤ z)  ≤ C σn,m 1 + ( n m)3 for all n ≥ d and m ≥ 2, where Wn,m and σ 2 n,m are the standardized count and variance, respectively, of the number of urns with d balls, and Z is a standard normal random variable. Asymptotically, the bound is optimal up to constants if n and m tend to infinity together in a way such that n/m stays bounded.
Coverage Adjusted Entropy Estimation
 TO APPEAR IN A SPECIAL ISSUE OF STATISTICS IN MEDICINE ON NEURONAL DATA ANALYSIS
, 2007
"... Data on “neural coding” have frequently been analyzed using informationtheoretic measures. These formulations involve the fundamental, and generally difficult statistical problem of estimating entropy. We review briefly several methods that have been advanced to estimate entropy, and highlight a m ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Data on “neural coding” have frequently been analyzed using informationtheoretic measures. These formulations involve the fundamental, and generally difficult statistical problem of estimating entropy. We review briefly several methods that have been advanced to estimate entropy, and highlight a method, the coverage adjusted entropy estimator (CAE), due to Chao and Shen that appeared recently in the environmental statistics literature. This method begins with the elementary HorvitzThompson estimator, developed for sampling from a finite population and adjusts for the potential new species that have not yet been observed in the sample—these become the new patterns or “words ” in a spike train that have not yet been observed. The adjustment is due to I.J. Good, and is called the GoodTuring coverage estimate. We provide a new empirical regularization derivation of the coverageadjusted probability estimator, which shrinks the MLE. We prove that the CAE is consistent and firstorder optimal, with rate OP (1 / log n), in the class of distributions with finite entropy variance and that within the class of distributions with finite qth moment of the loglikelihood, the GoodTuring coverage estimate and the total probability of unobserved words converge at rate OP (1/(log n) q). We then provide a simulation study of the estimator with standard distributions and examples from neuronal data, where observations are dependent. The results show that, with a minor modification, the CAE performs much better than the MLE and is better than the Best Upper Bound estimator, due to Paninski, when the number of possible words m is unknown or infinite.