Random Sampling for Histogram Construction: How much is enough?
, 1998
Cited by 106 (11 self)
Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equiheight histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for prespecified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose ...
Counting the uncountable: statistical approaches to estimating microbial diversity
 Appl. Environ
, 2001
Cited by 36 (0 self)
The pervasive effects of an antibiotic on the human gut microbiota, as revealed by deep 16S rRNA sequencing
 PLoS Biol
, 2008
Cited by 17 (0 self)
The human intestinal microbiota is essential to the health of the host and plays a role in nutrition, development, metabolism, pathogen resistance, and regulation of immune responses. Antibiotics may disrupt these coevolved interactions, leading to acute or chronic disease in some individuals. Our understanding of antibioticassociated disturbance of the microbiota has been limited by the poor sensitivity, inadequate resolution, and significant cost of current research methods. The use of pyrosequencing technology to generate large numbers of 16S rDNA sequence tags circumvents these limitations and has been shown to reveal previously unexplored aspects of the ‘‘rare biosphere.’ ’ We investigated the distal gut bacterial communities of three healthy humans before and after treatment with ciprofloxacin, obtaining more than 7,000 fulllength rRNA sequences and over 900,000 pyrosequencing reads from two hypervariable regions of the rRNA gene. A companion paper in PLoS Genetics (see Huse et al., doi: 10.1371/ journal.pgen.1000255) shows that the taxonomic information obtained with these methods is concordant. Pyrosequencing of the V6 and V3 variable regions identified 3,300–5,700 taxa that collectively accounted for over 99 % of the variable region sequence tags that could be obtained from these samples. Ciprofloxacin treatment influenced the abundance of about a third of the bacterial taxa in the gut, decreasing the taxonomic richness, diversity, and evenness of the community. However, the magnitude of this effect varied among individuals, and some taxa showed interindividual variation in the response to ciprofloxacin. While differences of community composition
ABSTRACT Cardinality Estimation Using Sample Views with Quality Assurance
Cited by 7 (1 self)
Accurate cardinality estimation is critically important to highquality query optimization. It is well known that conventional cardinality estimation based on histograms or similar statistics may produce extremely poor estimates in a variety of situations, for example, queries with complex predicates, correlation among columns, or predicates containing userdefined functions. In this paper, we propose a new, general cardinality estimation technique that combines random sampling and materialized view technology to produce accurate estimates even in these situations. As a major innovation, we exploit feedback information from query execution and process control techniques to assure that estimates remain statistically valid when the underlying data changes. Experimental results based on a prototype implementation in Microsoft SQL Server demonstrate the practicality of the approach and illustrate the dramatic effects improved cardinality estimates may have.
Toward a census of bacteria in soil
 PLOS Comp Biol
, 2006
Cited by 7 (0 self)
For more than a century, microbiologists have sought to determine the species richness of bacteria in soil, but the extreme complexity and unknown structure of soil microbial communities have obscured the answer. We developed a statistical model that makes the problem of estimating richness statistically accessible by evaluating the characteristics of samples drawn from simulated communities with parametric community distributions. We identified simulated communities with rankabundance distributions that followed a truncated lognormal distribution whose samples resembled the structure of 16S rRNA gene sequence collections made using Alaskan and Minnesotan soils. The simulated communities constructed based on the distribution of 16S rRNA gene sequences sampled from the Alaskan and Minnesotan soils had a richness of 5,000 and 2,000 operational taxonomic units (OTUs), respectively, where an OTU represents a collection of sequences not more than 3 % distant from each other. To sample each of these OTUs in the Alaskan 16S rRNA gene library at least twice, 480,000 sequences would be required; however, to estimate the richness of the simulated communities using nonparametric richness estimators would require only 18,000 sequences. Quantifying the richness of complex environments such as soil is an important step in building an ecological framework. We have shown that generating sufficient sequence data to do so requires less sequencing effort than completely sequencing a bacterial genome. Citation: Schloss PD, Handelsman J (2006) Toward a census of bacteria in soil. PLoS Comp Biol 2(7): e92. DOI: 10.1371/journal.pcbi.0020092
Estimation of sums of random variables: Examples and information
 Annals of Statistics
, 2005
Cited by 6 (2 self)
This paper concerns the estimation of sums of functions of observable and unobservable variables. Lower bounds for the asymptotic variance and a convolution theorem are derived in general finite and infinitedimensional models. An explicit relationship is established between efficient influence functions for the estimation of sums of variables and the estimation of their means. Certain “plugin ” estimators are proved to be asymptotically efficient in finitedimensional models, while “u, v ” estimators of Robbins are proved to be efficient in infinitedimensional mixture models. Examples include certain species, network and data confidentiality problems.
Seasonal changes in an alpine soil bacterial community in the Colorado rocky
, 2004
Cited by 5 (0 self)
Crowdsourced Enumeration Queries
Cited by 3 (1 self)
Abstract — Hybrid human/computer database systems promise to greatly expand the usefulness of query processing by incorporating the crowd for data gathering and other tasks. Such systems raise many implementation questions. Perhaps the most fundamental question is that the closed world assumption underlying relational query semantics does not hold in such systems. As a consequence the meaning of even simple queries can be called into question. Furthermore, query progress monitoring becomes difficult due to nonuniformities in the arrival of crowdsourced data and peculiarities of how people work in crowdsourcing systems. To address these issues, we develop statistical tools that enable users and systems developers to reason about query completeness. These tools can also help drive query execution and crowdsourcing strategies. We evaluate our techniques using experiments on a popular crowdsourcing platform. I.
Estimating the number of classes
 Annals of Statistics
, 2007
Cited by 2 (1 self)
Estimating the unknown number of classes in a population has numerous important applications. In a Poisson mixture model, the problem is reduced to estimating the odds that a class is undetected in a sample. The discontinuity of the odds prevents the existence of locally unbiased and informative estimators and restricts confidence intervals to be onesided. Confidence intervals for the number of classes are also necessarily onesided. A sequence of lower bounds to the odds is developed and used to define pseudo maximum likelihood estimators for the number of classes. 1. Introduction. The
CultureIndependent Characterization of the Microbiota of the Ant Lion Myrmeleon mobilis (Neuroptera: Myrmeleontidae)
, 2005
Cited by 2 (0 self)
