Results 1 -
8 of
8
Sampling-based estimation of the number of distinct values of an attribute
- In Proc. 21st International Conf. on Very Large Data Bases
, 1995
"... We provide several new sampling-based estima-tors of the number of distinct values of an at-tribute in a relation. We compare these new esti-mators to estimators from the database and sta-tistical literature empirically, using a large num-ber of attribute-value distributions drawn from a variety of ..."
Abstract
-
Cited by 97 (9 self)
- Add to MetaCart
We provide several new sampling-based estima-tors of the number of distinct values of an at-tribute in a relation. We compare these new esti-mators to estimators from the database and sta-tistical literature empirically, using a large num-ber of attribute-value distributions drawn from a variety of real-world databases. This appears to be the first extensive comparison of distinct-value estimators in either the database or statistical lit-erature, and is certainly the first to use highly-skewed data of the sort frequently encountered in database applications. Our experiments indicate that a new “hybrid ” estimator yields the highest precision on average for a given sampling frac-tion. This estimator explicitly takes into account the degree of skew in the data and combines a new “smoothed jackknife ” estimator with an es-timator due to Shlosser. We investigate how the hybrid estimator behaves as we scale up the size of the database. 1
Random Sampling for Histogram Construction: How much is enough?
, 1998
"... Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equi-h ..."
Abstract
-
Cited by 91 (11 self)
- Add to MetaCart
Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equi-height histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for pre-specified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose ...
Robust estimation of population size in closed animal populations from capture~recapture experiments
, 1983
"... This paper considers the problem of finding robust estimators of population size in closed K-sample capture-recapture experimerts.Particular attention is paid to models where heterogeneity of capture probabilities is allowed. First a general estimation procedure is given which does not depend on ass ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
This paper considers the problem of finding robust estimators of population size in closed K-sample capture-recapture experimerts.Particular attention is paid to models where heterogeneity of capture probabilities is allowed. First a general estimation procedure is given which does not depend on assuming anything about the form of the distribution of capture probabilities. This is followed by a detailed discussion of the usefulness of the generalized jackknife technique to reduce bias. Numerical comparisons of the bias and variance of various estimators are given. Finally a general discussion is given with several recommendations on estimators to be used in practice. Key words: Capture-recapture sampling; Population size estimation; Heterogeneity;
SPECIES RICHNESS ESTIMATION
"... Abstract. Various models and estimation procedures for estimating the number of species in a community are reviewed under the following sam-pling schemes: sampling by continuous-type of efforts, sampling by indi-viduals, and sampling by quadrats (or multiple occasions). Applications and relevant sof ..."
Abstract
- Add to MetaCart
Abstract. Various models and estimation procedures for estimating the number of species in a community are reviewed under the following sam-pling schemes: sampling by continuous-type of efforts, sampling by indi-viduals, and sampling by quadrats (or multiple occasions). Applications and relevant software are briefly reviewed.
Distinct Value Estimation on Peer-to-Peer Networks
"... Peer-to-Peer networks have become very popular on the Internet, with millions of peers all over the world sharing large volumes of data. In the assistive healthcare sector, it is likely that P2P networks will develop that interconnect and allow the controlled sharing of patient databases of various ..."
Abstract
- Add to MetaCart
Peer-to-Peer networks have become very popular on the Internet, with millions of peers all over the world sharing large volumes of data. In the assistive healthcare sector, it is likely that P2P networks will develop that interconnect and allow the controlled sharing of patient databases of various hospitals, clinics, and research laboratories. However, the sheer scale of these networks has made it difficult to gather statistics that could be used for building new features. In this paper, we present a technique to obtain estimations of the number of distinct values matching a query on the network. We evaluate the technique experimentally and provide a set of results that demonstrate its effectiveness, as well as its flexibility in supporting a variety of queries and applications. 1.
Northern Flying Squirrel Mycophagy and Truffle Production in Fir Forests in Northeastern California 1
"... In this paper we summarize the results of four studies in which we either examined the feeding habits of the northern flying squirrel (Glaucomys sabrinus), a mycophagous (consuming fungi) small mammal, or compared the abundance of truffles (sporocarps of hypogeous mycorrhizal fungi) among different ..."
Abstract
- Add to MetaCart
In this paper we summarize the results of four studies in which we either examined the feeding habits of the northern flying squirrel (Glaucomys sabrinus), a mycophagous (consuming fungi) small mammal, or compared the abundance of truffles (sporocarps of hypogeous mycorrhizal fungi) among different types of fir (Abies) forest. The studies were conducted within the Lassen National Forest in northeastern California between 1990 and 1994. In the first study, we found that abundance of northern flying squirrels was significantly less in old-growth fir stands that had been shelterwood-logged 6 to 7 years previously than in nearby, unlogged old-growth and mature fir stands. Truffles were common in the diet of flying squirrels, truffle frequency was low in the shelterwood-logged stands compared to the unlogged old-growth and mature stands, and abundance of flying squirrels was correlated with truffle frequency across the 12 stands in which we estimated both. In the second study, we found no significant effects on total truffle frequency and biomass of truffles from commercial thinning or broadcast burning that had occurred about 10 years previously, but there were significant effects of thinning on frequencies of individual truffle genera. In the
1754 Population estimation with sparse data: the role of estimators versus indices revisited
"... Abstract: The use of indices to evaluate small-mammal populations has been heavily criticized, yet a review of smallmammal studies published from 1996 through 2000 indicated that indices are still the primary methods employed for measuring populations. The literature review also found that 98 % of t ..."
Abstract
- Add to MetaCart
Abstract: The use of indices to evaluate small-mammal populations has been heavily criticized, yet a review of smallmammal studies published from 1996 through 2000 indicated that indices are still the primary methods employed for measuring populations. The literature review also found that 98 % of the samples collected in these studies were too small for reliable selection among population-estimation models. Researchers therefore generally have a choice between
On Estimators for Aggregate Relational Algebra Queries
, 1995
"... CASE-DB is a relational database management system that allows users to specify time constraints in queries. For an aggregate query AGG(E) where AGG is one of COUNT, SUM and AVERAGE, and E is a relational algebra expression, CASE-DB uses statistical estimators to approximate the query. This paper ex ..."
Abstract
- Add to MetaCart
CASE-DB is a relational database management system that allows users to specify time constraints in queries. For an aggregate query AGG(E) where AGG is one of COUNT, SUM and AVERAGE, and E is a relational algebra expression, CASE-DB uses statistical estimators to approximate the query. This paper extends our earlier work on statistical estimators of CASE-DB with the following features: (a) New statistical estimators for COUNT queries with projection. (b) Extending the methodology for SUM and AVERAGE aggregate queries. (c) New sampling plans based on systematic sampling and stratified sampling. We also present performance evaluation experiments of the estimators with the above extensions using correlated and uncorrelated database instances.

