Results 1  10
of
14
Samplingbased estimation of the number of distinct values of an attribute
 In Proc. of VLDB
, 1995
"... ..."
Random Sampling for Histogram Construction: How much is enough?
, 1998
"... Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context ..."
Abstract

Cited by 115 (11 self)
 Add to MetaCart
(Show Context)
Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equiheight histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for prespecified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose ...
Random Sampling from Databases  A Survey
 Statistics and Computing
, 1994
"... This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g. ..."
Abstract

Cited by 25 (0 self)
 Add to MetaCart
This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g., acceptance/rejection and reservoir sampling. A discussion of sampling from various data structures follows: B + trees, hash files, spatial data structures (including Rtrees and quadtrees)). Algorithms for sampling from simple relational queries, e.g., single relational operators such as selection, intersection, union, set difference, projection, and join are then described. We then describe sampling for estimation of aggregates (e.g., the size of query results). Here we discuss both clustered sampling, and sequential sampling approaches. Decision theoretic approaches to sampling for query optimization are reviewed. DRAFT of March 22, 1994. 1 Introduction In this paper we sur...
Distinct Values Estimators for Power Law Distributions
, 2006
"... The number of distinct values in a relation is an important statistic for database query optimization. As databases have grown in size, scalability of distinct values estimators has become extremely important, since a naïve linear scan through the data is no longer feasible. An approach that scales ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
The number of distinct values in a relation is an important statistic for database query optimization. As databases have grown in size, scalability of distinct values estimators has become extremely important, since a naïve linear scan through the data is no longer feasible. An approach that scales very well involves taking a sample of the data, and performing the estimate on the sample. Unfortunately, it has been shown that obtaining estimators with guaranteed small error bounds requires an extremely large sample size in the worst case. On the other hand, it is typically the case that the data is not worstcase, but follows some form of a Power Law or Zipfian distribution. We exploit data distribution assumptions to devise distinctvalues estimators with analytic error guarantees for Zipfian distributions. Our estimators are the first to have the required number of samples depend only on the number of distinct values present, D, and not the database size, n. This allows the estimators to scale well with the size of the database, particularly if the growth is due to multiple copies of the data. In addition to theoretical analysis, we also provide experimental evidence of the effectiveness of our estimators by benchmarking their performance against previously best known heuristic and analytic estimators on both synthetic and realworld datasets.
On topic identification and dialogue move recognition. Computer Speech and Language
, 1997
"... Dialogue move recognition is cited as being representative of a class of problem which may be of concern in data driven natural language processing. The dialogue move recognition problem is formulated as a keywordbased topic identification problem, and is shown to be sensitive to the issue of unkno ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Dialogue move recognition is cited as being representative of a class of problem which may be of concern in data driven natural language processing. The dialogue move recognition problem is formulated as a keywordbased topic identification problem, and is shown to be sensitive to the issue of unknown vocabulary. A model based on the multiple Poisson distribution is shown to alleviate the unknown vocabulary issue, subject to the assumption that the occurrence of keywords represents a small fraction of the data. A keyword selection strategy is derived to ensure this assumption is valid. It is shown that a modified version of Zipf’s law provides a suitable prior probability distribution for keywords, and that its inclusion increases classification performance. © Crown Copyright 1997 1.
On Estimators for Aggregate Relational Algebra Queries
, 1995
"... CASEDB is a relational database management system that allows users to specify time constraints in queries. For an aggregate query AGG(E) where AGG is one of COUNT, SUM and AVERAGE, and E is a relational algebra expression, CASEDB uses statistical estimators to approximate the query. This paper ex ..."
Abstract
 Add to MetaCart
CASEDB is a relational database management system that allows users to specify time constraints in queries. For an aggregate query AGG(E) where AGG is one of COUNT, SUM and AVERAGE, and E is a relational algebra expression, CASEDB uses statistical estimators to approximate the query. This paper extends our earlier work on statistical estimators of CASEDB with the following features: (a) New statistical estimators for COUNT queries with projection. (b) Extending the methodology for SUM and AVERAGE aggregate queries. (c) New sampling plans based on systematic sampling and stratified sampling. We also present performance evaluation experiments of the estimators with the above extensions using correlated and uncorrelated database instances.
Thu Feb 7 02:58:09 2008THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAFETERS
"... Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of ..."
Abstract
 Add to MetaCart
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, noncommercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
Distinct Value Estimation on PeertoPeer Networks
"... PeertoPeer networks have become very popular on the Internet, with millions of peers all over the world sharing large volumes of data. In the assistive healthcare sector, it is likely that P2P networks will develop that interconnect and allow the controlled sharing of patient databases of various ..."
Abstract
 Add to MetaCart
(Show Context)
PeertoPeer networks have become very popular on the Internet, with millions of peers all over the world sharing large volumes of data. In the assistive healthcare sector, it is likely that P2P networks will develop that interconnect and allow the controlled sharing of patient databases of various hospitals, clinics, and research laboratories. However, the sheer scale of these networks has made it difficult to gather statistics that could be used for building new features. In this paper, we present a technique to obtain estimations of the number of distinct values matching a query on the network. We evaluate the technique experimentally and provide a set of results that demonstrate its effectiveness, as well as its flexibility in supporting a variety of queries and applications. 1.
Abstract Distinct Values Estimators for Power Law Distributions
"... The number of distinct values in a relation is an important statistic for database query optimization. As databases have grown in size, scalability of distinct values estimators has become extremely important, since a naïve linear scan through the data is no longer feasible. An approach that scales ..."
Abstract
 Add to MetaCart
(Show Context)
The number of distinct values in a relation is an important statistic for database query optimization. As databases have grown in size, scalability of distinct values estimators has become extremely important, since a naïve linear scan through the data is no longer feasible. An approach that scales very well involves taking a sample of the data, and performing the estimate on the sample. Unfortunately, it has been shown that obtaining estimators with guaranteed small error bounds requires an extremely large sample size in the worst case. On the other hand, it is typically the case that the data is not worstcase, but follows some form of a Power Law or Zipfian distribution. We exploit data distribution assumptions to devise distinctvalues estimators with analytic error guarantees for Zipfian distributions. Our estimators are the first to have the required number of samples depend only on the number of distinct values present, D, and not the database size, n. This allows the estimators to scale well with the size of the database, particularly if the growth is due to multiple copies of the data. In addition to theoretical analysis, we also provide experimental evidence of the effectiveness of our estimators by benchmarking their performance against previously best known heuristic and analytic estimators on both synthetic and realworld datasets. 1