Results 1  10
of
82
Samplingbased estimation of the number of distinct values of an attribute
 In Proceedings of the 21th international conference on Very large data bases (VLDB’95
, 1995
"... ..."
Random Sampling for Histogram Construction: How much is enough?
, 1998
"... Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context ..."
Abstract

Cited by 108 (11 self)
 Add to MetaCart
Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equiheight histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for prespecified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose ...
Distinct sampling for highlyaccurate answers to distinct values queries and event reports
 In Proceedings of the 27th International Conference on Very Large Data Bases
"... Estimating the number of distinct values is a wellstudied problem, due to its frequent occurrence in queries and its importance in selecting good query plans. Previous work has shown powerful negative results on the quality of distinctvalues estimates based on sampling (or other techniques that exa ..."
Abstract

Cited by 97 (5 self)
 Add to MetaCart
Estimating the number of distinct values is a wellstudied problem, due to its frequent occurrence in queries and its importance in selecting good query plans. Previous work has shown powerful negative results on the quality of distinctvalues estimates based on sampling (or other techniques that examine only part of the input data). We present an approach, called distinct sampling, that collects a specially tailored sample over the distinct values in the input, in a single scan of the data. In contrast to the previous negative results, our small Distinct Samples are guaranteed to accurately estimate the number of distinct values. The samples can be incrementally maintained uptodate in the presence of data insertions and deletions, with minimal time and memory overheads, so that the full scan may be performed only once. Moreover, a stored Distinct Sample can be used to accurately estimate the number of distinct values within any range specified by the query, or within any other subset of the data satisfying a query predicate. We present an extensive experimental study of distinct sampling. Using synthetic and realworld data sets, we show that distinct sampling gives distinctvalues estimates to within 0%–10 % relative error, whereas previous methods typically incur 50%–250 % relative error. Next, we show how distinct sampling can provide fast, highlyaccurate approximate answers for “report ” queries in highvolume, sessionbased event recording environments, such as IP networks, customer service call centers, etc. For a commercial call center environment, we show that a 1 % Distinct Sample
Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs
 Bioinformatics
, 2004
"... Biological and engineered networks have recently been shown to display network motifs: a small set of characteristic patterns which occur much more frequently than in randomized networks with the same degree sequence. Network motifs were demonstrated to play key information processing roles in biolo ..."
Abstract

Cited by 71 (0 self)
 Add to MetaCart
Biological and engineered networks have recently been shown to display network motifs: a small set of characteristic patterns which occur much more frequently than in randomized networks with the same degree sequence. Network motifs were demonstrated to play key information processing roles in biological regulation networks. Existing algorithms for detecting network motifs act by exhaustively enumerating all subgraphs with a given number of nodes in the network. The runtime of such full enumeration algorithms increases strongly with network size. Here we present a novel algorithm that allows estimation of subgraph concentrations and detection of network motifs at a run time that is asymptotically independent of the network size. This algorithm is based on random sampling of subgraphs. Network motifs are detected with a surprisingly small number of samples in a wide variety of networks. Our method can be applied to estimate the concentrations of larger subgraphs in larger networks than was previously possible with full enumeration algorithms. We present results for highorder motifs in several biological networks and discuss their possible functions. Availability: A software tool for estimating subgraph concentrations and detecting network motifs (mfinder 2.0) and further information is available at:
Comparing data streams using hamming norms (how to zero in)
, 2003
"... Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases and instead must be processed “on the fly” as they are produced. Similarly, sensor networ ..."
Abstract

Cited by 71 (7 self)
 Add to MetaCart
Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases and instead must be processed “on the fly” as they are produced. Similarly, sensor networks produce multiple data streams of observations from their sensors. There is growing focus on manipulating data streams and, hence, there is a need to identify basic operations of interest in managing data streams, and to support them efficiently. We propose computation of the Hamming norm as a basic operation of interest. The Hamming norm formalizes ideas that are used throughout data processing. When applied to a single stream, the Hamming norm gives the number of distinct items that are present in that data stream, which is a statistic of great interest in databases. When applied to a pair of streams, the Hamming norm gives an important measure of (dis)similarity: the number of unequal item counts in the two streams. Hamming norms have many uses in comparing data streams. We present a novel approximation technique for estimating the Hamming norm for massive data streams; this relies on what we call the “l0 sketch ” and we prove its accuracy. We test our approximation method on a large quantity of synthetic and real stream data, and show that the estimation is accurate to within a few percentage points.
Adaptive Parallel Aggregation Algorithms
 In ACM SIGMOD
, 1995
"... Aggregation and duplicate removal are common in SQL queries. However, in the parallel query processing literature, aggregate processing has received surprisingly little attention; furthermore, for each of the traditional parallel aggregation algorithms, there is a range of grouping selectivities whe ..."
Abstract

Cited by 45 (2 self)
 Add to MetaCart
Aggregation and duplicate removal are common in SQL queries. However, in the parallel query processing literature, aggregate processing has received surprisingly little attention; furthermore, for each of the traditional parallel aggregation algorithms, there is a range of grouping selectivities where the algorithm performs poorly. In this work, we propose new algorithms that dynamically adapt, at query evaluation time, in response to observed grouping selectivities. Performance analysis via analytical modeling and an implementation on a workstationcluster shows that the proposed algorithms are able to perform well for all grouping selectivities. Finally, we study the effect of data skew and show that for certain data sets the proposed algorithms can even outperform the best of traditional approaches. 1 Introduction SQL queries are replete with aggregate and duplicate elimination operations. One measure of the perceived importance of aggregation is that in the proposed TPCD benchmark...
Notes on the occupancy problem with infinitely many boxes: general asymptotics and power laws
, 2008
"... ..."
Random Sampling from Databases  A Survey
 Statistics and Computing
, 1994
"... This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g. ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g., acceptance/rejection and reservoir sampling. A discussion of sampling from various data structures follows: B + trees, hash files, spatial data structures (including Rtrees and quadtrees)). Algorithms for sampling from simple relational queries, e.g., single relational operators such as selection, intersection, union, set difference, projection, and join are then described. We then describe sampling for estimation of aggregates (e.g., the size of query results). Here we discuss both clustered sampling, and sequential sampling approaches. Decision theoretic approaches to sampling for query optimization are reviewed. DRAFT of March 22, 1994. 1 Introduction In this paper we sur...
Estimating animal abundance: review III
 Statistical Science
, 1999
"... The literature describing methods for estimating animal abundance and related parameters continues to grow. This paper reviews recent developments in the subject over the past seven years and updates two previous reviews. ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
The literature describing methods for estimating animal abundance and related parameters continues to grow. This paper reviews recent developments in the subject over the past seven years and updates two previous reviews.