Results 1  10
of
58
Samplingbased estimation of the number of distinct values of an attribute
 In Proc. 21st International Conf. on Very Large Data Bases
, 1995
"... We provide several new samplingbased estimators of the number of distinct values of an attribute in a relation. We compare these new estimators to estimators from the database and statistical literature empirically, using a large number of attributevalue distributions drawn from a variety of ..."
Abstract

Cited by 114 (9 self)
 Add to MetaCart
We provide several new samplingbased estimators of the number of distinct values of an attribute in a relation. We compare these new estimators to estimators from the database and statistical literature empirically, using a large number of attributevalue distributions drawn from a variety of realworld databases. This appears to be the first extensive comparison of distinctvalue estimators in either the database or statistical literature, and is certainly the first to use highlyskewed data of the sort frequently encountered in database applications. Our experiments indicate that a new “hybrid ” estimator yields the highest precision on average for a given sampling fraction. This estimator explicitly takes into account the degree of skew in the data and combines a new “smoothed jackknife ” estimator with an estimator due to Shlosser. We investigate how the hybrid estimator behaves as we scale up the size of the database. 1
Random Sampling for Histogram Construction: How much is enough?
, 1998
"... Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equih ..."
Abstract

Cited by 106 (11 self)
 Add to MetaCart
Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining "How much sampling is enough?" We address this critical question in the context of equiheight histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for prespecified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose ...
Distinct sampling for highlyaccurate answers to distinct values queries and event reports
 In Proceedings of the 27th International Conference on Very Large Data Bases
"... Estimating the number of distinct values is a wellstudied problem, due to its frequent occurrence in queries and its importance in selecting good query plans. Previous work has shown powerful negative results on the quality of distinctvalues estimates based on sampling (or other techniques that exa ..."
Abstract

Cited by 94 (5 self)
 Add to MetaCart
Estimating the number of distinct values is a wellstudied problem, due to its frequent occurrence in queries and its importance in selecting good query plans. Previous work has shown powerful negative results on the quality of distinctvalues estimates based on sampling (or other techniques that examine only part of the input data). We present an approach, called distinct sampling, that collects a specially tailored sample over the distinct values in the input, in a single scan of the data. In contrast to the previous negative results, our small Distinct Samples are guaranteed to accurately estimate the number of distinct values. The samples can be incrementally maintained uptodate in the presence of data insertions and deletions, with minimal time and memory overheads, so that the full scan may be performed only once. Moreover, a stored Distinct Sample can be used to accurately estimate the number of distinct values within any range specified by the query, or within any other subset of the data satisfying a query predicate. We present an extensive experimental study of distinct sampling. Using synthetic and realworld data sets, we show that distinct sampling gives distinctvalues estimates to within 0%–10 % relative error, whereas previous methods typically incur 50%–250 % relative error. Next, we show how distinct sampling can provide fast, highlyaccurate approximate answers for “report ” queries in highvolume, sessionbased event recording environments, such as IP networks, customer service call centers, etc. For a commercial call center environment, we show that a 1 % Distinct Sample
Comparing data streams using hamming norms (how to zero in)
, 2003
"... Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases and instead must be processed “on the fly” as they are produced. Similarly, sensor networ ..."
Abstract

Cited by 71 (7 self)
 Add to MetaCart
Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases and instead must be processed “on the fly” as they are produced. Similarly, sensor networks produce multiple data streams of observations from their sensors. There is growing focus on manipulating data streams and, hence, there is a need to identify basic operations of interest in managing data streams, and to support them efficiently. We propose computation of the Hamming norm as a basic operation of interest. The Hamming norm formalizes ideas that are used throughout data processing. When applied to a single stream, the Hamming norm gives the number of distinct items that are present in that data stream, which is a statistic of great interest in databases. When applied to a pair of streams, the Hamming norm gives an important measure of (dis)similarity: the number of unequal item counts in the two streams. Hamming norms have many uses in comparing data streams. We present a novel approximation technique for estimating the Hamming norm for massive data streams; this relies on what we call the “l0 sketch ” and we prove its accuracy. We test our approximation method on a large quantity of synthetic and real stream data, and show that the estimation is accurate to within a few percentage points.
Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs
 Bioinformatics
, 2004
"... Biological and engineered networks have recently been shown to display network motifs: a small set of characteristic patterns which occur much more frequently than in randomized networks with the same degree sequence. Network motifs were demonstrated to play key information processing roles in biolo ..."
Abstract

Cited by 70 (0 self)
 Add to MetaCart
Biological and engineered networks have recently been shown to display network motifs: a small set of characteristic patterns which occur much more frequently than in randomized networks with the same degree sequence. Network motifs were demonstrated to play key information processing roles in biological regulation networks. Existing algorithms for detecting network motifs act by exhaustively enumerating all subgraphs with a given number of nodes in the network. The runtime of such full enumeration algorithms increases strongly with network size. Here we present a novel algorithm that allows estimation of subgraph concentrations and detection of network motifs at a run time that is asymptotically independent of the network size. This algorithm is based on random sampling of subgraphs. Network motifs are detected with a surprisingly small number of samples in a wide variety of networks. Our method can be applied to estimate the concentrations of larger subgraphs in larger networks than was previously possible with full enumeration algorithms. We present results for highorder motifs in several biological networks and discuss their possible functions. Availability: A software tool for estimating subgraph concentrations and detecting network motifs (mfinder 2.0) and further information is available at:
Adaptive Parallel Aggregation Algorithms
 In ACM SIGMOD
, 1995
"... Aggregation and duplicate removal are common in SQL queries. However, in the parallel query processing literature, aggregate processing has received surprisingly little attention; furthermore, for each of the traditional parallel aggregation algorithms, there is a range of grouping selectivities whe ..."
Abstract

Cited by 42 (2 self)
 Add to MetaCart
Aggregation and duplicate removal are common in SQL queries. However, in the parallel query processing literature, aggregate processing has received surprisingly little attention; furthermore, for each of the traditional parallel aggregation algorithms, there is a range of grouping selectivities where the algorithm performs poorly. In this work, we propose new algorithms that dynamically adapt, at query evaluation time, in response to observed grouping selectivities. Performance analysis via analytical modeling and an implementation on a workstationcluster shows that the proposed algorithms are able to perform well for all grouping selectivities. Finally, we study the effect of data skew and show that for certain data sets the proposed algorithms can even outperform the best of traditional approaches. 1 Introduction SQL queries are replete with aggregate and duplicate elimination operations. One measure of the perceived importance of aggregation is that in the proposed TPCD benchmark...
Notes on the occupancy problem with infinitely many boxes: general asymptotics and power laws
, 2008
"... ..."
Random Sampling from Databases  A Survey
 Statistics and Computing
, 1994
"... This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g. ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g., acceptance/rejection and reservoir sampling. A discussion of sampling from various data structures follows: B + trees, hash files, spatial data structures (including Rtrees and quadtrees)). Algorithms for sampling from simple relational queries, e.g., single relational operators such as selection, intersection, union, set difference, projection, and join are then described. We then describe sampling for estimation of aggregates (e.g., the size of query results). Here we discuss both clustered sampling, and sequential sampling approaches. Decision theoretic approaches to sampling for query optimization are reviewed. DRAFT of March 22, 1994. 1 Introduction In this paper we sur...
Correction of sequencebased artifacts in serial analysis of gene expression
 Bioinformatics
, 2004
"... Motivation: Serial Analysis of Gene Expression (SAGE) is a powerful technology for measuring global gene expression, through rapid generation of large numbers of transcript tags. Beyond their intrinsic value in differential gene expression analysis, SAGE tag collections afford abundant information o ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
Motivation: Serial Analysis of Gene Expression (SAGE) is a powerful technology for measuring global gene expression, through rapid generation of large numbers of transcript tags. Beyond their intrinsic value in differential gene expression analysis, SAGE tag collections afford abundant information on the size and shape of the sample transcriptome and can accelerate novel gene discovery. These latter SAGE applications are facilitated by the enhanced method of Long SAGE. A characteristic of sequencingbased methods, such as SAGE and Long SAGE is the unavoidable occurrence of artifact sequences resulting from sequencing errors. By virtue of their lowrandom incidence, such tag errors have minimal impact on differential expression analysis. However, to fully exploit the value of large SAGE tag datasets, it is desirable to account for and correct tag artifacts. Results: We present estimates for occurrences of tag errors, and an efficient error correction algorithm. Error rate estimates are based on a stochastic model that includes the Polymerase chain reaction and sequencing error contributions.The correction algorithm, SAGEScreen, is a multistep procedure that addresses ditag processing, estimation of empirical error rates from highly abundant tags, grouping of similarsequence tags and statistical testing of observed counts. We apply SAGEScreen to Long SAGE libraries and compare error rates for several processing scenarios. Results with simulated tag collections indicate that SAGEScreen corrects 78% of recoverable tag errors and reduces the occurrences of singleton tags.