Results 1 -
7 of
7
A linear-time probabilistic counting algorithm for database applications
- ACM Transactions on Database Systems
, 1990
"... We present a probabilistic algorithm for counting the number of unique values in the presence of duplicates. This algorithm has O(q) time complexity, where q is the number of values including duplicates, and produces an estimation with an arbitrary accuracy prespecified by the user using only a smal ..."
Abstract
-
Cited by 74 (5 self)
- Add to MetaCart
We present a probabilistic algorithm for counting the number of unique values in the presence of duplicates. This algorithm has O(q) time complexity, where q is the number of values including duplicates, and produces an estimation with an arbitrary accuracy prespecified by the user using only a small amount of space. Traditionally, accurate counts of unique values were obtained by sorting, which has O(q log q) time complexity. Our technique, called linear counting, is based on hashing. We present a comprehensive theoretical and experimental analysis of linear counting. The analysis reveals an interesting result: A load factor (number of unique values/hash table size) much larger than 1.0 (e.g., 12) can be used for accurate estimation (e.g., 1 % of error). We present this technique with two important applications to database problems: namely, (1) obtaining the column cardinality (the number of unique values in a column of a relation) and (2) obtaining the join selectivity (the number of unique values in the join column resulting from an unconditional join divided by the number of unique join column values in the relation to he joined). These two parameters are important statistics that are used in relational query optimization and physical database design.
Random Sampling from Databases - A Survey
- Statistics and Computing
, 1994
"... This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g. ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g., acceptance/rejection and reservoir sampling. A discussion of sampling from various data structures follows: B + trees, hash files, spatial data structures (including R-trees and quadtrees)). Algorithms for sampling from simple relational queries, e.g., single relational operators such as selection, intersection, union, set difference, projection, and join are then described. We then describe sampling for estimation of aggregates (e.g., the size of query results). Here we discuss both clustered sampling, and sequential sampling approaches. Decision theoretic approaches to sampling for query optimization are reviewed. DRAFT of March 22, 1994. 1 Introduction In this paper we sur...
Temporal Relations in Geographic Information Systems: A Workshop at the University of Maine, Orono, October 12-13, 1990
- ACM Sigmod Record
, 1991
"... A workshop on temporal relations in Geographic Information Systems (GIS) was held on October 12-13, 1990, at the University of Maine. The meeting was sponsored by the National Center for Geographic Information and Analysis (NCGIA). Seventeen specialists gathered from the fields of Geography, GIS, an ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
A workshop on temporal relations in Geographic Information Systems (GIS) was held on October 12-13, 1990, at the University of Maine. The meeting was sponsored by the National Center for Geographic Information and Analysis (NCGIA). Seventeen specialists gathered from the fields of Geography, GIS, and Computer Science to discuss users' requirements of temporal GIS and to identify the research issues posed by these requirements. We found - Two paradigms of time as users understand it: one that sees time as a continuum, in which evolution is gradual and often described by differential equations, and another in which time is a sequence of intervals and changes are caused by events. - The need for including exploratory mechanisms appropriate for searches in large datasets. - The convenience of including views that reflect a user's need e.g., temporal/spatial aggregation, and rules to construct these views and to enforce consistency. - The need for the manipulation of inexact or approximate ...
Semantic Cardinality Estimation for Queries over Objects
- In Persistent Object Systems: Principles and Practice. The Seventh Int'l Workshop on Persistent Object Systems, Cape May, NJ
, 1997
"... In this paper, we address the problem of estimating cardinalities of queries over sets of objects. We base our estimates on a knowledge of subset relationships between sets (e.g., Students form a subset of People). Previous work on cardinality estimation has assumed that every subset is a representa ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In this paper, we address the problem of estimating cardinalities of queries over sets of objects. We base our estimates on a knowledge of subset relationships between sets (e.g., Students form a subset of People). Previous work on cardinality estimation has assumed that every subset is a representative sample of the set it is taken from. We maintain (offline) information for the cases where the subset is not a uniform representative, and use this information to improve the accuracy of our cardinality estimates. We present cardinality estimation techniques for image sets and select sets. Image sets are created by applying a function to an existing set. Select sets contain the elements that satisfy some predicate in an existing set. Empirically we show that we can obtain estimates within a factor of four for reasonable image sets on our test database while maintaining offline statistics requiring space linear in the number of named sets in the database and the number of attributes defin...
Semantic Cardinality Estimation for Mapping and Selecting Queries in Complex Object Bases
"... In this paper, we address the problem of estimating cardinalities of queries over complex objects. We base our estimates on a knowledge of subset relationships between sets (e.g., Students form a subset of People). Previous work on cardinality estimation has assumed that every subset is a represen ..."
Abstract
- Add to MetaCart
In this paper, we address the problem of estimating cardinalities of queries over complex objects. We base our estimates on a knowledge of subset relationships between sets (e.g., Students form a subset of People). Previous work on cardinality estimation has assumed that every subset is a representative sample of the set it is taken from. We maintain (offline) information for the cases where the subset is not a uniform representative, and use this information to improve the accuracy of our cardinality estimates.
Statistical Versus Symbolic Parsing
"... This paper reports on an important direction that we have explored recently: mixing traditional symbolic parsing with probabilistic ranking based on a res- tricted kind of statistical information ..."
Abstract
- Add to MetaCart
This paper reports on an important direction that we have explored recently: mixing traditional symbolic parsing with probabilistic ranking based on a res- tricted kind of statistical information

