Random Sampling from Databases by Frank Olken Doctor of Philosophy in Computer Science University of California at Berkeley Professor Michael Stonebraker, Chair In this thesis I describe efficient methods of answering random sampling queries of relational databases, i.e., retrieving random samples of the results of relational queries. I begin with a discussion of the motivation for including sampling operators in the database management system (DBMS). Uses include auditing, estimation (e.g., approximate answers to aggregate queries), and query optimization. The second chapter contains a review of the basic file sampling methods used in the thesis: acceptance/rejection sampling, reservoir sampling, and partial sum (ranked) tree sampling. I describe their usage for sampling from variably blocked files, and sampling from results as they are generated. Related literature on sampling from databases is reviewed. In Chapter Three I show how acceptance/rejection sampling of B + trees can be...
|
2573
|
Classification and Regression Trees
– Breiman, Friedman, et al.
- 1984
|
|
1014
|
The Design and Analysis of Spatial Data Structures
– Samet
- 1989
|
|
778
|
Image Analysis and Mathematical Morphology
– Serra
- 1988
|
|
413
|
An Introduction to Database Systems
– Date
- 2000
|
|
324
|
The quadtree and related hierarchical data structures
– SAMET
- 1984
|
|
302
|
The Jackknife, the Bootstrap and Other Resampling Plans, (Philadelphia, Society for Industrial and Applied Mathematics
– Efron
- 1982
|
|
290
|
The Art of Computer Programming, Vol.3: Sorting and Searching
– Knuth
- 1973
|
|
257
|
Simulation and the Monte Carlo method
– Rubinstein
- 1981
|
|
256
|
Application of Spatial Data Structures
– Samet
- 1989
|
|
231
|
Sampling Techniques
– Cochran
- 1977
|
|
218
|
The Art of Computer Programming, Vol. 2 (Seminumerical Algorithms
– Knuth
- 1969
|
|
213
|
Probabilistic counting algorithms for data base applications
– Flajolet, Martin
- 1985
|
|
211
|
Sequential Analysis
– Wald
- 1947
|
|
187
|
Deriving production rules for incremental view maintenance
– Ceri, Widom
- 1991
|
|
166
|
Efficiently Updating Materialized Views
– Blakeley, Larson, et al.
- 1986
|
|
162
|
D.J.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries
– Muralikrishna, DeWitt
- 1988
|
|
157
|
Random sampling with a reservoir
– Vitter
- 1985
|
|
148
|
Introduction to Statistical Quality Control
– Montgomery
- 1991
|
|
143
|
Practical selectivity estimation through adaptive sampling
– Lipton, Naughton
- 1990
|
|
143
|
Classi cation and Regression Trees
– Breiman, Friedman, et al.
- 1984
|
|
139
|
Updating derived relations: Detecting irrelevant and autonomously computable updates
– Blakeley, Coburn, et al.
- 1989
|
|
138
|
Accurate Estimation of the Number of Tuples Satisfying a Condition
– Piatetsky-Shapiro, Connell
- 1984
|
|
126
|
Implementation of integrity constraints and views by query modification
– Stonebraker
- 1975
|
|
116
|
Introduction to the Theory of Coverage Processes
– Hall
- 1998
|
|
115
|
Object and File Management in the EXODUS Extensible Database System
– Carey, DeWitt, et al.
- 1986
|
|
110
|
Probability Approximations via the PoissonClumping Heuristic
– Aldous
- 1989
|
|
97
|
Extendible hashing { a fast access method for dynamic les
– Fagin
- 1979
|
|
80
|
A Performance Analysis of View Materialization Strategy
– Hanson
- 1987
|
|
79
|
Sequential sampling procedures for query size estimation
– Haas, Swami
- 1992
|
|
79
|
Urn Models and Their Application
– JOHNSON, KOTZ
- 1977
|
|
70
|
Parallel Sorting on a Shared-Nothing Architecture Using Probabilistic Splitting
– DeWitt, Naughton, et al.
- 1991
|
|
66
|
Practical Skew Handling in Parallel Joins
– DeWitt, Naughton, et al.
- 1992
|
|
59
|
A linear-time probabilistic counting algorithm for database applications
– Whang, Vander-Zanden, et al.
- 1990
|
|
58
|
Differential files: Their application to the maintenance of large databases
– Severance, Lehman
- 1976
|
|
56
|
Probabilistic counting
– Flajolet, Martin
- 1983
|
|
54
|
Implications of certain assumptions in database performance evaluation
– Christodoulakis
- 1984
|
|
54
|
Secure statistical databases with random sample queries
– Denning
- 1980
|
|
52
|
Statistical estimators for relational algebra expressions
– Hou, Ozsoyoglu, et al.
- 1988
|
|
52
|
Processing aggregate relational queries with hard time constraints
– Hou, Ozsoyoglu, et al.
- 1989
|
|
52
|
Updating distributed materialized views
– Segev, Park
- 1989
|
|
52
|
E ciently Monitoring Relational Databases
– Buneman, Clemons
- 1979
|
|
48
|
Database Snapshots
– Adiba, Lindsay
- 1980
|
|
42
|
A Snapshot Differential Refresh Algorithm
– Lindsay, Hass, et al.
- 1986
|
|
39
|
hashing: a New Tool for File and Table Addressing
– Linear
- 1980
|
|
39
|
Simple random sampling from relational databases
– Olken, Rotem
- 1986
|
|
38
|
The tracker: A threat to statistical database security
– Denning, Denning, et al.
- 1979
|
|
36
|
E ciently Updating Materialized Views
– Blakeley, Larson, et al.
- 1986
|
|
33
|
Dynamic query optimization in Rdb/VMS
– Antoshenkov
- 1993
|
|
32
|
Estimating the number of species: A review
– Bunge, Fitzpatrick
- 1993
|
|
31
|
Approximate counting: a detailed analysis
– Flajolet
- 1985
|