Results 1 
8 of
8
A lineartime probabilistic counting algorithm for database applications
 ACM Transactions on Database Systems
, 1990
"... We present a probabilistic algorithm for counting the number of unique values in the presence of duplicates. This algorithm has O(q) time complexity, where q is the number of values including duplicates, and produces an estimation with an arbitrary accuracy prespecified by the user using only a smal ..."
Abstract

Cited by 95 (5 self)
 Add to MetaCart
(Show Context)
We present a probabilistic algorithm for counting the number of unique values in the presence of duplicates. This algorithm has O(q) time complexity, where q is the number of values including duplicates, and produces an estimation with an arbitrary accuracy prespecified by the user using only a small amount of space. Traditionally, accurate counts of unique values were obtained by sorting, which has O(q log q) time complexity. Our technique, called linear counting, is based on hashing. We present a comprehensive theoretical and experimental analysis of linear counting. The analysis reveals an interesting result: A load factor (number of unique values/hash table size) much larger than 1.0 (e.g., 12) can be used for accurate estimation (e.g., 1 % of error). We present this technique with two important applications to database problems: namely, (1) obtaining the column cardinality (the number of unique values in a column of a relation) and (2) obtaining the join selectivity (the number of unique values in the join column resulting from an unconditional join divided by the number of unique join column values in the relation to he joined). These two parameters are important statistics that are used in relational query optimization and physical database design.
Random Sampling from Databases  A Survey
 Statistics and Computing
, 1994
"... This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g. ..."
Abstract

Cited by 25 (0 self)
 Add to MetaCart
This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g., acceptance/rejection and reservoir sampling. A discussion of sampling from various data structures follows: B + trees, hash files, spatial data structures (including Rtrees and quadtrees)). Algorithms for sampling from simple relational queries, e.g., single relational operators such as selection, intersection, union, set difference, projection, and join are then described. We then describe sampling for estimation of aggregates (e.g., the size of query results). Here we discuss both clustered sampling, and sequential sampling approaches. Decision theoretic approaches to sampling for query optimization are reviewed. DRAFT of March 22, 1994. 1 Introduction In this paper we sur...
Temporal Relations in Geographic Information Systems: A Workshop at the University of Maine, Orono, October 1213, 1990
 ACM Sigmod Record
, 1991
"... A workshop on temporal relations in Geographic Information Systems (GIS) was held on October 1213, 1990, at the University of Maine. The meeting was sponsored by the National Center for Geographic Information and Analysis (NCGIA). Seventeen specialists gathered from the fields of Geography, GIS, an ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
A workshop on temporal relations in Geographic Information Systems (GIS) was held on October 1213, 1990, at the University of Maine. The meeting was sponsored by the National Center for Geographic Information and Analysis (NCGIA). Seventeen specialists gathered from the fields of Geography, GIS, and Computer Science to discuss users' requirements of temporal GIS and to identify the research issues posed by these requirements. We found  Two paradigms of time as users understand it: one that sees time as a continuum, in which evolution is gradual and often described by differential equations, and another in which time is a sequence of intervals and changes are caused by events.  The need for including exploratory mechanisms appropriate for searches in large datasets.  The convenience of including views that reflect a user's need e.g., temporal/spatial aggregation, and rules to construct these views and to enforce consistency.  The need for the manipulation of inexact or approximate ...
Semantic Cardinality Estimation for Queries over Objects
 IN PERSISTENT OBJECT SYSTEMS: PRINCIPLES AND PRACTICE. THE SEVENTH INT'L WORKSHOP ON PERSISTENT OBJECT SYSTEMS, CAPE MAY, NJ
, 1996
"... In this paper, we address the problem of estimating cardinalities of queries over sets of objects. We base our estimates on a knowledge of subset relationships between sets (e.g., Students form a subset of People). Previous work on cardinality estimation has assumed that every subset is a representa ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
In this paper, we address the problem of estimating cardinalities of queries over sets of objects. We base our estimates on a knowledge of subset relationships between sets (e.g., Students form a subset of People). Previous work on cardinality estimation has assumed that every subset is a representative sample of the set it is taken from. We maintain (offline) information for the cases where the subset is not a uniform representative, and use this information to improve the accuracy of our cardinality estimates. We present cardinality estimation techniques for image sets and select sets. Image sets are created by applying a function to an existing set. Select sets contain the elements that satisfy some predicate in an existing set. Empirically we show that we can obtain estimates within a factor of four for reasonable image sets on our test database while maintaining offline statistics requiring space linear in the number of named sets in the database and the number of attributes defin...
Understanding of Navy Technical Language via Statistical Parsing
"... A key problem in indexing technical information is the interpretation of technical words and word senses, expressions not used in everyday language. This is important for captions on technical images, whose often pithy descriptions can be valuable to decipher. We describe the naturallanguage proces ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
A key problem in indexing technical information is the interpretation of technical words and word senses, expressions not used in everyday language. This is important for captions on technical images, whose often pithy descriptions can be valuable to decipher. We describe the naturallanguage processing for MARIE2, a naturallanguage information retrieval system for multimedia captions. Our approach is to provide general tools for lexicon enhancement with the specialized words and word senses, and to learn word usage information (both on word senses and wordsense pairs) from a training corpus with a statistical parser. Innovations of our approach are in statistical inheritance of binary cooccurrence probabilities and in weighting of sentence subsequences. MARIE2 was trained and tested on 616 captions (with 1009 distinct sentences) from the photograph library of a Navy laboratory. The captions had extensive nominal compounds, code phrases, abbreviations, and acronyms, but few verbs, abstract nouns, conjunctions, and pronouns. Experimental results fit a processing time in seconds of
Statistical Versus Symbolic Parsing
"... This paper reports on an important direction that we have explored recently: mixing traditional symbolic parsing with probabilistic ranking based on a res tricted kind of statistical information ..."
Abstract
 Add to MetaCart
This paper reports on an important direction that we have explored recently: mixing traditional symbolic parsing with probabilistic ranking based on a res tricted kind of statistical information
Semantic Cardinality Estimation for Mapping and Selecting Queries in Complex Object Bases
"... In this paper, we address the problem of estimating cardinalities of queries over complex objects. We base our estimates on a knowledge of subset relationships between sets (e.g., Students form a subset of People). Previous work on cardinality estimation has assumed that every subset is a represen ..."
Abstract
 Add to MetaCart
In this paper, we address the problem of estimating cardinalities of queries over complex objects. We base our estimates on a knowledge of subset relationships between sets (e.g., Students form a subset of People). Previous work on cardinality estimation has assumed that every subset is a representative sample of the set it is taken from. We maintain (offline) information for the cases where the subset is not a uniform representative, and use this information to improve the accuracy of our cardinality estimates.