Results 1 - 10
of
25
The space complexity of approximating the frequency moments
- JOURNAL OF COMPUTER AND SYSTEM SCIENCES
, 1996
"... The frequency moments of a sequence containing mi elements of type i, for 1 ≤ i ≤ n, are the numbers Fk = �n i=1 mki. We consider the space complexity of randomized algorithms that approximate the numbers Fk, when the elements of the sequence are given one by one and cannot be stored. Surprisingly, ..."
Abstract
-
Cited by 570 (13 self)
- Add to MetaCart
The frequency moments of a sequence containing mi elements of type i, for 1 ≤ i ≤ n, are the numbers Fk = �n i=1 mki. We consider the space complexity of randomized algorithms that approximate the numbers Fk, when the elements of the sequence are given one by one and cannot be stored. Surprisingly, it turns out that the numbers F0, F1 and F2 can be approximated in logarithmic space, whereas the approximation of Fk for k ≥ 6 requires nΩ(1) space. Applications to data bases are mentioned as well.
Probabilistic Counting Algorithms for Data Base Applications
, 1985
"... This paper introduces a class of probabilistic counting lgorithms with which one can estimate the number of distinct elements in a large collection of data (typically a large file stored on disk) in a single pass using only a small additional storage (typically less than a hundred binary words) a ..."
Abstract
-
Cited by 284 (6 self)
- Add to MetaCart
This paper introduces a class of probabilistic counting lgorithms with which one can estimate the number of distinct elements in a large collection of data (typically a large file stored on disk) in a single pass using only a small additional storage (typically less than a hundred binary words) and only a few operations per element scanned. The algorithms are based on statistical observations made on bits of hashed values of records. They are by con- struction totally insensitive to the replicafive structure of elements in the file; they can be used in the context of distributed systems without any degradation of performances and prove especially useful in the context of data bases query optimisation. ; 1985 Academic Press, Inc
Frequency estimation of internet packet streams with limited space
- In Proceedings of the 10th Annual European Symposium on Algorithms
, 2002
"... Abstract. We consider a router on the Internet analyzing the statistical properties of a TCP/IP packet stream. A fundamental difficulty with measuring traffic behavior on the Internet is that there is simply too much data to be recorded for later analysis, on the order of gigabytes a second. As a re ..."
Abstract
-
Cited by 117 (1 self)
- Add to MetaCart
Abstract. We consider a router on the Internet analyzing the statistical properties of a TCP/IP packet stream. A fundamental difficulty with measuring traffic behavior on the Internet is that there is simply too much data to be recorded for later analysis, on the order of gigabytes a second. As a result, network routers can collect only relatively few statistics about the data. The central problem addressed here is to use the limited memory of routers to determine essential features of the network traffic stream. A particularly difficult and representative subproblem is to determine the top k categories to which the most packets belong, for a desired value of k and for a given notion of categorization such as the destination IP address. We present an algorithm that deterministically finds (in particular) all categories having a frequency above 1/(m + 1) using m counters, which we prove is best possible in the worst case. We also present a sampling-based algorithm for the case that packet categories follow an arbitrary distribution, but their order over time is permuted uniformly at random. Under this model, our algorithm identifies flows above a frequency threshold of roughly 1 / √ nm with high probability, where m is the number of counters and n is the number of packets observed. This guarantee is not far off from the ideal of identifying all flows (probability 1/n), and we prove that it is best possible up to a logarithmic factor. We show that the algorithm ranks the identified flows according to frequency within any desired constant factor of accuracy. 1
Loglog Counting of Large Cardinalities
- In ESA
, 2003
"... Using an auxiliary memory smaller than the size of this abstract, the LogLog algorithm makes it possible to estimate in a single pass and within a few percents the number of different words in the whole of Shakespeare's works. In general the LogLog algorithm makes use of m "small bytes" of auxiliary ..."
Abstract
-
Cited by 57 (2 self)
- Add to MetaCart
Using an auxiliary memory smaller than the size of this abstract, the LogLog algorithm makes it possible to estimate in a single pass and within a few percents the number of different words in the whole of Shakespeare's works. In general the LogLog algorithm makes use of m "small bytes" of auxiliary memory in order to estimate in a single pass the number of distinct elements (the "cardinality") in a file, and it does so with an accuracy that is of the order of 1= m. The "small bytes" to be used in order to count cardinalities till Nmax comprise about log log Nmax bits, so that cardinalities well in the range of billions can be determined using one or two kilobytes of memory only. The basic version of the LogLog algorithm is validated by a complete analysis. An optimized version, super-LogLog, is also engineered and tested on real-life data. The algorithm parallelizes optimally.
Stability Of Binary Exponential Backoff
, 1988
"... Binary exponential backoff is a randomized protocol for regulating transmissions on a multiple access broadcast channel. Ethernet, a local area network, is built upon this protocol. The fundamental theoretical issue is stability: does the backlog of packets awaiting transmission remain bounded in ti ..."
Abstract
-
Cited by 39 (0 self)
- Add to MetaCart
Binary exponential backoff is a randomized protocol for regulating transmissions on a multiple access broadcast channel. Ethernet, a local area network, is built upon this protocol. The fundamental theoretical issue is stability: does the backlog of packets awaiting transmission remain bounded in time, provided the rates of new packet arrivals are small enough? It is assumed n 2 stations share the channel, each having an infinite buffer where packets accumulate while the station attempts to transmit the first from the buffer. Here, it is established that binary exponential backoff is stable if the sum of the arrival rates is sufficiently small. Detailed results are obtained on which rates lead to stability when n = 2 stations share the channel. In passing several other results are derived bearing on the efficiency of the conflict resolution process. Simulation results are reported that, in particular, indicate alternative retransmission protocols can significantly improve performanc...
Practical Implementations of Arithmetic Coding
- IN IMAGE AND TEXT
, 1992
"... We provide a tutorial on arithmetic coding, showing how it provides nearly optimal data compression and how it can be matched with almost any probabilistic model. We indicate the main disadvantage of arithmetic coding, its slowness, and give the basis of a fast, space-efficient, approximate arithmet ..."
Abstract
-
Cited by 31 (6 self)
- Add to MetaCart
We provide a tutorial on arithmetic coding, showing how it provides nearly optimal data compression and how it can be matched with almost any probabilistic model. We indicate the main disadvantage of arithmetic coding, its slowness, and give the basis of a fast, space-efficient, approximate arithmetic coder with only minimal loss of compression efficiency. Our coder is based on the replacement of arithmetic by table lookups coupled with a new deterministic probability estimation scheme.
Distinctness of compositions of an integer: A probabilistic analysis
- RANDOM STRUCTURES AND ALGORITHMS
, 2001
"... Compositions of integers are used as theoretical models for many applications. The degree of distinctness of a composition is a natural and important parameter. In this paper, we use as measure of distinctness the number of distinct parts (or components). We investigate, from a probabilistic point o ..."
Abstract
-
Cited by 25 (11 self)
- Add to MetaCart
Compositions of integers are used as theoretical models for many applications. The degree of distinctness of a composition is a natural and important parameter. In this paper, we use as measure of distinctness the number of distinct parts (or components). We investigate, from a probabilistic point of view, the first empty part, the maximum part size and the distribution of the number of distinct part sizes. We obtain asymptotically, for the classical composition of an integer, the moments and an expression for a continuous distribution F, the (discrete) distribution of the number of distinct part sizes being computable from F. We next analyze another composition: the Carlitz one, where two successive parts are dierent. We use tools such as analytical depoissonization, Mellin transforms, Markov chain potential theory, limiting hitting times, singularity analysis and perturbation analysis.
Random Sampling from Databases - A Survey
- Statistics and Computing
, 1994
"... This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g. ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g., acceptance/rejection and reservoir sampling. A discussion of sampling from various data structures follows: B + trees, hash files, spatial data structures (including R-trees and quadtrees)). Algorithms for sampling from simple relational queries, e.g., single relational operators such as selection, intersection, union, set difference, projection, and join are then described. We then describe sampling for estimation of aggregates (e.g., the size of query results). Here we discuss both clustered sampling, and sequential sampling approaches. Decision theoretic approaches to sampling for query optimization are reviewed. DRAFT of March 22, 1994. 1 Introduction In this paper we sur...
Aqua project white paper
, 1997
"... Viswanath Poosala z In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding o ..."
Abstract
-
Cited by 16 (10 self)
- Add to MetaCart
Viswanath Poosala z In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding or minimizing the number of accesses to the base data. This white paper describes the Approximate QUery Answering (AQUA) Project underway in the Information Sciences Research Center at Bell Labs. We present a framework for an approximate query engine that observes new data as it arrives and maintains small synopsis data structures on that data. These data structures are used to provide fast, approximate answers to a broad class of queries. We describe metrics for evaluating approximate query answers. We also present new synopsis data structures, and new techniques for approximate query answers. We report on the goals and status of the Aqua project, and plans for future work.

