Results 1 - 10
of
48
The space complexity of approximating the frequency moments
- JOURNAL OF COMPUTER AND SYSTEM SCIENCES
, 1996
"... The frequency moments of a sequence containing mi elements of type i, for 1 ≤ i ≤ n, are the numbers Fk = �n i=1 mki. We consider the space complexity of randomized algorithms that approximate the numbers Fk, when the elements of the sequence are given one by one and cannot be stored. Surprisingly, ..."
Abstract
-
Cited by 570 (13 self)
- Add to MetaCart
The frequency moments of a sequence containing mi elements of type i, for 1 ≤ i ≤ n, are the numbers Fk = �n i=1 mki. We consider the space complexity of randomized algorithms that approximate the numbers Fk, when the elements of the sequence are given one by one and cannot be stored. Surprisingly, it turns out that the numbers F0, F1 and F2 can be approximated in logarithmic space, whereas the approximation of Fk for k ≥ 6 requires nΩ(1) space. Applications to data bases are mentioned as well.
Automated worm fingerprinting
- In OSDI
, 2004
"... Network worms are a clear and growing threat to the security of today’s Internet-connected hosts and networks. The combination of the Internet’s unrestricted connectivity and widespread software homogeneity allows network pathogens to exploit tremendous parallelism in their propagation. In fact, mod ..."
Abstract
-
Cited by 239 (6 self)
- Add to MetaCart
Network worms are a clear and growing threat to the security of today’s Internet-connected hosts and networks. The combination of the Internet’s unrestricted connectivity and widespread software homogeneity allows network pathogens to exploit tremendous parallelism in their propagation. In fact, modern worms can spread so quickly, and so widely, that no human-mediated reaction can hope to contain an outbreak. In this paper, we propose an automated approach for quickly detecting previously unknown worms and viruses based on two key behavioral characteristics – a common exploit sequence together with a range of unique sources generating infections and destinations being targeted. More importantly, our approach – called “content sifting ” – automatically generates precise signatures that can then be used to filter or moderate the spread of the worm elsewhere in the network. Using a combination of existing and novel algorithms we have developed a scalable content sifting implementation with low memory and CPU requirements. Over months of active use at UCSD, our Earlybird prototype system has automatically detected and generated signatures for all pathogens known to be active on our network as well as for several new worms and viruses which were unknown at the time our system identified them. Our initial experience suggests that, for a wide range of network pathogens, it may be practical to construct fully automated defenses – even against so-called “zero-day” epidemics. 1
Computing iceberg queries efficiently
- In Proc. of the 24th VLDB Conf
, 1998
"... Many applications compute aggregate functions... ..."
Frequency estimation of internet packet streams with limited space
- In Proceedings of the 10th Annual European Symposium on Algorithms
, 2002
"... Abstract. We consider a router on the Internet analyzing the statistical properties of a TCP/IP packet stream. A fundamental difficulty with measuring traffic behavior on the Internet is that there is simply too much data to be recorded for later analysis, on the order of gigabytes a second. As a re ..."
Abstract
-
Cited by 117 (1 self)
- Add to MetaCart
Abstract. We consider a router on the Internet analyzing the statistical properties of a TCP/IP packet stream. A fundamental difficulty with measuring traffic behavior on the Internet is that there is simply too much data to be recorded for later analysis, on the order of gigabytes a second. As a result, network routers can collect only relatively few statistics about the data. The central problem addressed here is to use the limited memory of routers to determine essential features of the network traffic stream. A particularly difficult and representative subproblem is to determine the top k categories to which the most packets belong, for a desired value of k and for a given notion of categorization such as the destination IP address. We present an algorithm that deterministically finds (in particular) all categories having a frequency above 1/(m + 1) using m counters, which we prove is best possible in the worst case. We also present a sampling-based algorithm for the case that packet categories follow an arbitrary distribution, but their order over time is permuted uniformly at random. Under this model, our algorithm identifies flows above a frequency threshold of roughly 1 / √ nm with high probability, where m is the number of counters and n is the number of packets observed. This guarantee is not far off from the ideal of identifying all flows (probability 1/n), and we prove that it is best possible up to a logarithmic factor. We show that the algorithm ranks the identified flows according to frequency within any desired constant factor of accuracy. 1
Counting Distinct Elements in a Data Stream
, 2002
"... We present three algorithms to count the number of distinct elements in a data stream to within a factor of 1 ± epsilon. Our algorithms improve upon known algorithms for this problem, and offer a spectrum of time/space tradeoffs. ..."
Abstract
-
Cited by 111 (4 self)
- Add to MetaCart
We present three algorithms to count the number of distinct elements in a data stream to within a factor of 1 ± epsilon. Our algorithms improve upon known algorithms for this problem, and offer a spectrum of time/space tradeoffs.
Bitmap algorithms for counting active flows on high speed links
- In Internet Measurement Conference
, 2003
"... ..."
Loglog Counting of Large Cardinalities
- In ESA
, 2003
"... Using an auxiliary memory smaller than the size of this abstract, the LogLog algorithm makes it possible to estimate in a single pass and within a few percents the number of different words in the whole of Shakespeare's works. In general the LogLog algorithm makes use of m "small bytes" of auxiliary ..."
Abstract
-
Cited by 57 (2 self)
- Add to MetaCart
Using an auxiliary memory smaller than the size of this abstract, the LogLog algorithm makes it possible to estimate in a single pass and within a few percents the number of different words in the whole of Shakespeare's works. In general the LogLog algorithm makes use of m "small bytes" of auxiliary memory in order to estimate in a single pass the number of distinct elements (the "cardinality") in a file, and it does so with an accuracy that is of the order of 1= m. The "small bytes" to be used in order to count cardinalities till Nmax comprise about log log Nmax bits, so that cardinalities well in the range of billions can be determined using one or two kilobytes of memory only. The basic version of the LogLog algorithm is validated by a complete analysis. An optimized version, super-LogLog, is also engineered and tested on real-life data. The algorithm parallelizes optimally.
Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution
, 2004
"... Knowing the distribution of the sizes of traffic flows passing through a network link helps a network operator to characterize network resource usage, infer traffic demands, detect traffic anomalies, and accommodate new traffic demands through better traffic engineering. Previous work on estimating ..."
Abstract
-
Cited by 56 (5 self)
- Add to MetaCart
Knowing the distribution of the sizes of traffic flows passing through a network link helps a network operator to characterize network resource usage, infer traffic demands, detect traffic anomalies, and accommodate new traffic demands through better traffic engineering. Previous work on estimating the flow size distribution has been focused on making inferences from sampled network traffic. Its accuracy is limited by the (typically) low sampling rate required to make the sampling operation affordable. In this paper we present a novel data streaming algorithm to provide much more accurate estimates of flow distribution, using a "lossy data structure" which consists of an array of counters fitted well into SRAM. For each incoming packet, our algorithm only needs to increment one underlying counter, making the algorithm fast enough even for 40 Gbps (OC-768) links. The data structure is lossy in the sense that sizes of multiple flows may collide into the same counter. Our algorithm uses Bayesian statistical methods such as Expectation Maximization to infer the most likely flow size distribution that results in the observed counter values after collision. Evaluations of this algorithm on large Internet traces obtained from several sources (including a tier-1 ISP) demonstrate that it has very high measurement accuracy (within 2%). Our algorithm not only dramatically improves the accuracy of flow distribution measurement, but also contributes to the field of data streaming by formalizing an existing methodology and applying it to the context of estimating the flow-distribution.
Efficient Computation of Frequent and Top-k Elements in Data Streams
- IN ICDT
, 2005
"... We propose an approximate integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream coming from a large domain. Our solution is space efficient and reports both frequent and top-k elements with tight guarantees on errors. For ..."
Abstract
-
Cited by 35 (6 self)
- Add to MetaCart
We propose an approximate integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream coming from a large domain. Our solution is space efficient and reports both frequent and top-k elements with tight guarantees on errors. For general data distributions, our top-k algorithm returns k elements that have roughly the highest frequencies; and it uses limited space for calculating frequent elements. For realistic Zipfian data, the space requirement of the proposed algorithm for solving the exact frequent elements problem decreases dramatically with the parameter of the distribution; and for top-k queries, the analysis ensures that only the top-k elements, in the correct order, are reported. The experiments, using real and synthetic data sets, show space reductions with no loss in accuracy. Having proved the effectiveness of the proposed approach through both analysis and experiments, we extend it to be able to answer continuous queries about frequent and top-k elements. Although the problems of incremental reporting of frequent and top-k elements are useful in many applications, to the best of our knowledge, no solution has been proposed.

