Results 1 - 10
of
29
Data Streams: Algorithms and Applications
, 2005
"... In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerg ..."
Abstract
-
Cited by 533 (22 self)
- Add to MetaCart
In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [175].1
Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation
, 2000
"... In this paper we show several results obtained by combining the use of stable distributions with pseudorandom generators for bounded space. In particular: ffl we show how to maintain (using only O(log n=ffl 2 ) words of storage) a sketch C(p) of a point p 2 l n 1 under dynamic updates of its coo ..."
Abstract
-
Cited by 324 (13 self)
- Add to MetaCart
In this paper we show several results obtained by combining the use of stable distributions with pseudorandom generators for bounded space. In particular: ffl we show how to maintain (using only O(log n=ffl 2 ) words of storage) a sketch C(p) of a point p 2 l n 1 under dynamic updates of its coordinates, such that given sketches C(p) and C(q) one can estimate jp \Gamma qj 1 up to a factor of (1 + ffl) with large probability. This solves the main open problem of [10]. ffl we obtain another sketch function C 0 which maps l n 1 into a normed space l m 1 (as opposed to C), such that m = m(n) is much smaller than n; to our knowledge this is the first dimensionality reduction lemma for l 1 norm ffl we give an explicit embedding of l n 2 into l n O(log n) 1 with distortion (1 + 1=n \Theta(1) ) and a nonconstructive embedding of l n 2 into l O(n) 1 with distortion (1 + ffl) such that the embedding can be represented using only O(n log 2 n) bits (as opposed to at least...
What's New: Finding Significant Differences in Network Data Streams
- in Proc. of IEEE Infocom
, 2004
"... Monitoring and analyzing network traffic usage patterns is vital for managing IP Networks. An important problem is to provide network managers with information about changes in traffic, informing them about "what's new". Specifically, we focus on the challenge of finding significantly ..."
Abstract
-
Cited by 86 (8 self)
- Add to MetaCart
(Show Context)
Monitoring and analyzing network traffic usage patterns is vital for managing IP Networks. An important problem is to provide network managers with information about changes in traffic, informing them about "what's new". Specifically, we focus on the challenge of finding significantly large differences in traffic: over time, between interfaces and between routers. We introduce the idea of a deltoid: an item that has a large difference, whether the difference is absolute, relative or variational. We present novel...
Comparing data streams using hamming norms (how to zero in)
, 2003
"... Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases and instead must be processed “on the fly” as they are produced. Similarly, sensor networ ..."
Abstract
-
Cited by 81 (7 self)
- Add to MetaCart
(Show Context)
Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases and instead must be processed “on the fly” as they are produced. Similarly, sensor networks produce multiple data streams of observations from their sensors. There is growing focus on manipulating data streams and, hence, there is a need to identify basic operations of interest in managing data streams, and to support them efficiently. We propose computation of the Hamming norm as a basic operation of interest. The Hamming norm formalizes ideas that are used throughout data processing. When applied to a single stream, the Hamming norm gives the number of distinct items that are present in that data stream, which is a statistic of great interest in databases. When applied to a pair of streams, the Hamming norm gives an important measure of (dis)similarity: the number of unequal item counts in the two streams. Hamming norms have many uses in comparing data streams. We present a novel approximation technique for estimating the Hamming norm for massive data streams; this relies on what we call the “l0 sketch ” and we prove its accuracy. We test our approximation method on a large quantity of synthetic and real stream data, and show that the estimation is accurate to within a few percentage points.
Forward Decay: A Practical Time Decay Model for Streaming Systems
"... Abstract — Temporal data analysis in data warehouses and data streaming systems often uses time decay to reduce the importance of older tuples, without eliminating their influence, on the results of the analysis. While exponential time decay is commonly used in practice, other decay functions (e.g. ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
Abstract — Temporal data analysis in data warehouses and data streaming systems often uses time decay to reduce the importance of older tuples, without eliminating their influence, on the results of the analysis. While exponential time decay is commonly used in practice, other decay functions (e.g. polynomial decay) are not, even though they have been identified as useful. We argue that this is because the usual definitions of time decay are “backwards”: the decayed weight of a tuple is based on its age, measured backward from the current time. Since this age is constantly changing, such decay is too complex and unwieldy for scalable implementation. In this paper, we propose a new class of “forward ” decay functions based on measuring forward from a fixed point in time. We show that this model captures the more practical models already known, such as exponential decay and landmark windows, but also includes a wide class of other types of time decay. We provide efficient algorithms to compute a variety of aggregates and draw samples under forward decay, and show that these are easy to implement scalably. Further, we provide empirical evidence that these can be executed in a production data stream management system with little or no overhead compared to the undecayed computations. Our implementation required no extensions to the query language or the DSMS, demonstrating that forward decay represents a practical model of time decay for systems that deal with time-based data. I.
Reverse hashing for high-speed network monitoring: Algorithms, evaluation, and applications
- In IEEE INFOCOM
, 2004
"... A key function for network traffic monitoring and analysis is the ability to perform aggregate queries over multiple data streams. Change detection is an important primitive which can be extended to construct many aggregate queries. The recently proposed sketches [1] are among the very few that can ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
(Show Context)
A key function for network traffic monitoring and analysis is the ability to perform aggregate queries over multiple data streams. Change detection is an important primitive which can be extended to construct many aggregate queries. The recently proposed sketches [1] are among the very few that can detect heavy changes online for high speed links, and thus support various aggregate queries in both temporal and spatial domains. However, it does not preserve the keys (e.g., source IP address) of flows, making it difficult to reconstruct the desired set of anomalous keys. In an earlier abstract we proposed a framework for a reversible sketch data structure that offers hope for efficient extraction of keys [2]. However, this scheme is only able to detect a single heavy change key and places restrictions on the statistical properties of the key space. To address these challenges, we propose an efficient reverse hashing scheme to infer the keys of culprit flows from reversible sketches. There are two phases. The first operates online, recording the packet stream in a compact representation with negligible extra memory and few extra memory accesses. Our prototype single FPGA board implementation can achieve a throughput of over 16 Gbps for 40-byte-packet streams (the worst case). The second phase identifies heavy changes and their keys from the representation in nearly real time. We evaluate our scheme using traces from large edge routers with OC-12 or higher links. Both the analytical and experimental results show that we are able to achieve online traffic monitoring and accurate change/intrusion detection over massive data streams on high speed links, all in a manner that scales to large key space size. To the best of our knowledge, our system is the first to achieve these properties simultaneously. I.
Reversible Sketches: Enabling Monitoring and Analysis over High-speed Data Streams
"... Abstract — A key function for network traffic monitoring and analysis is the ability to perform aggregate queries over multiple data streams. Change detection is an important primitive which can be extended to construct many aggregate queries. The recently proposed sketches [1] are among the very fe ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
(Show Context)
Abstract — A key function for network traffic monitoring and analysis is the ability to perform aggregate queries over multiple data streams. Change detection is an important primitive which can be extended to construct many aggregate queries. The recently proposed sketches [1] are among the very few that can detect heavy changes online for high speed links, and thus support various aggregate queries in both temporal and spatial domains. However, it does not preserve the keys (e.g., source IP address) of flows, making it difficult to reconstruct the desired set of anomalous keys. To address this challenge, we propose the reversible sketch data structure along with reverse hashing algorithms to infer the keys of culprit flows. There are two phases. The first operates online, recording the packet stream in a compact representation with negligible extra memory and few extra memory accesses. Our prototype single FPGA board implementation can achieve a throughput of over 16 Gbps for 40-byte-packet streams (the worst case). The second phase identifies heavy changes and their keys from the representation in nearly real time. We evaluate our scheme using traces from large edge routers with OC-12 or higher links. Both the analytical and experimental results show that we are able to achieve online traffic monitoring and accurate change/intrusion detection over massive data streams on high speed links, all in a manner that scales to large key space size. To the best of our knowledge, our system is the first to achieve these properties simultaneously. I.
Stable distributions for stream computations: It’s as easy as 0,1,2
- In Workshop on Management and Processing of Massive Data Streams, at FCRC
, 2003
"... -1. Introduction A surprising number of data stream problems are solved bymethods involving computations with stable distributions. This ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
-1. Introduction A surprising number of data stream problems are solved bymethods involving computations with stable distributions. This
Improved range-summable random variable construction algorithms
- In Proceedings of the sixteenth annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics
, 2005
"... ..."
(Show Context)
and S.Tirthapura, “Range-efficient computation of F0 over massive data streams
- in Proceedings of 21st International Conference on Data Engineering, 2005
"... Efficient one-pass computation of F0, the number of dis-tinct elements in a data stream, is a fundamental problem arising in various contexts in databases and networking. We consider the problem of efficiently estimating F0 of a data stream where each element of the stream is an interval of in-teger ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
(Show Context)
Efficient one-pass computation of F0, the number of dis-tinct elements in a data stream, is a fundamental problem arising in various contexts in databases and networking. We consider the problem of efficiently estimating F0 of a data stream where each element of the stream is an interval of in-tegers. We present a randomized algorithm which gives an (; ) approximation of F0, with the following time complexity (n is the size of the universe of the items): (1)The amortized processing time per interval is O(log 1 log n ). (2)The time to answer a query for F0 is O(log 1=). The workspace used is O ( 12 log 1 log n) bits. Our algorithm improves upon a previous algorithm by Bar-Yossef, Kumar and Sivakumar [5], which re-quires O ( 15 log 1 log 5 n) processing time per item. Our al-gorithm can be used to compute the max-dominance norm of a stream of multiple signals, and significantly im-proves upon the current best bounds due to Cormode and Muthukrishnan [11]. This also provides efficient and novel solutions for data aggregation problems in sensor net-works studied by Nath and Gibbons [22] and Considine et. al. [8]. 1.