Results 1 - 10
of
33
Sketch-based Change Detection: Methods, Evaluation, and Applications
- IN INTERNET MEASUREMENT CONFERENCE
, 2003
"... Traffic anomalies such as failures and attacks are commonplace in today's network, and identifying them rapidly and accurately is critical for large network operators. The detection typically treats the traffic as a collection of flows that need to be examined for significant changes in traffic patt ..."
Abstract
-
Cited by 95 (11 self)
- Add to MetaCart
Traffic anomalies such as failures and attacks are commonplace in today's network, and identifying them rapidly and accurately is critical for large network operators. The detection typically treats the traffic as a collection of flows that need to be examined for significant changes in traffic pattern (e.g., volume, number of connections) . However, as link speeds and the number of flows increase, keeping per-flow state is either too expensive or too slow. We propose building compact summaries of the traffic data using the notion of sketches. We have designed a variant of the sketch data structure, k-ary sketch, which uses a constant, small amount of memory, and has constant per-record update and reconstruction cost. Its linearity property enables us to summarize traffic at various levels. We then implement a variety of time series forecast models (ARIMA, Holt-Winters, etc.) on top of such summaries and detect significant changes by looking for flows with large forecast errors. We also present heuristics for automatically configuring the model parameters. Using a
Why simple hash functions work: Exploiting the entropy in a data stream
- In Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms
, 2008
"... Hashing is fundamental to many algorithms and data structures widely used in practice. For theoretical analysis of hashing, there have been two main approaches. First, one can assume that the hash function is truly random, mapping each data item independently and uniformly to the range. This idealiz ..."
Abstract
-
Cited by 27 (6 self)
- Add to MetaCart
Hashing is fundamental to many algorithms and data structures widely used in practice. For theoretical analysis of hashing, there have been two main approaches. First, one can assume that the hash function is truly random, mapping each data item independently and uniformly to the range. This idealized model is unrealistic because a truly random hash function requires an exponential number of bits to describe. Alternatively, one can provide rigorous bounds on performance when explicit families of hash functions are used, such as 2-universal or O(1)-wise independent families. For such families, performance guarantees are often noticeably weaker than for ideal hashing. In practice, however, it is commonly observed that weak hash functions, including 2-universal hash functions, perform as predicted by the idealized analysis for truly random hash functions. In this paper, we try to explain this phenomenon. We demonstrate that the strong performance of universal hash functions in practice can arise naturally from a combination of the randomness of the hash function and the data. Specifically, following the large body of literature on random sources and randomness extraction, we model the data as coming from a “block source, ” whereby
Finding Frequent Items in Data Streams
- PVLDB
, 2008
"... The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, ..."
Abstract
-
Cited by 25 (5 self)
- Add to MetaCart
The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. However, there has not been much comparison of the different methods under uniform experimental conditions. It is common to find papers touching on this topic in which important related work is mischaracterized, overlooked, or reinvented. In this paper, we aim to present the most important algorithms for this problem in a common framework. We have created baseline implementations of the algorithms, and used these to perform a thorough experimental study of their properties. We give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware.
Summarizing and mining skewed data streams
- In SIAM Conference on Data Mining
, 2005
"... Many applications generate massive data streams. Summarizing such massive data requires fast, small space algorithms to support post-hoc queries and mining. An important observation is that such streams are rarely uniform, and real data sources typically exhibit significant skewness. These are well ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
Many applications generate massive data streams. Summarizing such massive data requires fast, small space algorithms to support post-hoc queries and mining. An important observation is that such streams are rarely uniform, and real data sources typically exhibit significant skewness. These are well modeled by Zipf distributions, which are characterized by a parameter, z, that captures the amount of skew. We present a data stream summary that can answer point queries with ε accuracy and show that the space needed is only O(ε − min{1,1/z}). This is the first o(1/ε) space algorithm for this problem, and we show it is essentially tight for skewed distributions. We show that the same data structure can also estimate the L2 norm of the stream in o(1/ε2) space for z> 1 2, another improvement over the existing Ω(1/ε2) methods. We support our theoretical results with an experimental study over a large variety of real and synthetic data. We show that significant skew is present in both textual and telecommunication data. Our methods give strong accuracy, significantly better than other methods, and behave exactly in line with their analytic bounds.
Path-quality monitoring in the presence of adversaries
- In ACM SIGMETRICS
, 2008
"... Edge networks connected to the Internet need effective monitoring techniques to drive routing decisions and detect violations of Service Level Agreements (SLAs). However, existing measurement tools, like ping, traceroute, and trajectory sampling, are vulnerable to attacks that make a path look bette ..."
Abstract
-
Cited by 19 (7 self)
- Add to MetaCart
Edge networks connected to the Internet need effective monitoring techniques to drive routing decisions and detect violations of Service Level Agreements (SLAs). However, existing measurement tools, like ping, traceroute, and trajectory sampling, are vulnerable to attacks that make a path look better than it really is. In this paper, we design and analyze path-quality monitoring protocols that robustly raise an alarm when packet-loss rate and delay exceeds a threshold, even when adversary tries to bias monitoring results by selectively delaying, dropping, modifying, injecting, or preferentially treating packets. Despite the strong threat model we consider in this paper, our protocols are efficient enough to run at line rate on high-speed routers. We present a secure sketching protocol for identifying when packet loss and delay degrade beyond a threshold. This protocol is extremely lightweight, requiring only 250–600 bytes of storage and periodic transmission of a comparably sized IP packet. We also present secure sampling protocols that provide faster feedback and more accurate round-trip delay estimates, at the expense of somewhat higher storage and communication costs. We prove that all our protocols satisfy a precise definition of secure pathquality monitoring and derive analytic expressions for the trade-off between statistical accuracy and system overhead. We also compare how our protocols perform in the clientserver setting, when paths are asymmetric, and when packet marking is not permitted. 1.
Linear probing with constant independence
- In STOC ’07: Proceedings of the thirty-ninth annual ACM symposium on Theory of computing
, 2007
"... Hashing with linear probing dates back to the 1950s, and is among the most studied algorithms. In recent years it has become one of the most important hash table organizations since it uses the cache of modern computers very well. Unfortunately, previous analyses rely either on complicated and space ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Hashing with linear probing dates back to the 1950s, and is among the most studied algorithms. In recent years it has become one of the most important hash table organizations since it uses the cache of modern computers very well. Unfortunately, previous analyses rely either on complicated and space consuming hash functions, or on the unrealistic assumption of free access to a truly random hash function. Already Carter and Wegman, in their seminal paper on universal hashing, raised the question of extending their analysis to linear probing. However, we show in this paper that linear probing using a pairwise independent family may have expected logarithmic cost per operation. On the positive side, we show that 5-wise independence is enough to ensure constant expected time per operation. This resolves the question of finding a space and time efficient hash function that provably ensures good performance for linear probing.
On estimating frequency moments of data streams
- In International Workshop on Randomization and Approximation Techniques in Computer Science
, 2007
"... Abstract. Space-economical estimation of the pth frequency moments, defined as Fp = P n i=1 |fi|p, for p> 0, are of interest in estimating all-pairs distances in a large data matrix [14], machine learning, and in data stream computation. Random sketches formed by the inner product of the frequency v ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Abstract. Space-economical estimation of the pth frequency moments, defined as Fp = P n i=1 |fi|p, for p> 0, are of interest in estimating all-pairs distances in a large data matrix [14], machine learning, and in data stream computation. Random sketches formed by the inner product of the frequency vector f1,..., fn with a suitably chosen random vector were pioneered by Alon, Ma-tias and Szegedy [1], and have since played a central role in estimating Fp and for data stream computations in general. The concept of p-stable sketches formed by the inner product of the frequency vector with a random vector whose components are drawn from a p-stable distribution, was proposed by Indyk [11] for estimating Fp, for 0 < p < 2, and has been further studied in Li [13]. In this paper, we consider the problem of estimating Fp, for 0 < p < 2. A disadvantage of the sta-ble sketches technique and its variants is that they require O ( 1 ɛ 2) inner-products of the frequency vector with dense vectors of stable (or nearly stable [14, 13]) random variables to be maintained. This means that each stream update can be quite time-consuming. We present algorithms for esti-mating Fp, for 0 < p < 2, that does not require the use of stable sketches or its approximations. Our technique is elementary in nature, in that, it uses simple randomization in conjunction with well-known summary structures for data streams, such as the COUNT-MIN sketch [7] and the COUNTSKETCH structure [5]. Our algorithms require space 1 ± ɛ factors and requires expected time O(log F1 log 1 δ Õ ( 1 ɛ 2+p) 3 to estimate Fp to within) to process each update. Thus, our tech-nique trades an O ( 1 ɛ p) factor in space for much more efficient processing of stream updates. We also present a stand-alone iterative estimator for F1. 1
Extracting Hidden Anomalies using Sketch and Non Gaussian Multiresolution Statistical Detection Procedures
- LSAD'07
, 2007
"... A new profile-based anomaly detection and characterization procedure is proposed. It aims at performing prompt and accurate detection of both short-lived and long-lasting low-intensity anomalies, without the recourse of any prior knowledge of the targetted traffic. Key features of the algorithm lie ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
A new profile-based anomaly detection and characterization procedure is proposed. It aims at performing prompt and accurate detection of both short-lived and long-lasting low-intensity anomalies, without the recourse of any prior knowledge of the targetted traffic. Key features of the algorithm lie in the joint use of random projection techniques (sketches) and of a multiresolution non Gaussian marginal distribution modeling. The former enables both a reduction in the dimensionality of the data and the measurement of the reference (i.e., normal) traffic behavior, while the latter extracts anomalies at different aggregation levels. This procedure is used to blindly analyze a large-scale packet trace database collected on a trans-Pacific transit link from 2001 to 2006. It can detect and identify a large number of known and unknown anomalies and attacks, whose intensities are low (down to below one percent). Using sketches also makes possible a real-time identification of the source or destination IP addresses associated to the detected anomaly and hence their mitigation.
Seven Years and One Day: Sketching the Evolution of Internet Traffic
"... Abstract—This contribution aims at performing a longitudinal study of the evolution of the traffic collected every day for seven years on a trans-Pacific backbone link (the MAWI dataset). Long term characteristics are investigated both at TCP/IP layers (packet and flow attributes) and application us ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Abstract—This contribution aims at performing a longitudinal study of the evolution of the traffic collected every day for seven years on a trans-Pacific backbone link (the MAWI dataset). Long term characteristics are investigated both at TCP/IP layers (packet and flow attributes) and application usages. The analysis of this unique dataset provides new insights into changes in traffic statistics, notably on the persistence of Long Range Dependence, induced by the on-going increase in link bandwidth. Traffic in the MAWI dataset is subject to bandwidth changes, to congestions, and to a variety of anomalies. This allows the comparison of their impacts on the traffic statistics but at the same time significantly impairs long term evolution characterizations. To account for this difficulty, we show and explain how and why random projection (sketch) based analysis procedures provide practitioners with an efficient and robust tool to disentangle actual long term evolutions from time localized events such as anomalies and link congestions. Our central results consist in showing a strong and persistent long range dependence controlling jointly byte and packet counts. An additional study of a 24-hour trace complements the long-term results with the analysis of intraday variabilities.
Annotations in Data Streams
, 2009
"... The central goal of data stream algorithms is to process massive streams of data using sublinear storage space. Motivated by work in the database community on outsourcing database and data stream processing, we ask whether the space usage of such algorithms be further reduced by enlisting a more pow ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
The central goal of data stream algorithms is to process massive streams of data using sublinear storage space. Motivated by work in the database community on outsourcing database and data stream processing, we ask whether the space usage of such algorithms be further reduced by enlisting a more powerful “helper ” who can annotate the stream as it is read. We do not wish to blindly trust the helper, so we require that the algorithm be convinced of having computed a correct answer. We show upper bounds that achieve a non-trivial tradeoff between the amount of annotation used and the space required to verify it. We also prove lower bounds on such tradeoffs, often nearly matching the upper bounds, via notions related to Merlin-Arthur communication complexity. Our results cover the classic data stream problems of selection, frequency moments, and fundamental graph problems such as triangle-freeness and connectivity. Our work is also part of a growing trend — including recent studies of multi-pass streaming, read/write streams and randomly ordered streams — of asking more complexity-theoretic questions about data stream processing. It is a recognition that, in addition to practical relevance, the data stream model raises many interesting theoretical questions in its own right. 1

