Results 1  10
of
62
Sketchbased Change Detection: Methods, Evaluation, and Applications
 IN INTERNET MEASUREMENT CONFERENCE
, 2003
"... Traffic anomalies such as failures and attacks are commonplace in today's network, and identifying them rapidly and accurately is critical for large network operators. The detection typically treats the traffic as a collection of flows that need to be examined for significant changes in traffic ..."
Abstract

Cited by 132 (17 self)
 Add to MetaCart
Traffic anomalies such as failures and attacks are commonplace in today's network, and identifying them rapidly and accurately is critical for large network operators. The detection typically treats the traffic as a collection of flows that need to be examined for significant changes in traffic pattern (e.g., volume, number of connections) . However, as link speeds and the number of flows increase, keeping perflow state is either too expensive or too slow. We propose building compact summaries of the traffic data using the notion of sketches. We have designed a variant of the sketch data structure, kary sketch, which uses a constant, small amount of memory, and has constant perrecord update and reconstruction cost. Its linearity property enables us to summarize traffic at various levels. We then implement a variety of time series forecast models (ARIMA, HoltWinters, etc.) on top of such summaries and detect significant changes by looking for flows with large forecast errors. We also present heuristics for automatically configuring the model parameters. Using a
Finding Frequent Items in Data Streams
 PVLDB
, 2008
"... The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, ..."
Abstract

Cited by 39 (6 self)
 Add to MetaCart
(Show Context)
The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. However, there has not been much comparison of the different methods under uniform experimental conditions. It is common to find papers touching on this topic in which important related work is mischaracterized, overlooked, or reinvented. In this paper, we aim to present the most important algorithms for this problem in a common framework. We have created baseline implementations of the algorithms, and used these to perform a thorough experimental study of their properties. We give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware.
Why simple hash functions work: Exploiting the entropy in a data stream
 In Proceedings of the 19th Annual ACMSIAM Symposium on Discrete Algorithms
, 2008
"... Hashing is fundamental to many algorithms and data structures widely used in practice. For theoretical analysis of hashing, there have been two main approaches. First, one can assume that the hash function is truly random, mapping each data item independently and uniformly to the range. This idealiz ..."
Abstract

Cited by 36 (6 self)
 Add to MetaCart
(Show Context)
Hashing is fundamental to many algorithms and data structures widely used in practice. For theoretical analysis of hashing, there have been two main approaches. First, one can assume that the hash function is truly random, mapping each data item independently and uniformly to the range. This idealized model is unrealistic because a truly random hash function requires an exponential number of bits to describe. Alternatively, one can provide rigorous bounds on performance when explicit families of hash functions are used, such as 2universal or O(1)wise independent families. For such families, performance guarantees are often noticeably weaker than for ideal hashing. In practice, however, it is commonly observed that weak hash functions, including 2universal hash functions, perform as predicted by the idealized analysis for truly random hash functions. In this paper, we try to explain this phenomenon. We demonstrate that the strong performance of universal hash functions in practice can arise naturally from a combination of the randomness of the hash function and the data. Specifically, following the large body of literature on random sources and randomness extraction, we model the data as coming from a “block source, ” whereby
Pathquality monitoring in the presence of adversaries
 In ACM SIGMETRICS
, 2008
"... Edge networks connected to the Internet need effective monitoring techniques to drive routing decisions and detect violations of Service Level Agreements (SLAs). However, existing measurement tools, like ping, traceroute, and trajectory sampling, are vulnerable to attacks that make a path look bette ..."
Abstract

Cited by 28 (8 self)
 Add to MetaCart
(Show Context)
Edge networks connected to the Internet need effective monitoring techniques to drive routing decisions and detect violations of Service Level Agreements (SLAs). However, existing measurement tools, like ping, traceroute, and trajectory sampling, are vulnerable to attacks that make a path look better than it really is. In this paper, we design and analyze pathquality monitoring protocols that robustly raise an alarm when packetloss rate and delay exceeds a threshold, even when adversary tries to bias monitoring results by selectively delaying, dropping, modifying, injecting, or preferentially treating packets. Despite the strong threat model we consider in this paper, our protocols are efficient enough to run at line rate on highspeed routers. We present a secure sketching protocol for identifying when packet loss and delay degrade beyond a threshold. This protocol is extremely lightweight, requiring only 250–600 bytes of storage and periodic transmission of a comparably sized IP packet. We also present secure sampling protocols that provide faster feedback and more accurate roundtrip delay estimates, at the expense of somewhat higher storage and communication costs. We prove that all our protocols satisfy a precise definition of secure pathquality monitoring and derive analytic expressions for the tradeoff between statistical accuracy and system overhead. We also compare how our protocols perform in the clientserver setting, when paths are asymmetric, and when packet marking is not permitted. 1.
Summarizing and Mining Skewed Data Streams
"... Many applications generate massive data streams. Summarizing such massive data requires fast, small space algorithms to support posthoc queries and mining. An important observation is that such streams are rarely uniform, and real data sources typically exhibit significant skewness. These are well ..."
Abstract

Cited by 27 (2 self)
 Add to MetaCart
(Show Context)
Many applications generate massive data streams. Summarizing such massive data requires fast, small space algorithms to support posthoc queries and mining. An important observation is that such streams are rarely uniform, and real data sources typically exhibit significant skewness. These are well modeled by Zipf distributions, which are characterized by a parameter, z, that captures the amount of skew. We present a data stream summary that can answer point queries with ε accuracy and show that the space needed is only O(ε − min{1,1/z}). This is the first o(1/ε) space algorithm for this problem, and we show it is essentially tight for skewed distributions. We show that the same data structure can also estimate the L2 norm of the stream in o(1/ε2) space for z> 1 2, another improvement over the existing Ω(1/ε2) methods. We support our theoretical results with an experimental study over a large variety of real and synthetic data. We show that significant skew is present in both textual and telecommunication data. Our methods give strong accuracy, significantly better than other methods, and behave exactly in line with their analytic bounds.
Seven Years and One Day: Sketching the Evolution of Internet Traffic
"... Abstract—This contribution aims at performing a longitudinal study of the evolution of the traffic collected every day for seven years on a transPacific backbone link (the MAWI dataset). Long term characteristics are investigated both at TCP/IP layers (packet and flow attributes) and application us ..."
Abstract

Cited by 23 (8 self)
 Add to MetaCart
(Show Context)
Abstract—This contribution aims at performing a longitudinal study of the evolution of the traffic collected every day for seven years on a transPacific backbone link (the MAWI dataset). Long term characteristics are investigated both at TCP/IP layers (packet and flow attributes) and application usages. The analysis of this unique dataset provides new insights into changes in traffic statistics, notably on the persistence of Long Range Dependence, induced by the ongoing increase in link bandwidth. Traffic in the MAWI dataset is subject to bandwidth changes, to congestions, and to a variety of anomalies. This allows the comparison of their impacts on the traffic statistics but at the same time significantly impairs long term evolution characterizations. To account for this difficulty, we show and explain how and why random projection (sketch) based analysis procedures provide practitioners with an efficient and robust tool to disentangle actual long term evolutions from time localized events such as anomalies and link congestions. Our central results consist in showing a strong and persistent long range dependence controlling jointly byte and packet counts. An additional study of a 24hour trace complements the longterm results with the analysis of intraday variabilities.
Fast moment estimation in data streams in optimal space
 In Proceedings of the 43rd ACM Symposium on Theory of Computing (STOC
, 2011
"... We give a spaceoptimal algorithm with update time O(log 2 (1/ε) log log(1/ε)) for (1 ± ε)approximating the pth frequency moment, 0 < p < 2, of a lengthn vector updated in a data stream. This provides a nearly exponential improvement in the update time complexity over the previous spaceoptim ..."
Abstract

Cited by 21 (6 self)
 Add to MetaCart
(Show Context)
We give a spaceoptimal algorithm with update time O(log 2 (1/ε) log log(1/ε)) for (1 ± ε)approximating the pth frequency moment, 0 < p < 2, of a lengthn vector updated in a data stream. This provides a nearly exponential improvement in the update time complexity over the previous spaceoptimal algorithm of [KaneNelsonWoodruff, SODA 2010], which had update time Ω(1/ε 2). 1
Extracting Hidden Anomalies using Sketch and Non Gaussian Multiresolution Statistical Detection Procedures
 LSAD'07
, 2007
"... A new profilebased anomaly detection and characterization procedure is proposed. It aims at performing prompt and accurate detection of both shortlived and longlasting lowintensity anomalies, without the recourse of any prior knowledge of the targetted traffic. Key features of the algorithm lie ..."
Abstract

Cited by 19 (8 self)
 Add to MetaCart
(Show Context)
A new profilebased anomaly detection and characterization procedure is proposed. It aims at performing prompt and accurate detection of both shortlived and longlasting lowintensity anomalies, without the recourse of any prior knowledge of the targetted traffic. Key features of the algorithm lie in the joint use of random projection techniques (sketches) and of a multiresolution non Gaussian marginal distribution modeling. The former enables both a reduction in the dimensionality of the data and the measurement of the reference (i.e., normal) traffic behavior, while the latter extracts anomalies at different aggregation levels. This procedure is used to blindly analyze a largescale packet trace database collected on a transPacific transit link from 2001 to 2006. It can detect and identify a large number of known and unknown anomalies and attacks, whose intensities are low (down to below one percent). Using sketches also makes possible a realtime identification of the source or destination IP addresses associated to the detected anomaly and hence their mitigation.
A Derandomized Sparse JohnsonLindenstrauss Transform
"... Recent work of [DasguptaKumarSarlós, STOC 2010] gave a sparse JohnsonLindenstrauss transform and left as a main open question whether their construction could be efficiently derandomized. We answer their question affirmatively by giving an alternative proof of their result requiring only bounded ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
Recent work of [DasguptaKumarSarlós, STOC 2010] gave a sparse JohnsonLindenstrauss transform and left as a main open question whether their construction could be efficiently derandomized. We answer their question affirmatively by giving an alternative proof of their result requiring only bounded independence hash functions. Furthermore, the sparsity bound obtained in our proof is improved. The main ingredient in our proof is a spectral moment bound for quadratic forms that was recently used in [DiakonikolasKaneNelson, FOCS 2010].
Linear probing with constant independence
 In STOC ’07: Proceedings of the thirtyninth annual ACM symposium on Theory of computing
, 2007
"... Hashing with linear probing dates back to the 1950s, and is among the most studied algorithms. In recent years it has become one of the most important hash table organizations since it uses the cache of modern computers very well. Unfortunately, previous analyses rely either on complicated and space ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
(Show Context)
Hashing with linear probing dates back to the 1950s, and is among the most studied algorithms. In recent years it has become one of the most important hash table organizations since it uses the cache of modern computers very well. Unfortunately, previous analyses rely either on complicated and space consuming hash functions, or on the unrealistic assumption of free access to a truly random hash function. Already Carter and Wegman, in their seminal paper on universal hashing, raised the question of extending their analysis to linear probing. However, we show in this paper that linear probing using a pairwise independent family may have expected logarithmic cost per operation. On the positive side, we show that 5wise independence is enough to ensure constant expected time per operation. This resolves the question of finding a space and time efficient hash function that provably ensures good performance for linear probing.