Results 1  10
of
77
Sketchbased Change Detection: Methods, Evaluation, and Applications
 IN INTERNET MEASUREMENT CONFERENCE
, 2003
"... Traffic anomalies such as failures and attacks are commonplace in today's network, and identifying them rapidly and accurately is critical for large network operators. The detection typically treats the traffic as a collection of flows that need to be examined for significant changes in traffic ..."
Abstract

Cited by 161 (17 self)
 Add to MetaCart
Traffic anomalies such as failures and attacks are commonplace in today's network, and identifying them rapidly and accurately is critical for large network operators. The detection typically treats the traffic as a collection of flows that need to be examined for significant changes in traffic pattern (e.g., volume, number of connections) . However, as link speeds and the number of flows increase, keeping perflow state is either too expensive or too slow. We propose building compact summaries of the traffic data using the notion of sketches. We have designed a variant of the sketch data structure, kary sketch, which uses a constant, small amount of memory, and has constant perrecord update and reconstruction cost. Its linearity property enables us to summarize traffic at various levels. We then implement a variety of time series forecast models (ARIMA, HoltWinters, etc.) on top of such summaries and detect significant changes by looking for flows with large forecast errors. We also present heuristics for automatically configuring the model parameters. Using a
Finding Frequent Items in Data Streams
 PVLDB
, 2008
"... The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, ..."
Abstract

Cited by 54 (7 self)
 Add to MetaCart
(Show Context)
The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. However, there has not been much comparison of the different methods under uniform experimental conditions. It is common to find papers touching on this topic in which important related work is mischaracterized, overlooked, or reinvented. In this paper, we aim to present the most important algorithms for this problem in a common framework. We have created baseline implementations of the algorithms, and used these to perform a thorough experimental study of their properties. We give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware.
Why simple hash functions work: Exploiting the entropy in a data stream
 In Proceedings of the 19th Annual ACMSIAM Symposium on Discrete Algorithms
, 2008
"... Hashing is fundamental to many algorithms and data structures widely used in practice. For theoretical analysis of hashing, there have been two main approaches. First, one can assume that the hash function is truly random, mapping each data item independently and uniformly to the range. This idealiz ..."
Abstract

Cited by 50 (9 self)
 Add to MetaCart
(Show Context)
Hashing is fundamental to many algorithms and data structures widely used in practice. For theoretical analysis of hashing, there have been two main approaches. First, one can assume that the hash function is truly random, mapping each data item independently and uniformly to the range. This idealized model is unrealistic because a truly random hash function requires an exponential number of bits to describe. Alternatively, one can provide rigorous bounds on performance when explicit families of hash functions are used, such as 2universal or O(1)wise independent families. For such families, performance guarantees are often noticeably weaker than for ideal hashing. In practice, however, it is commonly observed that weak hash functions, including 2universal hash functions, perform as predicted by the idealized analysis for truly random hash functions. In this paper, we try to explain this phenomenon. We demonstrate that the strong performance of universal hash functions in practice can arise naturally from a combination of the randomness of the hash function and the data. Specifically, following the large body of literature on random sources and randomness extraction, we model the data as coming from a “block source, ” whereby
Pathquality monitoring in the presence of adversaries
 In ACM SIGMETRICS
, 2008
"... Edge networks connected to the Internet need effective monitoring techniques to drive routing decisions and detect violations of Service Level Agreements (SLAs). However, existing measurement tools, like ping, traceroute, and trajectory sampling, are vulnerable to attacks that make a path look bette ..."
Abstract

Cited by 36 (9 self)
 Add to MetaCart
(Show Context)
Edge networks connected to the Internet need effective monitoring techniques to drive routing decisions and detect violations of Service Level Agreements (SLAs). However, existing measurement tools, like ping, traceroute, and trajectory sampling, are vulnerable to attacks that make a path look better than it really is. In this paper, we design and analyze pathquality monitoring protocols that robustly raise an alarm when packetloss rate and delay exceeds a threshold, even when adversary tries to bias monitoring results by selectively delaying, dropping, modifying, injecting, or preferentially treating packets. Despite the strong threat model we consider in this paper, our protocols are efficient enough to run at line rate on highspeed routers. We present a secure sketching protocol for identifying when packet loss and delay degrade beyond a threshold. This protocol is extremely lightweight, requiring only 250–600 bytes of storage and periodic transmission of a comparably sized IP packet. We also present secure sampling protocols that provide faster feedback and more accurate roundtrip delay estimates, at the expense of somewhat higher storage and communication costs. We prove that all our protocols satisfy a precise definition of secure pathquality monitoring and derive analytic expressions for the tradeoff between statistical accuracy and system overhead. We also compare how our protocols perform in the clientserver setting, when paths are asymmetric, and when packet marking is not permitted. 1.
Summarizing and Mining Skewed Data Streams
"... Many applications generate massive data streams. Summarizing such massive data requires fast, small space algorithms to support posthoc queries and mining. An important observation is that such streams are rarely uniform, and real data sources typically exhibit significant skewness. These are well ..."
Abstract

Cited by 35 (3 self)
 Add to MetaCart
(Show Context)
Many applications generate massive data streams. Summarizing such massive data requires fast, small space algorithms to support posthoc queries and mining. An important observation is that such streams are rarely uniform, and real data sources typically exhibit significant skewness. These are well modeled by Zipf distributions, which are characterized by a parameter, z, that captures the amount of skew. We present a data stream summary that can answer point queries with ε accuracy and show that the space needed is only O(ε − min{1,1/z}). This is the first o(1/ε) space algorithm for this problem, and we show it is essentially tight for skewed distributions. We show that the same data structure can also estimate the L2 norm of the stream in o(1/ε2) space for z> 1 2, another improvement over the existing Ω(1/ε2) methods. We support our theoretical results with an experimental study over a large variety of real and synthetic data. We show that significant skew is present in both textual and telecommunication data. Our methods give strong accuracy, significantly better than other methods, and behave exactly in line with their analytic bounds.
Seven Years and One Day: Sketching the Evolution of Internet Traffic
"... Abstract—This contribution aims at performing a longitudinal study of the evolution of the traffic collected every day for seven years on a transPacific backbone link (the MAWI dataset). Long term characteristics are investigated both at TCP/IP layers (packet and flow attributes) and application us ..."
Abstract

Cited by 35 (8 self)
 Add to MetaCart
(Show Context)
Abstract—This contribution aims at performing a longitudinal study of the evolution of the traffic collected every day for seven years on a transPacific backbone link (the MAWI dataset). Long term characteristics are investigated both at TCP/IP layers (packet and flow attributes) and application usages. The analysis of this unique dataset provides new insights into changes in traffic statistics, notably on the persistence of Long Range Dependence, induced by the ongoing increase in link bandwidth. Traffic in the MAWI dataset is subject to bandwidth changes, to congestions, and to a variety of anomalies. This allows the comparison of their impacts on the traffic statistics but at the same time significantly impairs long term evolution characterizations. To account for this difficulty, we show and explain how and why random projection (sketch) based analysis procedures provide practitioners with an efficient and robust tool to disentangle actual long term evolutions from time localized events such as anomalies and link congestions. Our central results consist in showing a strong and persistent long range dependence controlling jointly byte and packet counts. An additional study of a 24hour trace complements the longterm results with the analysis of intraday variabilities.
Sparser JohnsonLindenstrauss Transforms
"... We give two different constructions for dimensionality reduction in ℓ2 via linear mappings that are sparse: only an O(ε)fraction of entries in each column of our embedding matrices are nonzero to achieve distortion 1+ε with high probability, while still achieving the asymptotically optimal number ..."
Abstract

Cited by 30 (8 self)
 Add to MetaCart
We give two different constructions for dimensionality reduction in ℓ2 via linear mappings that are sparse: only an O(ε)fraction of entries in each column of our embedding matrices are nonzero to achieve distortion 1+ε with high probability, while still achieving the asymptotically optimal number of rows. These are the first constructions to provide subconstant sparsity for all values of parameters. Both constructions are also very simple: a vector can be embedded in two for loops. Such distributions can be used to speed up applications where ℓ2 dimensionality reduction is used.
Optimal Bounds for JohnsonLindenstrauss Transforms and Streaming Problems with SubConstant Error
"... The JohnsonLindenstrauss transform is a dimensionality reduction technique with a wide range of applications to theoretical computer science. It is specified by a distribution over projection matrices from R n → R k where k ≪ d and states that k = O(ε −2 log 1/δ) dimensions suffice to approximate t ..."
Abstract

Cited by 30 (1 self)
 Add to MetaCart
The JohnsonLindenstrauss transform is a dimensionality reduction technique with a wide range of applications to theoretical computer science. It is specified by a distribution over projection matrices from R n → R k where k ≪ d and states that k = O(ε −2 log 1/δ) dimensions suffice to approximate the norm of any fixed vector in R d to within a factor of 1 ± ε with probability at least 1 − δ. In this paper we show that this bound on k is optimal up to a constant factor, improving upon a previous Ω((ε −2 log 1/δ) / log(1/ε)) dimension bound of Alon. Our techniques are based on lower bounding the information cost of a novel oneway communication game and yield the first space lower bounds in a data stream model that depend on the error probability δ. For many streaming problems, the most naïve way of achieving error probability δ is to first achieve constant probability, then take the median of O(log 1/δ) independent repetitions. Our techniques show that for a wide range of problems this is in fact optimal! As an example, we show that estimating the ℓpdistance for any p ∈ [0, 2] requires Ω(ε −2 log n log 1/δ) space, even for vectors in {0, 1} n. This is optimal in all parameters and closes a long line of work on this problem. We also show the number of distinct elements requires Ω(ε −2 log 1/δ + log n) space, which is optimal if ε −2 = Ω(log n). We also improve previous lower bounds for entropy in the strict turnstile and general turnstile models by a multiplicative factor of Ω(log 1/δ). Finally, we give an application to oneway communication complexity under product distributions, showing that unlike in the case of constant δ, the VCdimension does not characterize the complexity when δ = o(1).
A Derandomized Sparse JohnsonLindenstrauss Transform
"... Recent work of [DasguptaKumarSarlós, STOC 2010] gave a sparse JohnsonLindenstrauss transform and left as a main open question whether their construction could be efficiently derandomized. We answer their question affirmatively by giving an alternative proof of their result requiring only bounded ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
(Show Context)
Recent work of [DasguptaKumarSarlós, STOC 2010] gave a sparse JohnsonLindenstrauss transform and left as a main open question whether their construction could be efficiently derandomized. We answer their question affirmatively by giving an alternative proof of their result requiring only bounded independence hash functions. Furthermore, the sparsity bound obtained in our proof is improved. The main ingredient in our proof is a spectral moment bound for quadratic forms that was recently used in [DiakonikolasKaneNelson, FOCS 2010].
Extracting Hidden Anomalies using Sketch and Non Gaussian Multiresolution Statistical Detection Procedures
 LSAD'07
, 2007
"... A new profilebased anomaly detection and characterization procedure is proposed. It aims at performing prompt and accurate detection of both shortlived and longlasting lowintensity anomalies, without the recourse of any prior knowledge of the targetted traffic. Key features of the algorithm lie ..."
Abstract

Cited by 28 (9 self)
 Add to MetaCart
(Show Context)
A new profilebased anomaly detection and characterization procedure is proposed. It aims at performing prompt and accurate detection of both shortlived and longlasting lowintensity anomalies, without the recourse of any prior knowledge of the targetted traffic. Key features of the algorithm lie in the joint use of random projection techniques (sketches) and of a multiresolution non Gaussian marginal distribution modeling. The former enables both a reduction in the dimensionality of the data and the measurement of the reference (i.e., normal) traffic behavior, while the latter extracts anomalies at different aggregation levels. This procedure is used to blindly analyze a largescale packet trace database collected on a transPacific transit link from 2001 to 2006. It can detect and identify a large number of known and unknown anomalies and attacks, whose intensities are low (down to below one percent). Using sketches also makes possible a realtime identification of the source or destination IP addresses associated to the detected anomaly and hence their mitigation.