Results 1 - 10
of
24
Sketch-based Change Detection: Methods, Evaluation, and Applications
- IN INTERNET MEASUREMENT CONFERENCE
, 2003
"... Traffic anomalies such as failures and attacks are commonplace in today's network, and identifying them rapidly and accurately is critical for large network operators. The detection typically treats the traffic as a collection of flows that need to be examined for significant changes in traffic patt ..."
Abstract
-
Cited by 95 (11 self)
- Add to MetaCart
Traffic anomalies such as failures and attacks are commonplace in today's network, and identifying them rapidly and accurately is critical for large network operators. The detection typically treats the traffic as a collection of flows that need to be examined for significant changes in traffic pattern (e.g., volume, number of connections) . However, as link speeds and the number of flows increase, keeping per-flow state is either too expensive or too slow. We propose building compact summaries of the traffic data using the notion of sketches. We have designed a variant of the sketch data structure, k-ary sketch, which uses a constant, small amount of memory, and has constant per-record update and reconstruction cost. Its linearity property enables us to summarize traffic at various levels. We then implement a variety of time series forecast models (ARIMA, Holt-Winters, etc.) on top of such summaries and detect significant changes by looking for flows with large forecast errors. We also present heuristics for automatically configuring the model parameters. Using a
What's New: Finding Significant Differences in Network Data Streams
- in Proc. of IEEE Infocom
, 2004
"... Monitoring and analyzing network traffic usage patterns is vital for managing IP Networks. An important problem is to provide network managers with information about changes in traffic, informing them about "what's new". Specifically, we focus on the challenge of finding significantly large differen ..."
Abstract
-
Cited by 51 (7 self)
- Add to MetaCart
Monitoring and analyzing network traffic usage patterns is vital for managing IP Networks. An important problem is to provide network managers with information about changes in traffic, informing them about "what's new". Specifically, we focus on the challenge of finding significantly large differences in traffic: over time, between interfaces and between routers. We introduce the idea of a deltoid: an item that has a large difference, whether the difference is absolute, relative or variational. We present novel...
On scalable attack detection in the network
, 2007
"... Current intrusion detection and prevention systems seek to detect a wide class of network intrusions (e.g., DoS attacks, worms, port scans) at network vantage points. Unfortunately, even today, many IDS systems we know of keep per-connection or per-flow state to detect malicious TCP flows. Thus, it ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
Current intrusion detection and prevention systems seek to detect a wide class of network intrusions (e.g., DoS attacks, worms, port scans) at network vantage points. Unfortunately, even today, many IDS systems we know of keep per-connection or per-flow state to detect malicious TCP flows. Thus, it is hardly surprising that these IDS systems have not scaled to multi-gigabit speeds. By contrast, both router lookups and fair queuing have scaled to high speeds using aggregation via prefix lookups or DiffServ. Thus, in this paper, we initiate research into the question as to whether one can detect attacks without keeping per-flow state. We will show that such aggregation, while making fast implementations possible, immediately causes two problems. First, aggregation can cause behavioral aliasing where, for example, good behaviors can aggregate to look like bad behaviors. Second, aggregated schemes are susceptible to spoofing by which the intruder sends attacks that have appropriate aggregate behavior. We examine a wide variety of DoS and scanning attacks and show that several categories (bandwidth based, claim-and-hold, port-scanning) can be scalably detected. In addition to existing approaches for scalable attack detection, we propose a novel data structure called partial completion filters (PCFs) that can detect claim-and-hold attacks scalably in the network. We analyze PCFs both analytically and using experiments on real network traces to demonstrate how we can tune PCFs to achieve extremely low false positive and false negative probabilities.
Estimating dominance norms of multiple data streams
- in Proceedings of the 11th European Symposium on Algorithms (ESA
, 2003
"... Abstract. There is much focus in the algorithms and database communities on designing tools to manage and mine data streams. Typically, data streams consist of multiple signals. Formally, a stream of multiple signals is (i, ai,j) where i’s correspond to the domain, j’s index the different signals an ..."
Abstract
-
Cited by 21 (7 self)
- Add to MetaCart
Abstract. There is much focus in the algorithms and database communities on designing tools to manage and mine data streams. Typically, data streams consist of multiple signals. Formally, a stream of multiple signals is (i, ai,j) where i’s correspond to the domain, j’s index the different signals and ai,j ≥ 0 give the value of the jth signal at point i. We study the problem of finding norms that are cumulative of the multiple signals in the data stream. For example, consider the max-dominance norm, defined as i maxj{ai,j}. It may be thought as estimating the norm of the “upper envelope ” of the multiple signals, or alternatively, as estimating the norm of the “marginal ” distribution of tabular data streams. It is used in applications to estimate the “worst case influence” of multiple processes,for example in IP traffic analysis, electrical grid monitoring and financial domain. In addition, it is a natural measure, generalizing the union of data streams or counting distinct elements in data streams. We present the first known data stream algorithms for estimating max-dominance of multiple signals. In particular, we use workspace and time-per-item that are both sublinear (in fact, poly-logarithmic) in the input size. In contrast other notions of dominance on streams a, b — min-dominance ( i minj{ai,j}), countdominance (|{i|ai> bi}|) or relative-dominance ( i ai / max{1, bi} ) — are all impossible to estimate accurately with sublinear space. 1
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling
- VLDB
, 2005
"... Database management systems face the challenge of dealing with massive data distributions which arrive at high speeds while there is only small storage available for managing and mining them. Emerging data stream management systems approach this problem by summarizing and mining the distributions us ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Database management systems face the challenge of dealing with massive data distributions which arrive at high speeds while there is only small storage available for managing and mining them. Emerging data stream management systems approach this problem by summarizing and mining the distributions using samples or sketches. However, data distributions can be “viewed” in different ways. For example, a data stream of integer values can be viewed either as the forward distribution f(x), ie., the number of occurrences of x in the stream, or as its inverse, f −1 (i), which is the number of items that appear i times. While both such “views ” are equivalent in stored data systems, over data streams that entail approximations, they may be significantly different. In other words, samples and sketches developed for the forward distribution may be ineffective for summarizing or mining the inverse distribution. Yet, many applications such as IP traffic monitoring naturally rely on mining inverse distributions. We formalize the problems of managing and mining inverse distributions and show provable differences between summarizing the forward distribution vs the inverse distribution. We present methods for summarizing and mining inverse distributions of data streams: they rely on a novel technique to maintain a dynamic sample over the stream with provable guarantees which can be used for variety of summarization tasks (building quantiles or equidepth histograms) and mining (anomaly detection: finding heavy hitters, and measuring the number of rare items), all with provable guarantees on quality of approximations and time/space used by our streaming methods. We also complement our analytical and algorithmic results by presenting an experimental study of the methods over network data streams.
On Distributing Symmetric Streaming Computations
"... A common approach for dealing with large data sets is to stream over the input in one pass, and perform computations using sublinear resources. For truly massive data sets, however, even making a single pass over the data is prohibitive. Therefore, streaming computations must be distributed over man ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
A common approach for dealing with large data sets is to stream over the input in one pass, and perform computations using sublinear resources. For truly massive data sets, however, even making a single pass over the data is prohibitive. Therefore, streaming computations must be distributed over many machines. In practice, obtaining significant speedups using distributed computation has numerous challenges including synchronization, load balancing, overcoming processor failures, and data distribution. Successful systems in practice such as Google’s MapReduce and Apache’s Hadoop address these problems by only allowing a certain class of highly distributable tasks defined by local computations that can be applied in any order to the input. The fundamental question that arises is: How does the class of computational tasks supported by these systems differ from the class for which streaming solutions exist? We introduce a simple algorithmic model for massive, unordered, distributed (mud) computation, as implemented by these systems. We show that in principle, mud algorithms are equivalent in power to symmetric streaming algorithms. More precisely, we show that any symmetric (orderinvariant) function that can be computed by a streaming algorithm can also be computed by a mud algorithm, with comparable space and communication complexity. Our simulation uses Savitch’s theorem and therefore has superpolynomial time complexity. We extend our simulation result to some natural classes of approximate and randomized streaming algorithms. We also give negative results, using communication complexity arguments to prove that extensions to private randomness, promise problems and indeterminate functions are impossible. We also introduce an extension of the mud model to multiple keys and multiple rounds. 1
Gamps: Compressing multi sensor data by grouping and amplitude scaling
- In: ACM SIGMOD. (2009
"... We consider the problem of collectively approximating a set of sensor signals using the least amount of space so that any individual signal can be efficiently reconstructed within a given maximum (L∞) error ε. The problem arises naturally in applications that need to collect large amounts of data fr ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
We consider the problem of collectively approximating a set of sensor signals using the least amount of space so that any individual signal can be efficiently reconstructed within a given maximum (L∞) error ε. The problem arises naturally in applications that need to collect large amounts of data from multiple concurrent sources, such as sensors, servers and network routers, and archive them over a long period of time for offline data mining. We present GAMPS, a general framework that addresses this problem by combining several novel techniques. First, it dynamically groups multiple signals together so that signals within each group are correlated and can be maximally compressed jointly. Second, it appropriately scales the amplitudes of different signals within a group and compresses them within the maximum allowed reconstruction error bound. Our schemes are polynomial time O(α, β) approximation schemes, meaning that the maximum (L∞) error is at most αε and it uses at most β times the optimal memory. Finally, GAMPS maintains an index so that various queries can be issued directly on compressed data. Our experiments on several real-world sensor datasets show that GAMPS significantly reduces space without compromising the quality of search and query. Categories and Subject Descriptors
Distributed Deviation Detection in Sensor Networks
- SIGMOD Rec
, 2003
"... Sensor networks have recently attracted much attention, because of their potential applications in a number of different settings. The sensors can be deployed in large numbers in wide geographical areas, and can be used to monitor physical phenomena, or to detect certain events. An interesting probl ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Sensor networks have recently attracted much attention, because of their potential applications in a number of different settings. The sensors can be deployed in large numbers in wide geographical areas, and can be used to monitor physical phenomena, or to detect certain events. An interesting problem which has not been adequately addressed so far is that of distributed online deviation detection in streaming data. The identification of deviating values provides an efficient way to focus on the interesting events in the sensor network. In this work, we propose a technique for online deviation detection in streaming data. We discuss how these techniques can operate efficiently in the distributed environment of a sensor network, and discuss the tradeoffs that arise in this setting. Our techniques process as much of the data as possible in a decentralized fashion, so as to avoid unnecessary communication and computational effort. 1
Pan-private streaming algorithms
- In Proceedings of ICS
, 2010
"... Abstract: Collectors of confidential data, such as governmental agencies, hospitals, or search engine providers, can be pressured to permit data to be used for purposes other than that for which they were collected. To support the data curators, we initiate a study of pan-private algorithms; roughly ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
Abstract: Collectors of confidential data, such as governmental agencies, hospitals, or search engine providers, can be pressured to permit data to be used for purposes other than that for which they were collected. To support the data curators, we initiate a study of pan-private algorithms; roughly speaking, these algorithms retain their privacy properties even if their internal state becomes visible to an adversary. Our principal focus is on streaming algorithms, where each datum may be discarded immediately after processing.
Maintaining significant stream statistics over sliding windows
- Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
, 2006
"... In this paper, we introduce the Significant One Counting problem. Let ε and θ be respectively some user-specified error bound and threshold. The input of the problem is a stream of bits. We need to maintain some data structure that allows us to estimate the number of 1-bits in a sliding window of si ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
In this paper, we introduce the Significant One Counting problem. Let ε and θ be respectively some user-specified error bound and threshold. The input of the problem is a stream of bits. We need to maintain some data structure that allows us to estimate the number of 1-bits in a sliding window of size n such that whenever there are at least θn 1-bits in the window, the relative error of the estimate is guaranteed to be at most ε. When θ = 1/n, our problem becomes the Basic Counting problem proposed by Dataretal. [ACM-SIAM Symposium on Discrete Algorithms (2002), pp. 635–644]. We prove that any data structure for the Significant One Counting problem must use at least + log εθn) bits of memory. We also design a data structure for the problem that matches this memory bound and supports constant query and update time. Note that for fixed θ and ε, our data structure uses O(log n) bits of memory, while any data structure for the Basic Counting problem needs Ω(log 2 n) bits in the worst case. Ω ( 1 1 log2 ε θ 1

