Results 1  10
of
129
Data Streams: Algorithms and Applications
, 2005
"... In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerg ..."
Abstract

Cited by 375 (21 self)
 Add to MetaCart
In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudorandom computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [175].1
An improved data stream summary: The CountMin sketch and its applications
 J. Algorithms
, 2004
"... Abstract. We introduce a new sublinear space data structure—the CountMin Sketch — for summarizing data streams. Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applie ..."
Abstract

Cited by 293 (36 self)
 Add to MetaCart
Abstract. We introduce a new sublinear space data structure—the CountMin Sketch — for summarizing data streams. Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc. The time and space bounds we show for using the CM sketch to solve these problems significantly improve those previously known — typically from 1/ε 2 to 1/ε in factor. 1
Approximate counts and quantiles over sliding windows
 In PODS
, 2004
"... We consider the problem of maintaining ɛapproximate counts and quantiles over a stream sliding window using limited space. We consider two types of sliding windows depending on whether the number of elements N in the window is fixed (fixedsize sliding window) or variable (variablesize sliding win ..."
Abstract

Cited by 84 (1 self)
 Add to MetaCart
We consider the problem of maintaining ɛapproximate counts and quantiles over a stream sliding window using limited space. We consider two types of sliding windows depending on whether the number of elements N in the window is fixed (fixedsize sliding window) or variable (variablesize sliding window). In a fixedsize sliding window, both the ends of the window slide synchronously over the stream. In a variablesize sliding window, an adversary slides the window ends independently, and therefore has the ability to vary the number of elements N in the window. We present various deterministic and randomized algorithms for approximate counts and quantiles. All of our algorithms require O ( 1 1 polylog ( , N)) space. For quantiles, this space ɛ ɛ requirement is an improvement over the previous best bound of O ( 1 ɛ2 polylog ( 1, N)). We believe that no previous work ɛ on spaceefficient approximate counts over sliding windows exists. 1.
Holistic Aggregates in a Networked World: Distributed Tracking of Approximate Quantiles
 In SIGMOD
, 2005
"... While traditional database systems optimize for performance on oneshot queries, emerging largescale monitoring applications require continuous tracking of complex aggregates and datadistribution summaries over collections of physicallydistributed streams. Thus, effective solutions have to be sim ..."
Abstract

Cited by 77 (19 self)
 Add to MetaCart
While traditional database systems optimize for performance on oneshot queries, emerging largescale monitoring applications require continuous tracking of complex aggregates and datadistribution summaries over collections of physicallydistributed streams. Thus, effective solutions have to be simultaneously space efficient (at each remote site), communication efficient (across the underlying communication network), and provide continuous, guaranteedquality estimates. In this paper, we propose novel algorithmic solutions for the problem of continuously tracking complex holistic aggregates in such a distributedstreams setting — our primary focus is on approximate quantile summaries, but our approach is more broadly applicable and can handle other holisticaggregate functions (e.g., “heavyhitters ” queries). We present the first known distributedtracking schemes for maintaining accurate quantile estimates with provable approximation guarantees, while simultaneously optimizing the storage space at each remote site as well as the communication cost across the network. In a nutshell, our algorithms employ a combination of local tracking at remote sites and simple prediction models for local site behavior in order to produce highly communication and spaceefficient solutions. We perform extensive experiments with real and synthetic data to explore the various tradeoffs and understand the role of prediction models in our schemes. The results clearly validate our approach, revealing significant savings over naive solutions as well as our analytical worstcase guarantees. 1.
What's New: Finding Significant Differences in Network Data Streams
 in Proc. of IEEE Infocom
, 2004
"... Monitoring and analyzing network traffic usage patterns is vital for managing IP Networks. An important problem is to provide network managers with information about changes in traffic, informing them about "what's new". Specifically, we focus on the challenge of finding significantly large differen ..."
Abstract

Cited by 67 (8 self)
 Add to MetaCart
Monitoring and analyzing network traffic usage patterns is vital for managing IP Networks. An important problem is to provide network managers with information about changes in traffic, informing them about "what's new". Specifically, we focus on the challenge of finding significantly large differences in traffic: over time, between interfaces and between routers. We introduce the idea of a deltoid: an item that has a large difference, whether the difference is absolute, relative or variational. We present novel...
Combinatorial Algorithms for Compressed Sensing
 In Proc. of SIROCCO
, 2006
"... Abstract — In sparse approximation theory, the fundamental problem is to reconstruct a signal A ∈ R n from linear measurements 〈A, ψi 〉 with respect to a dictionary of ψi’s. Recently, there is focus on the novel direction of Compressed Sensing [1] where the reconstruction can be done with very few—O ..."
Abstract

Cited by 64 (1 self)
 Add to MetaCart
Abstract — In sparse approximation theory, the fundamental problem is to reconstruct a signal A ∈ R n from linear measurements 〈A, ψi 〉 with respect to a dictionary of ψi’s. Recently, there is focus on the novel direction of Compressed Sensing [1] where the reconstruction can be done with very few—O(k log n)— linear measurements over a modified dictionary if the signal is compressible, that is, its information is concentrated in k coefficients with the original dictionary. In particular, these results [1], [2], [3] prove that there exists a single O(k log n) × n measurement matrix such that any such signal can be reconstructed from these measurements, with error at most O(1) times the worst case error for the class of such signals. Compressed sensing has generated tremendous excitement both because of the sophisticated underlying Mathematics and because of its potential applications. In this paper, we address outstanding open problems in Compressed Sensing. Our main result is an explicit construction of a nonadaptive measurement matrix and the corresponding reconstruction algorithm so that with a number of measurements polynomial in k, log n, 1/ε, we can reconstruct compressible signals. This is the first known polynomial time explicit construction of any such measurement matrix. In addition, our result improves the error guarantee from O(1) to 1 + ε and improves the reconstruction time from poly(n) to poly(k log n). Our second result is a randomized construction of O(k polylog(n)) measurements that work for each signal with high probability and gives perinstance approximation guarantees rather than over the class of all signals. Previous work on Compressed Sensing does not provide such perinstance approximation guarantees; our result improves the best known number of measurements known from prior work in other areas including Learning Theory [4], [5], Streaming algorithms [6], [7], [8] and Complexity Theory [9] for this case. Our approach is combinatorial. In particular, we use two parallel sets of group tests, one to filter and the other to certify and estimate; the resulting algorithms are quite simple to implement. I.
Online identification of hierarchical heavy hitters: Algorithms, evaluation, and applications
 In Proceedings of the 4th ACM SIGCOMM Internet Measurement Conference
, 2004
"... In traffic monitoring, accounting, and network anomaly detection, it is often important to be able to detect highvolume traffic clusters in near realtime. Such heavyhitter traffic clusters are often hierarchical (i.e., they may occur at different aggregation levels like ranges of IP addresses) an ..."
Abstract

Cited by 48 (10 self)
 Add to MetaCart
In traffic monitoring, accounting, and network anomaly detection, it is often important to be able to detect highvolume traffic clusters in near realtime. Such heavyhitter traffic clusters are often hierarchical (i.e., they may occur at different aggregation levels like ranges of IP addresses) and possibly multidimensional (i.e., they may involve the combination of different IP header fields like IP addresses, port numbers, and protocol). Without prior knowledge about the precise structures of such traffic clusters, a naive approach would require the monitoring system to examine all possible combinations of aggregates in order to detect the heavy hitters, which can be prohibitive in terms of computation resources. In this paper, we focus on online identification of 1dimensional and 2dimensional hierarchical heavy hitters (HHHs), arguably the two most important scenarios in traffic analysis. We show that the
Efficient Computation of Frequent and Topk Elements in Data Streams
 IN ICDT
, 2005
"... We propose an approximate integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream coming from a large domain. Our solution is space efficient and reports both frequent and topk elements with tight guarantees on errors. For ..."
Abstract

Cited by 47 (6 self)
 Add to MetaCart
We propose an approximate integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream coming from a large domain. Our solution is space efficient and reports both frequent and topk elements with tight guarantees on errors. For general data distributions, our topk algorithm returns k elements that have roughly the highest frequencies; and it uses limited space for calculating frequent elements. For realistic Zipfian data, the space requirement of the proposed algorithm for solving the exact frequent elements problem decreases dramatically with the parameter of the distribution; and for topk queries, the analysis ensures that only the topk elements, in the correct order, are reported. The experiments, using real and synthetic data sets, show space reductions with no loss in accuracy. Having proved the effectiveness of the proposed approach through both analysis and experiments, we extend it to be able to answer continuous queries about frequent and topk elements. Although the problems of incremental reporting of frequent and topk elements are useful in many applications, to the best of our knowledge, no solution has been proposed.
Finding hierarchical heavy hitters in data streams
 In Proc. of VLDB
, 2003
"... Aggregation along hierarchies is a critical summary technique in a large variety of online applications including decision support, and network management (e.g., IP clustering, denialofservice attack monitoring). Despite the amount of recent study that has been dedicated to online aggregation on s ..."
Abstract

Cited by 46 (5 self)
 Add to MetaCart
Aggregation along hierarchies is a critical summary technique in a large variety of online applications including decision support, and network management (e.g., IP clustering, denialofservice attack monitoring). Despite the amount of recent study that has been dedicated to online aggregation on sets (e.g., quantiles, hot items), surprisingly little attention has been paid to summarizing hierarchical structure in stream data. The problem we study in this paper is that of finding Hierarchical Heavy Hitters (HHH): given a hierarchy and a fraction φ, we want to find all HHH nodes that have a total number of descendants in the data stream larger than φ of the total number of elements in the data stream, after discounting the descendant nodes that are HHH nodes. The resulting summary gives a topological “cartogram ” of the hierarchical data. We present deterministic and randomized algorithms for finding HHHs, which builds upon existing techniques by incorporating the hierarchy into the algorithms. Our experiments demonstrate several factors of improvement in accuracy over
Identifying Frequent Items in Sliding Windows over
 IN PROCEEDINGS OF THE INTERNET MEASUREMENT CONFERENCE
, 2003
"... ..."