Results 1  10
of
200
Data Streams: Algorithms and Applications
, 2005
"... In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerg ..."
Abstract

Cited by 543 (23 self)
 Add to MetaCart
In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudorandom computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [175].1
An improved data stream summary: The CountMin sketch and its applications
 J. Algorithms
, 2004
"... Abstract. We introduce a new sublinear space data structure—the CountMin Sketch — for summarizing data streams. Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applie ..."
Abstract

Cited by 413 (44 self)
 Add to MetaCart
(Show Context)
Abstract. We introduce a new sublinear space data structure—the CountMin Sketch — for summarizing data streams. Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc. The time and space bounds we show for using the CM sketch to solve these problems significantly improve those previously known — typically from 1/ε 2 to 1/ε in factor. 1
Combinatorial Algorithms for Compressed Sensing
 In Proc. of SIROCCO
, 2006
"... Abstract — In sparse approximation theory, the fundamental problem is to reconstruct a signal A ∈ R n from linear measurements 〈A, ψi 〉 with respect to a dictionary of ψi’s. Recently, there is focus on the novel direction of Compressed Sensing [1] where the reconstruction can be done with very few—O ..."
Abstract

Cited by 117 (1 self)
 Add to MetaCart
(Show Context)
Abstract — In sparse approximation theory, the fundamental problem is to reconstruct a signal A ∈ R n from linear measurements 〈A, ψi 〉 with respect to a dictionary of ψi’s. Recently, there is focus on the novel direction of Compressed Sensing [1] where the reconstruction can be done with very few—O(k log n)— linear measurements over a modified dictionary if the signal is compressible, that is, its information is concentrated in k coefficients with the original dictionary. In particular, these results [1], [2], [3] prove that there exists a single O(k log n) × n measurement matrix such that any such signal can be reconstructed from these measurements, with error at most O(1) times the worst case error for the class of such signals. Compressed sensing has generated tremendous excitement both because of the sophisticated underlying Mathematics and because of its potential applications. In this paper, we address outstanding open problems in Compressed Sensing. Our main result is an explicit construction of a nonadaptive measurement matrix and the corresponding reconstruction algorithm so that with a number of measurements polynomial in k, log n, 1/ε, we can reconstruct compressible signals. This is the first known polynomial time explicit construction of any such measurement matrix. In addition, our result improves the error guarantee from O(1) to 1 + ε and improves the reconstruction time from poly(n) to poly(k log n). Our second result is a randomized construction of O(k polylog(n)) measurements that work for each signal with high probability and gives perinstance approximation guarantees rather than over the class of all signals. Previous work on Compressed Sensing does not provide such perinstance approximation guarantees; our result improves the best known number of measurements known from prior work in other areas including Learning Theory [4], [5], Streaming algorithms [6], [7], [8] and Complexity Theory [9] for this case. Our approach is combinatorial. In particular, we use two parallel sets of group tests, one to filter and the other to certify and estimate; the resulting algorithms are quite simple to implement. I.
Holistic Aggregates in a Networked World: Distributed Tracking of Approximate Quantiles
 In SIGMOD
, 2005
"... While traditional database systems optimize for performance on oneshot queries, emerging largescale monitoring applications require continuous tracking of complex aggregates and datadistribution summaries over collections of physicallydistributed streams. Thus, effective solutions have to be sim ..."
Abstract

Cited by 102 (23 self)
 Add to MetaCart
(Show Context)
While traditional database systems optimize for performance on oneshot queries, emerging largescale monitoring applications require continuous tracking of complex aggregates and datadistribution summaries over collections of physicallydistributed streams. Thus, effective solutions have to be simultaneously space efficient (at each remote site), communication efficient (across the underlying communication network), and provide continuous, guaranteedquality estimates. In this paper, we propose novel algorithmic solutions for the problem of continuously tracking complex holistic aggregates in such a distributedstreams setting — our primary focus is on approximate quantile summaries, but our approach is more broadly applicable and can handle other holisticaggregate functions (e.g., “heavyhitters ” queries). We present the first known distributedtracking schemes for maintaining accurate quantile estimates with provable approximation guarantees, while simultaneously optimizing the storage space at each remote site as well as the communication cost across the network. In a nutshell, our algorithms employ a combination of local tracking at remote sites and simple prediction models for local site behavior in order to produce highly communication and spaceefficient solutions. We perform extensive experiments with real and synthetic data to explore the various tradeoffs and understand the role of prediction models in our schemes. The results clearly validate our approach, revealing significant savings over naive solutions as well as our analytical worstcase guarantees. 1.
Approximate Counts and Quantiles over Sliding Windows
 Proc. of ACM PODS Symp
, 2004
"... We consider the problem of maintaining approximate counts and quantiles over fixed and variablesize sliding windows in limited space. For quantiles, we present deterministic algorithms whose space requirements are O ( 1! log 1! logN) and O( ..."
Abstract

Cited by 99 (1 self)
 Add to MetaCart
(Show Context)
We consider the problem of maintaining approximate counts and quantiles over fixed and variablesize sliding windows in limited space. For quantiles, we present deterministic algorithms whose space requirements are O ( 1! log 1! logN) and O(
What's New: Finding Significant Differences in Network Data Streams
 in Proc. of IEEE Infocom
, 2004
"... Monitoring and analyzing network traffic usage patterns is vital for managing IP Networks. An important problem is to provide network managers with information about changes in traffic, informing them about "what's new". Specifically, we focus on the challenge of finding significantly ..."
Abstract

Cited by 85 (8 self)
 Add to MetaCart
(Show Context)
Monitoring and analyzing network traffic usage patterns is vital for managing IP Networks. An important problem is to provide network managers with information about changes in traffic, informing them about "what's new". Specifically, we focus on the challenge of finding significantly large differences in traffic: over time, between interfaces and between routers. We introduce the idea of a deltoid: an item that has a large difference, whether the difference is absolute, relative or variational. We present novel...
Efficient Computation of Frequent and Topk Elements in Data Streams
 IN ICDT
, 2005
"... We propose an approximate integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream coming from a large domain. Our solution is space efficient and reports both frequent and topk elements with tight guarantees on errors. For ..."
Abstract

Cited by 69 (7 self)
 Add to MetaCart
We propose an approximate integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream coming from a large domain. Our solution is space efficient and reports both frequent and topk elements with tight guarantees on errors. For general data distributions, our topk algorithm returns k elements that have roughly the highest frequencies; and it uses limited space for calculating frequent elements. For realistic Zipfian data, the space requirement of the proposed algorithm for solving the exact frequent elements problem decreases dramatically with the parameter of the distribution; and for topk queries, the analysis ensures that only the topk elements, in the correct order, are reported. The experiments, using real and synthetic data sets, show space reductions with no loss in accuracy. Having proved the effectiveness of the proposed approach through both analysis and experiments, we extend it to be able to answer continuous queries about frequent and topk elements. Although the problems of incremental reporting of frequent and topk elements are useful in many applications, to the best of our knowledge, no solution has been proposed.
Online identification of hierarchical heavy hitters: Algorithms, evaluation, and applications
 In Proceedings of the 4th ACM SIGCOMM Internet Measurement Conference
, 2004
"... In traffic monitoring, accounting, and network anomaly detection, it is often important to be able to detect highvolume traffic clusters in near realtime. Such heavyhitter traffic clusters are often hierarchical (i.e., they may occur at different aggregation levels like ranges of IP addresses) an ..."
Abstract

Cited by 66 (11 self)
 Add to MetaCart
(Show Context)
In traffic monitoring, accounting, and network anomaly detection, it is often important to be able to detect highvolume traffic clusters in near realtime. Such heavyhitter traffic clusters are often hierarchical (i.e., they may occur at different aggregation levels like ranges of IP addresses) and possibly multidimensional (i.e., they may involve the combination of different IP header fields like IP addresses, port numbers, and protocol). Without prior knowledge about the precise structures of such traffic clusters, a naive approach would require the monitoring system to examine all possible combinations of aggregates in order to detect the heavy hitters, which can be prohibitive in terms of computation resources. In this paper, we focus on online identification of 1dimensional and 2dimensional hierarchical heavy hitters (HHHs), arguably the two most important scenarios in traffic analysis. We show that the
Finding hierarchical heavy hitters in data streams
 In Proc. of VLDB
, 2003
"... Aggregation along hierarchies is a critical summary technique in a large variety of online applications including decision support, and network management (e.g., IP clustering, denialofservice attack monitoring). Despite the amount of recent study that has been dedicated to online aggregation on s ..."
Abstract

Cited by 60 (8 self)
 Add to MetaCart
(Show Context)
Aggregation along hierarchies is a critical summary technique in a large variety of online applications including decision support, and network management (e.g., IP clustering, denialofservice attack monitoring). Despite the amount of recent study that has been dedicated to online aggregation on sets (e.g., quantiles, hot items), surprisingly little attention has been paid to summarizing hierarchical structure in stream data. The problem we study in this paper is that of finding Hierarchical Heavy Hitters (HHH): given a hierarchy and a fraction φ, we want to find all HHH nodes that have a total number of descendants in the data stream larger than φ of the total number of elements in the data stream, after discounting the descendant nodes that are HHH nodes. The resulting summary gives a topological “cartogram ” of the hierarchical data. We present deterministic and randomized algorithms for finding HHHs, which builds upon existing techniques by incorporating the hierarchy into the algorithms. Our experiments demonstrate several factors of improvement in accuracy over
Fast and approximate stream mining of quantiles and frequencies using graphics processors
 In SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM
, 2005
"... We present algorithms for fast quantile and frequency estimation in large data streams using graphics processors (GPUs). We exploit the high computation power and memory bandwidth of graphics processors and present a new sorting algorithm that performs rasterization operations on the GPUs. We use s ..."
Abstract

Cited by 55 (13 self)
 Add to MetaCart
(Show Context)
We present algorithms for fast quantile and frequency estimation in large data streams using graphics processors (GPUs). We exploit the high computation power and memory bandwidth of graphics processors and present a new sorting algorithm that performs rasterization operations on the GPUs. We use sorting as the main computational component for histogram approximation and construction of approximate quantile and frequency summaries. Our algorithms for numerical statistics computation on data streams are deterministic, applicable to xed or variablesized sliding windows and use a limited memory footprint. We use GPU as a coprocessor and minimize the data transmission between the CPU and GPU by taking into account the low bus bandwidth. We implemented our algorithms on a PC with a NVIDIA GeForce FX 6800 Ultra GPU and a 3:4 GHz Pentium IV CPU and applied them to large data streams consisting of more than 100 million values. We also compared the performance of our GPUbased algorithms with optimized implementations of prior CPUbased algorithms. Overall, our results demonstrate that the graphics processors available on a commodity computer system are efcient streamprocessor and useful coprocessors for mining data streams.