Results 1  10
of
44
An improved data stream summary: The CountMin sketch and its applications
 J. Algorithms
, 2004
"... Abstract. We introduce a new sublinear space data structure—the CountMin Sketch — for summarizing data streams. Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applie ..."
Abstract

Cited by 412 (44 self)
 Add to MetaCart
Abstract. We introduce a new sublinear space data structure—the CountMin Sketch — for summarizing data streams. Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc. The time and space bounds we show for using the CM sketch to solve these problems significantly improve those previously known — typically from 1/ε 2 to 1/ε in factor. 1
Finding frequent items in data streams
, 2002
"... Abstract. We present a 1pass algorithm for estimating the most frequent items in a data stream using very limited storage space. Our method relies on a novel data structure called a count sketch, which allows us to estimate the frequencies of all the items in the stream. Our algorithm achieves bett ..."
Abstract

Cited by 344 (0 self)
 Add to MetaCart
Abstract. We present a 1pass algorithm for estimating the most frequent items in a data stream using very limited storage space. Our method relies on a novel data structure called a count sketch, which allows us to estimate the frequencies of all the items in the stream. Our algorithm achieves better space bounds than the previous best known algorithms for this problem for many natural distributions on the item frequencies. In addition, our algorithm leads directly to a 2pass algorithm for the problem of estimating the items with the largest (absolute) change in frequency between two data streams. To our knowledge, this problem has not been previously studied in the literature. 1
An Information Statistics Approach to Data Stream and Communication Complexity
, 2003
"... We present a new method for proving strong lower bounds in communication complexity. ..."
Abstract

Cited by 240 (8 self)
 Add to MetaCart
We present a new method for proving strong lower bounds in communication complexity.
What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically
, 2003
"... Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items” in the relation: those that appear many times (most frequently, or more than some threshold). For example, endbiased histograms keep the hot items as part of t ..."
Abstract

Cited by 201 (13 self)
 Add to MetaCart
Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items” in the relation: those that appear many times (most frequently, or more than some threshold). For example, endbiased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in networking applications. We present a new algorithm for dynamically determining the hot items at any time in the relation that is undergoing deletion operations as well as inserts. Our algorithm maintains a small space data structure that monitors the transactions on the relation, and when required, quickly outputs all hot items, without rescanning the relation in the database. With userspecified probability, it is able to report all hot items. Our algorithm relies on the idea of “group testing”, is simple to implement, and has provable quality, space and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees can not handle deletions, and those that handle deletions can not make similar guarantees without rescanning the database. Our experiments with real and synthetic data shows that our algorithm is remarkably accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.
Clustering data streams: Theory and practice
 IEEE TKDE
, 2003
"... Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little ..."
Abstract

Cited by 154 (4 self)
 Add to MetaCart
(Show Context)
Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm’s performance on synthetic and real data streams. Index Terms—Clustering, data streams, approximation algorithms. 1
Combinatorial Algorithms for Compressed Sensing
 In Proc. of SIROCCO
, 2006
"... Abstract — In sparse approximation theory, the fundamental problem is to reconstruct a signal A ∈ R n from linear measurements 〈A, ψi 〉 with respect to a dictionary of ψi’s. Recently, there is focus on the novel direction of Compressed Sensing [1] where the reconstruction can be done with very few—O ..."
Abstract

Cited by 116 (1 self)
 Add to MetaCart
(Show Context)
Abstract — In sparse approximation theory, the fundamental problem is to reconstruct a signal A ∈ R n from linear measurements 〈A, ψi 〉 with respect to a dictionary of ψi’s. Recently, there is focus on the novel direction of Compressed Sensing [1] where the reconstruction can be done with very few—O(k log n)— linear measurements over a modified dictionary if the signal is compressible, that is, its information is concentrated in k coefficients with the original dictionary. In particular, these results [1], [2], [3] prove that there exists a single O(k log n) × n measurement matrix such that any such signal can be reconstructed from these measurements, with error at most O(1) times the worst case error for the class of such signals. Compressed sensing has generated tremendous excitement both because of the sophisticated underlying Mathematics and because of its potential applications. In this paper, we address outstanding open problems in Compressed Sensing. Our main result is an explicit construction of a nonadaptive measurement matrix and the corresponding reconstruction algorithm so that with a number of measurements polynomial in k, log n, 1/ε, we can reconstruct compressible signals. This is the first known polynomial time explicit construction of any such measurement matrix. In addition, our result improves the error guarantee from O(1) to 1 + ε and improves the reconstruction time from poly(n) to poly(k log n). Our second result is a randomized construction of O(k polylog(n)) measurements that work for each signal with high probability and gives perinstance approximation guarantees rather than over the class of all signals. Previous work on Compressed Sensing does not provide such perinstance approximation guarantees; our result improves the best known number of measurements known from prior work in other areas including Learning Theory [4], [5], Streaming algorithms [6], [7], [8] and Complexity Theory [9] for this case. Our approach is combinatorial. In particular, we use two parallel sets of group tests, one to filter and the other to certify and estimate; the resulting algorithms are quite simple to implement. I.
Fast Monte Carlo algorithms for matrices I: Approximating matrix multiplication
 SIAM Journal on Computing
, 2004
"... ..."
(Show Context)
Characterizing Memory Requirements for Queries over Continuous Data
 In PODS
, 2002
"... This paper deals with continuous conjunctive queries with arithmetic comparisons and optional aggregation over multiple data streams. An algorithm is presented for determining whether or not any given query can be evaluated using a bounded amount of memory for all possible instances of the data stre ..."
Abstract

Cited by 99 (11 self)
 Add to MetaCart
This paper deals with continuous conjunctive queries with arithmetic comparisons and optional aggregation over multiple data streams. An algorithm is presented for determining whether or not any given query can be evaluated using a bounded amount of memory for all possible instances of the data streams. For queries that can be evaluated using bounded memory, an execution strategy based on constantsized synopses of the data streams is proposed. For queries that cannot be evaluated using bounded memory, data stream scenarios are identified in which evaluating the queries requires memory linear in the size of the unbounded streams
What's New: Finding Significant Differences in Network Data Streams
 in Proc. of IEEE Infocom
, 2004
"... Monitoring and analyzing network traffic usage patterns is vital for managing IP Networks. An important problem is to provide network managers with information about changes in traffic, informing them about "what's new". Specifically, we focus on the challenge of finding significantly ..."
Abstract

Cited by 85 (8 self)
 Add to MetaCart
(Show Context)
Monitoring and analyzing network traffic usage patterns is vital for managing IP Networks. An important problem is to provide network managers with information about changes in traffic, informing them about "what's new". Specifically, we focus on the challenge of finding significantly large differences in traffic: over time, between interfaces and between routers. We introduce the idea of a deltoid: an item that has a large difference, whether the difference is absolute, relative or variational. We present novel...
Graph distances in the streaming model: the value of space
 In ACMSIAM Symposium on Discrete Algorithms
, 2005
"... We investigate the importance of space when solving problems based on graph distance in the streaming model. In this model, the input graph is presented as a stream of edges in an arbitrary order. The main computational restriction of the model is that we have limited space and therefore cannot stor ..."
Abstract

Cited by 69 (11 self)
 Add to MetaCart
We investigate the importance of space when solving problems based on graph distance in the streaming model. In this model, the input graph is presented as a stream of edges in an arbitrary order. The main computational restriction of the model is that we have limited space and therefore cannot store all the streamed data; we are forced to make spaceefficient summaries of the data as we go along. For a graph of n vertices and m edges, we show that testing many graph properties, including connectivity (ergo any reasonable decision problem about distances) and bipartiteness, requires Ω(n) bits of space. Given this, we then investigate how the power of the model increases as we relax our space restriction. Our main result is an efficient randomized algorithm that constructs a (2t + 1)spanner in one pass. With high probability, it uses O(t · n 1+1/t log 2 n) bits of space and processes each edge in the stream in O(t 2 · n 1/t log n) time. We find approximations to diameter and girth via the log n constructed spanner. For t = Ω (), the space log log n requirement of the algorithm is O(n·polylog n), and the peredge processing time is O(polylog n). We also show a corresponding lower bound of t for the approximation ratio achievable when the space restriction is O(t · n1+1/t log 2 n). We then consider the scenario in which we are allowed multiple passes over the input stream. Here, we investigate whether allowing these extra passes will compensate for a given space restriction. We show that ∗This work was supported by the DoD University Research Initiative (URI) administered by the Office of Naval Research