Results 1  10
of
112
Models and issues in data stream systems
 IN PODS
, 2002
"... In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, timevarying data streams. In addition to reviewing past work releva ..."
Abstract

Cited by 770 (19 self)
 Add to MetaCart
(Show Context)
In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, timevarying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues.
Similarity estimation techniques from rounding algorithms
 In Proc. of 34th STOC
, 2002
"... A locality sensitive hashing scheme is a distribution on a family F of hash functions operating on a collection of objects, such that for two objects x, y, Prh∈F[h(x) = h(y)] = sim(x,y), where sim(x,y) ∈ [0, 1] is some similarity function defined on the collection of objects. Such a scheme leads ..."
Abstract

Cited by 436 (6 self)
 Add to MetaCart
(Show Context)
A locality sensitive hashing scheme is a distribution on a family F of hash functions operating on a collection of objects, such that for two objects x, y, Prh∈F[h(x) = h(y)] = sim(x,y), where sim(x,y) ∈ [0, 1] is some similarity function defined on the collection of objects. Such a scheme leads to a compact representation of objects so that similarity of objects can be estimated from their compact sketches, and also leads to efficient algorithms for approximate nearest neighbor search and clustering. Minwise independent permutations provide an elegant construction of such a locality sensitive hashing scheme for a collection of subsets with the set similarity measure sim(A, B) = A∩B A∪B . We show that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects. Based on this insight, we construct new locality sensitive hashing schemes for: 1. A collection of vectors with the distance between ⃗u and ⃗v measured by θ(⃗u,⃗v)/π, where θ(⃗u,⃗v) is the angle between ⃗u and ⃗v. This yields a sketching scheme for estimating the cosine similarity measure between two vectors, as well as a simple alternative to minwise independent permutations for estimating set similarity. 2. A collection of distributions on n points in a metric space, with distance between distributions measured by the Earth Mover Distance (EMD), (a popular distance measure in graphics and vision). Our hash functions map distributions to points in the metric space such that, for distributions P and Q,
An improved data stream summary: The CountMin sketch and its applications
 J. Algorithms
, 2004
"... Abstract. We introduce a new sublinear space data structure—the CountMin Sketch — for summarizing data streams. Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applie ..."
Abstract

Cited by 412 (44 self)
 Add to MetaCart
(Show Context)
Abstract. We introduce a new sublinear space data structure—the CountMin Sketch — for summarizing data streams. Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc. The time and space bounds we show for using the CM sketch to solve these problems significantly improve those previously known — typically from 1/ε 2 to 1/ε in factor. 1
Finding frequent items in data streams
, 2002
"... Abstract. We present a 1pass algorithm for estimating the most frequent items in a data stream using very limited storage space. Our method relies on a novel data structure called a count sketch, which allows us to estimate the frequencies of all the items in the stream. Our algorithm achieves bett ..."
Abstract

Cited by 344 (0 self)
 Add to MetaCart
Abstract. We present a 1pass algorithm for estimating the most frequent items in a data stream using very limited storage space. Our method relies on a novel data structure called a count sketch, which allows us to estimate the frequencies of all the items in the stream. Our algorithm achieves better space bounds than the previous best known algorithms for this problem for many natural distributions on the item frequencies. In addition, our algorithm leads directly to a 2pass algorithm for the problem of estimating the items with the largest (absolute) change in frequency between two data streams. To our knowledge, this problem has not been previously studied in the literature. 1
An Information Statistics Approach to Data Stream and Communication Complexity
, 2003
"... We present a new method for proving strong lower bounds in communication complexity. ..."
Abstract

Cited by 240 (8 self)
 Add to MetaCart
We present a new method for proving strong lower bounds in communication complexity.
What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically
, 2003
"... Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items” in the relation: those that appear many times (most frequently, or more than some threshold). For example, endbiased histograms keep the hot items as part of t ..."
Abstract

Cited by 201 (13 self)
 Add to MetaCart
Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items” in the relation: those that appear many times (most frequently, or more than some threshold). For example, endbiased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in networking applications. We present a new algorithm for dynamically determining the hot items at any time in the relation that is undergoing deletion operations as well as inserts. Our algorithm maintains a small space data structure that monitors the transactions on the relation, and when required, quickly outputs all hot items, without rescanning the relation in the database. With userspecified probability, it is able to report all hot items. Our algorithm relies on the idea of “group testing”, is simple to implement, and has provable quality, space and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees can not handle deletions, and those that handle deletions can not make similar guarantees without rescanning the database. Our experiments with real and synthetic data shows that our algorithm is remarkably accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.
COMBINING GEOMETRY AND COMBINATORICS: A UNIFIED APPROACH TO SPARSE SIGNAL RECOVERY
"... Abstract. There are two main algorithmic approaches to sparse signal recovery: geometric and combinatorial. The geometric approach starts with a geometric constraint on the measurement matrix Φ and then uses linear programming to decode information about x from Φx. The combinatorial approach constru ..."
Abstract

Cited by 161 (15 self)
 Add to MetaCart
Abstract. There are two main algorithmic approaches to sparse signal recovery: geometric and combinatorial. The geometric approach starts with a geometric constraint on the measurement matrix Φ and then uses linear programming to decode information about x from Φx. The combinatorial approach constructs Φ and a combinatorial decoding algorithm to match. We present a unified approach to these two classes of sparse signal recovery algorithms. The unifying elements are the adjacency matrices of highquality unbalanced expanders. We generalize the notion of Restricted Isometry Property (RIP), crucial to compressed sensing results for signal recovery, from the Euclidean norm to the ℓp norm for p ≈ 1, and then show that unbalanced expanders are essentially equivalent to RIPp matrices. From known deterministic constructions for such matrices, we obtain new deterministic measurement matrix constructions and algorithms for signal recovery which, compared to previous deterministic algorithms, are superior in either the number of measurements or in noise tolerance. 1.
Clustering data streams: Theory and practice
 IEEE TKDE
, 2003
"... Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little ..."
Abstract

Cited by 154 (4 self)
 Add to MetaCart
Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm’s performance on synthetic and real data streams. Index Terms—Clustering, data streams, approximation algorithms. 1
Combinatorial Algorithms for Compressed Sensing
 In Proc. of SIROCCO
, 2006
"... Abstract — In sparse approximation theory, the fundamental problem is to reconstruct a signal A ∈ R n from linear measurements 〈A, ψi 〉 with respect to a dictionary of ψi’s. Recently, there is focus on the novel direction of Compressed Sensing [1] where the reconstruction can be done with very few—O ..."
Abstract

Cited by 116 (1 self)
 Add to MetaCart
(Show Context)
Abstract — In sparse approximation theory, the fundamental problem is to reconstruct a signal A ∈ R n from linear measurements 〈A, ψi 〉 with respect to a dictionary of ψi’s. Recently, there is focus on the novel direction of Compressed Sensing [1] where the reconstruction can be done with very few—O(k log n)— linear measurements over a modified dictionary if the signal is compressible, that is, its information is concentrated in k coefficients with the original dictionary. In particular, these results [1], [2], [3] prove that there exists a single O(k log n) × n measurement matrix such that any such signal can be reconstructed from these measurements, with error at most O(1) times the worst case error for the class of such signals. Compressed sensing has generated tremendous excitement both because of the sophisticated underlying Mathematics and because of its potential applications. In this paper, we address outstanding open problems in Compressed Sensing. Our main result is an explicit construction of a nonadaptive measurement matrix and the corresponding reconstruction algorithm so that with a number of measurements polynomial in k, log n, 1/ε, we can reconstruct compressible signals. This is the first known polynomial time explicit construction of any such measurement matrix. In addition, our result improves the error guarantee from O(1) to 1 + ε and improves the reconstruction time from poly(n) to poly(k log n). Our second result is a randomized construction of O(k polylog(n)) measurements that work for each signal with high probability and gives perinstance approximation guarantees rather than over the class of all signals. Previous work on Compressed Sensing does not provide such perinstance approximation guarantees; our result improves the best known number of measurements known from prior work in other areas including Learning Theory [4], [5], Streaming algorithms [6], [7], [8] and Complexity Theory [9] for this case. Our approach is combinatorial. In particular, we use two parallel sets of group tests, one to filter and the other to certify and estimate; the resulting algorithms are quite simple to implement. I.
How to Summarize the Universe: Dynamic Maintenance of Quantiles
 In VLDB
, 2002
"... Order statistics, i.e., quantiles, are frequently used in databases both at the database server as well as the application level. For example, they are useful in selectivity estimation during query optimization, in partitioning large relations, in estimating query result sizes when building us ..."
Abstract

Cited by 111 (15 self)
 Add to MetaCart
(Show Context)
Order statistics, i.e., quantiles, are frequently used in databases both at the database server as well as the application level. For example, they are useful in selectivity estimation during query optimization, in partitioning large relations, in estimating query result sizes when building user interfaces, and in characterizing the data distribution of evolving datasets in the process of data mining.