Results 1 - 10
of
61
Models and issues in data stream systems
- In PODS
, 2002
"... In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work releva ..."
Abstract
-
Cited by 519 (18 self)
- Add to MetaCart
In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues. 1
Counting Distinct Elements in a Data Stream
, 2002
"... We present three algorithms to count the number of distinct elements in a data stream to within a factor of 1 ± epsilon. Our algorithms improve upon known algorithms for this problem, and offer a spectrum of time/space tradeoffs. ..."
Abstract
-
Cited by 111 (4 self)
- Add to MetaCart
We present three algorithms to count the number of distinct elements in a data stream to within a factor of 1 ± epsilon. Our algorithms improve upon known algorithms for this problem, and offer a spectrum of time/space tradeoffs.
Better Streaming Algorithms for Clustering Problems
- In Proc. of 35th ACM Symposium on Theory of Computing (STOC
, 2003
"... We study cluster ng pr blems in the str aming model, wher e the goal is to cluster a set of points by making one pass (or a few passes) over the data using a small amount of storSD space.Our mainr esult is a r ndomized algor ithm for k--Median prE lem which p duces a constant factor a ..."
Abstract
-
Cited by 63 (1 self)
- Add to MetaCart
We study cluster ng pr blems in the str aming model, wher e the goal is to cluster a set of points by making one pass (or a few passes) over the data using a small amount of storSD space.Our mainr esult is a r ndomized algor ithm for k--Median prE lem which p duces a constant factor appr oximation in one pass using storR4 space O(kpolylog n). This is a significant imp r vement of the prS ious best algor5 hm which yielded a 2 appr ximation using O(n )space.
Comparing data streams using hamming norms (how to zero in)
, 2003
"... Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases and instead must be processed “on the fly” as they are produced. Similarly, sensor networ ..."
Abstract
-
Cited by 59 (6 self)
- Add to MetaCart
Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases and instead must be processed “on the fly” as they are produced. Similarly, sensor networks produce multiple data streams of observations from their sensors. There is growing focus on manipulating data streams and, hence, there is a need to identify basic operations of interest in managing data streams, and to support them efficiently. We propose computation of the Hamming norm as a basic operation of interest. The Hamming norm formalizes ideas that are used throughout data processing. When applied to a single stream, the Hamming norm gives the number of distinct items that are present in that data stream, which is a statistic of great interest in databases. When applied to a pair of streams, the Hamming norm gives an important measure of (dis)similarity: the number of unequal item counts in the two streams. Hamming norms have many uses in comparing data streams. We present a novel approximation technique for estimating the Hamming norm for massive data streams; this relies on what we call the “l0 sketch ” and we prove its accuracy. We test our approximation method on a large quantity of synthetic and real stream data, and show that the estimation is accurate to within a few percentage points.
Graph mining: Laws, generators, and algorithms
- ACM COMPUTING SURVEYS
, 2006
"... How does the Web look? How could we tell an abnormal social network from a normal one? These and similar questions are important in many fields where the data can intuitively be cast as a graph; examples range from computer networks to sociology to biology and many more. Indeed, any M : N relation i ..."
Abstract
-
Cited by 49 (7 self)
- Add to MetaCart
How does the Web look? How could we tell an abnormal social network from a normal one? These and similar questions are important in many fields where the data can intuitively be cast as a graph; examples range from computer networks to sociology to biology and many more. Indeed, any M : N relation in database terminology can be represented as a graph. A lot of these questions boil down to the following: "How can we generate synthetic but realistic graphs?" To answer this, we must first understand what patterns are common in real-world graphs and can thus be considered a mark of normality/realism. This survey give an overview of the incredible variety of work that has been done on these problems. One of our main contributions is the integration of points of view from physics, mathematics, sociology, and computer science. Further, we briefly describe recent advances on some related and interesting graph problems.
Distributed streams algorithms for sliding windows
- In Proc. ACM Symp. on Parallel Algorithms and Architectures (SPAA
, 2002
"... Massive data sets often arise as physically distributed, parallel data streams, and it is important to estimate various aggregates and statistics on the union of these streams. This paper presents algorithms for estimating aggregate functions over a “sliding window ” of the N most recent data items ..."
Abstract
-
Cited by 48 (10 self)
- Add to MetaCart
Massive data sets often arise as physically distributed, parallel data streams, and it is important to estimate various aggregates and statistics on the union of these streams. This paper presents algorithms for estimating aggregate functions over a “sliding window ” of the N most recent data items in one or more streams. Our results include: 1. For a single stream, we present the first ɛ-approximation scheme for the number of 1’s in a sliding window that is optimal in both worst case time and space. We also present the first ɛ-approximation scheme for the sum of integers in [0..R] in a sliding window that is optimal in both worst case time and space (assuming R is at most polynomial in N). Both algorithms are deterministic and use only logarithmic memory words. 2. In contrast, we show that any deterministic algorithm that estimates, to within a small constant relative error, the number of 1’s (or the sum of integers) in a sliding window on the union of distributed streams requires Ω(N) space.
New Streaming Algorithms for Fast Detection of Superspreaders
- in Proceedings of Network and Distributed System Security Symposium (NDSS
, 2005
"... High-speed monitoring of Internet traffic is an important and challenging problem, with applications to realtime attack detection and mitigation, traffic engineering, etc. However, packet-level monitoring requires fast streaming algorithms that use very little memory and little communication among c ..."
Abstract
-
Cited by 45 (2 self)
- Add to MetaCart
High-speed monitoring of Internet traffic is an important and challenging problem, with applications to realtime attack detection and mitigation, traffic engineering, etc. However, packet-level monitoring requires fast streaming algorithms that use very little memory and little communication among collaborating network monitoring points. In this paper, we consider the problem of detecting superspreaders, which are sources that connect to a large number of distinct destinations. We propose new streaming algorithms for detecting superspreaders and prove guarantees on their accuracy and memory requirements. We also show experimental results on real network traces. Our algorithms are substantially more efficient (both theoretically and experimentally) than previous approaches. We also extend our algorithms to identify superspreaders in a distributed setting, with sliding windows, and when deletions are allowed in the stream (which lets us identify sources that make a large number of failed connections to distinct destinations). More generally, our algorithms are applicable to any problem that can be formulated as follows: given a stream of (x, y) pairs, find all the x’s that are paired with a large number of distinct y’s. We call this the heavy distinct-hitters problem. There are many network security applications of this general problem. This paper discusses these applications and, for concreteness, focuses on the superspreader problem. 1
On graph problems in a semi-streaming model
- In 31st International Colloquium on Automata, Languages and Programming
, 2004
"... Abstract. We formalize a potentially rich new streaming model, the semi-streaming model, that we believe is necessary for the fruitful study of efficient algorithms for solving problems on massive graphs whose edge sets cannot be stored in memory. In this model, the input graph, G = (V, E), is prese ..."
Abstract
-
Cited by 38 (4 self)
- Add to MetaCart
Abstract. We formalize a potentially rich new streaming model, the semi-streaming model, that we believe is necessary for the fruitful study of efficient algorithms for solving problems on massive graphs whose edge sets cannot be stored in memory. In this model, the input graph, G = (V, E), is presented as a stream of edges (in adversarial order), and the storage space of an algorithm is bounded by O(n · polylog n), where n = |V |. We are particularly interested in algorithms that use only one pass over the input, but, for problems where this is provably insufficient, we also look at algorithms using constant or, in some cases, logarithmically many passes. In the course of this general study, we give semi-streaming constant approximation algorithms for the unweighted and weighted matching problems, along with a further algorithm improvement for the bipartite case. We also exhibit log n / log log n semistreaming approximations to the diameter and the problem of computing the distance between specified vertices in a weighted graph. These are complemented by Ω(log (1−ɛ) n) lower bounds. 1
Graph distances in the streaming model: the value of space
- In ACM-SIAM Symposium on Discrete Algorithms
, 2005
"... We investigate the importance of space when solving problems based on graph distance in the streaming model. In this model, the input graph is presented as a stream of edges in an arbitrary order. The main computational restriction of the model is that we have limited space and therefore cannot stor ..."
Abstract
-
Cited by 38 (8 self)
- Add to MetaCart
We investigate the importance of space when solving problems based on graph distance in the streaming model. In this model, the input graph is presented as a stream of edges in an arbitrary order. The main computational restriction of the model is that we have limited space and therefore cannot store all the streamed data; we are forced to make space-efficient summaries of the data as we go along. For a graph of n vertices and m edges, we show that testing many graph properties, including connectivity (ergo any reasonable decision problem about distances) and bipartiteness, requires Ω(n) bits of space. Given this, we then investigate how the power of the model increases as we relax our space restriction. Our main result is an efficient randomized algorithm that constructs a (2t + 1)-spanner in one pass. With high probability, it uses O(t · n 1+1/t log 2 n) bits of space and processes each edge in the stream in O(t 2 · n 1/t log n) time. We find approximations to diameter and girth via the log n constructed spanner. For t = Ω (), the space log log n requirement of the algorithm is O(n·polylog n), and the per-edge processing time is O(polylog n). We also show a corresponding lower bound of t for the approximation ratio achievable when the space restriction is O(t · n1+1/t log 2 n). We then consider the scenario in which we are allowed multiple passes over the input stream. Here, we investigate whether allowing these extra passes will compensate for a given space restriction. We show that ∗This work was supported by the DoD University Research Initiative (URI) administered by the Office of Naval Research

