Results 1  10
of
48
Data Streams: Algorithms and Applications
, 2005
"... In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerg ..."
Abstract

Cited by 375 (21 self)
 Add to MetaCart
In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudorandom computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [175].1
Kmeans++: the advantages of careful seeding
 In Proceedings of the 18th Annual ACMSIAM Symposium on Discrete Algorithms
, 2007
"... The kmeans method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting kmeans with a very simple, randomized se ..."
Abstract

Cited by 209 (6 self)
 Add to MetaCart
The kmeans method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting kmeans with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(log k)competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of kmeans, often quite dramatically. 1
Incremental Clustering and Dynamic Information Retrieval
, 1997
"... Motivated by applications such as document and image classification in information retrieval, we consider the problem of clustering dynamic point sets in a metric space. We propose a model called incremental clustering which is based on a careful analysis of the requirements of the information retri ..."
Abstract

Cited by 153 (5 self)
 Add to MetaCart
Motivated by applications such as document and image classification in information retrieval, we consider the problem of clustering dynamic point sets in a metric space. We propose a model called incremental clustering which is based on a careful analysis of the requirements of the information retrieval application, and which should also be useful in other applications. The goal is to efficiently maintain clusters of small diameter as new points are inserted. We analyze several natural greedy algorithms and demonstrate that they perform poorly. We propose new deterministic and randomized incremental clustering algorithms which have a provably good performance. We complement our positive results with lower bounds on the performance of incremental algorithms. Finally, we consider the dual clustering problem where the clusters are of fixed diameter, and the goal is to minimize the number of clusters. 1 Introduction We consider the following problem: as a sequence of points from a metric...
Issues in Data Stream Management
, 2003
"... Traditional databases store sets of relatively static records with no predefined notion of time, unless timestamp attributes are explicitly added. While this model adequately represents commercial catalogues or repositories of personal information, many current and emerging applications require sup ..."
Abstract

Cited by 125 (6 self)
 Add to MetaCart
Traditional databases store sets of relatively static records with no predefined notion of time, unless timestamp attributes are explicitly added. While this model adequately represents commercial catalogues or repositories of personal information, many current and emerging applications require support for online analysis of rapidly changing data streams. Limitations of traditional DBMSs in supporting streaming applications have been recognized, prompting research to augment existing technologies and build new systems to manage streaming data. The purpose of this paper is to review recent work in data stream management systems, with an emphasis on application requirements, data models, continuous query languages, and query evaluation.
Maintaining Variance and kMedians over Data Stream Windows
 In PODS
, 2003
"... The sliding window model is useful for discounting stale data in data stream applications. In this model, data elements arrive continually and only the most recent N elements are used when answering queries. We present a novel technique for solving two important and related problems in the sliding w ..."
Abstract

Cited by 74 (1 self)
 Add to MetaCart
The sliding window model is useful for discounting stale data in data stream applications. In this model, data elements arrive continually and only the most recent N elements are used when answering queries. We present a novel technique for solving two important and related problems in the sliding window model  maintaining variance and maintaining a k median clustering. Our solution to the problem of maintaining variance provides a continually updated estimate of the variance of the last N values in a data stream with relative error of at most # using O( # 2 log N) memory. We present a constantfactor approximation algorithm which maintains an approximate kmedian solution for the last N data points using O( N) memory, where # < 1/2 is a parameter which trades o# the space bound with the approximation factor of O(2 ).
Densitybased clustering over an evolving data stream with noise
 In 2006 SIAM Conference on Data Mining
, 2006
"... Clustering is an important task in mining evolving data streams. Beside the limited memory and onepass constraints, the nature of evolving data streams implies the following requirements for stream clustering: no assumption on the number of clusters, discovery of clusters with arbitrary shape and a ..."
Abstract

Cited by 48 (1 self)
 Add to MetaCart
Clustering is an important task in mining evolving data streams. Beside the limited memory and onepass constraints, the nature of evolving data streams implies the following requirements for stream clustering: no assumption on the number of clusters, discovery of clusters with arbitrary shape and ability to handle outliers. While a lot of clustering algorithms for data streams have been proposed, they offer no solution to the combination of these requirements. In this paper, we present DenStream, a new approach for discovering clusters in an evolving data stream. The “dense ” microcluster (named coremicrocluster) is introduced to summarize the clusters with arbitrary shape, while the potential coremicrocluster and outlier microcluster structures are proposed to maintain and distinguish the potential clusters and outliers. A novel pruning strategy is designed based on these concepts, which guarantees the precision of the weights of the microclusters with limited memory. Our performance study over a number of real and synthetic data sets demonstrates the effectiveness and efficiency of our method.
Coresets for kMeans and kMedian Clustering and their Applications
 In Proc. 36th Annu. ACM Sympos. Theory Comput
, 2003
"... In this paper, we show the existence of small coresets for the problems of computing kmedian and kmeans clustering for points in low dimension. In other words, we show that given a point set P in IR , one can compute a weighted set S P , of size log n), such that one can compute the kmed ..."
Abstract

Cited by 46 (13 self)
 Add to MetaCart
In this paper, we show the existence of small coresets for the problems of computing kmedian and kmeans clustering for points in low dimension. In other words, we show that given a point set P in IR , one can compute a weighted set S P , of size log n), such that one can compute the kmedian/means clustering on S instead of on P , and get an (1 + ")approximation.
Sublinear time algorithms
 SIGACT News
, 2003
"... Abstract Sublinear time algorithms represent a new paradigm in computing, where an algorithmmust give some sort of an answer after inspecting only a very small portion of the input. We discuss the sorts of answers that one might be able to achieve in this new setting. 1 Introduction The goal of algo ..."
Abstract

Cited by 22 (2 self)
 Add to MetaCart
Abstract Sublinear time algorithms represent a new paradigm in computing, where an algorithmmust give some sort of an answer after inspecting only a very small portion of the input. We discuss the sorts of answers that one might be able to achieve in this new setting. 1 Introduction The goal of algorithmic research is to design efficient algorithms, where efficiency is typicallymeasured as a function of the length of the input. For instance, the elementary school algorithm for multiplying two n digit integers takes roughly n2 steps, while more sophisticated algorithmshave been devised which run in less than n log2 n steps. It is still not known whether a linear time algorithm is achievable for integer multiplication. Obviously any algorithm for this task, as for anyother nontrivial task, would need to take at least linear time in n, since this is what it would take to read the entire input and write the output. Thus, showing the existence of a linear time algorithmfor a problem was traditionally considered to be the gold standard of achievement. Nevertheless, due to the recent tremendous increase in computational power that is inundatingus with a multitude of data, we are now encountering a paradigm shift from traditional computational models. The scale of these data sets, coupled with the typical situation in which there is verylittle time to perform our computations, raises the issue of whether there is time to consider any more than a miniscule fraction of the data in our computations? Analogous to the reasoning thatwe used for multiplication, for most natural problems, an algorithm which runs in sublinear time must necessarily use randomization and must give an answer which is in some sense imprecise.Nevertheless, there are many situations in which a fast approximate solution is more useful than a slower exact solution.
Efficient algorithms for constructing (1 + ɛ, β)spanners in the distributed and streaming models
 Distributed Computing
, 2004
"... For an unweighted undirected graph G = (V, E), and a pair of positive integers α ≥ 1, β ≥ 0, a subgraph G ′ = (V, H), H ⊆ E, is called an (α, β)spanner of G if for every pair of vertices u, v ∈ V, distG ′(u, v) ≤ α · distG(u, v) + β. It was shown in [20] that for any ɛ> 0, κ = 1, 2,..., there exi ..."
Abstract

Cited by 21 (6 self)
 Add to MetaCart
For an unweighted undirected graph G = (V, E), and a pair of positive integers α ≥ 1, β ≥ 0, a subgraph G ′ = (V, H), H ⊆ E, is called an (α, β)spanner of G if for every pair of vertices u, v ∈ V, distG ′(u, v) ≤ α · distG(u, v) + β. It was shown in [20] that for any ɛ> 0, κ = 1, 2,..., there exists an integer β = β(ɛ, κ) such that for every nvertex graph G there exists a (1 + ɛ, β)spanner G ′ with O(n 1+1/κ) edges. An efficient distributed protocol for constructing (1+ ɛ, β)spanners was devised in [18]. The running time and the communication complexity of that protocol are O(n 1+ρ) and O(En ρ), respectively, where ρ is an additional control parameter of the protocol that affects only the additive term β. In this paper we devise a protocol with a drastically improved running time (O(n ρ) as opposed to O(n 1+ρ)) for constructing (1 + ɛ, β)spanners. Our protocol has the same communication complexity as the protocol of [18], and it constructs spanners with essentially the same properties as the spanners that are constructed by the protocol of [18]. We also show that our protocol for constructing (1+ɛ, β)spanners can be adapted to the streaming model, and devise a streaming algorithm that uses a constant number of passes and O(n 1+1/κ · log n) bits of space for computing allpairsalmostshortestpaths of length at most by a multiplicative factor (1 + ɛ) and an additive term of β greater than the shortest paths. Our algorithm processes each edge in time O(n ρ), for an arbitrarily small ρ> 0. The only
Declaring Independence via the Sketching of Sketches
"... We consider the problem of identifying correlations in data streams. Surprisingly, our work seems to be the first to consider this natural problem. In the centralized model, we consider a stream of pairs (i, j)â[n] 2 whose frequencies define a joint distribution (X, Y). In the distributed model, e ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
We consider the problem of identifying correlations in data streams. Surprisingly, our work seems to be the first to consider this natural problem. In the centralized model, we consider a stream of pairs (i, j)â[n] 2 whose frequencies define a joint distribution (X, Y). In the distributed model, each coordinate of the pair may appear separately in the stream. We present a range of algorithms for approximating to what extent X and Y are independent, i.e., how close the joint distribution is to the product of the marginals. We consider various measures of closeness including â1, â2, and the mutual information between X and Y. Our algorithms are based on âsketching sketchesâ, i.e., composing smallspace linear synopses of the distributions. Perhaps ironically, the biggest technical challenges that arise relate to ensuring that different components of our estimates are sufficiently independent.