Results 1 - 10
of
41
Incremental Clustering and Dynamic Information Retrieval
, 1997
"... Motivated by applications such as document and image classification in information retrieval, we consider the problem of clustering dynamic point sets in a metric space. We propose a model called incremental clustering which is based on a careful analysis of the requirements of the information retri ..."
Abstract
-
Cited by 129 (3 self)
- Add to MetaCart
Motivated by applications such as document and image classification in information retrieval, we consider the problem of clustering dynamic point sets in a metric space. We propose a model called incremental clustering which is based on a careful analysis of the requirements of the information retrieval application, and which should also be useful in other applications. The goal is to efficiently maintain clusters of small diameter as new points are inserted. We analyze several natural greedy algorithms and demonstrate that they perform poorly. We propose new deterministic and randomized incremental clustering algorithms which have a provably good performance. We complement our positive results with lower bounds on the performance of incremental algorithms. Finally, we consider the dual clustering problem where the clusters are of fixed diameter, and the goal is to minimize the number of clusters. 1 Introduction We consider the following problem: as a sequence of points from a metric...
Issues in Data Stream Management
, 2003
"... Traditional databases store sets of relatively static records with no pre-defined notion of time, unless timestamp attributes are explicitly added. While this model adequately represents commercial catalogues or repositories of personal information, many current and emerging applications require sup ..."
Abstract
-
Cited by 105 (5 self)
- Add to MetaCart
Traditional databases store sets of relatively static records with no pre-defined notion of time, unless timestamp attributes are explicitly added. While this model adequately represents commercial catalogues or repositories of personal information, many current and emerging applications require support for online analysis of rapidly changing data streams. Limitations of traditional DBMSs in supporting streaming applications have been recognized, prompting research to augment existing technologies and build new systems to manage streaming data. The purpose of this paper is to review recent work in data stream management systems, with an emphasis on application requirements, data models, continuous query languages, and query evaluation.
Maintaining Variance and k-Medians over Data Stream Windows
- In PODS
, 2003
"... The sliding window model is useful for discounting stale data in data stream applications. In this model, data elements arrive continually and only the most recent N elements are used when answering queries. We present a novel technique for solving two important and related problems in the sliding w ..."
Abstract
-
Cited by 60 (0 self)
- Add to MetaCart
The sliding window model is useful for discounting stale data in data stream applications. In this model, data elements arrive continually and only the most recent N elements are used when answering queries. We present a novel technique for solving two important and related problems in the sliding window model --- maintaining variance and maintaining a k-- median clustering. Our solution to the problem of maintaining variance provides a continually updated estimate of the variance of the last N values in a data stream with relative error of at most # using O( # 2 log N) memory. We present a constant-factor approximation algorithm which maintains an approximate k--median solution for the last N data points using O( N) memory, where # < 1/2 is a parameter which trades o# the space bound with the approximation factor of O(2 ).
Coresets for k-Means and k-Median Clustering and their Applications
- In Proc. 36th Annu. ACM Sympos. Theory Comput
, 2003
"... In this paper, we show the existence of small coresets for the problems of computing k-median and k-means clustering for points in low dimension. In other words, we show that given a point set P in IR , one can compute a weighted set S P , of size log n), such that one can compute the k-med ..."
Abstract
-
Cited by 41 (13 self)
- Add to MetaCart
In this paper, we show the existence of small coresets for the problems of computing k-median and k-means clustering for points in low dimension. In other words, we show that given a point set P in IR , one can compute a weighted set S P , of size log n), such that one can compute the k-median/means clustering on S instead of on P , and get an (1 + ")-approximation.
Density-based clustering over an evolving data stream with noise
- In 2006 SIAM Conference on Data Mining
, 2006
"... Clustering is an important task in mining evolving data streams. Beside the limited memory and one-pass constraints, the nature of evolving data streams implies the following requirements for stream clustering: no assumption on the number of clusters, discovery of clusters with arbitrary shape and a ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
Clustering is an important task in mining evolving data streams. Beside the limited memory and one-pass constraints, the nature of evolving data streams implies the following requirements for stream clustering: no assumption on the number of clusters, discovery of clusters with arbitrary shape and ability to handle outliers. While a lot of clustering algorithms for data streams have been proposed, they offer no solution to the combination of these requirements. In this paper, we present DenStream, a new approach for discovering clusters in an evolving data stream. The “dense ” micro-cluster (named core-micro-cluster) is introduced to summarize the clusters with arbitrary shape, while the potential core-micro-cluster and outlier micro-cluster structures are proposed to maintain and distinguish the potential clusters and outliers. A novel pruning strategy is designed based on these concepts, which guarantees the precision of the weights of the micro-clusters with limited memory. Our performance study over a number of real and synthetic data sets demonstrates the effectiveness and efficiency of our method.
Efficient algorithms for constructing (1 + ɛ, β)-spanners in the distributed and streaming models
- Distributed Computing
, 2004
"... For an unweighted undirected graph G = (V, E), and a pair of positive integers α ≥ 1, β ≥ 0, a subgraph G ′ = (V, H), H ⊆ E, is called an (α, β)-spanner of G if for every pair of vertices u, v ∈ V, distG ′(u, v) ≤ α · distG(u, v) + β. It was shown in [20] that for any ɛ> 0, κ = 1, 2,..., there exi ..."
Abstract
-
Cited by 17 (6 self)
- Add to MetaCart
For an unweighted undirected graph G = (V, E), and a pair of positive integers α ≥ 1, β ≥ 0, a subgraph G ′ = (V, H), H ⊆ E, is called an (α, β)-spanner of G if for every pair of vertices u, v ∈ V, distG ′(u, v) ≤ α · distG(u, v) + β. It was shown in [20] that for any ɛ> 0, κ = 1, 2,..., there exists an integer β = β(ɛ, κ) such that for every n-vertex graph G there exists a (1 + ɛ, β)-spanner G ′ with O(n 1+1/κ) edges. An efficient distributed protocol for constructing (1+ ɛ, β)-spanners was devised in [18]. The running time and the communication complexity of that protocol are O(n 1+ρ) and O(|E|n ρ), respectively, where ρ is an additional control parameter of the protocol that affects only the additive term β. In this paper we devise a protocol with a drastically improved running time (O(n ρ) as opposed to O(n 1+ρ)) for constructing (1 + ɛ, β)-spanners. Our protocol has the same communication complexity as the protocol of [18], and it constructs spanners with essentially the same properties as the spanners that are constructed by the protocol of [18]. We also show that our protocol for constructing (1+ɛ, β)spanners can be adapted to the streaming model, and devise a streaming algorithm that uses a constant number of passes and O(n 1+1/κ · log n) bits of space for computing allpairs-almost-shortest-paths of length at most by a multiplicative factor (1 + ɛ) and an additive term of β greater than the shortest paths. Our algorithm processes each edge in time O(n ρ), for an arbitrarily small ρ> 0. The only
Adaptive spatial partitioning for multidimensional data streams
- In ISAAC
, 2004
"... We propose a space-efficient scheme for summarizing multidimensional data streams. Our sketch can be used to solve spatial versions of several classical data stream queries efficiently. For instance, we can track ε-hotspots, which are congruent boxes containing at least an ε fraction of the stream, ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
We propose a space-efficient scheme for summarizing multidimensional data streams. Our sketch can be used to solve spatial versions of several classical data stream queries efficiently. For instance, we can track ε-hotspots, which are congruent boxes containing at least an ε fraction of the stream, and maintain hierarchical heavy hitters in d dimensions. Our sketch can also be viewed as a multidimensional generalization of the ε-approximate quantile summary. The space complexity of our scheme is O ( 1 ε log R) if the points lie in the domain [0, R]d, where d is assumed to be a constant. The scheme extends to the sliding window model with a log(εn) factor increase in space, where n is the size of the sliding window. Our sketch can also be used to answer ε-approximate rectangular range queries over a stream of d-dimensional points. 1
Sublinear time algorithms
- SIGACT News
, 2003
"... Abstract Sublinear time algorithms represent a new paradigm in computing, where an algorithmmust give some sort of an answer after inspecting only a very small portion of the input. We discuss the sorts of answers that one might be able to achieve in this new setting. 1 Introduction The goal of algo ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
Abstract Sublinear time algorithms represent a new paradigm in computing, where an algorithmmust give some sort of an answer after inspecting only a very small portion of the input. We discuss the sorts of answers that one might be able to achieve in this new setting. 1 Introduction The goal of algorithmic research is to design efficient algorithms, where efficiency is typicallymeasured as a function of the length of the input. For instance, the elementary school algorithm for multiplying two n digit integers takes roughly n2 steps, while more sophisticated algorithmshave been devised which run in less than n log2 n steps. It is still not known whether a linear time algorithm is achievable for integer multiplication. Obviously any algorithm for this task, as for anyother nontrivial task, would need to take at least linear time in n, since this is what it would take to read the entire input and write the output. Thus, showing the existence of a linear time algorithmfor a problem was traditionally considered to be the gold standard of achievement. Nevertheless, due to the recent tremendous increase in computational power that is inundatingus with a multitude of data, we are now encountering a paradigm shift from traditional computational models. The scale of these data sets, coupled with the typical situation in which there is verylittle time to perform our computations, raises the issue of whether there is time to consider any more than a miniscule fraction of the data in our computations? Analogous to the reasoning thatwe used for multiplication, for most natural problems, an algorithm which runs in sublinear time must necessarily use randomization and must give an answer which is in some sense imprecise.Nevertheless, there are many situations in which a fast approximate solution is more useful than a slower exact solution.
Coresets for weighted facilities and their applications
- In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06
, 2006
"... We develop efficient (1 + ε)-approximation algorithms for generalized facility location problems. Such facilities are not restricted to being points in R d, and can represent more complex structures such as linear facilities (lines in R d, j-dimensional flats), etc. We introduce coresets for weighte ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
We develop efficient (1 + ε)-approximation algorithms for generalized facility location problems. Such facilities are not restricted to being points in R d, and can represent more complex structures such as linear facilities (lines in R d, j-dimensional flats), etc. We introduce coresets for weighted (point) facilities. These prove to be useful for such generalized facility location problems, and provide efficient algorithms for their construction. Applications include: k-mean and k-median generalizations, i.e., find k lines that minimize the sum (or sum of squares) of the distances from each input point to its nearest line. Other applications are generalizations of linear regression problems to multiple regression lines, new SVD/PCA generalizations, and many more. The results significantly improve on previous work, which deals efficiently only with special cases. Open source code for the algorithms in this paper is also available. 1

