Results 1  10
of
39
Clustering data streams: Theory and practice
 IEEE TKDE
, 2003
"... Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little ..."
Abstract

Cited by 154 (4 self)
 Add to MetaCart
(Show Context)
Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm’s performance on synthetic and real data streams. Index Terms—Clustering, data streams, approximation algorithms. 1
Approximate clustering via coresets
 In Proc. 34th Annu. ACM Sympos. Theory Comput
, 2002
"... In this paper, we show that for several clustering problems one can extract a small set of points, so that using those coresets enable us to perform approximate clustering efficiently. The surprising property of those coresets is that their size is independent of the dimension. Using those, we pre ..."
Abstract

Cited by 141 (17 self)
 Add to MetaCart
(Show Context)
In this paper, we show that for several clustering problems one can extract a small set of points, so that using those coresets enable us to perform approximate clustering efficiently. The surprising property of those coresets is that their size is independent of the dimension. Using those, we present a ¡ 1 ¢ ε £approximation algorithms for the kcenter clustering and kmedian clustering problems in Euclidean space. The running time of the new algorithms has linear or near linear dependency on the number of points and the dimension, and exponential dependency on 1 ¤ ε and k. As such, our results are a substantial improvement over what was previously known. We also present some other clustering results including ¡ 1 ¢ ε £approximate 1cylinder clustering, and kcenter clustering with outliers. 1
Mean Shift Based Clustering in High Dimensions: A Texture Classification Example
, 2003
"... Feature space analysis is the main module in many computer vision tasks. The most popular technique, kmeans clustering, however, has two inherent limitations: the clusters are constrained to be spherically symmetric and their number has to be known a priori. In nonparametric clustering methods, lik ..."
Abstract

Cited by 137 (4 self)
 Add to MetaCart
(Show Context)
Feature space analysis is the main module in many computer vision tasks. The most popular technique, kmeans clustering, however, has two inherent limitations: the clusters are constrained to be spherically symmetric and their number has to be known a priori. In nonparametric clustering methods, like the one based on mean shift, these limitations are eliminated but the amount of computation becomes prohibitively large as the dimension of the space increases. We exploit a recently proposed approximation technique, localitysensitive hashing (LSH), to reduce the computational complexity of adaptive mean shift. In our implementation of LSH the optimal parameters of the data structure are determined by a pilot learning procedure, and the partitions are data driven. As an application, the performance of mode and kmeans based textons are compared in a texture classification study.
Clustering Large Graphs via the Singular Value Decomposition
 MACHINE LEARNING
, 2004
"... We consider the problem of partitioning a set of m points in the ndimensional Euclidean space into k clusters (usually m and n are variable, while k is fixed), so as to minimize the sum of squared distances between each point and its cluster center. This formulation is usually the objective of the ..."
Abstract

Cited by 109 (2 self)
 Add to MetaCart
We consider the problem of partitioning a set of m points in the ndimensional Euclidean space into k clusters (usually m and n are variable, while k is fixed), so as to minimize the sum of squared distances between each point and its cluster center. This formulation is usually the objective of the kmeans clustering algorithm (Kanungo et al. (2000)). We prove that this problem in NPhard even for k 2, and we consider a continuous relaxation of this discrete problem: find the kdimensional subspace V that minimizes the sum of squared distances to V of the m points. This relaxation can be solved by computing the Singular Value Decomposition (SVD) of the n matrix A that represents the m points; this solution can be used to get a 2approximation algorithm for the original problem. We then argue that in fact the relaxation provides a generalized clustering which is useful in its own right. Finally, we
Matrix Approximation and Projective Clustering via Volume Sampling
, 2006
"... Frieze, Kannan, and Vempala (JACM 2004) proved that a small sample of rows of a given matrix A spans the rows of a lowrank approximation D that minimizes A−DF within a small additive error, and the sampling can be done efficiently using just two passes over the matrix. In this paper, we genera ..."
Abstract

Cited by 92 (3 self)
 Add to MetaCart
Frieze, Kannan, and Vempala (JACM 2004) proved that a small sample of rows of a given matrix A spans the rows of a lowrank approximation D that minimizes A−DF within a small additive error, and the sampling can be done efficiently using just two passes over the matrix. In this paper, we generalize this result in two ways. First, we prove that the additive error drops exponentially by iterating the sampling in an adaptive manner (adaptive sampling). Using this result, we give a passefficient algorithm for computing a lowrank approximation with reduced additive error. Our second result is that there exist k rows of A whose span contains the rows of a multiplicative (k + 1)approximation to the best rankk matrix; moreover, this subset can be found by sampling ksubsets of rows from a natural distribution (volume sampling). Combining volume sampling with adaptive sampling yields the existence of a set of k + k(k + 1)/ε rows whose span contains the rows of a multiplicative (1 + ε)approximation. This leads to a PTAS for the following NPhard
The effectiveness of lloydtype methods for the kmeans problem
 In FOCS
, 2006
"... We investigate variants of Lloyd’s heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data s ..."
Abstract

Cited by 82 (4 self)
 Add to MetaCart
(Show Context)
We investigate variants of Lloyd’s heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd’s heuristic that quickly lead to provably nearoptimal clustering solutions when applied to wellclusterable instances. This is the first performance guarantee for a variant of Lloyd’s heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd’s method. In addition, our other algorithms are faster on wellclusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloydtype iteration. 1
On the Impossibility of Dimension Reduction in l_1
 In Proc. 35th Annu. ACM Sympos. Theory Comput
, 2003
"... The JohnsonLindenstrauss Lemma shows that any n points in Euclidean space (with distances measured by the L2 norm) may be mapped down to O((log n)/ep^2) dimensions such that no pairwise distance is distorted by more than a (1 ep) factor. Determining whether such dimension reduction is possible in L ..."
Abstract

Cited by 56 (1 self)
 Add to MetaCart
The JohnsonLindenstrauss Lemma shows that any n points in Euclidean space (with distances measured by the L2 norm) may be mapped down to O((log n)/ep^2) dimensions such that no pairwise distance is distorted by more than a (1 ep) factor. Determining whether such dimension reduction is possible in L1 has been an intriguing open question. Charikar and Sahai [7] recently showed lower bounds for dimension reduction in L1 that can be achieved by linear projections, and positive results for shortest path metrics of restricted graph families. However the question of general dimension reduction in L1 was still open. For example, it was not known whether it is possible to reduce the number of dimensions to O(log n) with 1 ep distortion. We show strong lower bounds for general dimension reduction in L1. We give an explicity family of n points in L1 such that any embedding with distortion d requires n^Omega(1/d^2) dimensions. This proves that there is no analog of the JohnsonLindenstrauss Lemma for L1
Selfimproving algorithms
 in SODA ’06: Proceedings of the seventeenth annual ACMSIAM symposium on Discrete algorithm
"... We investigate ways in which an algorithm can improve its expected performance by finetuning itself automatically with respect to an arbitrary, unknown input distribution. We give such selfimproving algorithms for sorting and computing Delaunay triangulations. The highlights of this work: (i) an al ..."
Abstract

Cited by 34 (6 self)
 Add to MetaCart
We investigate ways in which an algorithm can improve its expected performance by finetuning itself automatically with respect to an arbitrary, unknown input distribution. We give such selfimproving algorithms for sorting and computing Delaunay triangulations. The highlights of this work: (i) an algorithm to sort a list of numbers with optimal expected limiting complexity; and (ii) an algorithm to compute the Delaunay triangulation of a set of points with optimal expected limiting complexity. In both cases, the algorithm begins with a training phase during which it adjusts itself to the input distribution, followed by a stationary regime in which the algorithm settles to its optimized incarnation. 1
On the optimality of the dimensionality reduction method
 in Proc. 47th IEEE Symposium on Foundations of Computer Science (FOCS
"... We investigate the optimality of (1+ɛ)approximation algorithms obtained via the dimensionality reduction method. We show that: • Any data structure for the (1 + ɛ)approximate nearest neighbor problem in Hamming space, which uses constant number of probes to answer each query, must use n Ω(1/ɛ2) sp ..."
Abstract

Cited by 34 (5 self)
 Add to MetaCart
(Show Context)
We investigate the optimality of (1+ɛ)approximation algorithms obtained via the dimensionality reduction method. We show that: • Any data structure for the (1 + ɛ)approximate nearest neighbor problem in Hamming space, which uses constant number of probes to answer each query, must use n Ω(1/ɛ2) space. • Any algorithm for the (1+ɛ)approximate closest substring problem must run in time exponential in 1/ɛ 2−γ for any γ> 0 (unless 3SAT can be solved in subexponential time) Both lower bounds are (essentially) tight. 1.
Clustering with the connectivity kernel
 In NIPS
, 2004
"... Clustering aims at extracting hidden structure in dataset. While the problem of finding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. In this paper we present a novel clustering algorithm which tack ..."
Abstract

Cited by 32 (1 self)
 Add to MetaCart
(Show Context)
Clustering aims at extracting hidden structure in dataset. While the problem of finding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. In this paper we present a novel clustering algorithm which tackles the problem by a two step procedure: first the data are transformed in such a way that elongated structures become compact ones. In a second step, these new objects are clustered by optimizing a compactnessbased criterion. The advantages of the method over related approaches are threefold: (i) robustness properties of compactnessbased criteria naturally transfer to the problem of extracting elongated structures, leading to a model which is highly robust against outlier objects; (ii) the transformed distances induce a Mercer kernel which allows us to formulate a polynomial approximation scheme to the generally N Phard clustering problem; (iii) the new method does not contain free kernel parameters in contrast to methods like spectral clustering or meanshift clustering. 1