Results 1  10
of
26
Approximate clustering via coresets
 In Proc. 34th Annu. ACM Sympos. Theory Comput
, 2002
"... In this paper, we show that for several clustering problems one can extract a small set of points, so that using those coresets enable us to perform approximate clustering efficiently. The surprising property of those coresets is that their size is independent of the dimension. Using those, we pre ..."
Abstract

Cited by 111 (15 self)
 Add to MetaCart
In this paper, we show that for several clustering problems one can extract a small set of points, so that using those coresets enable us to perform approximate clustering efficiently. The surprising property of those coresets is that their size is independent of the dimension. Using those, we present a ¡ 1 ¢ ε £approximation algorithms for the kcenter clustering and kmedian clustering problems in Euclidean space. The running time of the new algorithms has linear or near linear dependency on the number of points and the dimension, and exponential dependency on 1 ¤ ε and k. As such, our results are a substantial improvement over what was previously known. We also present some other clustering results including ¡ 1 ¢ ε £approximate 1cylinder clustering, and kcenter clustering with outliers. 1
Clustering data streams: Theory and practice
 IEEE TKDE
, 2003
"... Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little ..."
Abstract

Cited by 106 (2 self)
 Add to MetaCart
Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm’s performance on synthetic and real data streams. Index Terms—Clustering, data streams, approximation algorithms. 1
Mean Shift Based Clustering in High Dimensions: A Texture Classification Example
, 2003
"... Feature space analysis is the main module in many computer vision tasks. The most popular technique, kmeans clustering, however, has two inherent limitations: the clusters are constrained to be spherically symmetric and their number has to be known a priori. In nonparametric clustering methods, lik ..."
Abstract

Cited by 95 (2 self)
 Add to MetaCart
Feature space analysis is the main module in many computer vision tasks. The most popular technique, kmeans clustering, however, has two inherent limitations: the clusters are constrained to be spherically symmetric and their number has to be known a priori. In nonparametric clustering methods, like the one based on mean shift, these limitations are eliminated but the amount of computation becomes prohibitively large as the dimension of the space increases. We exploit a recently proposed approximation technique, localitysensitive hashing (LSH), to reduce the computational complexity of adaptive mean shift. In our implementation of LSH the optimal parameters of the data structure are determined by a pilot learning procedure, and the partitions are data driven. As an application, the performance of mode and kmeans based textons are compared in a texture classification study.
Clustering Large Graphs via the Singular Value Decomposition
 MACHINE LEARNING
, 2004
"... We consider the problem of partitioning a set of m points in the ndimensional Euclidean space into k clusters (usually m and n are variable, while k is fixed), so as to minimize the sum of squared distances between each point and its cluster center. This formulation is usually the objective of the ..."
Abstract

Cited by 73 (1 self)
 Add to MetaCart
We consider the problem of partitioning a set of m points in the ndimensional Euclidean space into k clusters (usually m and n are variable, while k is fixed), so as to minimize the sum of squared distances between each point and its cluster center. This formulation is usually the objective of the kmeans clustering algorithm (Kanungo et al. (2000)). We prove that this problem in NPhard even for k 2, and we consider a continuous relaxation of this discrete problem: find the kdimensional subspace V that minimizes the sum of squared distances to V of the m points. This relaxation can be solved by computing the Singular Value Decomposition (SVD) of the n matrix A that represents the m points; this solution can be used to get a 2approximation algorithm for the original problem. We then argue that in fact the relaxation provides a generalized clustering which is useful in its own right. Finally, we
Matrix approximation and projective clustering via volume sampling
 In SODA
, 2006
"... We present two new results for the problem of approximating a given real m × n matrix A by a rankk matrix D, where k < min{m, n}, so as to minimize A − D  2 F. It is known that by sampling O(k/ɛ) rows of the matrix, one can find a lowrank approximation with additive error ɛA  2 F. Our firs ..."
Abstract

Cited by 63 (2 self)
 Add to MetaCart
We present two new results for the problem of approximating a given real m × n matrix A by a rankk matrix D, where k < min{m, n}, so as to minimize A − D  2 F. It is known that by sampling O(k/ɛ) rows of the matrix, one can find a lowrank approximation with additive error ɛA  2 F. Our first result shows that with adaptive sampling in t rounds and O(k/ɛ) samples in each round, the additive error drops exponentially as ɛt; the computation time is nearly linear in the number of nonzero entries. This demonstrates that multiple passes can be highly beneficial for a natural (and widely studied) algorithmic problem. Our second result is that there exists a subset of O(k2 /ɛ) rows such that their span contains a rankk approximation with multiplicative (1 + ɛ) error (i.e., the sum of squares distance has a small “coreset ” whose span determines a good approximation). This existence theorem leads to a PTAS for the following projective clustering problem: Given a set of points P in Rd, and integers k, j, find a set of j subspaces F1,..., Fj, each of dimension at most k, that minimize ∑ p∈P mini d(p, Fi) 2. 1
The effectiveness of lloydtype methods for the kmeans problem
 In FOCS
, 2006
"... We investigate variants of Lloyd’s heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data s ..."
Abstract

Cited by 50 (3 self)
 Add to MetaCart
We investigate variants of Lloyd’s heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd’s heuristic that quickly lead to provably nearoptimal clustering solutions when applied to wellclusterable instances. This is the first performance guarantee for a variant of Lloyd’s heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd’s method. In addition, our other algorithms are faster on wellclusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloydtype iteration. 1
On the Impossibility of Dimension Reduction in l_1
 In Proc. 35th Annu. ACM Sympos. Theory Comput
, 2003
"... The JohnsonLindenstrauss Lemma shows that any n points in Euclidean space (with distances measured by the L2 norm) may be mapped down to O((log n)/ep^2) dimensions such that no pairwise distance is distorted by more than a (1 ep) factor. Determining whether such dimension reduction is possible in L ..."
Abstract

Cited by 43 (1 self)
 Add to MetaCart
The JohnsonLindenstrauss Lemma shows that any n points in Euclidean space (with distances measured by the L2 norm) may be mapped down to O((log n)/ep^2) dimensions such that no pairwise distance is distorted by more than a (1 ep) factor. Determining whether such dimension reduction is possible in L1 has been an intriguing open question. Charikar and Sahai [7] recently showed lower bounds for dimension reduction in L1 that can be achieved by linear projections, and positive results for shortest path metrics of restricted graph families. However the question of general dimension reduction in L1 was still open. For example, it was not known whether it is possible to reduce the number of dimensions to O(log n) with 1 ep distortion. We show strong lower bounds for general dimension reduction in L1. We give an explicity family of n points in L1 such that any embedding with distortion d requires n^Omega(1/d^2) dimensions. This proves that there is no analog of the JohnsonLindenstrauss Lemma for L1
On the impossibility of dimension reduction in ℓ1
 In Proceedings of the 44th Annual IEEE Conference on Foundations of Computer Science
, 2003
"... The JohnsonLindenstrauss Lemma shows that any n points in Euclidean space (with distances measured by the ℓ2 norm) may be mapped down to O((log n)/ε 2) dimensions such that no pairwise distance is distorted by more than a (1+ε) factor. Determining whether such dimension reduction is possible in ℓ1 ..."
Abstract

Cited by 28 (1 self)
 Add to MetaCart
The JohnsonLindenstrauss Lemma shows that any n points in Euclidean space (with distances measured by the ℓ2 norm) may be mapped down to O((log n)/ε 2) dimensions such that no pairwise distance is distorted by more than a (1+ε) factor. Determining whether such dimension reduction is possible in ℓ1 has been an intriguing open question. We show strong lower bounds for general dimension reduction in ℓ1. We give an explicit family of n points in ℓ1 such that any embedding with distortion δ requires n Ω(1/δ2) dimensions. This proves that there is no analog of the JohnsonLindenstrauss Lemma for ℓ1; in fact embedding with any constant distortion requires n Ω(1) dimensions. Further, embedding the points into ℓ1 with 1 + ε distortion requires n 1
Selfimproving algorithms
 in SODA ’06: Proceedings of the seventeenth annual ACMSIAM symposium on Discrete algorithm
"... We investigate ways in which an algorithm can improve its expected performance by finetuning itself automatically with respect to an arbitrary, unknown input distribution. We give such selfimproving algorithms for sorting and computing Delaunay triangulations. The highlights of this work: (i) an al ..."
Abstract

Cited by 26 (4 self)
 Add to MetaCart
We investigate ways in which an algorithm can improve its expected performance by finetuning itself automatically with respect to an arbitrary, unknown input distribution. We give such selfimproving algorithms for sorting and computing Delaunay triangulations. The highlights of this work: (i) an algorithm to sort a list of numbers with optimal expected limiting complexity; and (ii) an algorithm to compute the Delaunay triangulation of a set of points with optimal expected limiting complexity. In both cases, the algorithm begins with a training phase during which it adjusts itself to the input distribution, followed by a stationary regime in which the algorithm settles to its optimized incarnation. 1
Clustering with the connectivity kernel
 In NIPS
, 2004
"... Clustering aims at extracting hidden structure in dataset. While the problem of finding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. In this paper we present a novel clustering algorithm which tack ..."
Abstract

Cited by 24 (1 self)
 Add to MetaCart
Clustering aims at extracting hidden structure in dataset. While the problem of finding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. In this paper we present a novel clustering algorithm which tackles the problem by a two step procedure: first the data are transformed in such a way that elongated structures become compact ones. In a second step, these new objects are clustered by optimizing a compactnessbased criterion. The advantages of the method over related approaches are threefold: (i) robustness properties of compactnessbased criteria naturally transfer to the problem of extracting elongated structures, leading to a model which is highly robust against outlier objects; (ii) the transformed distances induce a Mercer kernel which allows us to formulate a polynomial approximation scheme to the generally N Phard clustering problem; (iii) the new method does not contain free kernel parameters in contrast to methods like spectral clustering or meanshift clustering. 1