Results 1  10
of
16
Approximate Clustering without the Approximation
"... Approximation algorithms for clustering points in metric spaces is a flourishing area of research, with much research effort spent on getting a better understanding of the approximation guarantees possible for many objective functions such as kmedian, kmeans, and minsum clustering. This quest for ..."
Abstract

Cited by 37 (18 self)
 Add to MetaCart
Approximation algorithms for clustering points in metric spaces is a flourishing area of research, with much research effort spent on getting a better understanding of the approximation guarantees possible for many objective functions such as kmedian, kmeans, and minsum clustering. This quest for better approximation algorithms is further fueled by the implicit hope that these better approximations also give us more accurate clusterings. E.g., for many problems such as clustering proteins by function, or clustering images by subject, there is some unknown “correct” target clustering and the implicit hope is that approximately optimizing these objective functions will in fact produce a clustering that is close (in symmetric difference) to the truth. In this paper, we show that if we make this implicit assumption explicit—that is, if we assume that any capproximation to the given clustering objective F is ǫclose to the target—then we can produce clusterings that are O(ǫ)close to the target, even for values c for which obtaining a capproximation is NPhard. In particular, for kmedian and kmeans objectives, we show that we can achieve this guarantee for any constant c> 1, and for minsum objective we can do this for any constant c> 2. Our results also highlight a somewhat surprising conceptual difference between assuming that the optimal solution to, say, the kmedian objective is ǫclose to the target, and assuming that any approximately optimal solution is ǫclose to the target, even for approximation factor say c = 1.01. In the former case, the problem of finding a solution that is O(ǫ)close to the target remains computationally hard, and yet for the latter we have an efficient algorithm.
Spectral Clustering of Graphs with General Degrees in the Extended Planted Partition Model
"... In this paper, we examine a spectral clustering algorithm for similarity graphs drawn from a simple random graph model, where nodes are allowed to have varying degrees, and we provide theoretical bounds on its performance. The random graph model we study is the Extended Planted Partition (EPP) model ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
In this paper, we examine a spectral clustering algorithm for similarity graphs drawn from a simple random graph model, where nodes are allowed to have varying degrees, and we provide theoretical bounds on its performance. The random graph model we study is the Extended Planted Partition (EPP) model, a variant of the classical planted partition model. The standard approach to spectral clustering of graphs is to compute the bottom k singular vectors or eigenvectors of a suitable graph Laplacian, project the nodes of the graph onto these vectors, and then use an iterative clustering algorithm on the projected nodes. However a challenge with applying this approach to graphs generated from the EPP model is that unnormalized Laplacians do not work, and normalized Laplacians do not concentrate well when the graph has a number of low degree nodes. We resolve this issue by introducing the notion of a degreecorrected graph Laplacian. For graphs with many low degree nodes, degree correction has a regularizing effect on the Laplacian. Our spectral clustering algorithm projects the nodes in the graph onto the bottom k right singular vectors of the degreecorrected randomwalk Laplacian, and clusters the nodes in this subspace. We show guarantees on the performance of this algorithm, demonstrating that it outputs the correct partition under a wide range of parameter values. Unlike some previous work, our algorithm does not require access to any generative parameters of the model.
Community landscapes: an integrative approach to determine overlapping network module hierarchy, identify key nodes and predict network dynamics
 PLoS One
, 2010
"... Background: Network communities help the functional organization and evolution of complex networks. However, the development of a method, which is both fast and accurate, provides modular overlaps and partitions of a heterogeneous network, has proven to be rather difficult. Methodology/Principal Fin ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
Background: Network communities help the functional organization and evolution of complex networks. However, the development of a method, which is both fast and accurate, provides modular overlaps and partitions of a heterogeneous network, has proven to be rather difficult. Methodology/Principal Findings: Here we introduce the novel concept of ModuLand, an integrative method family determining overlapping network modules as hills of an influence functionbased, centralitytype community landscape, and including several widely used modularization methods as special cases. As various adaptations of the method family, we developed several algorithms, which provide an efficient analysis of weighted and directed networks, and (1) determine pervasively overlapping modules with high resolution; (2) uncover a detailed hierarchical network structure allowing an efficient, zoomin analysis of large networks; (3) allow the determination of key network nodes and (4) help to predict network dynamics. Conclusions/Significance: The concept opens a wide range of possibilities to develop new approaches and applications
Improved spectralnorm bounds for clustering
 In APPROXRANDOM. 37–49
, 2012
"... Aiming to unify known results about clustering mixtures of distributions under separation conditions, Kumar and Kannan [KK10] introduced a deterministic condition for clustering datasets. They showed that this single deterministic condition encompasses many previously studied clustering assumptions. ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Aiming to unify known results about clustering mixtures of distributions under separation conditions, Kumar and Kannan [KK10] introduced a deterministic condition for clustering datasets. They showed that this single deterministic condition encompasses many previously studied clustering assumptions. More specifically, their proximity condition requires that in the target kclustering, the projection of a point x onto the line joining its cluster center µ and some other center µ ′ , is a large additive factor closer to µ than to µ ′. This additive factor can be roughly described as k times the spectral norm of the matrix representing the differences between the given (known) dataset and the means of the (unknown) target clustering. Clearly, the proximity condition implies center separation – the distance between any two centers must be as large as the above mentioned bound. In this paper we improve upon the work of Kumar and Kannan [KK10] along several axes. First, we weaken the center separation bound by a factor of √ k, and secondly we weaken the proximity condition by a factor of k (in other words, the revised separation condition is independent of k). Using these weaker bounds we still achieve the same guarantees when all
Clustering under Approximation Stability
, 2009
"... A common approach to clustering data is to view data objects as points in a metric space, and then to optimize a natural distancebased objective such as the kmedian, kmeans, or minsum score. For applications such as clustering proteins by function or clustering images by subject, the implicit ho ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
A common approach to clustering data is to view data objects as points in a metric space, and then to optimize a natural distancebased objective such as the kmedian, kmeans, or minsum score. For applications such as clustering proteins by function or clustering images by subject, the implicit hope in taking this approach is that the optimal solution to the chosen objective will closely match the desired “target ” clustering (e.g., a correct clustering of proteins by function or of images by who is in them). However, most distancebased objectives, including those above, are NPhard to optimize. So, this assumption by itself is not sufficient, assuming P ̸ = NP, to achieve clusterings of lowerror via polynomial time algorithms. In this paper, we show that we can bypass this barrier if we slightly extend this assumption to ask that for some small constant c, not only the optimal solution, but also all capproximations to the optimal solution, differ from the target on at most some ɛ fraction of points—we call this (c, ɛ)approximationstability. We show that under this condition, it is possible to efficiently obtain lowerror clusterings even if the property holds only for values c for which the objective is known to be NPhard to approximate. Specifically, for any constant c> 1, (c, ɛ)approximationstability of kmedian or kmeans objectives can be used to efficiently produce a clustering of error O(ɛ), as
Uniqueness of Tensor Decompositions with Applications to Polynomial Identifiability. ArXiv 1304.8087
, 2013
"... We give a robust version of the celebrated result of Kruskal on the uniqueness of tensor decompositions: we prove that given a tensor whose decomposition satisfies a robust form of Kruskal’s rank condition, it is possible to approximately recover the decomposition if the tensor is known up to a suff ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We give a robust version of the celebrated result of Kruskal on the uniqueness of tensor decompositions: we prove that given a tensor whose decomposition satisfies a robust form of Kruskal’s rank condition, it is possible to approximately recover the decomposition if the tensor is known up to a sufficiently small (inverse polynomial) error. Kruskal’s theorem has found many applications in proving the identifiability of parameters for various latent variable models and mixture models such as Hidden Markov models, topic models etc. Our robust version immediately implies identifiability using only polynomially many samples in many of these settings. This polynomial identifiability is an essential first step towards efficient learning algorithms for these models. Recently, algorithms based on tensor decompositions have been used to estimate the parameters of various hidden variable models efficiently in special cases as long as they satisfy certain “nondegeneracy ” properties. Our methods give a way to go beyond this nondegeneracy barrier, and establish polynomial identifiablity of the parameters under much milder conditions. Given the importance of Kruskal’s theorem in the tensor literature, we expect that this robust version will have several applications beyond the settings we explore in this work.
Distributed PCA and kMeans Clustering
"... This paper proposes a distributed PCA algorithm, with the theoretical guarantee that any good approximation solution on the projected data for kmeans clustering is also a good approximation on the original data, while the projected dimension required is independent of the original dimension. When c ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
This paper proposes a distributed PCA algorithm, with the theoretical guarantee that any good approximation solution on the projected data for kmeans clustering is also a good approximation on the original data, while the projected dimension required is independent of the original dimension. When combined with the distributed coresetbased clustering approach in [3], this leads to an algorithm in which the number of vectors communicated is independent of the size and the dimension of the original data. Our experiment results demonstrate the effectiveness of the algorithm. 1
Thesis Proposal: Approximation Algorithms and New Models for Clustering and Learning
"... This thesis concerns two fundamental problems in clustering and learning: (a) the kmedian and the kmeans clustering problems, and (b) the problem of learning under adversarial noise, also known as agnostic learning. For kmedian and kmeans clustering we design efficient algorithms which provide a ..."
Abstract
 Add to MetaCart
This thesis concerns two fundamental problems in clustering and learning: (a) the kmedian and the kmeans clustering problems, and (b) the problem of learning under adversarial noise, also known as agnostic learning. For kmedian and kmeans clustering we design efficient algorithms which provide arbitrarily good approximation guarantees on a wide class of datasets. These are datasets which satisfy a natural notion of stability called weakdeletion stability. In addition to giving good approximation algorithms, the notion of stability studied in this thesis seems quite promising in approaching the task of transfer clustering. We also make progress on the the problem of agnostically learning the class of Boolean disjunctions and improve on the best known approximation guarantee. In addition we study two new interactive models for clustering and learning which are well
Acknowledgements Acknowledgements
, 2006
"... teaching and invaluable instruction. He has not only taught us how to do research and courses, but also told us how to face our life with right attitude. I also thank him to endure my poor English during the last two years, and he made me have remarkable progress in English. He also encouraged me in ..."
Abstract
 Add to MetaCart
teaching and invaluable instruction. He has not only taught us how to do research and courses, but also told us how to face our life with right attitude. I also thank him to endure my poor English during the last two years, and he made me have remarkable progress in English. He also encouraged me in literature and classical music so that I can enjoy beautiful things in the world. In the meantime, he impressed me on the importance of foundations. A great achievement always comes from a strong foundation. I will remember his meaningful words forever. Great thanks to my parents for their support and encouragement. Their support helped me to concentrate on studying. Their encouragement gave me power to advance me in the future. I could not have any accomplishment without them. I also thank Professor G. S. Huang. When I did not understand the concept of a paper or courses, he always gave me a correct direction. I am also quite grateful for his selfless
Center Based Clustering: A Foundational Perspective
, 2013
"... In the first part of this chapter we present existing work in center based clustering methods. In particular, we focus on the kmeans and the kmedian clustering which are two of the most widely used clustering objectives. We describe popular heuristics for these methods and theoretical guarantees a ..."
Abstract
 Add to MetaCart
In the first part of this chapter we present existing work in center based clustering methods. In particular, we focus on the kmeans and the kmedian clustering which are two of the most widely used clustering objectives. We describe popular heuristics for these methods and theoretical guarantees associated with them. We also describe how to design worst case approximately optimal algorithms for these problems. In the second part of the chapter we describe recent work on how to improve on these worst case algorithms even further by using insights from the nature of real world clustering problems and data sets. Finally, we also summarize theoretical work on clustering data generated from mixture models such as a mixture of Gaussians. 1 Approximation algorithms for kmeans and kmedian One of the most popular approaches to clustering is to define an objective function over the data points and find a partitioning which achieves the optimal solution, or an approximately optimal solution to the given objective function. Common objective functions include center based objective functions such as kmedian and kmeans where one selects k center points and the clustering is obtained by assigning each data point to its closest center point. In