Results 1  10
of
39
Power Iteration Clustering
"... We show that the power iteration, typically used to approximate the dominant eigenvector of a matrix, can be applied to a normalized affinity matrix to create a onedimensional embedding of the underlying data. This embedding is then used, as in spectral clustering, to cluster the data via kmeans. ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
We show that the power iteration, typically used to approximate the dominant eigenvector of a matrix, can be applied to a normalized affinity matrix to create a onedimensional embedding of the underlying data. This embedding is then used, as in spectral clustering, to cluster the data via kmeans. We demonstrate this method’s effectiveness and scalability on several synthetic and real datasets, and conclude that to find a meaningful lowdimensional embedding for clustering, it is not necessary to find any eigenvectors—we just need a linear combination of the top eigenvectors. 1
A Tour of Modern Image Filtering  New insights and methods, both practical and theoretical
 IEEE SIGNAL PROCESSING MAGAZINE [106]
, 2013
"... Recent developments in computational imaging and restoration have heralded the arrival and convergence of several powerful methods for adaptive processing of multidimensional data. Examples include moving least square (from graphics), the bilateral filter (BF) and anisotropic diffusion (from compute ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Recent developments in computational imaging and restoration have heralded the arrival and convergence of several powerful methods for adaptive processing of multidimensional data. Examples include moving least square (from graphics), the bilateral filter (BF) and anisotropic diffusion (from computer vision), boosting, kernel, and spectral methods (from machine learning), nonlocal means (NLM) and its variants (from signal processing), Bregman iterations (from applied math), kernel regression, and iterative scaling (from statistics). While these approaches found their inspirations in diverse fields of nascence, they are deeply connected. Digital Object Identifier 10.1109/MSP.2011.2179329 Date of publication: 5 December 2012 In this article, I present a practical and accessible framework to understand some of the basic underpinnings of these methods, with the intention of leading the reader to a broad understanding of how they interrelate. I also illustrate connections between these techniques and more classical (empirical) Bayesian approaches. The proposed framework is used to arrive at new insights and methods, both practical and theoretical. In particular, several novel optimality properties of algorithms in wide use such as blockmatching and threedimensional (3D) filtering (BM3D), and methods for their iterative improvement (or nonexistence thereof) are discussed. A general approach is laid out to enable the performance analysis and subsequent improvement of many existing filtering algorithms. While much of the material discussed is applicable to the wider class of linear degradation models beyond noise (e.g., blur,) to keep matters focused, we consider the problem of denoising here.
Spectral Clustering on a Budget
"... Spectral clustering is a modern and well known method for performing data clustering. However, it depends on the availability of a similarity matrix, which in many applications can be nontrivial to obtain. In this paper, we focus on the problem of performing spectral clustering under a budget const ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Spectral clustering is a modern and well known method for performing data clustering. However, it depends on the availability of a similarity matrix, which in many applications can be nontrivial to obtain. In this paper, we focus on the problem of performing spectral clustering under a budget constraint, where there is a limit on the number of entries which can be queried from the similarity matrix. We propose two algorithms for this problem, and study them theoretically and experimentally. These algorithms allow a tradeoff between computational efficiency and actual performance, and are also relevant for the problem of speeding up standard spectral clustering. 1
A Very Fast Method for Clustering Big Text Datasets
"... Largescale text datasets have long eluded a family of particularly elegant and effective clustering methods that exploits the power of pairwise similarities between data points due to the prohibitive cost, time and spacewise, in operating on a similarity matrix, where the stateoftheart is at ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Largescale text datasets have long eluded a family of particularly elegant and effective clustering methods that exploits the power of pairwise similarities between data points due to the prohibitive cost, time and spacewise, in operating on a similarity matrix, where the stateoftheart is at best quadratic in time and in space. We present an extremely fast and simple method also using the power of all pairwise similarity between data points, and show through experiments that it does as well as previous methods in clustering accuracy, and it does so with in linear time and space, without sampling data points or sparsifying the similarity matrix. 1
Fast Overlapping Clustering of Networks Using Sampled Spectral Distance Embedding and GMMs
"... Clustering social networks is vital to understanding online interactions and influence. This task becomes more difficult when communities overlap, and when the social networks become extremely large. We present an efficient algorithm for constructing overlapping clusters, roughly linear in the size ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Clustering social networks is vital to understanding online interactions and influence. This task becomes more difficult when communities overlap, and when the social networks become extremely large. We present an efficient algorithm for constructing overlapping clusters, roughly linear in the size of the network. The algorithm first embeds the graph and then performs a metric clustering using a Gaussian Mixture Model (GMM). We evaluate the algorithm on the DBLP paperpaper network which consists of about 1 million nodes and over 30 million edges; we can cluster this network in under 20 minutes on a modest single CPU machine.
A Tour of Modern Image Filtering
, 2011
"... Recent developments in computational imaging and restoration have heralded the arrival and convergence of several powerful methods for adaptive processing of multidimensional data. Examples include ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Recent developments in computational imaging and restoration have heralded the arrival and convergence of several powerful methods for adaptive processing of multidimensional data. Examples include
Efficient spectral neighborhood blocking for entity resolution
 In ICDE
, 2011
"... Abstract—In many telecom and web applications, there is a need to identify whether data objects in the same source or different sources represent the same entity in the realworld. This problem arises for subscribers in multiple services, customers in supply chain management, and users in social net ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Abstract—In many telecom and web applications, there is a need to identify whether data objects in the same source or different sources represent the same entity in the realworld. This problem arises for subscribers in multiple services, customers in supply chain management, and users in social networks when there lacks a unique identifier across multiple data sources to represent a realworld entity. Entity resolution is to identify and discover objects in the data sets that refer to the same entity in the real world. We investigate the entity resolution problem for large data sets where efficient and scalable solutions are needed. We propose a novel unsupervised blocking algorithm, namely SPectrAl Neighborhood (SPAN), which constructs a fast bipartition tree for the records based on spectral clustering such that real entities can be identified accurately by neighborhood records in the tree. There are two major novel aspects in our approach: 1) We develop a fast algorithm that performs spectral clustering without computing pairwise similarities explicitly, which dramatically improves the scalability of the standard spectral clustering algorithm; 2) We utilize a stopping criterion specified by NewmanGirvan modularity in the bipartition process. Our experimental results with both synthetic and realworld data demonstrate that SPAN is robust and outperforms other blocking algorithms in terms of accuracy while it is efficient and scalable to deal with large data sets. I.
A Tour of Modern Image Processing
, 2011
"... Recent developments in computational imaging and restoration have heralded the arrival and convergence of several powerful methods for adaptive processing of multidimensional data. Examples include Moving Least Square (from Graphics), the Bilateral Filter and Anisotropic Diffusion (from Machine Visi ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Recent developments in computational imaging and restoration have heralded the arrival and convergence of several powerful methods for adaptive processing of multidimensional data. Examples include Moving Least Square (from Graphics), the Bilateral Filter and Anisotropic Diffusion (from Machine Vision), Boosting and Spectral Methods (from Machine Learning), Nonlocal Means (from Signal Processing), Bregman Iterations (from Applied Math), Kernel Regression and Iterative Scaling (from Statistics). While these approaches found their inspirations in diverse ﬁelds of nascence, they are deeply connected. In this paper
I present a practical and uniﬁed framework to understand some of the basic underpinnings of these methods, with the intention of leading the reader to a broad understanding of how they interrelate.
I also illustrate connections between these techniques and Bayesian approaches.
The proposed framework is used to arrive at new insights, methods, and both practical and theoretical results. In particular, several novel optimality properties of algorithms in wide use such as BM3D, and methods for their iterative improvement (or nonexistence thereof) are discussed.
Several theoretical results are discussed which will enable the performance analysis and subsequent improvement of any existing restoration algorithm. While much of the material discussed is applicable to wider class of linear degradation models (e.g. noise, blur, etc.,) in order to keep matters focused, we consider the problem of denoising here.
Finding representative nodes in probabilistic graphs
"... We introduce the problem of identifying representative nodes in probabilistic graphs, motivated by the need to produce different simple views to large networks. We define a probabilistic similarity measure for nodes, and then apply clustering methods to find groups of nodes. Finally, a representativ ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
We introduce the problem of identifying representative nodes in probabilistic graphs, motivated by the need to produce different simple views to large networks. We define a probabilistic similarity measure for nodes, and then apply clustering methods to find groups of nodes. Finally, a representative is output from each cluster. We report on experiments with real biomedical data, using both the kmedoids and hierarchical clustering methods in the clustering step. The results suggest that the clustering based approaches are capable of finding a representative set of nodes.
Clustering under Approximation Stability
, 2009
"... A common approach to clustering data is to view data objects as points in a metric space, and then to optimize a natural distancebased objective such as the kmedian, kmeans, or minsum score. For applications such as clustering proteins by function or clustering images by subject, the implicit ho ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
A common approach to clustering data is to view data objects as points in a metric space, and then to optimize a natural distancebased objective such as the kmedian, kmeans, or minsum score. For applications such as clustering proteins by function or clustering images by subject, the implicit hope in taking this approach is that the optimal solution to the chosen objective will closely match the desired “target ” clustering (e.g., a correct clustering of proteins by function or of images by who is in them). However, most distancebased objectives, including those above, are NPhard to optimize. So, this assumption by itself is not sufficient, assuming P ̸ = NP, to achieve clusterings of lowerror via polynomial time algorithms. In this paper, we show that we can bypass this barrier if we slightly extend this assumption to ask that for some small constant c, not only the optimal solution, but also all capproximations to the optimal solution, differ from the target on at most some ɛ fraction of points—we call this (c, ɛ)approximationstability. We show that under this condition, it is possible to efficiently obtain lowerror clusterings even if the property holds only for values c for which the objective is known to be NPhard to approximate. Specifically, for any constant c> 1, (c, ɛ)approximationstability of kmedian or kmeans objectives can be used to efficiently produce a clustering of error O(ɛ), as