Results 1 
7 of
7
Creating efficient codebooks for visual recognition
 In Proceedings of the IEEE International Conference on Computer Vision
, 2005
"... Visual codebook based quantization of robust appearance descriptors extracted from local image patches is an effective means of capturing image statistics for texture analysis and scene classification. Codebooks are usually constructed by using a method such as kmeans to cluster the descriptor vect ..."
Abstract

Cited by 193 (22 self)
 Add to MetaCart
Visual codebook based quantization of robust appearance descriptors extracted from local image patches is an effective means of capturing image statistics for texture analysis and scene classification. Codebooks are usually constructed by using a method such as kmeans to cluster the descriptor vectors of patches sampled either densely (‘textons’) or sparsely (‘bags of features ’ based on keypoints or salience measures) from a set of training images. This works well for texture analysis in homogeneous images, but the images that arise in natural object recognition tasks have far less uniform statistics. We show that for dense sampling, kmeans overadapts to this, clustering centres almost exclusively around the densest few regions in descriptor space and thus failing to code other informative regions. This gives suboptimal codes that are no better than using randomly selected centres. We describe a scalable acceptanceradius based clusterer that generates better codebooks and study its performance on several image classification tasks. We also show that dense representations outperform equivalent keypoint based ones on these tasks and that SVM or Mutual Information based feature selection starting from a dense codebook further improves the performance. 1.
A Framework for Statistical Clustering with a Constant Time Approximation Algorithms for KMedian Clustering
 In COLT
, 2004
"... We consider a framework in which the clustering algorithm gets as input a sample generated i.i.d by some unknown arbitrary distribution, and has to output a clustering of the full domain set, that is evaluated with respect to the underlying distribution. We provide general conditions on clusteri ..."
Abstract

Cited by 25 (4 self)
 Add to MetaCart
We consider a framework in which the clustering algorithm gets as input a sample generated i.i.d by some unknown arbitrary distribution, and has to output a clustering of the full domain set, that is evaluated with respect to the underlying distribution. We provide general conditions on clustering problems that imply the existence of sampling based clusterings that approximate the optimal clustering. We show that the Kmedian clustering, as well as the Vector Quantization problem, satisfy these conditions. In particular our results apply to the sampling  based approximate clustering scenario. As a corollary, we get a samplingbased algorithm for the Kmedian clustering problem that finds an almost optimal set of centers in time depending only on the confidence and accuracy parameters of the approximation, but independent of the input size. Furthermore, in the Euclidean input case, the running time of our algorithm is independent of the Euclidean dimension.
Sublineartime algorithms
 In Oded Goldreich, editor, Property Testing, volume 6390 of Lecture Notes in Computer Science
, 2010
"... In this paper we survey recent (up to end of 2009) advances in the area of sublineartime algorithms. 1 ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
In this paper we survey recent (up to end of 2009) advances in the area of sublineartime algorithms. 1
Sublineartime approximation for clustering via random sampling
 In Proceedings of the 31st International Colloquium on Automata, Languages and Programming (ICALP’04
, 2004
"... Abstract. In this paper we present a novel analysis of a random sampling approach for three clustering problems in metric spaces: kmedian, minsum kclustering, and balanced kmedian. For all these problems we consider the following simple sampling scheme: select a small sample set of points unifor ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
Abstract. In this paper we present a novel analysis of a random sampling approach for three clustering problems in metric spaces: kmedian, minsum kclustering, and balanced kmedian. For all these problems we consider the following simple sampling scheme: select a small sample set of points uniformly at random from V and then run some approximation algorithm on this sample set to compute an approximation of the best possible clustering of this set. Our main technical contribution is a significantly strengthened analysis of the approximation guarantee by this scheme for the clustering problems. The main motivation behind our analyses was to design sublineartime algorithms for clustering problems. Our second contribution is the development of new approximation algorithms for the aforementioned clustering problems. Using our random sampling approach we obtain for the first time approximation algorithms that have the running time independent of the input size, and depending on k and the diameter of the metric space only. 1
ALGORITHMS FOR MIXTURE MODELS
, 2006
"... Mixture models form one of the most fundamental classes of generative models for clustered data. Specific application examples include text classification problems, image segmentation and motion detection, collaborative filtering and many others. However, quite surprisingly, very little had been kno ..."
Abstract
 Add to MetaCart
Mixture models form one of the most fundamental classes of generative models for clustered data. Specific application examples include text classification problems, image segmentation and motion detection, collaborative filtering and many others. However, quite surprisingly, very little had been known about algorithms which have provable performance guarantees within the framework of mixture models. This is the topic we study in this work. Our contribution is twofold. First, for the canonical problem of separating mixtures of continuous distributions in the highdimensional Euclidean space, we provide the first algorithm that can learn distributions with heavy tails, including those with infinite variance and expectation. We formulate necessary conditions and provide an algorithm which guarantees that the underlying mixture model can be learned by observing only polynomially many samples. We also show that for many classes of distributions, our separation conditions are necessary for any algorithm which guarantees accurate reconstruction. Second for the case of discrete mixture models we give an efficient polynomial time algorithm with provable performance guarantees. Recasting of our algorithm for the text classification problem immediately results in a very fast unsupervised learning method, with an excellent classification accuracy.
The KModes Algorithm for Clustering
, 1304
"... Many clustering algorithms exist that estimate a cluster centroid, such as Kmeans, Kmedoids or meanshift, but no algorithm seems to exist that clusters data by returning exactly K meaningful modes. We propose a natural definition of a Kmodes objective function by combining the notions of density ..."
Abstract
 Add to MetaCart
Many clustering algorithms exist that estimate a cluster centroid, such as Kmeans, Kmedoids or meanshift, but no algorithm seems to exist that clusters data by returning exactly K meaningful modes. We propose a natural definition of a Kmodes objective function by combining the notions of density and cluster assignment. The algorithm becomes Kmeans and Kmedoids in the limit of very large and very small scales. Computationally, it is slightly slower than Kmeans but much faster than meanshift or Kmedoids. Unlike Kmeans, it is able to find centroids that are valid patterns, truly representative of a cluster, even with nonconvex clusters, and appears robust to outliers and misspecification of the scale and number of clusters. Givenadatasetx1,...,xN ∈ R D, we considerclusteringalgorithmsbased oncentroids, i.e., that estimate a representative ck ∈ R D of each cluster k in addition to assigning data points to clusters. Two of the most widely used algorithms of this type are Kmeans and meanshift. Kmeans has the number of clusters K as a user parameter and tries to minimize the objective function s.t. rnk ∈ {0,1}, minE(R,C) = R,C K∑ k=1n=1
Signature Page................................... iii
, 2013
"... Copyright Weiran Wang, 2013 All rights reserved. The dissertation of Weiran Wang is approved, and it is acceptable in quality and form for publication on microfilm and electronically: ..."
Abstract
 Add to MetaCart
Copyright Weiran Wang, 2013 All rights reserved. The dissertation of Weiran Wang is approved, and it is acceptable in quality and form for publication on microfilm and electronically: