Results 1  10
of
10
Creating efficient codebooks for visual recognition
 In Proceedings of the IEEE International Conference on Computer Vision
, 2005
"... Visual codebook based quantization of robust appearance descriptors extracted from local image patches is an effective means of capturing image statistics for texture analysis and scene classification. Codebooks are usually constructed by using a method such as kmeans to cluster the descriptor vect ..."
Abstract

Cited by 276 (25 self)
 Add to MetaCart
(Show Context)
Visual codebook based quantization of robust appearance descriptors extracted from local image patches is an effective means of capturing image statistics for texture analysis and scene classification. Codebooks are usually constructed by using a method such as kmeans to cluster the descriptor vectors of patches sampled either densely (‘textons’) or sparsely (‘bags of features ’ based on keypoints or salience measures) from a set of training images. This works well for texture analysis in homogeneous images, but the images that arise in natural object recognition tasks have far less uniform statistics. We show that for dense sampling, kmeans overadapts to this, clustering centres almost exclusively around the densest few regions in descriptor space and thus failing to code other informative regions. This gives suboptimal codes that are no better than using randomly selected centres. We describe a scalable acceptanceradius based clusterer that generates better codebooks and study its performance on several image classification tasks. We also show that dense representations outperform equivalent keypoint based ones on these tasks and that SVM or Mutual Information based feature selection starting from a dense codebook further improves the performance. 1.
A Framework for Statistical Clustering with a Constant Time Approximation Algorithms for KMedian Clustering
 In COLT
, 2004
"... We consider a framework in which the clustering algorithm gets as input a sample generated i.i.d by some unknown arbitrary distribution, and has to output a clustering of the full domain set, that is evaluated with respect to the underlying distribution. We provide general conditions on clusteri ..."
Abstract

Cited by 31 (5 self)
 Add to MetaCart
(Show Context)
We consider a framework in which the clustering algorithm gets as input a sample generated i.i.d by some unknown arbitrary distribution, and has to output a clustering of the full domain set, that is evaluated with respect to the underlying distribution. We provide general conditions on clustering problems that imply the existence of sampling based clusterings that approximate the optimal clustering. We show that the Kmedian clustering, as well as the Vector Quantization problem, satisfy these conditions. In particular our results apply to the sampling  based approximate clustering scenario. As a corollary, we get a samplingbased algorithm for the Kmedian clustering problem that finds an almost optimal set of centers in time depending only on the confidence and accuracy parameters of the approximation, but independent of the input size. Furthermore, in the Euclidean input case, the running time of our algorithm is independent of the Euclidean dimension.
Sublineartime algorithms
 In Oded Goldreich, editor, Property Testing, volume 6390 of Lecture Notes in Computer Science
, 2010
"... In this paper we survey recent (up to end of 2009) advances in the area of sublineartime algorithms. 1 ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
(Show Context)
In this paper we survey recent (up to end of 2009) advances in the area of sublineartime algorithms. 1
Sublineartime approximation for clustering via random sampling
 In Proceedings of the 31st International Colloquium on Automata, Languages and Programming (ICALP’04
, 2004
"... Abstract. In this paper we present a novel analysis of a random sampling approach for three clustering problems in metric spaces: kmedian, minsum kclustering, and balanced kmedian. For all these problems we consider the following simple sampling scheme: select a small sample set of points unifor ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
Abstract. In this paper we present a novel analysis of a random sampling approach for three clustering problems in metric spaces: kmedian, minsum kclustering, and balanced kmedian. For all these problems we consider the following simple sampling scheme: select a small sample set of points uniformly at random from V and then run some approximation algorithm on this sample set to compute an approximation of the best possible clustering of this set. Our main technical contribution is a significantly strengthened analysis of the approximation guarantee by this scheme for the clustering problems. The main motivation behind our analyses was to design sublineartime algorithms for clustering problems. Our second contribution is the development of new approximation algorithms for the aforementioned clustering problems. Using our random sampling approach we obtain for the first time approximation algorithms that have the running time independent of the input size, and depending on k and the diameter of the metric space only. 1
P.: A distributed genetic algorithm for graphbased clustering
 ManMachine Interactions
, 2011
"... Abstract Clustering is one of the most prominent data analysis techniques to structure large datasets and produce a humanunderstandable overview. In this paper, we focus on the case when the data has many categorical attributes, and thus can not be represented in a faithful way in the Euclidean sp ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract Clustering is one of the most prominent data analysis techniques to structure large datasets and produce a humanunderstandable overview. In this paper, we focus on the case when the data has many categorical attributes, and thus can not be represented in a faithful way in the Euclidean space. We follow the graphbased paradigm and propose a graphbased genetic algorithm for clustering, the flexibility of which can mainly be attributed to the possibility of using various kernels. As our approach can naturally be parallelized, while implementing and testing it, we distribute the computations over several CPUs. In contrast to the complexity of the problem, that is NPhard, our experiments show that in case of well clusterable data, our algorithm scales well. We also perform experiments on real medical data.
Tradeoffs for Space, Time, Data and Risk in Unsupervised Learning
"... Faced with massive data, is it possible to trade off (statistical) risk, and (computational) space and time? This challenge lies at the heart of largescale machine learning. Using kmeans clustering as a prototypical unsupervised learning problem, we show how we can strategically summarize the da ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Faced with massive data, is it possible to trade off (statistical) risk, and (computational) space and time? This challenge lies at the heart of largescale machine learning. Using kmeans clustering as a prototypical unsupervised learning problem, we show how we can strategically summarize the data (control space) in order to trade off risk and time when data is generated by a probabilistic model. Our summarization is based on coreset constructions from computational geometry. We also develop an algorithm, TRAM, to navigate the space/time/data/risk tradeoff in practice. In particular, we show that for a fixed risk (or data size), as the data size increases (resp. risk increases) the running time of TRAM decreases. Our extensive experiments on real data sets demonstrate the existence and practical utility of such tradeoffs, not only for kmeans but also for Gaussian Mixture Models. 1
The KModes Algorithm for Clustering
, 1304
"... Many clustering algorithms exist that estimate a cluster centroid, such as Kmeans, Kmedoids or meanshift, but no algorithm seems to exist that clusters data by returning exactly K meaningful modes. We propose a natural definition of a Kmodes objective function by combining the notions of density ..."
Abstract
 Add to MetaCart
Many clustering algorithms exist that estimate a cluster centroid, such as Kmeans, Kmedoids or meanshift, but no algorithm seems to exist that clusters data by returning exactly K meaningful modes. We propose a natural definition of a Kmodes objective function by combining the notions of density and cluster assignment. The algorithm becomes Kmeans and Kmedoids in the limit of very large and very small scales. Computationally, it is slightly slower than Kmeans but much faster than meanshift or Kmedoids. Unlike Kmeans, it is able to find centroids that are valid patterns, truly representative of a cluster, even with nonconvex clusters, and appears robust to outliers and misspecification of the scale and number of clusters. Givenadatasetx1,...,xN ∈ R D, we considerclusteringalgorithmsbased oncentroids, i.e., that estimate a representative ck ∈ R D of each cluster k in addition to assigning data points to clusters. Two of the most widely used algorithms of this type are Kmeans and meanshift. Kmeans has the number of clusters K as a user parameter and tries to minimize the objective function s.t. rnk ∈ {0,1}, minE(R,C) = R,C K∑ k=1n=1
Growing the Charging Station Network for Electric Vehicles with Trajectory Data Analytics
"... Abstract—Electric vehicles (EVs) have undergone an explosive increase over recent years, due to the unparalleled advantages over gasoline cars in green transportation and cost efficiency. Such a drastic increase drives a growing need for widely deployed publicly accessible charging stations. Thus, h ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—Electric vehicles (EVs) have undergone an explosive increase over recent years, due to the unparalleled advantages over gasoline cars in green transportation and cost efficiency. Such a drastic increase drives a growing need for widely deployed publicly accessible charging stations. Thus, how to strategically deploy the charging stations and charging points becomes an emerging and challenging question to urban planners and electric utility companies. In this paper, by analyzing a large scale electric taxi trajectory data, we make the first attempt to investigate this problem. We develop an optimal charging station deployment (OCSD) framework that takes the historical EV taxi trajectory data, road map data, and existing charging station information as input, and performs optimal charging station placement (OCSP) and optimal charging point assignment (OCPA). The OCSP and OCPA optimization components are designed to minimize the average time to the nearest charging station, and the average waiting time for an available charging point, respectively. To evaluate the performance of our OCSD framework, we conduct experiments on onemonth real EV taxi trajectory data. The evaluation results demonstrate that our OCSD framework can achieve a 26%–94 % reduction rate on average time to find a charging station, and up to two orders of magnitude reduction on waiting time before charging, over baseline methods. Moreover, our results reveal interesting insights in answering the question: “Super or small stations?”: When the number of deployable charging points is sufficiently large, more small stations are preferred; and when there are relatively few charging points to deploy, super stations is a wiser choice. I.
ALGORITHMS FOR MIXTURE MODELS
, 2006
"... Mixture models form one of the most fundamental classes of generative models for clustered data. Specific application examples include text classification problems, image segmentation and motion detection, collaborative filtering and many others. However, quite surprisingly, very little had been kno ..."
Abstract
 Add to MetaCart
(Show Context)
Mixture models form one of the most fundamental classes of generative models for clustered data. Specific application examples include text classification problems, image segmentation and motion detection, collaborative filtering and many others. However, quite surprisingly, very little had been known about algorithms which have provable performance guarantees within the framework of mixture models. This is the topic we study in this work. Our contribution is twofold. First, for the canonical problem of separating mixtures of continuous distributions in the highdimensional Euclidean space, we provide the first algorithm that can learn distributions with heavy tails, including those with infinite variance and expectation. We formulate necessary conditions and provide an algorithm which guarantees that the underlying mixture model can be learned by observing only polynomially many samples. We also show that for many classes of distributions, our separation conditions are necessary for any algorithm which guarantees accurate reconstruction. Second for the case of discrete mixture models we give an efficient polynomial time algorithm with provable performance guarantees. Recasting of our algorithm for the text classification problem immediately results in a very fast unsupervised learning method, with an excellent classification accuracy.