Results 1  10
of
479
Similarity Measures for Text Document Clustering
, 2008
"... Clustering is a useful technique that organizes a large quantity of unordered text documents into a small number of meaningful and coherent clusters, thereby providing a basis for intuitive and informative navigation and browsing mechanisms. Partitional clustering algorithms have been recognized to ..."
Abstract

Cited by 62 (0 self)
 Add to MetaCart
(Show Context)
Clustering is a useful technique that organizes a large quantity of unordered text documents into a small number of meaningful and coherent clusters, thereby providing a basis for intuitive and informative navigation and browsing mechanisms. Partitional clustering algorithms have been recognized to be more suitable as opposed to the hierarchical clustering schemes for processing large datasets. A wide variety of distance functions and similarity measures have been used for clustering, such as squared Euclidean distance, cosine similarity, and relative entropy. In this paper, we compare and analyze the effectiveness of these measures in partitional clustering for text document datasets. Our experiments utilize the standard Kmeans algorithm and we report results on seven text document datasets and five distance/similarity measures that have been most commonly used in text clustering.
Fast Approximate Spectral Clustering
, 2009
"... Spectral clustering refers to a flexible class of clustering procedures that can produce highquality clusterings on small data sets but which has limited applicability to largescale problems due to its computational complexity of O(n 3), with n the number of data points. We extend the range of spe ..."
Abstract

Cited by 58 (1 self)
 Add to MetaCart
(Show Context)
Spectral clustering refers to a flexible class of clustering procedures that can produce highquality clusterings on small data sets but which has limited applicability to largescale problems due to its computational complexity of O(n 3), with n the number of data points. We extend the range of spectral clustering by developing a general framework for fast approximate spectral clustering in which a distortionminimizing local transformation is first applied to the data. This framework is based on a theoretical analysis that provides a statistical characterization of the effect of local distortion on the misclustering rate. We develop two concrete instances of our general framework, one based on local kmeans clustering (KASP) and one based on random projection trees (RASP). Extensive experiments show that these algorithms can achieve significant speedups with little degradation in clustering accuracy. Specifically, our algorithms outperform kmeans by a large margin in terms of accuracy, and run several times faster than approximate spectral clustering based on the Nyström method, with comparable accuracy and significantly smaller memory footprint. Remarkably, our algorithms make it possible for a single machine to spectral cluster data sets with a million observations within several minutes. 1
The Planar kmeans Problem is NPhard
, 2009
"... In the kmeans problem, we are given a finite set S of points in ℜ m, and integer k ≥ 1, and we want to find k points (centers) so as to minimize the sum of the square of the Euclidean distance of each point in S to its nearest center. We show that this wellknown problem is NPhard even for instanc ..."
Abstract

Cited by 45 (0 self)
 Add to MetaCart
In the kmeans problem, we are given a finite set S of points in ℜ m, and integer k ≥ 1, and we want to find k points (centers) so as to minimize the sum of the square of the Euclidean distance of each point in S to its nearest center. We show that this wellknown problem is NPhard even for instances in the plane, answering an open question posed by Dasgupta [7].
Beyond the Euclidean distance: Creating effective visual codebooks using the histogram intersection kernel
"... Common visual codebook generation methods used in a Bag of Visual words model, e.g. kmeans or Gaussian Mixture Model, use the Euclidean distance to cluster features into visual code words. However, most popular visual descriptors are histograms of image measurements. It has been shown that the Hist ..."
Abstract

Cited by 42 (3 self)
 Add to MetaCart
(Show Context)
Common visual codebook generation methods used in a Bag of Visual words model, e.g. kmeans or Gaussian Mixture Model, use the Euclidean distance to cluster features into visual code words. However, most popular visual descriptors are histograms of image measurements. It has been shown that the Histogram Intersection Kernel (HIK) is more effective than the Euclidean distance in supervised learning tasks with histogram features. In this paper, we demonstrate that HIK can also be used in an unsupervised manner to significantly improve the generation of visual codebooks. We propose a histogram kernel kmeans algorithm which is easy to implement and runs almost as fast as kmeans. The HIK codebook has consistently higher recognition accuracy over kmeans codebooks by 24%. In addition, we propose a oneclass SVM formulation to create more effective visual code words which can achieve even higher accuracy. The proposed method has established new stateoftheart performance numbers for 3 popular benchmark datasets on object and scene recognition. In addition, we show that the standard kmedian clustering method can be used for visual codebook generation and can act as a compromise between HIK and kmeans approaches. 1.
Approximation algorithms for clustering uncertain data
 in PODS Conference
, 2008
"... There is an increasing quantity of data with uncertainty arising from applications such as sensor network measurements, record linkage, and as output of mining algorithms. This uncertainty is typically formalized as probability density functions over tuple values. Beyond storing and processing such ..."
Abstract

Cited by 39 (1 self)
 Add to MetaCart
(Show Context)
There is an increasing quantity of data with uncertainty arising from applications such as sensor network measurements, record linkage, and as output of mining algorithms. This uncertainty is typically formalized as probability density functions over tuple values. Beyond storing and processing such data in a DBMS, it is necessary to perform other data analysis tasks such as data mining. We study the core mining problem of clustering on uncertain data, and define appropriate natural generalizations of standard clustering optimization criteria. Two variations arise, depending on whether a point is automatically associated with its optimal center, or whether it must be assigned to a fixed cluster no matter where it is actually located. For uncertain versions of kmeans and kmedian, we show reductions to their corresponding weighted versions on data with no uncertainties. These are simple in the unassigned case, but require some care for the assigned version. Our most interesting results are for uncertain kcenter, which generalizes both traditional kcenter and kmedian objectives. We show a variety of bicriteria approximation algorithms. One picks O(kɛ −1 log 2 n) centers and achieves a (1 + ɛ) approximation to the best uncertain kcenters. Another picks 2k centers and achieves a constant factor approximation. Collectively, these results are the first known guaranteed approximation algorithms for the problems of clustering uncertain data.
Sided and symmetrized Bregman centroids
 IEEE Transactions on Information Theory
, 2009
"... Abstract—We generalize the notions of centroids (and barycenters) to the broad class of informationtheoretic distortion measures called Bregman divergences. Bregman divergences form a rich and versatile family of distances that unifies quadratic Euclidean distances with various wellknown statistic ..."
Abstract

Cited by 37 (13 self)
 Add to MetaCart
(Show Context)
Abstract—We generalize the notions of centroids (and barycenters) to the broad class of informationtheoretic distortion measures called Bregman divergences. Bregman divergences form a rich and versatile family of distances that unifies quadratic Euclidean distances with various wellknown statistical entropic measures. Since besides the squared Euclidean distance, Bregman divergences are asymmetric, we consider the leftsided and rightsided centroids and the symmetrized centroids as minimizers of average Bregman distortions. We prove that all three centroids are unique and give closedform solutions for the sided centroids that are generalized means. Furthermore, we design a provably fast and efficient arbitrary close approximation algorithm for the symmetrized centroid based on its exact geometric characterization. The geometric approximation algorithm requires only to walk on a geodesic linking the two left/right sided centroids. We report on our implementation for computing entropic centers of image histogram clusters and entropic centers of multivariate normal distributions that are useful operations for processing multimedia information and retrieval. These experiments illustrate that our generic methods compare favorably with former limited adhoc methods. Index Terms—Centroid, KullbackLeibler divergence, Bregman divergence, Bregman power divergence, BurbeaRao divergence,
deSEO: Combating SearchResult Poisoning
 In Proceedings of the 20th USENIX Security Symposium
, 2011
"... We perform an indepth study of SEO attacks that spread malware by poisoning search results for popular queries. Such attacks, although recent, appear to be both widespread and effective. They compromise legitimate Web sites and generate a large number of fake pages targeting trendy keywords. We fir ..."
Abstract

Cited by 36 (3 self)
 Add to MetaCart
(Show Context)
We perform an indepth study of SEO attacks that spread malware by poisoning search results for popular queries. Such attacks, although recent, appear to be both widespread and effective. They compromise legitimate Web sites and generate a large number of fake pages targeting trendy keywords. We first dissect one example attack that affects over 5,000 Web domains and attracts over 81,000 user visits. Further, we develop deSEO, a system that automatically detects these attacks. Using large datasets with hundreds of billions of URLs, deSEO successfully identifies multiple malicious SEO campaigns. In particular, applying the URL signatures derived from deSEO, we find 36 % of sampled searches to Google and Bing contain at least one malicious link in the top results at the time of our experiment. 1
SUN: Topdown saliency using natural statistics
"... When people try to find particular objects in natural scenes they make extensive use of knowledge about how and where objects tend to appear in a scene. Although many forms of such “topdown ” knowledge have been incorporated into saliency map models of visual search, surprisingly, the role of objec ..."
Abstract

Cited by 34 (2 self)
 Add to MetaCart
When people try to find particular objects in natural scenes they make extensive use of knowledge about how and where objects tend to appear in a scene. Although many forms of such “topdown ” knowledge have been incorporated into saliency map models of visual search, surprisingly, the role of object appearance has been infrequently investigated. Here we present an appearance based saliency model derived in a Bayesian framework. We compare our approach with both bottomup saliency algorithms as well as the stateoftheart Contextual Guidance model of Torralba et al. (2006) at predicting human fixations. Although both topdown approaches use very different types of information, they achieve similar performance; each substantially better than the purely bottomup models. Our experiments reveal that a simple model of object appearance can predict human fixations quite well, even making the same mistakes as people.
On Centroidal Voronoi Tessellation  Energy Smoothness and Fast Computation
, 2008
"... Centroidal Voronoi tessellation (CVT) is a fundamental geometric structure that finds many applications in ..."
Abstract

Cited by 33 (15 self)
 Add to MetaCart
Centroidal Voronoi tessellation (CVT) is a fundamental geometric structure that finds many applications in
Streaming kmeans approximation
"... We provide a clustering algorithm that approximately optimizes the kmeans objective, in the onepass streaming setting. We make no assumptions about the data, and our algorithm is very lightweight in terms of memory, and computation. This setting is applicable to unsupervised learning on massive d ..."
Abstract

Cited by 32 (3 self)
 Add to MetaCart
(Show Context)
We provide a clustering algorithm that approximately optimizes the kmeans objective, in the onepass streaming setting. We make no assumptions about the data, and our algorithm is very lightweight in terms of memory, and computation. This setting is applicable to unsupervised learning on massive data sets, or resourceconstrained devices. The two main ingredients of our theoretical work are: a derivation of an extremely simple pseudoapproximation batch algorithm for kmeans (based on the recent kmeans++), in which the algorithm is allowed to output more than k centers, and a streaming clustering algorithm in which batch clustering algorithms are performed on small inputs (fitting in memory) and combined in a hierarchical manner. Empirical evaluations on real and simulated data reveal the practical utility of our method. 1