Results 1 
6 of
6
Fast and Robust Earth Mover’s Distances
"... We present a new algorithm for a robust family of Earth Mover’s Distances EMDs with thresholded ground distances. The algorithm transforms the flownetwork of the EMD so that the number of edges is reduced by an order of magnitude. As a result, we compute the EMD by an order of magnitude faster tha ..."
Abstract

Cited by 90 (6 self)
 Add to MetaCart
(Show Context)
We present a new algorithm for a robust family of Earth Mover’s Distances EMDs with thresholded ground distances. The algorithm transforms the flownetwork of the EMD so that the number of edges is reduced by an order of magnitude. As a result, we compute the EMD by an order of magnitude faster than the original algorithm, which makes it possible to compute the EMD on large histograms and databases. In addition, we show that EMDs with thresholded ground distances have many desirable properties. First, they correspond to the way humans perceive distances. Second, they are robust to outlier noise and quantization effects. Third, they are metrics. Finally, experimental results on image retrieval show that thresholding the ground distance of the EMD improves both accuracy and speed. 1.
Optimal algorithms for testing closeness of discrete distributions
, 2013
"... We study the question of closeness testing for two discrete distributions. More precisely, given samples from two distributions p and q over an nelement set, we wish to distinguish whether p = q versus p is at least εfar from q, in either `1 or `2 distance. Batu et al [BFR+00, BFR+13] gave the fir ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
We study the question of closeness testing for two discrete distributions. More precisely, given samples from two distributions p and q over an nelement set, we wish to distinguish whether p = q versus p is at least εfar from q, in either `1 or `2 distance. Batu et al [BFR+00, BFR+13] gave the first sublinear time algorithms for these problems, which matched the lower bounds of [Val11] up to a logarithmic factor in n, and a polynomial factor of ε. In this work, we present simple testers for both the `1 and `2 settings, with sample complexity that is informationtheoretically optimal, to constant factors, both in the dependence on n, and the dependence on ε; for the `1 testing problem we establish that the sample complexity is Θ(max{n2/3/ε4/3, n1/2/ε2}).
Comparing Clusterings in Space
"... This paper proposes a new method for comparing clusterings both partitionally and geometrically. Our approach is motivated by the following observation: the vast majority of previous techniques for comparing clusterings are entirely partitional, i.e., they examine assignments of points in set theore ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
(Show Context)
This paper proposes a new method for comparing clusterings both partitionally and geometrically. Our approach is motivated by the following observation: the vast majority of previous techniques for comparing clusterings are entirely partitional, i.e., they examine assignments of points in set theoretic terms after they have been partitioned. In doing so, these methods ignore the spatial layout of the data, disregarding the fact that this information is responsible for generating the clusterings to begin with. We demonstrate that this leads to a variety of failure modes. Previous comparison techniques often fail to differentiate between significant changes made in data being clustered. We formulate a new measure for comparing clusterings that combines spatial and partitional information into a single measure using optimization theory. Doing so eliminates pathological conditions in previous approaches. It also simultaneously removes common limitations, such as that each clustering must have the same number of clusters or they are over identical datasets. This approach is stable, easily implemented, and has strong intuitive appeal. spatial properties as well as their cluster membership assignments. We view a clustering as a partition of a set of points located in a space with an associated distance function. This view is natural, since popular clustering algorithms, e.g., kmeans, spectral clustering, affinity propagation, etc., take as input not only a collection of points to be clustered but also a distance function on the space in which the points lie. This distance function may be specified implicitly and it may be transformed by a kernel, but it must be defined one way or another and its properties are crucial to a clustering algorithm’s output. In contrast, almost all existing clustering comparison techniques ignore the distances between points, treating clusterings as partitions of disembodied atoms. While this approach has merit under some circumstances, it seems surprising to ignore the distance func
Testing monotone continuous distributions on highdimensional real cubes
 In Proceedings of 21st ACMSIAM Symposium on Discrete Algorithms
, 2010
"... We study the task of testing properties of probability distributions. We consider a scenario in which we have access to independent samples of an unknown distribution ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
We study the task of testing properties of probability distributions. We consider a scenario in which we have access to independent samples of an unknown distribution
Incorporating Spatial Similarity into Ensemble Clustering
"... This paper addresses a fundamental problem in ensemble clustering – namely, how should one compare the similarity of two clusterings? The vast majority of prior techniques for comparing clusterings are entirely partitional, i.e., they examine assignments of points in set theoretic terms after they h ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
This paper addresses a fundamental problem in ensemble clustering – namely, how should one compare the similarity of two clusterings? The vast majority of prior techniques for comparing clusterings are entirely partitional, i.e., they examine assignments of points in set theoretic terms after they have been partitioned. In doing so, these methods ignore the spatial layout of the data, disregarding the fact that this information is responsible for generating the clusterings to begin with. In this paper, we demonstrate the importance of incorporating spatial information into forming ensemble clusterings. We investigate the use of a recently proposed measure, called CDistance, which uses both spatial and partitional information to compare clusterings. We demonstrate that CDistance can be applied in a wellmotivated way to four areas fundamental to existing ensemble techniques: the correspondence problem, subsampling, stability analysis and diversity detection.
Ground Metric Learning
"... Optimal transport distances have been used for more than a decade in machine learning to compare histograms of features. They have one parameter: the ground metric, which can be any metric between the features themselves. As is the case for all parameterized distances, optimal transport distances c ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Optimal transport distances have been used for more than a decade in machine learning to compare histograms of features. They have one parameter: the ground metric, which can be any metric between the features themselves. As is the case for all parameterized distances, optimal transport distances can only prove useful in practice when this parameter is carefully chosen. To date, the only option available to practitioners to set the ground metric parameter was to rely on a priori knowledge of the features, which limited considerably the scope of application of optimal transport distances. We propose to lift this limitation and consider instead algorithms that can learn the ground metric using only a training set of labeled histograms. We call this approach ground metric learning. We formulate the problem of learning the ground metric as the minimization of the difference of two convex polyhedral functions over a convex set of metric matrices. We follow the presentation of our algorithms with promising experimental results which show that this approach is useful both for retrieval and binary/multiclass classification tasks.