Results 1  10
of
22
Fast and Robust Earth Mover’s Distances
"... We present a new algorithm for a robust family of Earth Mover’s Distances EMDs with thresholded ground distances. The algorithm transforms the flownetwork of the EMD so that the number of edges is reduced by an order of magnitude. As a result, we compute the EMD by an order of magnitude faster tha ..."
Abstract

Cited by 90 (6 self)
 Add to MetaCart
(Show Context)
We present a new algorithm for a robust family of Earth Mover’s Distances EMDs with thresholded ground distances. The algorithm transforms the flownetwork of the EMD so that the number of edges is reduced by an order of magnitude. As a result, we compute the EMD by an order of magnitude faster than the original algorithm, which makes it possible to compute the EMD on large histograms and databases. In addition, we show that EMDs with thresholded ground distances have many desirable properties. First, they correspond to the way humans perceive distances. Second, they are robust to outlier noise and quantization effects. Third, they are metrics. Finally, experimental results on image retrieval show that thresholding the ground distance of the EMD improves both accuracy and speed. 1.
The computational hardness of estimating edit distance
 In Proceedings of the Symposium on Foundations of Computer Science
, 2007
"... We prove the first nontrivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings. To the best of our knowledge, this is the first computational setting in which the complexity of computing the edit distance is provably ..."
Abstract

Cited by 24 (8 self)
 Add to MetaCart
(Show Context)
We prove the first nontrivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings. To the best of our knowledge, this is the first computational setting in which the complexity of computing the edit distance is provably larger than that of Hamming distance. Our lower bound exhibits a tradeoff between approximation and communication, asserting, for example, that protocols with O(1) bits of communication can only obtain approximation α ≥ Ω(log d / log log d), where d is the length of the input strings. This case of O(1) communication is of particular importance since it captures constantsize sketches as well as embeddings into spaces like L1 and squaredL2, two prevailing algorithmic approaches for dealing with edit distance. Furthermore, the bound holds not only for strings over alphabet Σ = {0, 1}, but also for strings that are permutations (aka the Ulam metric). Besides being applicable to a much richer class of algorithms than all previous results, our bounds are neartight in at least one case, namely of embedding permutations into L1. The proof uses a new technique, that relies on Fourier analysis in a rather elementary way. 1
Overcoming the ℓ1 nonembeddability barrier: Algorithms for product metrics
, 2008
"... A common approach for solving computational problems over a difficult metric space is to embed the “hard ” metric into L1, which admits efficient algorithms and is thus considered an “easy ” metric. This approach has proved successful or partially successful for important spaces such as the edit dis ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
(Show Context)
A common approach for solving computational problems over a difficult metric space is to embed the “hard ” metric into L1, which admits efficient algorithms and is thus considered an “easy ” metric. This approach has proved successful or partially successful for important spaces such as the edit distance, but it also has inherent limitations: it is provably impossible to go below certain approximation for some metrics. We propose a new approach, of embedding the difficult space into richer host spaces, namely iterated products of standard spaces like ℓ1 and ℓ∞. We show that this class is rich since it contains useful metric spaces with only a constant distortion, and, at the same time, it is tractable and admits efficient algorithms. Using this approach, we obtain for example the first nearest neighbor data structure with O(log log d) approximation for edit distance in nonrepetitive strings (the Ulam metric). This approximation is exponentially better than the lower bound for embedding into L1. Furthermore, we give constant factor approximation for two other computational problems. Along the way, we answer positively a question posed in [Ajtai, Jayram, Kumar, and Sivakumar, STOC 2002]. One of our algorithms has already found applications for smoothed edit distance over 01 strings [Andoni and Krauthgamer, ICALP 2008]. 1
Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality
, 2012
"... We present two algorithms for the approximate nearest neighbor problem in highdimensional spaces. For data sets of size n living in Rd, the algorithms require space that is only polynomial in n and d, while achieving query times that are sublinear in n and polynomial in d. We also show application ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
We present two algorithms for the approximate nearest neighbor problem in highdimensional spaces. For data sets of size n living in Rd, the algorithms require space that is only polynomial in n and d, while achieving query times that are sublinear in n and polynomial in d. We also show applications to other highdimensional geometric problems, such as the approximate minimum spanning tree. The article is based on the material from the authors’ STOC’98 and FOCS’01 papers. It unifies, generalizes and simplifies the results from those papers.
Efficient and Effective Similarity Search over Probabilistic Data based on Earth Mover’s Distance
, 2010
"... Probabilistic data is coming as a new deluge along with the technical advances on geographical tracking, multimedia processing, sensor network and RFID. While similarity search is an important functionality supporting the manipulation of probabilistic data, it raises new challenges to traditional re ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
Probabilistic data is coming as a new deluge along with the technical advances on geographical tracking, multimedia processing, sensor network and RFID. While similarity search is an important functionality supporting the manipulation of probabilistic data, it raises new challenges to traditional relational database. The problem stems from the limited effectiveness of the distance metric supported by the existing database system. On the other hand, some complicated distance operators have proven their values for better distinguishing ability in the probabilistic domain. In this paper, we discuss the similarity search problem with the Earth Mover’s Distance, which is the most successful distance metric on probabilistic histograms and an expensive operator with cubic complexity. We present a new database approach to answer range queries and knearest neighbor queries on probabilistic data, on the basis of Earth Mover’s Distance. Our solution utilizes the primaldual theory in linear programming and deploys B + tree index structures for effective candidate pruning. Extensive experiments show that our proposal dramatically improves the scalability of probabilistic databases. 1
Comparing distributions and shapes using the kernel distance
 In ACM SoCG
, 2011
"... ..."
(Show Context)
Approximating edit distance in nearlinear time
, 2009
"... We show how to compute the edit distance between two strings of length n up to a factor of 2 Õ( √ log n) in n 1+o(1) time. This is the first subpolynomial approximation algorithm for this problem that runs in nearlinear time, improving on the stateoftheart n 1/3+o(1) approximation. Previously, ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
(Show Context)
We show how to compute the edit distance between two strings of length n up to a factor of 2 Õ( √ log n) in n 1+o(1) time. This is the first subpolynomial approximation algorithm for this problem that runs in nearlinear time, improving on the stateoftheart n 1/3+o(1) approximation. Previously, approximation of 2 Õ( √ log n) was known only for embedding edit distance into ℓ1, and it is not known if that embedding can be computed in less than a quadratic time.
Sublinear Time Algorithms for Earth Mover's Distance
, 2009
"... We study the problem of estimating the Earth Mover's Distance (EMD) between probability distributions when given access only to samples. We give closeness testers and additiveerror estimators over domains in [0, ∆] d , with sample complexities independent of domain size permitting the test ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
We study the problem of estimating the Earth Mover's Distance (EMD) between probability distributions when given access only to samples. We give closeness testers and additiveerror estimators over domains in [0, ∆] d , with sample complexities independent of domain size permitting the testability even of continuous distributions over infinite domains. Instead, our algorithms depend on other parameters, such as the diameter of the domain space, which may be significantly smaller. We also prove lower bounds showing the dependencies on these parameters to be essentially optimal. Additionally, we consider whether natural classes of distributions exist for which there are algorithms with better dependence on the dimension, and show that for highly clusterable data, this is indeed the case. Lastly, we consider a variant of the EMD, defined over tree metrics instead of the usual ℓ 1 metric, and give optimal algorithms.
Homomorphic Fingerprints under Misalignments: Sketching Edit and Shift Distances
, 2013
"... Fingerprinting is a widelyused technique for efficiently verifying that two files are identical. More generally, linear sketching is a form of lossy compression (based on random projections) that also enables the “dissimilarity ” of nonidentical files to be estimated. Many sketches have been propos ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Fingerprinting is a widelyused technique for efficiently verifying that two files are identical. More generally, linear sketching is a form of lossy compression (based on random projections) that also enables the “dissimilarity ” of nonidentical files to be estimated. Many sketches have been proposed for dissimilarity measures that decompose coordinatewise such as the Hamming distance between alphanumeric strings, or the Euclidean distance between vectors. However, virtually nothing is known on sketches that would accommodate alignment errors. With such errors, Hamming or Euclidean distances are rendered useless: a small misalignment may result in a file that looks very dissimilar to the original file according such measures. In this paper, we present the first linear sketch that is robust to a small number of alignment errors. Specifically, the sketch can be used to determine whether two files are within a small Hamming distance of being a cyclic shift of each other. Furthermore, the sketch is homomorphic with respect to rotations: it is possible to construct the sketch of a cyclic shift of a file given only the sketch of the original file. The relevant dissimilarity measure, known as the shift distance, arises in the context of embedding edit distance and our result addressed an open problem [26, Question 13] with a rather surprising outcome. Our sketch projects a length n file into D(n) · polylog n dimensions where D(n) ≪ n is the number of divisors of n. The striking fact is that this is nearoptimal, i.e., the D(n) dependence is inherent to a problem that is ostensibly about lossy compression. In contrast, we then show that any sketch for estimating the edit distance between two files, even when small, requires sketches whose size is nearly linear in n. This lower bound addresses a longstanding open problem on the low distor