Results 1  10
of
24
Improved lower bounds for embeddings into L1
 SIAM J. COMPUT.
, 2009
"... We improve upon recent lower bounds on the minimum distortion of embedding certain finite metric spaces into L1. In particular, we show that for every n ≥ 1, there is an npoint metric space of negative type that requires a distortion of Ω(log log n) for such an embedding, implying the same lower bo ..."
Abstract

Cited by 39 (5 self)
 Add to MetaCart
We improve upon recent lower bounds on the minimum distortion of embedding certain finite metric spaces into L1. In particular, we show that for every n ≥ 1, there is an npoint metric space of negative type that requires a distortion of Ω(log log n) for such an embedding, implying the same lower bound on the integrality gap of a wellknown semidefinite programming relaxation for sparsest cut. This result builds upon and improves the recent lower bound of (log log n) 1/6−o(1) due to Khot and Vishnoi [The unique games conjecture, integrality gap for cut problems and the embeddability of negative type metrics into l1, in Proceedings of the 46th Annual IEEE Symposium
Earth Mover Distance over HighDimensional Spaces
, 2007
"... The Earth Mover Distance (EMD) between two equalsize sets of points in R d is defined to be the minimum cost of a bipartite matching between the two pointsets. It is a natural metric for comparing sets of features, and as such, it has received significant interest in computer vision. Motivated by re ..."
Abstract

Cited by 22 (8 self)
 Add to MetaCart
The Earth Mover Distance (EMD) between two equalsize sets of points in R d is defined to be the minimum cost of a bipartite matching between the two pointsets. It is a natural metric for comparing sets of features, and as such, it has received significant interest in computer vision. Motivated by recent developments in that area, we address computational problems involving EMD over highdimensional pointsets. A natural approach is to embed the EMD metric into ℓ1, and use the algorithms designed for the latter space. However, Khot and Naor [KN06] show that any embedding of EMD over the ddimensional Hamming cube into ℓ1 must incur a distortion Ω(d), thus practically losing all distance information. We circumvent this roadblock by focusing on sets with cardinalities upperbounded by a parameter s, and achieve a distortion of only O(log s · log d). Since in applications the feature sets have bounded size, the resulting distortion is much smaller than the Ω(d) lower bound. Our approach is quite general and easily extends to EMD over R d. We then provide a strong lower bound on the multiround communication complexity of estimating EMD, which in particular strengthens the known nonembeddability result of [KN06]. Our bound exhibits a smooth tradeoff between approximation and communication, and for example implies that every algorithm that estimates EMD using constant size sketches can only achieve Ω(log s) approximation.
Overcoming the ℓ1 nonembeddability barrier: Algorithms for product metrics
, 2008
"... A common approach for solving computational problems over a difficult metric space is to embed the “hard ” metric into L1, which admits efficient algorithms and is thus considered an “easy ” metric. This approach has proved successful or partially successful for important spaces such as the edit dis ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
(Show Context)
A common approach for solving computational problems over a difficult metric space is to embed the “hard ” metric into L1, which admits efficient algorithms and is thus considered an “easy ” metric. This approach has proved successful or partially successful for important spaces such as the edit distance, but it also has inherent limitations: it is provably impossible to go below certain approximation for some metrics. We propose a new approach, of embedding the difficult space into richer host spaces, namely iterated products of standard spaces like ℓ1 and ℓ∞. We show that this class is rich since it contains useful metric spaces with only a constant distortion, and, at the same time, it is tractable and admits efficient algorithms. Using this approach, we obtain for example the first nearest neighbor data structure with O(log log d) approximation for edit distance in nonrepetitive strings (the Ulam metric). This approximation is exponentially better than the lower bound for embedding into L1. Furthermore, we give constant factor approximation for two other computational problems. Along the way, we answer positively a question posed in [Ajtai, Jayram, Kumar, and Sivakumar, STOC 2002]. One of our algorithms has already found applications for smoothed edit distance over 01 strings [Andoni and Krauthgamer, ICALP 2008]. 1
The smoothed complexity of edit distance
 IN PROC. OF ICALP
, 2008
"... We initiate the study of the smoothed complexity of sequence alignment, by proposing a semirandom model of edit distance between two input strings, generated as follows. First, an adversary chooses two binary strings of length d and a longest common subsequence A of them. Then, every character is ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
We initiate the study of the smoothed complexity of sequence alignment, by proposing a semirandom model of edit distance between two input strings, generated as follows. First, an adversary chooses two binary strings of length d and a longest common subsequence A of them. Then, every character is perturbed independently with probability p, except that A is perturbed in exactly the same way inside the two strings. We design two efficient algorithms that compute the edit distance on smoothed instances up to a constant factor approximation. The first algorithm runs in nearlinear time, namely d 1+ε for any fixed ε> 0. The second one runs in time sublinear in d, assuming the edit distance is not too small. These approximation and runtime guarantees are significantly better then the bounds known for worstcase inputs, e.g. nearlinear time algorithm achieving approximation roughly d 1/3, due to Batu, Ergün, and Sahinalp [SODA 2006]. Our technical contribution is twofold. First, we rely on finding matches between substrings in the two strings, where two substrings are considered a match if their edit distance is relatively small, a prevailing technique in commonly used heuristics, such as PatternHunter of Ma, Tromp and Li [Bioinformatics, 2002]. Second, we effectively reduce the smoothed edit distance to a simpler variant of (worstcase) edit distance, namely, edit distance on permutations (a.k.a. Ulam’s metric). We are thus able to build on algorithms developed for the Ulam metric, whose much better algorithmic guarantees usually do not carry over to general edit distance.
Lower bounds for edit distance and product metrics via Poincarétype inequalities
 In Proceedings of the ACMSIAM Symposium on Discrete Algorithms (SODA
, 2010
"... We prove that any sketching protocol for edit distance achieving a constant approximation requires nearly logarithmic (in the strings ’ length) communication complexity. This is an exponential improvement over the previous, doublylogarithmic, lower bound of [AndoniKrauthgamer, FOCS’07]. Our lower ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
(Show Context)
We prove that any sketching protocol for edit distance achieving a constant approximation requires nearly logarithmic (in the strings ’ length) communication complexity. This is an exponential improvement over the previous, doublylogarithmic, lower bound of [AndoniKrauthgamer, FOCS’07]. Our lower bound also applies to the Ulam distance (edit distance over nonrepetitive strings). In this special case, it is polynomially related to the recent upper bound of [AndoniIndykKrauthgamer, SODA’09]. From a technical perspective, we prove a directsum theorem for sketching product metrics that is of independent interest. We show that, for any metric X that requires sketch size which is a sufficiently large constant, sketching the maxproduct metric ℓd ∞(X) requires Ω(d) bits. The conclusion, in fact, also holds for arbitrary twoway communication. The proof uses a novel technique for information complexity based on Poincaré inequalities and suggests an intimate connection between nonembeddability, sketching and communication complexity. ∗ Work done while at MIT. † Work done while at IBM Almaden. 1
Hashing TreeStructured Data: Methods and Applications
"... Abstract — In this article we propose a new hashing framework for treestructured data. Our method maps an unordered tree into a multiset of simple wedgeshaped structures refered to as pivots. By coupling our pivot multisets with the idea of minwise hashing, we realize a fixed sized signaturesketc ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Abstract — In this article we propose a new hashing framework for treestructured data. Our method maps an unordered tree into a multiset of simple wedgeshaped structures refered to as pivots. By coupling our pivot multisets with the idea of minwise hashing, we realize a fixed sized signaturesketch of the treestructured datum yielding an effective mechanism for hashing such data. We discuss several potential pivot structures and study some of the theoretical properties of such structures, and discuss their implications to tree edit distance and properties related to perfect hashing. We then empirically demonstrate the efficacy and efficiency of the overall approach on a range of realworld datasets and applications. I.
Polylogarithmic Approximation for Edit Distance and the Asymmetric Query Complexity
, 2010
"... We present a nearlinear time algorithm that approximates the edit distance between two strings within a polylogarithmic factor; specifically, for strings of length n and every fixed ε> 0, it can compute a (log n) O(1/ε) approximation in n 1+ε time. This is an exponential improvement over the pre ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
We present a nearlinear time algorithm that approximates the edit distance between two strings within a polylogarithmic factor; specifically, for strings of length n and every fixed ε> 0, it can compute a (log n) O(1/ε) approximation in n 1+ε time. This is an exponential improvement over the previously known factor, 2 Õ( √ log n), with a comparable running time [OR07, AO09]. Previously, no efficient polylogarithmic approximation algorithm was known for any computational task involving edit distance (e.g., nearest neighbor search or sketching). This result arises naturally in the study of a new asymmetric query model. In this model, the input consists of two strings x and y, and an algorithm can access y in an unrestricted manner, while being charged for querying every symbol of x. Indeed, we obtain our main result by designing an algorithm that makes a small number of queries in this model. We then provide a nearlymatching lower bound on the number of queries. Our lower bound is the first to expose hardness of edit distance stemming from the input strings being “repetitive”, which means that many of their substrings are approximately identical. Consequently, our lower bound provides the first rigorous separation between edit distance and Ulam distance, which is edit distance on nonrepetitive strings, such as permutations.
New Sublinear Methods in the Struggle against Classical Problems
, 2010
"... We study the time and query complexity of approximation algorithms that access only a minuscule fraction of the input, focusing on two classical sources of problems: combinatorial graph optimization and manipulation of strings. The tools we develop find applications outside of the area of sublinear ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We study the time and query complexity of approximation algorithms that access only a minuscule fraction of the input, focusing on two classical sources of problems: combinatorial graph optimization and manipulation of strings. The tools we develop find applications outside of the area of sublinear algorithms. For instance, we obtain a more efficient approximation algorithm for edit distance and distributed algorithms for combinatorial problems on graphs that run in a constant number of communication rounds.
R.: Adaptive metric dimensionality reduction
, 2013
"... Abstract. We study dataadaptive dimensionality reduction in the context of supervised learning in general metric spaces. Our main statistical contribution is a generalization bound for Lipschitz functions in metric spaces that are doubling, or nearly doubling, which yields a new theoretical explana ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
Abstract. We study dataadaptive dimensionality reduction in the context of supervised learning in general metric spaces. Our main statistical contribution is a generalization bound for Lipschitz functions in metric spaces that are doubling, or nearly doubling, which yields a new theoretical explanation for empirically reported improvements gained by preprocessing Euclidean data by PCA (Principal Components Analysis) prior to constructing a linear classifier. On the algorithmic front, we describe an analogue of PCA for metric spaces, namely an efficient procedure that approximates the data’s intrinsic dimension, which is often much lower than the ambient dimension. Our approach thus leverages the dual benefits of low dimensionality: (1) more efficient algorithms, e.g., for proximity search, and (2) more optimistic generalization bounds. 1