Results 1  10
of
18
Efficient sketches for earthmover distance, with applications
 in FOCS
, 2009
"... Abstract — We provide the first sublinear sketching algorithm for estimating the planar EarthMover Distance with a constant approximation. For sets living in the twodimensional grid [∆] 2, we achieve space ∆ ɛ for approximation O(1/ɛ), for any desired 0 < ɛ < 1. Our sketch has immediate app ..."
Abstract

Cited by 28 (8 self)
 Add to MetaCart
(Show Context)
Abstract — We provide the first sublinear sketching algorithm for estimating the planar EarthMover Distance with a constant approximation. For sets living in the twodimensional grid [∆] 2, we achieve space ∆ ɛ for approximation O(1/ɛ), for any desired 0 < ɛ < 1. Our sketch has immediate applications to the streaming and nearest neighbor search problems. 1.
The computational hardness of estimating edit distance
 In Proceedings of the Symposium on Foundations of Computer Science
, 2007
"... We prove the first nontrivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings. To the best of our knowledge, this is the first computational setting in which the complexity of computing the edit distance is provably ..."
Abstract

Cited by 24 (8 self)
 Add to MetaCart
(Show Context)
We prove the first nontrivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings. To the best of our knowledge, this is the first computational setting in which the complexity of computing the edit distance is provably larger than that of Hamming distance. Our lower bound exhibits a tradeoff between approximation and communication, asserting, for example, that protocols with O(1) bits of communication can only obtain approximation α ≥ Ω(log d / log log d), where d is the length of the input strings. This case of O(1) communication is of particular importance since it captures constantsize sketches as well as embeddings into spaces like L1 and squaredL2, two prevailing algorithmic approaches for dealing with edit distance. Furthermore, the bound holds not only for strings over alphabet Σ = {0, 1}, but also for strings that are permutations (aka the Ulam metric). Besides being applicable to a much richer class of algorithms than all previous results, our bounds are neartight in at least one case, namely of embedding permutations into L1. The proof uses a new technique, that relies on Fourier analysis in a rather elementary way. 1
The data stream space complexity of cascaded norms
 In FOCS
, 2009
"... Abstract — We consider the problem of estimating cascaded aggregates over a matrix presented as a sequence of updates in a data stream. A cascaded aggregate P ◦ Q is defined by evaluating aggregate Q repeatedly over each row of the matrix, and then evaluating aggregate P over the resulting vector of ..."
Abstract

Cited by 17 (7 self)
 Add to MetaCart
(Show Context)
Abstract — We consider the problem of estimating cascaded aggregates over a matrix presented as a sequence of updates in a data stream. A cascaded aggregate P ◦ Q is defined by evaluating aggregate Q repeatedly over each row of the matrix, and then evaluating aggregate P over the resulting vector of values. This problem was introduced by Cormode and Muthukrishnan, PODS, 2005 [CM]. We analyze the space complexity of estimating cascaded norms on an n × d matrix to within a small relative error. Let Lp denote the pth norm, where p is a nonnegative integer. We abbreviate the cascaded norm L k ◦ Lp by L k,p. (1) For any constant k ≥ p ≥ 2, we obtain a 1pass Õ(n1−2/k d 1−2/p)space algorithm for estimating Lk,p. This is optimal up to polylogarithmic factors and resolves an open question of [CM] regarding the space complexity of L4,2. We also obtain 1pass spaceoptimal algorithms for estimating L∞,k and Lk,∞. (2) We prove a space lower bound of Ω(n1−1/k) on estimating Lk,0 and Lk,1, resolving an open question due to Indyk, IITK Data Streams Workshop (Problem 8), 2006. We also resolve two more questions of [CM] concerning Lk,2 estimation and block heavy hitter problems. Ganguly, Bansal and Dube (FAW, 2008) claimed an Õ(1)space algorithm for estimating Lk,p for any k, p ∈ [0,2]. Our lower bounds show this claim is incorrect. 1.
The smoothed complexity of edit distance
 IN PROC. OF ICALP
, 2008
"... We initiate the study of the smoothed complexity of sequence alignment, by proposing a semirandom model of edit distance between two input strings, generated as follows. First, an adversary chooses two binary strings of length d and a longest common subsequence A of them. Then, every character is ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
We initiate the study of the smoothed complexity of sequence alignment, by proposing a semirandom model of edit distance between two input strings, generated as follows. First, an adversary chooses two binary strings of length d and a longest common subsequence A of them. Then, every character is perturbed independently with probability p, except that A is perturbed in exactly the same way inside the two strings. We design two efficient algorithms that compute the edit distance on smoothed instances up to a constant factor approximation. The first algorithm runs in nearlinear time, namely d 1+ε for any fixed ε> 0. The second one runs in time sublinear in d, assuming the edit distance is not too small. These approximation and runtime guarantees are significantly better then the bounds known for worstcase inputs, e.g. nearlinear time algorithm achieving approximation roughly d 1/3, due to Batu, Ergün, and Sahinalp [SODA 2006]. Our technical contribution is twofold. First, we rely on finding matches between substrings in the two strings, where two substrings are considered a match if their edit distance is relatively small, a prevailing technique in commonly used heuristics, such as PatternHunter of Ma, Tromp and Li [Bioinformatics, 2002]. Second, we effectively reduce the smoothed edit distance to a simpler variant of (worstcase) edit distance, namely, edit distance on permutations (a.k.a. Ulam’s metric). We are thus able to build on algorithms developed for the Ulam metric, whose much better algorithmic guarantees usually do not carry over to general edit distance.
Lower Bound Techniques for Data Structures
, 2008
"... We describe new techniques for proving lower bounds on datastructure problems, with the following broad consequences:
â¢ the first Î©(lgn) lower bound for any dynamic problem, improving on a bound that had been standing since 1989;
â¢ for static data structures, the first separation between linea ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
We describe new techniques for proving lower bounds on datastructure problems, with the following broad consequences:
â¢ the first Î©(lgn) lower bound for any dynamic problem, improving on a bound that had been standing since 1989;
â¢ for static data structures, the first separation between linear and polynomial space. Specifically, for some problems that have constant query time when polynomial space is allowed, we can show Î©(lg n/ lg lg n) bounds when the space is O(n Â· polylog n).
Using these techniques, we analyze a variety of central datastructure problems, and obtain improved lower bounds for the following:
â¢ the partialsums problem (a fundamental application of augmented binary search trees);
â¢ the predecessor problem (which is equivalent to IP lookup in Internet routers);
â¢ dynamic trees and dynamic connectivity;
â¢ orthogonal range stabbing;
â¢ orthogonal range counting, and orthogonal range reporting;
â¢ the partial match problem (searching with wildcards);
â¢ (1 + Îµ)approximate near neighbor on the hypercube;
â¢ approximate nearest neighbor in the lâ metric.
Our new techniques lead to surprisingly nontechnical proofs. For several problems, we obtain simpler proofs for bounds that were already known.
Lower bounds for edit distance and product metrics via Poincarétype inequalities
 In Proceedings of the ACMSIAM Symposium on Discrete Algorithms (SODA
, 2010
"... We prove that any sketching protocol for edit distance achieving a constant approximation requires nearly logarithmic (in the strings ’ length) communication complexity. This is an exponential improvement over the previous, doublylogarithmic, lower bound of [AndoniKrauthgamer, FOCS’07]. Our lower ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
(Show Context)
We prove that any sketching protocol for edit distance achieving a constant approximation requires nearly logarithmic (in the strings ’ length) communication complexity. This is an exponential improvement over the previous, doublylogarithmic, lower bound of [AndoniKrauthgamer, FOCS’07]. Our lower bound also applies to the Ulam distance (edit distance over nonrepetitive strings). In this special case, it is polynomially related to the recent upper bound of [AndoniIndykKrauthgamer, SODA’09]. From a technical perspective, we prove a directsum theorem for sketching product metrics that is of independent interest. We show that, for any metric X that requires sketch size which is a sufficiently large constant, sketching the maxproduct metric ℓd ∞(X) requires Ω(d) bits. The conclusion, in fact, also holds for arbitrary twoway communication. The proof uses a novel technique for information complexity based on Poincaré inequalities and suggests an intimate connection between nonembeddability, sketching and communication complexity. ∗ Work done while at MIT. † Work done while at IBM Almaden. 1
Block heavy hitters
, 2008
"... We study a natural generalization of the heavy hitters problem in the streaming context. We term this generalization block heavy hitters and define it as follows. We are to stream over a matrix A, and report all rows that are heavy, where a row is heavy if its ℓ1norm is at least φ fraction of the ℓ ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
We study a natural generalization of the heavy hitters problem in the streaming context. We term this generalization block heavy hitters and define it as follows. We are to stream over a matrix A, and report all rows that are heavy, where a row is heavy if its ℓ1norm is at least φ fraction of the ℓ1 norm of the entire matrix A. In comparison, in the standard heavy hitters problem, we are required to report the matrix entries that are heavy. As is common in streaming, we solve the problem approximately: we return all rows with weight at least φ, but also possibly some other rows that have weight no less than (1 − ɛ)φ. To solve the block heavy hitters problem, we show how to construct a linear sketch of A from which we can recover the heavy rows of A. The block heavy hitters problem has already found applications for other streaming problems. In particular, it is a crucial building block in a streaming algorithm of [AIK08] that constructs a smallsize sketch for the Ulam metric, a metric on nonrepetitive strings under the edit (Levenshtein) distance.
RademacherSketch: A DimensionalityReducing Embedding for SumProduct Norms, with an Application to EarthMover Distance
"... Abstract. Consider a sumproduct normed space, i.e. a space of the form Y = ℓ n 1 ⊗ X, where X is another normed space. Each element in Y consists of a lengthn vector of elements in X, and the norm of an element in Y is the sum of the norms of its coordinates. In this paper we show a constantdisto ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Consider a sumproduct normed space, i.e. a space of the form Y = ℓ n 1 ⊗ X, where X is another normed space. Each element in Y consists of a lengthn vector of elements in X, and the norm of an element in Y is the sum of the norms of its coordinates. In this paper we show a constantdistortion embedding from the normed space ℓ n 1 ⊗ X into a lowerdimensional normed space ℓ n′ 1 ⊗ X, where n ′ ≪ n is some value that depends on the properties of the normed space X (namely, on its Rademacher dimension). In particular, composing this embedding with another wellknown embedding of Indyk [18], we get an O(1/ɛ)distortion embedding from the earthmover metric EMD ∆ on the grid [∆] 2 to ℓ ∆O(ɛ) 1 ⊗EEMD∆ɛ (where EEMD is a norm that generalizes earthmover distance). This embedding is stronger (and simpler) than the sketching algorithm of Andoni et al [4], which maps EMD ∆ with O(1/ɛ) approximation into sketches of size ∆ O(ɛ). 1
Estimating the longest increasing sequence in polylogarithmic time
"... Abstract—Finding the length of the longest increasing subsequence (LIS) is a classic algorithmic problem. Let n denote the size of the array. Simple O(n log n) time algorithms are known that determine the LIS exactly. In this paper, we develop a randomized approximation algorithm, that for any const ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
Abstract—Finding the length of the longest increasing subsequence (LIS) is a classic algorithmic problem. Let n denote the size of the array. Simple O(n log n) time algorithms are known that determine the LIS exactly. In this paper, we develop a randomized approximation algorithm, that for any constant δ>0, runs in time polylogarithmic in n and estimates the length of the LIS of an array up to an additive error of δn. The algorithm presented in this extended abstract runs in time (log n) O(1/δ). In the full paper, we will give an improved version of the algorithm with running time (log n) c (1/δ) O(1/δ) where the exponent c is independent of δ. Previously, the best known polylogarithmic time algorithms could only achieve an additive n/2approximation. Our techniques also yield a fast algorithm for estimating the distance to monotonicity to within a small multiplicative factor. The distance of f to monotonicity, εf, is equal to 1 −LIS/n (the fractional length of the complement of the LIS). For any δ> 0, we give an algorithm with running time O((ε −1 f log n) O(1/δ) ) that outputs a (1 + δ)multiplicative approximation to εf. This can be improved so that the exponent is a fixed constant. The previously known polylogarithmic algorithms gave only a 2approximation.