Results 1 
8 of
8
The computational hardness of estimating edit distance
 In Proceedings of the Symposium on Foundations of Computer Science
, 2007
"... We prove the first nontrivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings. To the best of our knowledge, this is the first computational setting in which the complexity of computing the edit distance is provably ..."
Abstract

Cited by 23 (8 self)
 Add to MetaCart
(Show Context)
We prove the first nontrivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings. To the best of our knowledge, this is the first computational setting in which the complexity of computing the edit distance is provably larger than that of Hamming distance. Our lower bound exhibits a tradeoff between approximation and communication, asserting, for example, that protocols with O(1) bits of communication can only obtain approximation α ≥ Ω(log d / log log d), where d is the length of the input strings. This case of O(1) communication is of particular importance since it captures constantsize sketches as well as embeddings into spaces like L1 and squaredL2, two prevailing algorithmic approaches for dealing with edit distance. Furthermore, the bound holds not only for strings over alphabet Σ = {0, 1}, but also for strings that are permutations (aka the Ulam metric). Besides being applicable to a much richer class of algorithms than all previous results, our bounds are neartight in at least one case, namely of embedding permutations into L1. The proof uses a new technique, that relies on Fourier analysis in a rather elementary way. 1
The data stream space complexity of cascaded norms
 In FOCS
, 2009
"... Abstract — We consider the problem of estimating cascaded aggregates over a matrix presented as a sequence of updates in a data stream. A cascaded aggregate P ◦ Q is defined by evaluating aggregate Q repeatedly over each row of the matrix, and then evaluating aggregate P over the resulting vector of ..."
Abstract

Cited by 17 (7 self)
 Add to MetaCart
(Show Context)
Abstract — We consider the problem of estimating cascaded aggregates over a matrix presented as a sequence of updates in a data stream. A cascaded aggregate P ◦ Q is defined by evaluating aggregate Q repeatedly over each row of the matrix, and then evaluating aggregate P over the resulting vector of values. This problem was introduced by Cormode and Muthukrishnan, PODS, 2005 [CM]. We analyze the space complexity of estimating cascaded norms on an n × d matrix to within a small relative error. Let Lp denote the pth norm, where p is a nonnegative integer. We abbreviate the cascaded norm L k ◦ Lp by L k,p. (1) For any constant k ≥ p ≥ 2, we obtain a 1pass Õ(n1−2/k d 1−2/p)space algorithm for estimating Lk,p. This is optimal up to polylogarithmic factors and resolves an open question of [CM] regarding the space complexity of L4,2. We also obtain 1pass spaceoptimal algorithms for estimating L∞,k and Lk,∞. (2) We prove a space lower bound of Ω(n1−1/k) on estimating Lk,0 and Lk,1, resolving an open question due to Indyk, IITK Data Streams Workshop (Problem 8), 2006. We also resolve two more questions of [CM] concerning Lk,2 estimation and block heavy hitter problems. Ganguly, Bansal and Dube (FAW, 2008) claimed an Õ(1)space algorithm for estimating Lk,p for any k, p ∈ [0,2]. Our lower bounds show this claim is incorrect. 1.
The smoothed complexity of edit distance
 IN PROC. OF ICALP
, 2008
"... We initiate the study of the smoothed complexity of sequence alignment, by proposing a semirandom model of edit distance between two input strings, generated as follows. First, an adversary chooses two binary strings of length d and a longest common subsequence A of them. Then, every character is ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
We initiate the study of the smoothed complexity of sequence alignment, by proposing a semirandom model of edit distance between two input strings, generated as follows. First, an adversary chooses two binary strings of length d and a longest common subsequence A of them. Then, every character is perturbed independently with probability p, except that A is perturbed in exactly the same way inside the two strings. We design two efficient algorithms that compute the edit distance on smoothed instances up to a constant factor approximation. The first algorithm runs in nearlinear time, namely d 1+ε for any fixed ε> 0. The second one runs in time sublinear in d, assuming the edit distance is not too small. These approximation and runtime guarantees are significantly better then the bounds known for worstcase inputs, e.g. nearlinear time algorithm achieving approximation roughly d 1/3, due to Batu, Ergün, and Sahinalp [SODA 2006]. Our technical contribution is twofold. First, we rely on finding matches between substrings in the two strings, where two substrings are considered a match if their edit distance is relatively small, a prevailing technique in commonly used heuristics, such as PatternHunter of Ma, Tromp and Li [Bioinformatics, 2002]. Second, we effectively reduce the smoothed edit distance to a simpler variant of (worstcase) edit distance, namely, edit distance on permutations (a.k.a. Ulam’s metric). We are thus able to build on algorithms developed for the Ulam metric, whose much better algorithmic guarantees usually do not carry over to general edit distance.
The Streaming Complexity of Cycle Counting, Sorting By Reversals, and Other Problems
, 2010
"... In this paper we introduce a new technique for proving streaming lower bounds (and oneway communication lower bounds), by reductions from a problem called the Boolean Hidden Hypermatching problem (BHH). BHH is a problem that we introduce and prove the first lower bound for, but it is a generalizati ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
In this paper we introduce a new technique for proving streaming lower bounds (and oneway communication lower bounds), by reductions from a problem called the Boolean Hidden Hypermatching problem (BHH). BHH is a problem that we introduce and prove the first lower bound for, but it is a generalization of a wellknown problem called the Boolean Hidden Matching, that was used by Gavinsky et al. to prove separations between quantum communication complexity and oneway randomized communication complexity. The hardness of the BHH problem is inherently oneway: it is easy to solve using logarithmic twoway communication, but requires √ n communication if Alice is only allowed to send messages to Bob, and not viceversa. This onewayness allows us to prove lower bounds, via reductions, for streaming problems and related communication problems whose hardness is also inherently oneway. By designing reductions from BHH, we prove lower bounds for the streaming complexity of approximating the sorting by reversal distance, for approximately counting the number of cycles in a 2regular graph, and for other problems. For example, here is one lower bound that we prove, for a cyclecounting problem: Alice gets a perfect matching EA on a set of n nodes, and Bob gets a perfect matching EB on the same set of nodes. The union EA ∪ EB is a collection of cycles, and the goal is to approximate the number of cycles in this collection. We prove that if Alice is allowed to send o ( √ n) bits to Bob (and Bob is not allowed to send anything to Alice), then the number of cycles cannot be approximated to within a factor of 1.999, even using a randomized protocol. We prove that it is not even possible to distinguish the case where all cycles are of length 4, from the case where all cycles are of length 8. This lower bound is “natively ” oneway: With 4 rounds of communication, it is easy to distinguish these two cases. 1
Polylogarithmic Approximation for Edit Distance and the Asymmetric Query Complexity
, 2010
"... We present a nearlinear time algorithm that approximates the edit distance between two strings within a polylogarithmic factor; specifically, for strings of length n and every fixed ε> 0, it can compute a (log n) O(1/ε) approximation in n 1+ε time. This is an exponential improvement over the pre ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
We present a nearlinear time algorithm that approximates the edit distance between two strings within a polylogarithmic factor; specifically, for strings of length n and every fixed ε> 0, it can compute a (log n) O(1/ε) approximation in n 1+ε time. This is an exponential improvement over the previously known factor, 2 Õ( √ log n), with a comparable running time [OR07, AO09]. Previously, no efficient polylogarithmic approximation algorithm was known for any computational task involving edit distance (e.g., nearest neighbor search or sketching). This result arises naturally in the study of a new asymmetric query model. In this model, the input consists of two strings x and y, and an algorithm can access y in an unrestricted manner, while being charged for querying every symbol of x. Indeed, we obtain our main result by designing an algorithm that makes a small number of queries in this model. We then provide a nearlymatching lower bound on the number of queries. Our lower bound is the first to expose hardness of edit distance stemming from the input strings being “repetitive”, which means that many of their substrings are approximately identical. Consequently, our lower bound provides the first rigorous separation between edit distance and Ulam distance, which is edit distance on nonrepetitive strings, such as permutations.
New Sublinear Methods in the Struggle against Classical Problems
, 2010
"... We study the time and query complexity of approximation algorithms that access only a minuscule fraction of the input, focusing on two classical sources of problems: combinatorial graph optimization and manipulation of strings. The tools we develop find applications outside of the area of sublinear ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We study the time and query complexity of approximation algorithms that access only a minuscule fraction of the input, focusing on two classical sources of problems: combinatorial graph optimization and manipulation of strings. The tools we develop find applications outside of the area of sublinear algorithms. For instance, we obtain a more efficient approximation algorithm for edit distance and distributed algorithms for combinatorial problems on graphs that run in a constant number of communication rounds.
Research Statement
, 2008
"... My research lies in the field of theoretical computer science with a primary focus in computational complexity theory and a secondary focus in (approximation) algorithms. These two areas may seem unconnected on the surface, but are in fact two sides of a coin: one of the chief goals in my complexity ..."
Abstract
 Add to MetaCart
(Show Context)
My research lies in the field of theoretical computer science with a primary focus in computational complexity theory and a secondary focus in (approximation) algorithms. These two areas may seem unconnected on the surface, but are in fact two sides of a coin: one of the chief goals in my complexity research is to establish limits on our ability to solve certain problems with computers, whereas in my work on approximation algorithms I attempt to work around the proven (or seeming) intractability of computational problems that need to be solved, for various applications. Both areas place great emphasis on precise mathematical modelling of computational problems and rigorous proofs (rather than experimental evidence) to ensure that the research results remain valid in spite of future advances in computer hardware and software. Finally, both areas draw upon, and contribute to, a common toolkit of ideas and basic techniques, leading to plenty of opportunities for crossfertilisation. Below, I provide some basic background for both these focus areas. I then identify key themes in my research so far, in Section 2 and move on to outlining my most important specific results (rather than exhaustively listing all my results), loosely grouped by topic, in Sections 3 and 4. Finally, in Section 5, I discuss some research directions and specific challenges that I would like to tackle in my future work. Copies of my papers can be found at