Results 1 
8 of
8
The computational hardness of estimating edit distance
 In Proceedings of the Symposium on Foundations of Computer Science
, 2007
"... We prove the first nontrivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings. To the best of our knowledge, this is the first computational setting in which the complexity of computing the edit distance is provably ..."
Abstract

Cited by 24 (8 self)
 Add to MetaCart
(Show Context)
We prove the first nontrivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings. To the best of our knowledge, this is the first computational setting in which the complexity of computing the edit distance is provably larger than that of Hamming distance. Our lower bound exhibits a tradeoff between approximation and communication, asserting, for example, that protocols with O(1) bits of communication can only obtain approximation α ≥ Ω(log d / log log d), where d is the length of the input strings. This case of O(1) communication is of particular importance since it captures constantsize sketches as well as embeddings into spaces like L1 and squaredL2, two prevailing algorithmic approaches for dealing with edit distance. Furthermore, the bound holds not only for strings over alphabet Σ = {0, 1}, but also for strings that are permutations (aka the Ulam metric). Besides being applicable to a much richer class of algorithms than all previous results, our bounds are neartight in at least one case, namely of embedding permutations into L1. The proof uses a new technique, that relies on Fourier analysis in a rather elementary way. 1
B edTree: An AllPurpose Index Structure for String Similarity Search Based on Edit Distance
"... Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similar ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
(Show Context)
Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similarity measure. String similarity search is a fundamental problem in information retrieval, database cleaning, biological sequence analysis, and more. While a large number of dissimilarity measures on strings have been proposed, edit distance is the most popular choice in a wide spectrum of applications. Existing indexing techniques for similarity search queries based on edit distance, e.g., approximate selection and join queries, rely mostly on ngram signatures coupled with inverted list structures. These techniques are tailored for specific query types only, and their performance remains unsatisfactory especially in scenarios with strict memory constraints or frequent data updates. In this paper we propose the B edtree, a B +tree based index structure for evaluating all types of similarity queries on edit distance and normalized edit distance. We identify the necessary properties of a mapping from the string space to the integer space for supporting searching and pruning for these queries. Three transformations are proposed that capture different aspects of information inherent in strings, enabling efficient pruning during the search process on the tree. Compared to stateoftheart methods on string similarity search, the B edtree is a complete solution that meets the requirements of all applications, providing high scalability and fast response time.
The smoothed complexity of edit distance
 IN PROC. OF ICALP
, 2008
"... We initiate the study of the smoothed complexity of sequence alignment, by proposing a semirandom model of edit distance between two input strings, generated as follows. First, an adversary chooses two binary strings of length d and a longest common subsequence A of them. Then, every character is ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
We initiate the study of the smoothed complexity of sequence alignment, by proposing a semirandom model of edit distance between two input strings, generated as follows. First, an adversary chooses two binary strings of length d and a longest common subsequence A of them. Then, every character is perturbed independently with probability p, except that A is perturbed in exactly the same way inside the two strings. We design two efficient algorithms that compute the edit distance on smoothed instances up to a constant factor approximation. The first algorithm runs in nearlinear time, namely d 1+ε for any fixed ε> 0. The second one runs in time sublinear in d, assuming the edit distance is not too small. These approximation and runtime guarantees are significantly better then the bounds known for worstcase inputs, e.g. nearlinear time algorithm achieving approximation roughly d 1/3, due to Batu, Ergün, and Sahinalp [SODA 2006]. Our technical contribution is twofold. First, we rely on finding matches between substrings in the two strings, where two substrings are considered a match if their edit distance is relatively small, a prevailing technique in commonly used heuristics, such as PatternHunter of Ma, Tromp and Li [Bioinformatics, 2002]. Second, we effectively reduce the smoothed edit distance to a simpler variant of (worstcase) edit distance, namely, edit distance on permutations (a.k.a. Ulam’s metric). We are thus able to build on algorithms developed for the Ulam metric, whose much better algorithmic guarantees usually do not carry over to general edit distance.
Polylogarithmic Approximation for Edit Distance and the Asymmetric Query Complexity
, 2010
"... We present a nearlinear time algorithm that approximates the edit distance between two strings within a polylogarithmic factor; specifically, for strings of length n and every fixed ε> 0, it can compute a (log n) O(1/ε) approximation in n 1+ε time. This is an exponential improvement over the pre ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
(Show Context)
We present a nearlinear time algorithm that approximates the edit distance between two strings within a polylogarithmic factor; specifically, for strings of length n and every fixed ε> 0, it can compute a (log n) O(1/ε) approximation in n 1+ε time. This is an exponential improvement over the previously known factor, 2 Õ( √ log n), with a comparable running time [OR07, AO09]. Previously, no efficient polylogarithmic approximation algorithm was known for any computational task involving edit distance (e.g., nearest neighbor search or sketching). This result arises naturally in the study of a new asymmetric query model. In this model, the input consists of two strings x and y, and an algorithm can access y in an unrestricted manner, while being charged for querying every symbol of x. Indeed, we obtain our main result by designing an algorithm that makes a small number of queries in this model. We then provide a nearlymatching lower bound on the number of queries. Our lower bound is the first to expose hardness of edit distance stemming from the input strings being “repetitive”, which means that many of their substrings are approximately identical. Consequently, our lower bound provides the first rigorous separation between edit distance and Ulam distance, which is edit distance on nonrepetitive strings, such as permutations.
RademacherSketch: A DimensionalityReducing Embedding for SumProduct Norms, with an Application to EarthMover Distance
"... Abstract. Consider a sumproduct normed space, i.e. a space of the form Y = ℓ n 1 ⊗ X, where X is another normed space. Each element in Y consists of a lengthn vector of elements in X, and the norm of an element in Y is the sum of the norms of its coordinates. In this paper we show a constantdisto ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Consider a sumproduct normed space, i.e. a space of the form Y = ℓ n 1 ⊗ X, where X is another normed space. Each element in Y consists of a lengthn vector of elements in X, and the norm of an element in Y is the sum of the norms of its coordinates. In this paper we show a constantdistortion embedding from the normed space ℓ n 1 ⊗ X into a lowerdimensional normed space ℓ n′ 1 ⊗ X, where n ′ ≪ n is some value that depends on the properties of the normed space X (namely, on its Rademacher dimension). In particular, composing this embedding with another wellknown embedding of Indyk [18], we get an O(1/ɛ)distortion embedding from the earthmover metric EMD ∆ on the grid [∆] 2 to ℓ ∆O(ɛ) 1 ⊗EEMD∆ɛ (where EEMD is a norm that generalizes earthmover distance). This embedding is stronger (and simpler) than the sketching algorithm of Andoni et al [4], which maps EMD ∆ with O(1/ɛ) approximation into sketches of size ∆ O(ɛ). 1
Efficient topk algorithms for approximate substring matching
 In SIGMOD Conference
, 2013
"... There is a wide range of applications that require to query a large database of texts to search for similar strings or substrings. Traditional approximate substring matching requests a user to specify a similarity threshold. Without topk approximate substring matching, users have to try repeatedl ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
There is a wide range of applications that require to query a large database of texts to search for similar strings or substrings. Traditional approximate substring matching requests a user to specify a similarity threshold. Without topk approximate substring matching, users have to try repeatedly different maximum distance threshold values when the proper threshold is unknown in advance. In our paper, we first propose the efficient algorithms for finding the topk approximate substring matches with a given query string in a set of data strings. To reduce the number of expensive distance computations, the proposed algorithms utilize our novel filtering techniques which take advantages of qgrams and inverted qgram indexes available. We conduct extensive experiments with reallife data sets. Our experimental results confirm the effectiveness and scalability of our proposed algorithms. Categories and Subject Descriptors H.2 [Database Management]: Systems—query processing, textual databases Keywords Topk approximate substring matching; edit distance; inverted qgram index 1.
The Dyck language edit distance problem in nearlinear time. FOCS
, 2014
"... Abstract Given a string σ over alphabet Σ and a grammar G defined over the same alphabet, how many minimum number of repairs (insertions, deletions and substitutions) are required to map σ into a valid member of G? The seminal work of Aho and Peterson in 1972 initiated the study of this language ed ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract Given a string σ over alphabet Σ and a grammar G defined over the same alphabet, how many minimum number of repairs (insertions, deletions and substitutions) are required to map σ into a valid member of G? The seminal work of Aho and Peterson in 1972 initiated the study of this language edit distance problem providing a dynamic programming algorithm for context free languages that runs in O(G 2 n 3 ) time, where n is the string length and G is the grammar size. While later improvements reduced the running time to O(Gn 3 ), the cubic running time on the input length held a major bottleneck for applying these algorithms to their multitude of applications. In this paper, we study the language edit distance problem for a fundamental context free language, DYCK(s) representing the language of wellbalanced parentheses of s different types, that has been pivotal in the development of formal language theory. We provide the very first nearlinear time algorithm to tightly approximate the DYCK(s) language edit distance problem for any arbitrary s. DYCK(s) language edit distance significantly generalizes the wellstudied string edit distance problem, and appears in most applications of language edit distance ranging from data quality in databases, generating automated errorcorrecting parsers in compiler optimization to structure prediction problems in biological sequences. Its nondeterministic counterpart is known as the hardest context free language. Our main result is an algorithm for edit distance computation to DYCK(s) for any positive integer s that runs in O(n 1+ polylog(n)) time and achieves an approximation factor of O( 1 β(n) log OP T ), for any > 0. Here OP T is the optimal edit distance to DYCK(s) and β(n) is the best approximation factor known for the simpler problem of string edit distance running in analogous time. If we allow O(n 1+ + OP T  2 n ) time, then the approximation factor can be reduced to O( 1 log OP T ). Since the best known nearlinear time algorithm for the string edit distance problem has β(n) = polylog(n), under nearlinear time computation model both DYCK(s) language and string edit distance problems have polylog(n) approximation factors. This comes as a surprise since the former is a significant generalization of the latter and their exact computations via dynamic programming show a stark difference in time complexity. Rather less surprisingly, we show that the framework for efficiently approximating edit distance to DYCK(s) can be utilized for many other languages. We illustrate this by considering various memory checking languages (studied extensively under distributed verification) such as STACK, QUEUE, PQ and DEQUE which comprise of valid transcripts of stacks, queues, priority queues and doubleended queues respectively. Therefore, any language that can be recognized by these data structures, can also be repaired efficiently by our algorithm.
Efficient GenomeWide, PrivacyPreserving Similar Patient Query based on Private Edit Distance
"... Edit distance has been proven to be an important and frequentlyused metric in many human genomic research, with Similar Patient Query (SPQ) being a particularly promising and attractive example. However, due to the widespread privacy concerns on revealing personal genomic data, the scope and scale ..."
Abstract
 Add to MetaCart
Edit distance has been proven to be an important and frequentlyused metric in many human genomic research, with Similar Patient Query (SPQ) being a particularly promising and attractive example. However, due to the widespread privacy concerns on revealing personal genomic data, the scope and scale of many novel use of genome edit distance are substantially limited. While the problem of private genomic edit distance has been studied by the research community for over a decade [5], the stateoftheart solution [30] is far from even close to be applicable to real genome sequences. In this paper, we propose several private edit distance protocols that feature unprecedentedly high efficiency and precision. Our construction is a combination of a novel genomic edit distance approximation algorithm and new construction of private set difference size protocols. With the private edit distance based secure SPQ primitive, we propose GENSETS, a genomewide, privacypreserving similar patient query system. It is able to support searching largescale, distributed genome databases across the nation. We have implemented a prototype of GENSETS. The experimental results show that, with 100 Mbps network connection, it would take GENSETS less than 200 minutes to search through 1 million breast cancer patients (distributed nationwide in 250 hospitals, each having 4000 patients), based on edit distances between their genomes of lengths about 75 million nucleotides each.