Results 11  20
of
133
A sublinear algorithm for weakly approximating edit distance
 In Proc. STOC 2003
, 2003
"... We show how to determine whether the edit distance between two given strings is small in sublinear time. Specifically, we present a test which, given two ncharacter strings A and B, runs in time o(n) and with high probability returns “CLOSE ” if their edit distance is O(n α), and “FAR”if their edit ..."
Abstract

Cited by 39 (4 self)
 Add to MetaCart
We show how to determine whether the edit distance between two given strings is small in sublinear time. Specifically, we present a test which, given two ncharacter strings A and B, runs in time o(n) and with high probability returns “CLOSE ” if their edit distance is O(n α), and “FAR”if their edit distance is Ω(n), where α is a fixed parameter less than 1. Our algorithm for testing the edit distance works by recursively subdividing the strings A and B into smaller substrings and looking for pairs of substrings in A, B with small edit distance. To do this, we query both strings at random places using a special technique for economizing on the samples which does not pick the samples independently and provides better query and overall complexity. As a result, our test runs in time Õ n max { α 2,2α−1} for any fixed α < 1. Our algorithm thus provides a tradeoff between accuracy and efficiency that is particularly useful when the input data is very large. We also show a lower bound of Ω(n α/2)onthequerycomplexity of every algorithm that distinguishes pairs of strings with edit distance at most n α from those with edit distance at least n/6.
EdJoin: An Efficient Algorithm for Similarity Joins With Edit Distance Constraints
, 2008
"... There has been considerable interest in similarity join in the research community recently. Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and pattern recognition. We focus on efficient algorithms for similarity join with ..."
Abstract

Cited by 34 (6 self)
 Add to MetaCart
There has been considerable interest in similarity join in the research community recently. Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and pattern recognition. We focus on efficient algorithms for similarity join with edit distance constraints. Existing approaches are mainly based on converting the edit distance constraint to a weaker constraint on the number of matching qgrams between pair of strings. In this paper, we propose the novel perspective of investigating mismatching qgrams. Technically, we derive two new edit distance lower bounds by analyzing the locations and contents of mismatching qgrams. A new algorithm, EdJoin, is proposed that exploits the new mismatchbased filtering methods; it achieves substantial reduction of the candidate sizes and hence saves computation time. We demonstrate experimentally that the new algorithm outperforms alternative methods on largescale real datasets under a wide range of parameter settings.
Secure and Private Sequence Comparisons
 In WPES’03: Proceedings of the 2003 ACM workshop on Privacy in the electronic society
, 2003
"... We give an e#cient protocol for sequence comparisons of the editdistance kind, such that neither party reveals anything about their private sequence to the other party (other than what can be inferred from the edit distance between their two sequences  which is unavoidable because computing that ..."
Abstract

Cited by 31 (7 self)
 Add to MetaCart
We give an e#cient protocol for sequence comparisons of the editdistance kind, such that neither party reveals anything about their private sequence to the other party (other than what can be inferred from the edit distance between their two sequences  which is unavoidable because computing that distance is the purpose of the protocol). The amount of communication done by our protocol is proportional to the time complexity of the bestknown algorithm for performing the sequence comparison.
Longest Common Subsequences
 In Proc. of 19th MFCS, number 841 in LNCS
, 1994
"... . The length of a longest common subsequence (LLCS) of two or more strings is a useful measure of their similarity. The LLCS of a pair of strings is related to the `edit distance', or number of mutations /errors/editing steps required in passing from one string to the other. In this talk, we explore ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
. The length of a longest common subsequence (LLCS) of two or more strings is a useful measure of their similarity. The LLCS of a pair of strings is related to the `edit distance', or number of mutations /errors/editing steps required in passing from one string to the other. In this talk, we explore some of the combinatorial properties of the suband supersequence relations, survey various algorithms for computing the LLCS, and introduce some results on the expected LLCS for pairs of random strings. 1 Introduction The set \Sigma of finite strings over an unordered finite alphabet \Sigma admits of several natural partial orders. Some, such as the substring, prefix, and suffix relations, depend on contiguity and lead to many interesting combinatorial questions with practical applications to stringmatching. An excellent survey is given by Aho in [1]. In this talk however we will focus on the `subsequence' partial order. We say that u = u 1 \Delta \Delta \Delta um is a subsequence of ...
Episode matching
 In Proceedings of CPM
"... Abstract. Given two words, text T of length n and episode P of length m, the episode matching problem is to find all minimal length substrings of text T that contain episode P as a subsequence. The respective optimization problem is to find the smallest number w, s.t. text T has a subword of length ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
Abstract. Given two words, text T of length n and episode P of length m, the episode matching problem is to find all minimal length substrings of text T that contain episode P as a subsequence. The respective optimization problem is to find the smallest number w, s.t. text T has a subword of length w which contains episode P. In this paper, we introduce a few efficient offline as well as online algorithms for the entire problem, where by online algorithms we mean algorithms which search from left to right consecutive text symbols only once. We present two alphabet independent algorithms which work in time O(nm). The offline algorithm operates in O(1) additional space while the online algorithm pays for its property with O(m) additional space. Two other online algorithms have subquadratic time complexity. One of them works in time O(nm/log m) and O(m) additional space. The other one gives a time/space tradeoff, i.e., it works in time O(n + s +nm log log s ~ log(s/m)) when additional space is limited to O(s). Finally, we present two approximation algorithms for the optimization problem. The offline algorithm is alphabet independent, it has superlinear time complexity O(n/e + nloglog(n/m)) and it uses only constant space. The online algorithm works in time O(n/e + n) and uses space O(m). Both approximation algorithms achieve 1 + e approximation ratio, for any e> 0. 1
Backward Machine Transliteration by Learning Phonetic Similarity
 SIXTH CONFERENCE ON NATURAL LANGUAGE LEARNING
, 2002
"... In many crosslingual applications we need to convert a transliterated word into its original word. In this paper, we present a similaritybased framework to model the task of backward transliteration, and provide a learning algorithm to automatically acquire phonetic similarities from a corpu ..."
Abstract

Cited by 26 (8 self)
 Add to MetaCart
In many crosslingual applications we need to convert a transliterated word into its original word. In this paper, we present a similaritybased framework to model the task of backward transliteration, and provide a learning algorithm to automatically acquire phonetic similarities from a corpus. The learning algorithm is based on WidrowHoff rule with some modifications. The experiment results show that the learning algorithm converges quickly, and the method using acquired phonetic similarities remarkably outperforms previous methods using predefined phonetic similarities or graphic similarities in a corpus of 1574 pairs of English names and transliterated Chinese names. The learning algorithm does not assume any underlying phonological structures or rules, and can be extended to other language pairs once a training corpus and a pronouncing dictionary are available.
A Fast and Practical BitVector Algorithm for the Longest Common Subsequence Problem
 Information Processing Letters
, 2000
"... This paper presents a new practical bitvector algorithm for solving the well known Longest Common Subsequence (LCS) problem. Given two strings of length m and n, n m, we present an algorithm which determines the length p of an LCS in O(nm=w) time and O(m=w) space, where w is the number of bits in a ..."
Abstract

Cited by 25 (2 self)
 Add to MetaCart
This paper presents a new practical bitvector algorithm for solving the well known Longest Common Subsequence (LCS) problem. Given two strings of length m and n, n m, we present an algorithm which determines the length p of an LCS in O(nm=w) time and O(m=w) space, where w is the number of bits in a machine word. This algorithm can be thought of as columnwise "parallelization" of the classical dynamic programming approach. Our algorithm is very efficiently in practice, where computing the length of an LCS of two strings can be done in linear time and constant (additional/working) space by assuming that m w.
Low distortion embeddings for edit distance
 In Proceedings of the Symposium on Theory of Computing
, 2005
"... We show that {0, 1} d endowed with edit distance embeds into ℓ1 with distortion 2 O( √ log d log log d). We further show efficient implementations of the embedding that yield solutions to various computational problems involving edit distance. These include sketching, communication complexity, neare ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
We show that {0, 1} d endowed with edit distance embeds into ℓ1 with distortion 2 O( √ log d log log d). We further show efficient implementations of the embedding that yield solutions to various computational problems involving edit distance. These include sketching, communication complexity, nearest neighbor search. For all these problems, we improve upon previous bounds. 1
LinearTime Computation of Similarity Measures for Sequential Data
, 2008
"... Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and nonmetric similarity functions. The basis for comp ..."
Abstract

Cited by 23 (17 self)
 Add to MetaCart
Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and nonmetric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, kgrams or all contiguous subsequences. As realizations of the framework we provide lineartime algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Experiments on data sets from bioinformatics, text processing and computer security illustrate the efficiency of the proposed algorithms—enabling peak performances of up to 10^6 pairwise comparisons per second. The utility of distances and nonmetric similarity measures for sequences as alternatives to string kernels is demonstrated in applications of text categorization, network intrusion detection and transcription site recognition in DNA.
Matching for RunLength Encoded Strings
, 1999
"... this paper, we develop significantly faster algorithms for a special class of strings which emerge frequently in pattern matching problems. A string S is runlength encoded if it is described as an ordered sequence of pairs (oe; i), each consisting of an alphabet symbol oe and an integer i. Each pai ..."
Abstract

Cited by 22 (2 self)
 Add to MetaCart
this paper, we develop significantly faster algorithms for a special class of strings which emerge frequently in pattern matching problems. A string S is runlength encoded if it is described as an ordered sequence of pairs (oe; i), each consisting of an alphabet symbol oe and an integer i. Each pair corresponds to a run in S consisting of i consecutive occurrences of oe. For example, the string