Results 11  20
of
182
Algorithms and Complexity for Annotated Sequence Analysis
, 1999
"... Molecular biologists use algorithms that compare and otherwise analyze sequences that represent genetic and protein molecules. Most of these algorithms, however, operate on the basic sequence and do not incorporate the additional information that is often known about the molecule and its pieces. Thi ..."
Abstract

Cited by 43 (1 self)
 Add to MetaCart
Molecular biologists use algorithms that compare and otherwise analyze sequences that represent genetic and protein molecules. Most of these algorithms, however, operate on the basic sequence and do not incorporate the additional information that is often known about the molecule and its pieces. This research describes schemes to combinatorially annotate this information onto sequences so that it can be analyzed in tandem with the sequence; the overall result would thus reflect both types of information about the sequence. These annotation schemes include adding colours and arcs to the sequence. Colouring a sequence would produce a samelength sequence of colours or other symbols that highlight or label parts of the sequence. Arcs can be used to link sequence symbols (or coloured substrings) to indicate molecular bonds or other relationships. Adding these annotations to sequence analysis problems such as sequence alignment or finding the longest common subsequence can make the problem more complex, often depending on the complexity of the annotation scheme. This research examines the different annotation schemes and the corresponding problems of verifying annotations, creating annotations, and finding the longest common subsequence of pairs of sequences with annotations. This work involves both the conventional complexity framework and parameterized complexity, and includes algorithms and hardness results for both frameworks. Automata and transducers are created for some annotation verification and creation problems. Different restrictions on layered substring and arc annotation are considered to de iii termine what properties an annotation scheme must have to make its incorporation feasible. Extensions to the algorithms that use weighting schemes are explored. Examin...
A sublinear algorithm for weakly approximating edit distance
 In Proc. STOC 2003
, 2003
"... We show how to determine whether the edit distance between two given strings is small in sublinear time. Specifically, we present a test which, given two ncharacter strings A and B, runs in time o(n) and with high probability returns “CLOSE ” if their edit distance is O(n α), and “FAR”if their edit ..."
Abstract

Cited by 36 (4 self)
 Add to MetaCart
We show how to determine whether the edit distance between two given strings is small in sublinear time. Specifically, we present a test which, given two ncharacter strings A and B, runs in time o(n) and with high probability returns “CLOSE ” if their edit distance is O(n α), and “FAR”if their edit distance is Ω(n), where α is a fixed parameter less than 1. Our algorithm for testing the edit distance works by recursively subdividing the strings A and B into smaller substrings and looking for pairs of substrings in A, B with small edit distance. To do this, we query both strings at random places using a special technique for economizing on the samples which does not pick the samples independently and provides better query and overall complexity. As a result, our test runs in time Õ n max { α 2,2α−1} for any fixed α < 1. Our algorithm thus provides a tradeoff between accuracy and efficiency that is particularly useful when the input data is very large. We also show a lower bound of Ω(n α/2)onthequerycomplexity of every algorithm that distinguishes pairs of strings with edit distance at most n α from those with edit distance at least n/6.
Secure and Private Sequence Comparisons
 In WPES’03: Proceedings of the 2003 ACM workshop on Privacy in the electronic society
, 2003
"... We give an e#cient protocol for sequence comparisons of the editdistance kind, such that neither party reveals anything about their private sequence to the other party (other than what can be inferred from the edit distance between their two sequences  which is unavoidable because computing that ..."
Abstract

Cited by 36 (7 self)
 Add to MetaCart
We give an e#cient protocol for sequence comparisons of the editdistance kind, such that neither party reveals anything about their private sequence to the other party (other than what can be inferred from the edit distance between their two sequences  which is unavoidable because computing that distance is the purpose of the protocol). The amount of communication done by our protocol is proportional to the time complexity of the bestknown algorithm for performing the sequence comparison.
Longest Common Subsequences
 In Proc. of 19th MFCS, number 841 in LNCS
, 1994
"... . The length of a longest common subsequence (LLCS) of two or more strings is a useful measure of their similarity. The LLCS of a pair of strings is related to the `edit distance', or number of mutations /errors/editing steps required in passing from one string to the other. In this talk, we ex ..."
Abstract

Cited by 31 (1 self)
 Add to MetaCart
. The length of a longest common subsequence (LLCS) of two or more strings is a useful measure of their similarity. The LLCS of a pair of strings is related to the `edit distance', or number of mutations /errors/editing steps required in passing from one string to the other. In this talk, we explore some of the combinatorial properties of the suband supersequence relations, survey various algorithms for computing the LLCS, and introduce some results on the expected LLCS for pairs of random strings. 1 Introduction The set \Sigma of finite strings over an unordered finite alphabet \Sigma admits of several natural partial orders. Some, such as the substring, prefix, and suffix relations, depend on contiguity and lead to many interesting combinatorial questions with practical applications to stringmatching. An excellent survey is given by Aho in [1]. In this talk however we will focus on the `subsequence' partial order. We say that u = u 1 \Delta \Delta \Delta um is a subsequence of ...
Backward Machine Transliteration by Learning Phonetic Similarity
 SIXTH CONFERENCE ON NATURAL LANGUAGE LEARNING
, 2002
"... In many crosslingual applications we need to convert a transliterated word into its original word. In this paper, we present a similaritybased framework to model the task of backward transliteration, and provide a learning algorithm to automatically acquire phonetic similarities from a corpu ..."
Abstract

Cited by 29 (8 self)
 Add to MetaCart
In many crosslingual applications we need to convert a transliterated word into its original word. In this paper, we present a similaritybased framework to model the task of backward transliteration, and provide a learning algorithm to automatically acquire phonetic similarities from a corpus. The learning algorithm is based on WidrowHoff rule with some modifications. The experiment results show that the learning algorithm converges quickly, and the method using acquired phonetic similarities remarkably outperforms previous methods using predefined phonetic similarities or graphic similarities in a corpus of 1574 pairs of English names and transliterated Chinese names. The learning algorithm does not assume any underlying phonological structures or rules, and can be extended to other language pairs once a training corpus and a pronouncing dictionary are available.
Episode matching
 In Proceedings of CPM
"... Abstract. Given two words, text T of length n and episode P of length m, the episode matching problem is to find all minimal length substrings of text T that contain episode P as a subsequence. The respective optimization problem is to find the smallest number w, s.t. text T has a subword of length ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
Abstract. Given two words, text T of length n and episode P of length m, the episode matching problem is to find all minimal length substrings of text T that contain episode P as a subsequence. The respective optimization problem is to find the smallest number w, s.t. text T has a subword of length w which contains episode P. In this paper, we introduce a few efficient offline as well as online algorithms for the entire problem, where by online algorithms we mean algorithms which search from left to right consecutive text symbols only once. We present two alphabet independent algorithms which work in time O(nm). The offline algorithm operates in O(1) additional space while the online algorithm pays for its property with O(m) additional space. Two other online algorithms have subquadratic time complexity. One of them works in time O(nm/log m) and O(m) additional space. The other one gives a time/space tradeoff, i.e., it works in time O(n + s +nm log log s ~ log(s/m)) when additional space is limited to O(s). Finally, we present two approximation algorithms for the optimization problem. The offline algorithm is alphabet independent, it has superlinear time complexity O(n/e + nloglog(n/m)) and it uses only constant space. The online algorithm works in time O(n/e + n) and uses space O(m). Both approximation algorithms achieve 1 + e approximation ratio, for any e> 0. 1
A Fast and Practical BitVector Algorithm for the Longest Common Subsequence Problem
 Information Processing Letters
, 2000
"... This paper presents a new practical bitvector algorithm for solving the well known Longest Common Subsequence (LCS) problem. Given two strings of length m and n, n m, we present an algorithm which determines the length p of an LCS in O(nm=w) time and O(m=w) space, where w is the number of bits in a ..."
Abstract

Cited by 27 (2 self)
 Add to MetaCart
This paper presents a new practical bitvector algorithm for solving the well known Longest Common Subsequence (LCS) problem. Given two strings of length m and n, n m, we present an algorithm which determines the length p of an LCS in O(nm=w) time and O(m=w) space, where w is the number of bits in a machine word. This algorithm can be thought of as columnwise "parallelization" of the classical dynamic programming approach. Our algorithm is very efficiently in practice, where computing the length of an LCS of two strings can be done in linear time and constant (additional/working) space by assuming that m w.
LinearTime Computation of Similarity Measures for Sequential Data
, 2008
"... Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and nonmetric similarity functions. The basis for comp ..."
Abstract

Cited by 25 (17 self)
 Add to MetaCart
Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and nonmetric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, kgrams or all contiguous subsequences. As realizations of the framework we provide lineartime algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Experiments on data sets from bioinformatics, text processing and computer security illustrate the efficiency of the proposed algorithms—enabling peak performances of up to 10^6 pairwise comparisons per second. The utility of distances and nonmetric similarity measures for sequences as alternatives to string kernels is demonstrated in applications of text categorization, network intrusion detection and transcription site recognition in DNA.
SECURE OUTSOURCING OF SEQUENCE COMPARISONS
"... Largescale problems in the physical and life sciences are being revolutionized by Internet computing technologies, like grid computing, that make possible the massive cooperative sharing of computational power, bandwidth, storage, and data. A weak computational device, once connected to such a grid ..."
Abstract

Cited by 23 (5 self)
 Add to MetaCart
Largescale problems in the physical and life sciences are being revolutionized by Internet computing technologies, like grid computing, that make possible the massive cooperative sharing of computational power, bandwidth, storage, and data. A weak computational device, once connected to such a grid, is no longer limited by its slow speed, small amounts of local storage, and limited bandwidth: It can avail itself of the abundance of these resources that is available elsewhere on the network. An impediment to the use of “computational outsourcing” is that the data in question is often sensitive, e.g., of national security importance, or proprietary and containing commercial secrets, or to be kept private for legal requirements such as the HIPAA legislation, GrammLeachBliley, or similar laws. This motivates the design of techniques for computational outsourcing in a privacypreserving manner, i.e., without revealing to the remote agents whose computational power is being used, either one’s data or the outcome of the computation on the data. This paper investigates such secure outsourcing for widely applicable sequence comparison problems, and gives an efficient protocol for a
Matching for RunLength Encoded Strings
, 1999
"... this paper, we develop significantly faster algorithms for a special class of strings which emerge frequently in pattern matching problems. A string S is runlength encoded if it is described as an ordered sequence of pairs (oe; i), each consisting of an alphabet symbol oe and an integer i. Each pai ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
this paper, we develop significantly faster algorithms for a special class of strings which emerge frequently in pattern matching problems. A string S is runlength encoded if it is described as an ordered sequence of pairs (oe; i), each consisting of an alphabet symbol oe and an integer i. Each pair corresponds to a run in S consisting of i consecutive occurrences of oe. For example, the string