Results 1  10
of
43
On The Closest String and Substring Problems
 Journal of the ACM
, 2002
"... The problem of finding a center string that is `close' to every given string arises in computational molecular biology and coding theory. This problem has two versions: the Closest String problem and the Closest Substring problem. Given a set of strings S = fs 1 ; s 2 ; : : : ; s n g, each of lengt ..."
Abstract

Cited by 54 (14 self)
 Add to MetaCart
The problem of finding a center string that is `close' to every given string arises in computational molecular biology and coding theory. This problem has two versions: the Closest String problem and the Closest Substring problem. Given a set of strings S = fs 1 ; s 2 ; : : : ; s n g, each of length m, the Closest String problem is to find the smallest d and a string s of length m which is within Hamming distance d to each s i 2 S. This problem comes from coding theory when we are looking for a code not too far away from a given set of codes. Closest Substring problem, with an additional input integer L, asks for the smallest d and a string s, of length L, which is within Hamming distance d away from a substring, of length L, of each s i . This problem is much more elusive than the Closest String problem. The Closest Substring problem is formulated from applications in finding conserved regions, identifying genetic drug targets and generating genetic probes in molecular biology. Whether there are efficient approximation algorithms for both problems are major open questions in this area. We present two polynomial time approximation algorithms with approximation ratio 1 + ffl for any small ffl to settle both questions.
Finding Similar Regions In Many Strings
 Journal of Computer and System Sciences
, 1999
"... Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. We solve three main open questions in this area. Assume that we are given n DNA sequences s1 ; : : : ; sn . The Consensus Patterns problem, which has been widely ..."
Abstract

Cited by 53 (8 self)
 Add to MetaCart
Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. We solve three main open questions in this area. Assume that we are given n DNA sequences s1 ; : : : ; sn . The Consensus Patterns problem, which has been widely studied in bioinformatics research [26, 16, 12, 25, 4, 6, 15, 22, 24, 27], in its simplest form, asks for a region of length L in each s i , and a median string s of length L so that the total Hamming distance from s to these regions is minimized. We show the problem is NPhard and give a polynomial time approximation scheme (PTAS) for it. We also give a PTAS for the problem under the original measure of [26, 16, 12, 25]. As an interesting application of our analysis, we further obtain a PTAS for a restricted (but still NPhard) version of the important star alignment problem allowing at most constant number of gaps, each of arbitrary length, in each sequence. The Closest String problem [2, 3, 7, 9, 18] asks for the smallest d and a string s which is within Hamming distance d to each s i . The problem is NPhard [7, 18]. [3] gives a polynomial time algorithm for constant d. For superlogarithmic d, [2, 9] give efficient approximation algorithms using linear program ralaxation techniques. The best polynomial time approximation has ratio 4 3 for all d, given by [18] ([9] also independently claimed the 4 3 ratio but only for superlogarithmic d). We settle the problem with a PTAS. We then give the first nontrivial betterthan2 approximation with ratio 2 \Gamma 2 2j\Sigmaj+1 for the more elusive Closest
Efficient Approximation Algorithms for the Hamming Center Problem
, 1999
"... The Hamming center problem for a set S of k binary strings, each of length n, asks for a binary string of length n that minimizes the maximum Hamming distance between and any string in S. The decision version of this problem is known to be NPcomplete [6]. We provide several approximation algorit ..."
Abstract

Cited by 27 (2 self)
 Add to MetaCart
The Hamming center problem for a set S of k binary strings, each of length n, asks for a binary string of length n that minimizes the maximum Hamming distance between and any string in S. The decision version of this problem is known to be NPcomplete [6]. We provide several approximation algorithms for the Hamming center problem. Our main result is a randomized ( 4 3 + ")approximation algorithm running in polynomial time if the Hamming radius of S is at least superlogarithmic in k. Furthermore, we show how to nd in polynomial time a set B of O(log k) strings of length n such that for each string in S there is at least one string in B within Hamming distance not exceeding the radius of S. 1 Introduction Let Z n 2 be the set of all strings of length n over the alphabet f0; 1g. For any 2 Z n 2 we use the notation [i] to refer to the symbol placed at the ith position of , where i = 1; ::; n, and we let [i::j] represent the substring of starting at position i and endin...
FixedParameter Algorithms for Closest String and Related Problems
 ALGORITHMICA
, 2003
"... Closest String is one of the core problems in the field of consensus word analysis with particular importance for computational biology. Given k strings ..."
Abstract

Cited by 25 (8 self)
 Add to MetaCart
Closest String is one of the core problems in the field of consensus word analysis with particular importance for computational biology. Given k strings
Exact Solutions for Closest String and Related Problems
, 2001
"... Closest String is one of the core problems in the field of consensus word analysis with particular importance for computational biology. Given k strings of same length and a positive integer d, find a "closest string" s such that none of the given strings has Hamming distance greater than d from s. ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
Closest String is one of the core problems in the field of consensus word analysis with particular importance for computational biology. Given k strings of same length and a positive integer d, find a "closest string" s such that none of the given strings has Hamming distance greater than d from s. Closest String is NPcomplete. We show how to solve Closest String in linear time for constant d (the exponential growth is O(d d )). We extend this result to the closely related problems dMismatch and Distinguishing String Selection. Moreover, we discuss fixed parameter tractability for parameter k and give an efficient linear time algorithm for Closest String when k = 3. Finally, the practical usefulness of our findings is substantiated by some experimental results.
On the Parameterized Intractability of Closest Substring and Related Problems
 In Proc. 19th STACS, volume 2285 of LNCS
, 2002
"... We show that Closest Substring, one of the most important problems in the field of biological sequence analysis, is W[1]hard with respect to the number k of input strings (even over a binary alphabet). This problem is therefore unlikely to be solvable in time O(f(k)n for any function f and constant ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
We show that Closest Substring, one of the most important problems in the field of biological sequence analysis, is W[1]hard with respect to the number k of input strings (even over a binary alphabet). This problem is therefore unlikely to be solvable in time O(f(k)n for any function f and constant c independent of k  effectively, the problem can be expected to be intractable, in any practical sense, for k 3. Our result supports the intuition that Closest Substring is computationally much harder than the special case of Closest String, although both problems are NPcomplete and both possess polynomial time approximation schemes. We also prove W[1]hardness for other parameterizations in the case of unbounded alphabet size. Our main W[1]hardness result generalizes to Consensus Patterns, a problem of similar significance in computational biology.
Genetic design of drugs without sideeffects
 SIAM Journal on Computing
, 2003
"... Abstract. Consider two sets of strings, B (bad genes) and G (good genes), as well as two integers db and dg (db ≤ dg). A frequently occurring problem in computational biology (and other fields) is to find a (distinguishing) substring s of length L that distinguishes the bad strings from good strings ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
Abstract. Consider two sets of strings, B (bad genes) and G (good genes), as well as two integers db and dg (db ≤ dg). A frequently occurring problem in computational biology (and other fields) is to find a (distinguishing) substring s of length L that distinguishes the bad strings from good strings, i.e., such that for each string si ∈Bthere exists a lengthL substring ti of si with d(s, ti) ≤ db (close to bad strings), and for every substring ui of length L of every string gi ∈G, d(s, ui) ≥ dg (far from good strings). We present a polynomial time approximation scheme to settle the problem; i.e., for any constant ɛ>0, the algorithm finds a string s of length L such that for every si ∈Bthere is a lengthL substring ti of si with d(ti,s) ≤ (1+ɛ)db, and for every substring ui of length L of every gi ∈G, d(ui,s) ≥ (1 − ɛ)dg if a solution to the original pair (db ≤ dg) exists. Since there is a polynomial number of such pairs (db,dg), we can exhaust all the possibilities in polynomial time to find a good approximation required by the corresponding application problems.
On the parameterized intractability of motif search problems
 Combinatorica
, 2006
"... We show that Closest Substring, one of the most important problems in the field of biological sequence analysis, is W[1]hard when parameterized by the number k of input strings (and remains so, even over a binary alphabet). This problem is therefore unlikely to be solvable in time O(f(k) · n c) fo ..."
Abstract

Cited by 11 (4 self)
 Add to MetaCart
We show that Closest Substring, one of the most important problems in the field of biological sequence analysis, is W[1]hard when parameterized by the number k of input strings (and remains so, even over a binary alphabet). This problem is therefore unlikely to be solvable in time O(f(k) · n c) for any function f of k and constant c independent of k. The problem can therefore be expected to be intractable, in any practical sense, for k ≥ 3. Our result supports the intuition that Closest Substring is computationally much harder than the special case of Closest String, although both problems are NPcomplete. We also prove W[1]hardness for other parameterizations in the case of unbounded alphabet size. Our W[1]hardness result for Closest Substring generalizes to Consensus Patterns, a problem of similar significance in computational biology. 1