## Algorithmic complexity of protein identification: Combinatorics of weighted strings (2004)

### Cached

### Download Links

- [www.inf.ethz.ch]
- [www.cebitec.uni-bielefeld.de]
- DBLP

### Other Repositories/Bibliography

Venue: | DISCRETE APPLIED MATHEMATICS, SPECIAL ISSUE ON COMBINATORICS OF SEARCHING, SORTING, AND CODING. (2002) |

Citations: | 4 - 1 self |

### BibTeX

@INPROCEEDINGS{Cieliebak04algorithmiccomplexity,

author = {Mark Cieliebak and Thomas Eriebach and Zsuzsanna Liptak and Jens Stoye and Emo Welzl},

title = {Algorithmic complexity of protein identification: Combinatorics of weighted strings},

booktitle = {DISCRETE APPLIED MATHEMATICS, SPECIAL ISSUE ON COMBINATORICS OF SEARCHING, SORTING, AND CODING. (2002)},

year = {2004},

pages = {27--46},

publisher = {}

}

### OpenURL

### Abstract

We investigate a problem from computational biology: Given a constant size alphabet M with a weight function / : M--> +, find an efficient data structure and query algorithm solving the following problem: For a weight M C + and a string cr over A, decide whether cr contains a substring with weight M (ONE STRING MASS FINDING PROBLEM). If the answer is yes, then we may in addition require a witness, i.e. indices i _ i and ending at position j has weight M. We allow preprocessing of the string, and measure efficiency in two parameters: storage space required for the preprocessed data, and running time of the query algorithm for given M. We are interested in data structures and algorithms requiring subquadratic storage space and sublinear query time, where we measure the input size as the length of the input string. We present two efficient algorithms: LOOKUP solves the problem with O(,) space and (Wg ' loglog,) time; INTERVAL solves the problem for binary alphabets with O0, ) space in O(log,) time. We sketch a third al-gorithm, CLUSTER, which can be adjusted for a space time tradeoff but for which we do not yet have a resource analysis. We introduce a function on weighted strings which is closely related to the analysis of algorithms for the ONE STRING MASS FINDING PROBLEM: The number of different submasses of a weighted string. We present several properties of this function, including upper and lower bounds. Finally, we introduce two more general variants of the problem and sketch how algorithms may be extended for these variants.

### Citations

11205 |
Computers and Intractability: a Guide to the Theory of NP-completeness
- Garey, Johnson
- 1979
(Show Context)
Citation Context ...aturally arises whether a given mass M can be the weight of a string. If the size of the alphabet is variable, then this question is a variant of the Integer Knapsack Problem, and is NP–complete (cf. =-=[GJ79]-=-). If the alphabet size is constant, the question can be solved with a simple Integer Linear Program. With the UDP, we have M(σ) = P(σ) for all σ. Note that this condition never holds if the masses ar... |

916 |
Algorithms on Strings, Trees, and Sequences
- Gusfield
- 1997
(Show Context)
Citation Context ...Likewise, using suffix trees, which can be applied to efficiently solve a large number of complex string problems, does not seem to help. Note, for instance, that the longest common substring problem =-=[Gus97]-=-, although at first sight related, has very different characteristics. A problem that may also appear to be close to the present one is maximum segment sum [Ben86]; however, it appears that it does no... |

627 | Applied Combinatorics on Words - Lothaire - 2005 |

391 | An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database - Eng, McCormack, et al. - 1994 |

335 | Text Algorithms - Crochemore, Rytter - 1994 |

166 | Computational Molecular Biology: An algorithmic approach - Pevzner - 2000 |

164 |
Programming Pearls
- Bentley
- 2000
(Show Context)
Citation Context ...the longest common substring problem [Gus97], although at first sight related, has very different characteristics. A problem that may also appear to be close to the present one is maximum segment sum =-=[Ben86]-=-; however, it appears that it does not lead to good solutions, either. Encoding amino acids on a binary alphabet is not feasible here, because that would only allow very restricted mass functions. The... |

103 | Error-tolerant identification of peptides in sequence databases by peptide sequence tags - Mann, Wilm - 1994 |

65 | Context-free languages and pushdown automata
- Autebert, Boasson
- 1997
(Show Context)
Citation Context ...et A = {a1,... ,as}. Given a string σ, let us define three combinatorial functions on σ. Recall that we denote the multiplicity vector of a string σ by mult(σ) (also referred to as Parikh–vector, see =-=[ABB97]-=-). 1. S(σ) := |{τ | τ ⊑ σ}|, the number of different substrings of σ, 2. P(σ) := |{mult(τ) | τ ⊑ σ}|, the number of different multiplicity vectors of substrings of σ, and 3. M(σ) := |{µ(τ) | τ ⊑ σ}|, ... |

56 | Rapid identification of proteins by peptide-mass fingerprinting - Pappin, Hojrup, et al. - 1993 |

55 | N.: SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database - Bafna, Edwards |

40 |
János Komlós, and Endre Szemerédi. Storing a sparse table with 0(1) worst case access time
- Fredman
- 1984
(Show Context)
Citation Context ...bmasses in σ, which is bounded by O(n2 ). The time for answering a query is thus O(log n). Since submasses are integers, we can use a hash table instead of a sorted array to store all submasses of σ. =-=[FKS84]-=- present hashing schemes which require storage space linear in the number of elements to be stored, and which allow membership queries in constant time. For the One–String Mass Finding Problem, this y... |

39 | C.L.: Mutation-tolerant protein identification by mass spectrometry
- Pevzner, Dancík, et al.
- 2000
(Show Context)
Citation Context ...+ 93, JQCG93, PHB93, MHR93, EMYI94]; some papers dealing with different aspects and modifications of the problem, e.g. the minimum number of masses needed to identify a protein [PHB93], combinatorial =-=[PDT00]-=- or probabilistic [BE01] models for scoring the difference of two mass spectra, or approaches for a correct identification even in the presence of post–translational modifications of the protein [MW94... |

30 | Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases - Henzel, Billeci, et al. - 1993 |

23 | Use of mass spectrometric molecular weight information to identify proteins in sequence databases - Mann, Hojrup, et al. - 1993 |

20 | C.L.: Efficiency of database search for identification of mutated and modified proteins via mass spectrometry. Genome Res 11(2 - Pevzner, Mulyukov, et al. - 2001 |

20 | Mining genomes: correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases - Jr, Eng, et al. - 1995 |

16 | Peptide mass maps: A highly informative approach to protein identification - Yates, Speicher, et al. - 1993 |

14 | Protein identification by mass profile fingerprinting - James, Quadroni, et al. - 1993 |

13 | III, “An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database - Eng, McCormack, et al. - 1994 |

13 |
Two applications of a probabilistic search technique: SortingX+ Yand building balanced search trees
- FREDMAN
- 1975
(Show Context)
Citation Context ...l structure of the weight function, such as its additivity. For instance, the problem of searching in X +Y , where X and Y are two sets of numbers, turns out to be closely related to our problem (see =-=[Fre75]-=- and [HPSS75]). However, we have been able to extend negative results which have been reached for that problem [CDF90]: We can show that this approach (using the naïve solution without preprocessing) ... |

5 |
The complexity of searching
- Cosnard, Duprat, et al.
- 1990
(Show Context)
Citation Context ...nd Y are two sets of numbers, turns out to be closely related to our problem (see [Fre75] and [HPSS75]). However, we have been able to extend negative results which have been reached for that problem =-=[CDF90]-=-: We can show that this approach (using the naïve solution without preprocessing) cannot lead to an efficient algorithm for our problem. Likewise, using suffix trees, which can be applied to efficient... |

1 | The complexity of searching in X + Y and other multisets - Cosnard, Duprat, et al. - 1990 |

1 | Combinatorics on Words - Lothaim - 1997 |

1 | Mining genomes: Correlating tanden mass-spectra of modified and unmodified peptides to sequences in nucleotide databases - Yates, Eng, et al. - 1995 |

1 | Yates III. Database searching using mass spectrometry data - R - 1998 |

1 |
The Cartoon Guide to Genetics. HarperPerennial, updated edition
- Gonick, Wheelis
- 1991
(Show Context)
Citation Context ...as chapter 11 of the book [Pev00] contain more detailed introductions to this topic. For an introduction to computational biology in general, see [SM97]; for more on molecular biology, [Str88]; while =-=[GW91]-=- is an easy–going introduction to genetics for non–biologists. In this paper, we deal with algorithmic questions that arise if nothing is known about the breaking points, i.e., we assume random fragme... |