## Finding motifs using random projections (2001)

### Cached

### Download Links

- [www.cs.columbia.edu]
- [www1.cs.columbia.edu]
- [www.cs.washington.edu]
- [www.cs.wisc.edu]
- [research.cs.wisc.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 210 - 5 self |

### BibTeX

@INPROCEEDINGS{Buhler01findingmotifs,

author = {Jeremy Buhler and Martin Tompa},

title = {Finding motifs using random projections},

booktitle = {},

year = {2001},

pages = {69--76}

}

### Years of Citing Articles

### OpenURL

### Abstract

Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif Å of length 15, where each planted instance differs from Å in 4 positions. Whereas previous algorithms all failed to solve this (15,4)-motif problem, Pevzner and Sze introduced algorithms that succeeded. However, their algorithms failed to solve the considerably more difficult (14,4)-, (16,5)-, and (18,6)motif problems. We introduce a novel motif discovery algorithm based on the use of random projections of the input’s substrings. Experiments on simulated data demonstrate that this algorithm performs better than existing algorithms and, in particular, typically solves the difficult (14,4)-, (16,5)-, and (18,6)-motif problems quite efficiently. A probabilistic estimate shows that the small values of � for which the algorithm fails to recover the planted Ð � �-motif are in all likelihood inherently impossible to solve. We also present experimental results on realistic biological data by identifying ribosome binding sites in prokaryotes as well as a number of known transcriptional regulatory motifs in eukaryotes. 1. CHALLENGING MOTIF PROBLEMS Pevzner and Sze [23] considered a very precise version of the motif discovery problem of computational biology, which had also been considered by Sagot [26]. Based on this formulation, they issued an algorithmic challenge: Planted Ð � �-Motif Problem: Suppose there is a fixed but unknown nucleotide sequence Å (the motif) of length Ð. The problem is to determine Å, givenØ nucleotide sequences each of length Ò, and each containing a planted variant of Å. More precisely, each such planted variant is a substring that is Å with exactly � point substitutions. One instantiation that they labeled “The Challenge Problem ” was parameterized as finding a planted (15,4)-motif in Ø � sequences each of length Ò � �. These values of Ò, Ø, andÐ are