Results 1  10
of
28
Finding motifs using random projections
, 2001
"... Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif Å of length 15, where each planted instance differs from Å in 4 positions. Whereas previous algorithms all failed to solve this (15,4)motif problem, Pevz ..."
Abstract

Cited by 211 (5 self)
 Add to MetaCart
Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif Å of length 15, where each planted instance differs from Å in 4 positions. Whereas previous algorithms all failed to solve this (15,4)motif problem, Pevzner and Sze introduced algorithms that succeeded. However, their algorithms failed to solve the considerably more difficult (14,4), (16,5), and (18,6)motif problems. We introduce a novel motif discovery algorithm based on the use of random projections of the input’s substrings. Experiments on simulated data demonstrate that this algorithm performs better than existing algorithms and, in particular, typically solves the difficult (14,4), (16,5), and (18,6)motif problems quite efficiently. A probabilistic estimate shows that the small values of � for which the algorithm fails to recover the planted Ð � �motif are in all likelihood inherently impossible to solve. We also present experimental results on realistic biological data by identifying ribosome binding sites in prokaryotes as well as a number of known transcriptional regulatory motifs in eukaryotes. 1. CHALLENGING MOTIF PROBLEMS Pevzner and Sze [23] considered a very precise version of the motif discovery problem of computational biology, which had also been considered by Sagot [26]. Based on this formulation, they issued an algorithmic challenge: Planted Ð � �Motif Problem: Suppose there is a fixed but unknown nucleotide sequence Å (the motif) of length Ð. The problem is to determine Å, givenØ nucleotide sequences each of length Ò, and each containing a planted variant of Å. More precisely, each such planted variant is a substring that is Å with exactly � point substitutions. One instantiation that they labeled “The Challenge Problem ” was parameterized as finding a planted (15,4)motif in Ø � sequences each of length Ò � �. These values of Ò, Ø, andÐ are
Approaches to the Automatic Discovery of Patterns in Biosequences
, 1995
"... This paper is a survey of approaches and algorithms used for the automatic discovery of patterns in biosequences. Patterns with the expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering those patterns which a ..."
Abstract

Cited by 138 (21 self)
 Add to MetaCart
This paper is a survey of approaches and algorithms used for the automatic discovery of patterns in biosequences. Patterns with the expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering those patterns which are the most frequently used in molecular bioinformatics. A formulation is given of the problem of the automatic discovery of such patterns from a set of sequences, and an analysis presented of the ways in which an assessment can be made of the significance and usefulness of the discovered patterns. It is shown that this problem is related to problems studied in the field of machine learning. The largest part of this paper comprises a review of a number of existing methods developed to solve this problem and how these relate to each other, focusing on the algorithms underlying the approaches. A comparison is given of the algorithms, and examples are given of patterns that have been discovered...
Probabilistic discovery of time series motifs
, 2003
"... Several important time series data mining problems reduce to the core task of finding approximately repeated subsequences in a longer time series. In an earlier work, we formalized the idea of approximately repeated subsequences by introducing the notion of time series motifs. Two limitations of thi ..."
Abstract

Cited by 119 (21 self)
 Add to MetaCart
Several important time series data mining problems reduce to the core task of finding approximately repeated subsequences in a longer time series. In an earlier work, we formalized the idea of approximately repeated subsequences by introducing the notion of time series motifs. Two limitations of this work were the poor scalability of the motif discovery algorithm, and the inability to discover motifs in the presence of noise. Here we address these limitations by introducing a novel algorithm inspired by recent advances in the problem of pattern discovery in biosequences. Our algorithm is probabilistic in nature, but as we show empirically and theoretically, it can find time series motifs with very high probability even in the presence of noise or “don’t care ” symbols. Not only is the algorithm fast, but it is an anytime algorithm, producing likely candidate motifs almost immediately, and gradually improving the quality of results over time.
Finding Motifs in Time Series
, 2002
"... The problem of efficiently locating previously known patterns in a time series database (i.e., query by content) has received much attention and may now largely be regarded as a solved problem. However, from a knowledge discovery viewpoint, a more interesting problem is the enumeration of previously ..."
Abstract

Cited by 72 (15 self)
 Add to MetaCart
The problem of efficiently locating previously known patterns in a time series database (i.e., query by content) has received much attention and may now largely be regarded as a solved problem. However, from a knowledge discovery viewpoint, a more interesting problem is the enumeration of previously unknown, frequently occurring patterns. We call such patterns "motifs," because of their close analogy to their discrete counterparts in computation biology. An efficient motif discovery algorithm for time series would be useful as a tool for summarizing and visualizing massive time series databases. In addition, it could be used as a subroutine in various other data mining tasks, including the discovery of association rules, clustering and classification. In this work we carefully motivate, then introduce, a nontrivial definition of time series motifs. We propose an efficient algorithm to discover them, and we demonstrate the utility and efficiency of our approach on several real world datasets.
Efficient Discovery of Conserved Patterns Using a Pattern Graph
 Comput. Appl. Biosci
, 1997
"... Motivation: We have previously reported an algorithm for discovering patterns conserved in sets of related unaligned protein sequences. The algorithm was implemented in a program called Pratt. Pratt allows the user to define a class of patterns (e.g. the degree of ambiguity allowed and the length an ..."
Abstract

Cited by 66 (8 self)
 Add to MetaCart
Motivation: We have previously reported an algorithm for discovering patterns conserved in sets of related unaligned protein sequences. The algorithm was implemented in a program called Pratt. Pratt allows the user to define a class of patterns (e.g. the degree of ambiguity allowed and the length and number of gaps), and is then guaranteed to find the consen>ed patterns in this class scoring highest according to a defined fitness measure. In many cases, this version of Pratt was very efficient, but in other cases it was too time consuming to be applied. Hence, a more efficient algorithm was needed. Results: In this paper, we describe a new and improved searching strategy that has two main advantages over the old strategy. First, it allows for easier integration with programs for multiple sequence alignment and data base search. Secondly, it makes it possible to use branchandbound search, and heuristics, to speed up the search. The new search strategy has been implemented in a new version of the Pratt program. Availability: The source code for the Pratt programs can be
An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site
 Problem,” Proc. Seventh Int’l Conf. Intelligent Systems for Molecular Biology
, 1999
"... This is an investigation of methods for finding short motifs that only occur in a fraction of the input sequences. Unlike local search techniques that may not reach a global optimum, the method proposed here is guaranteed to produce the motifs with greatest zscores. This method is illustrated for t ..."
Abstract

Cited by 62 (3 self)
 Add to MetaCart
This is an investigation of methods for finding short motifs that only occur in a fraction of the input sequences. Unlike local search techniques that may not reach a global optimum, the method proposed here is guaranteed to produce the motifs with greatest zscores. This method is illustrated for the Ribosome Binding Site Problem, which is to identify the short mRNA 5 ′ untranslated sequence that is recognized by the ribosome during initiation of protein synthesis. Experiments were performed to solve this problem for each of fourteen sequenced prokaryotes, by applying the method to the full complement of genes from each. One of the interesting results of this experimentation is evidence that the recognized sequence of the thermophilic archaea A. fulgidus, M. jannaschii, M. thermoautotrophicum, andP. horikoshii may be somewhat different than the well known ShineDalgarno sequence.
Experiencing SAX: A Novel Symbolic Representation of Time Series. Data Mining and Knowledge Discovery Journal
, 2007
"... Abstract Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models, etc. Many researchers have also considered symbolic representations of time series, noting that such representations would pote ..."
Abstract

Cited by 51 (13 self)
 Add to MetaCart
Abstract Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models, etc. Many researchers have also considered symbolic representations of time series, noting that such representations would potentiality allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities. While many symbolic representations of time series have been introduced over the past decades, they all suffer from two fatal flaws. First, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Second, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. In this work we formulate a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction,
Mining Motifs in Massive Time Series Databases
 In Proceedings of IEEE International Conference on Data Mining (ICDM’02
, 2002
"... The problem of efficiently locating previously known patterns in a time series database (i.e., query by content) has received much attention and may now largely be regarded as a solved problem. However, from a knowledge discovery viewpoint, a more interesting problem is the enumeration of previously ..."
Abstract

Cited by 30 (0 self)
 Add to MetaCart
The problem of efficiently locating previously known patterns in a time series database (i.e., query by content) has received much attention and may now largely be regarded as a solved problem. However, from a knowledge discovery viewpoint, a more interesting problem is the enumeration of previously unknown, frequently occurring patterns. We call such patterns "motifs", because of their close analogy to their discrete counterparts in computation biology. An efficient motif discovery algorithm for time series would be useful as a tool for summarizing and visualizing massive time series databases. In addition it could be used as a subroutine in various other data mining tasks, including the discovery of association rules, clustering and classification.
VOTING ALGORITHMS FOR DISCOVERING LONG MOTIFS
"... Pevzner and Sze [14] have introduced the Planted (l,d)Motif Problem to find similar patterns (motifs) in sequences which represent the promoter region of coregulated genes. l is the length of the motif and d is the maximum Hamming distance around the similar patterns. Many algorithms have been dev ..."
Abstract

Cited by 28 (6 self)
 Add to MetaCart
Pevzner and Sze [14] have introduced the Planted (l,d)Motif Problem to find similar patterns (motifs) in sequences which represent the promoter region of coregulated genes. l is the length of the motif and d is the maximum Hamming distance around the similar patterns. Many algorithms have been developed to solve this motif problem. However, these algorithms either have long running times or do not guarantee the motif can be found. In this paper, we introduce new algorithms to solve the motif problem. Our algorithms can find motifs in reasonable time for not only the challenging (9,2), (11,3), (15,5)motif problems but for even longer motifs, say (20,7), (30,11) and (40,15), which have never been seriously attempted by other researchers because of heavy time and space requirements. 1
An Exact Algorithm to Identify Motifs in Orthologous Sequences from Multiple Species
 In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
, 2000
"... The identification of sequence motifs is a fundamental method for suggesting good candidates for biologically functional regions such as genes, promoters, splice sites, binding sites, etc. We investigate the following approach to identifying motifs: given a collection of orthologous sequences from m ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
The identification of sequence motifs is a fundamental method for suggesting good candidates for biologically functional regions such as genes, promoters, splice sites, binding sites, etc. We investigate the following approach to identifying motifs: given a collection of orthologous sequences from multiple species related by a known phylogenetic tree, search for motifs that are well conserved (according to a parsimony measure) in most or all of the species. We present an exact algorithm for solving this problem. We then discuss experimental results on finding promoters of the rbcS gene for a family of 10 plants, on finding promoters of the adh gene for 12 Drosophila species, and on finding promoters of several chloroplast encoded genes.