## Finding Motifs in Time Series (2002)

### Cached

### Download Links

Citations: | 73 - 15 self |

### BibTeX

@INPROCEEDINGS{Lin02findingmotifs,

author = {Jessica Lin and Eamonn Keogh and Stefano Lonardi and Pranav Patel},

title = {Finding Motifs in Time Series},

booktitle = {},

year = {2002},

pages = {53--68}

}

### Years of Citing Articles

### OpenURL

### Abstract

The problem of efficiently locating previously known patterns in a time series database (i.e., query by content) has received much attention and may now largely be regarded as a solved problem. However, from a knowledge discovery viewpoint, a more interesting problem is the enumeration of previously unknown, frequently occurring patterns. We call such patterns "motifs," because of their close analogy to their discrete counterparts in computation biology. An efficient motif discovery algorithm for time series would be useful as a tool for summarizing and visualizing massive time series databases. In addition, it could be used as a subroutine in various other data mining tasks, including the discovery of association rules, clustering and classification. In this work we carefully motivate, then introduce, a non-trivial definition of time series motifs. We propose an efficient algorithm to discover them, and we demonstrate the utility and efficiency of our approach on several real world datasets.

### Citations

828 | Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
- Durbin, Eddy, et al.
- 1998
(Show Context)
Citation Context ...sting problem is the detection of previously unknown, frequently occurring patterns. We call such patterns motifs, because of their close analogy to their discrete counterparts in computation biology =-=[11, 16, 30, 34, 36]-=-. Figure 1 illustrates an example of a motif discovered in an astronomical database. An efficient motif discovery algorithm for time series would be useful as a tool for summarizing and visualizing ma... |

507 |
Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment
- Lawrence
- 1993
(Show Context)
Citation Context ...eria, algorithms and software have been developed in correspondence. We mention a few representatives of this large family of methods, without claiming to be exhaustive: CONSENSUS [16], GIBBS SAMPLER =-=[26]-=-, WINNOWER [30], PROJECTION [36], VERBUMCULUS [4, 28] These methods have been studied from a rigorous statistical viewpoint (see, e.g., [31] for a review) and also employed successfully in practice (s... |

416 | Efficient similarity search in sequence databases
- Agrawal, Faloutsos, et al.
- 1993
(Show Context)
Citation Context ... The problem of efficiently locating previously defined patterns in a time series database (i.e., query by content) has received much attention and may now be essentially regarded as a solved problem =-=[1, 8, 13, 21, 22, 23, 35, 40]-=-. However, from a knowledge discovery viewpoint, a more interesting problem is the detection of previously unknown, frequently occurring patterns. We call such patterns motifs, because of their close ... |

276 |
Identifying DNA and protein patterns with statistically significant alignments of multiple sequences
- Hertz, Stormo
- 1999
(Show Context)
Citation Context ...sting problem is the detection of previously unknown, frequently occurring patterns. We call such patterns motifs, because of their close analogy to their discrete counterparts in computation biology =-=[11, 16, 30, 34, 36]-=-. Figure 1 illustrates an example of a motif discovered in an astronomical database. An efficient motif discovery algorithm for time series would be useful as a tool for summarizing and visualizing ma... |

246 | Scaling clustering algorithms to large databases
- Bradley, Fayyad, et al.
- 1998
(Show Context)
Citation Context ...hm as an online algorithm, then at this point we can report the current motif as a tentative answer, before continuing the search. Such “anytime” behavior is very desirable in a data-mining algori=-=thm [7]-=-. Next, a simple test is performed. If the number of matches to the current best-so-far motif is greater than the largest unexplored neighborhood (line 18), we are done. We can record the best so far ... |

231 | Locally adaptive dimensionality reduction for indexing large time series databases
- Keogh, Chakrabarti, et al.
(Show Context)
Citation Context ... Measures Having considered various representations of time series data, we can now define distance measures on them. By far the most common distance measure for time series is the Euclidean distance =-=[8, 22, 23, 32, 40]. Give-=-n two time series Q and C of the same length n, Eq. 3 defines their Euclidean distance, and Figure 8.A illustrates a visual intuition of the measure. b j c jsD n 2 ( Q C ) ≡ ( q c ) , ∑ − (3) i=... |

210 | TOMPA: Finding motifs using random projections
- BUHLER, MARTIN
- 2002
(Show Context)
Citation Context ...sting problem is the detection of previously unknown, frequently occurring patterns. We call such patterns motifs, because of their close analogy to their discrete counterparts in computation biology =-=[11, 16, 30, 34, 36]-=-. Figure 1 illustrates an example of a motif discovered in an astronomical database. An efficient motif discovery algorithm for time series would be useful as a tool for summarizing and visualizing ma... |

201 | Efficient time series matching by wavelets
- Chan, Fu
- 1999
(Show Context)
Citation Context ... The problem of efficiently locating previously defined patterns in a time series database (i.e., query by content) has received much attention and may now be essentially regarded as a solved problem =-=[1, 8, 13, 21, 22, 23, 35, 40]-=-. However, from a knowledge discovery viewpoint, a more interesting problem is the detection of previously unknown, frequently occurring patterns. We call such patterns motifs, because of their close ... |

186 | Combinatorial approaches to finding subtle signals - Pevzner, Sze - 2000 |

185 |
Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies
- Helden, André, et al.
- 1998
(Show Context)
Citation Context ... [30], PROJECTION [36], VERBUMCULUS [4, 28] These methods have been studied from a rigorous statistical viewpoint (see, e.g., [31] for a review) and also employed successfully in practice (see, e.g., =-=[17]-=- and references therein). While there are literally hundreds of papers on discretizing (symbolizing, tokenizing) time series [2, 3, 9, 13, 19, 25, 27] (see [10] for an extensive survey), and dozens of... |

173 | Discovering similar multidimensional trajectories - Vlachos, Kollios, et al. - 2002 |

155 | Dimensionality Reduction for fast similarity search in large time series databases
- Keogh, Chakrabarti, et al.
- 2000
(Show Context)
Citation Context ... The problem of efficiently locating previously defined patterns in a time series database (i.e., query by content) has received much attention and may now be essentially regarded as a solved problem =-=[1, 8, 13, 21, 22, 23, 35, 40]-=-. However, from a knowledge discovery viewpoint, a more interesting problem is the detection of previously unknown, frequently occurring patterns. We call such patterns motifs, because of their close ... |

147 |
Fast time sequence indexing for arbitrary Lp forms
- Yi, Faloutsos
- 2000
(Show Context)
Citation Context |

143 | Rule Discovery from Time Series
- Lin, Mannila, et al.
- 1998
(Show Context)
Citation Context ...] for a review) and also employed successfully in practice (see, e.g., [17] and references therein). While there are literally hundreds of papers on discretizing (symbolizing, tokenizing) time series =-=[2, 3, 9, 13, 19, 25, 27]-=- (see [10] for an extensive survey), and dozens of distance measures defined on these representations, none of the techniques allows a distance measure which lower bounds a distance measure defined on... |

129 |
Some approaches to best-match file searching
- Burkhard, Keller
- 1973
(Show Context)
Citation Context ... value of D(Q,Cb), which we now know to be at least 5 units away. The first formalization of this idea for fast searching of nearest neighbors in matrices is generally credited to Burkhard and Keller =-=[5]-=-. More efficient implementations are possible, for example Shasha and Wang [33], introduced the Approximation Distance Map (ADM) algorithm that takes advantage of an arbitrary set of pre-computed dist... |

129 | An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback
- Keogh, Pazzani
- 1998
(Show Context)
Citation Context ...ion, seeding the algorithm with motifs rather than random points could speed up convergence [12]. • Several time series classification algorithms work by constructing typical prototypes of each clas=-=s [24]-=-. While this approach works for small datasets, the construction of the prototypes (which we see as motifs) requires quadratic time, as is thus unable to scale to massive datasets. In this work we car... |

107 | Querying shapes of histories
- Agrawal
- 1995
(Show Context)
Citation Context ...length n < m of contiguous position from T, that is, C = tp,…,t p+n-1 for 1≤ p ≤ m – n + 1. A task associated with subsequences is to determine if a given subsequence is similar to other subse=-=quences [1, 2, 3, 8, 13, 19, 21, 22, 23, 24, 25, 27, 29, 35, 40]-=-. This idea is formalized in the definition of a match. Definition 3. Match: Given a positive real number R (called range) and a time series T containing a subsequence C beginning at position p and a ... |

83 | Schbath S: Probabilistic and Statistical Properties of Words: An Overview
- Reinert
(Show Context)
Citation Context ...ut claiming to be exhaustive: CONSENSUS [16], GIBBS SAMPLER [26], WINNOWER [30], PROJECTION [36], VERBUMCULUS [4, 28] These methods have been studied from a rigorous statistical viewpoint (see, e.g., =-=[31]-=- for a review) and also employed successfully in practice (see, e.g., [17] and references therein). While there are literally hundreds of papers on discretizing (symbolizing, tokenizing) time series [... |

73 | Landmarks: A New Model for Similarity-based Pattern Querying
- Perng, Wang, et al.
- 2000
(Show Context)
Citation Context ...length n < m of contiguous position from T, that is, C = tp,…,t p+n-1 for 1≤ p ≤ m – n + 1. A task associated with subsequences is to determine if a given subsequence is similar to other subse=-=quences [1, 2, 3, 8, 13, 19, 21, 22, 23, 24, 25, 27, 29, 35, 40]-=-. This idea is formalized in the definition of a match. Definition 3. Match: Given a positive real number R (called range) and a time series T containing a subsequence C beginning at position p and a ... |

65 |
A bibliography of temporal, spatial and spatio-temporal data mining research
- Roddick, Spiliopoulou
- 1999
(Show Context)
Citation Context ... Measures Having considered various representations of time series data, we can now define distance measures on them. By far the most common distance measure for time series is the Euclidean distance =-=[8, 22, 23, 32, 40]. Give-=-n two time series Q and C of the same length n, Eq. 3 defines their Euclidean distance, and Figure 8.A illustrates a visual intuition of the measure. b j c jsD n 2 ( Q C ) ≡ ( q c ) , ∑ − (3) i=... |

63 | Deformable markov model templates for time-series pattern matching
- Ge, Smyth
- 2000
(Show Context)
Citation Context |

63 |
Identifying representative trends in massive time series data sets using sketches
- Indyk, Koudas, et al.
- 2000
(Show Context)
Citation Context ...ontaining billions of observations [15]. We are typically not interested in any of the global properties of a time series; rather, data miners confine their interest to subsections of the time series =-=[1, 20, 23], -=-which are called subsequences. Definition 2. Subsequence: Given a time series T of length m, a subsequence C of T is a sampling of length n < m of contiguous position from T, that is, C = tp,…,t p+n... |

58 |
Distance Measures for Effective Clustering of ARIMA Time-Series
- Kalpakis, Gada, et al.
(Show Context)
Citation Context |

53 | New Techniques for Best-Match Retrieval
- Shasha, Wang
- 1990
(Show Context)
Citation Context ... to measure D(Q,Cb), but in fact we don’t have to do this calculation! We can use the triangular inequality to discover that D(Q,Cb) could not be a match to Q. The triangular inequality requires tha=-=t [2, 22, 33]: D(Q,-=-Ca) ≤ D(Q,C b) + D(C a,C b) (8) Filling in the known values give us Rearranging the terms gives us 3 2 3 4 10 8 9 6 7 5 Alphabet size a 7 ≤ D(Q,C b) + 2 (9) 5 ≤ D(Q,C b) (10) But since we are on... |

49 | Discovery of Temporal Patterns – Learning Rules about the Qualitative Behaviour of Time Series
- Hoppner
(Show Context)
Citation Context ...ica, eamonn, stelo, prpatel}@cs.ucr.edu • The discovery of association rules in time series first requires the discovery of motifs (referred to as “primitive shapes” in [9] and “frequent patte=-=rns” in [18]).-=- However, the current solution to finding the motifs is either high quality and very expensive, or low quality but cheap [9]. • Several researchers have advocated K-means clustering of time series d... |

45 | On clustering fMRI time series - Goutte, Toft, et al. - 1999 |

36 | Monotony of surprise and large-scale quest for unusual words
- Apostolico, Bock, et al.
- 2002
(Show Context)
Citation Context ...in correspondence. We mention a few representatives of this large family of methods, without claiming to be exhaustive: CONSENSUS [16], GIBBS SAMPLER [26], WINNOWER [30], PROJECTION [36], VERBUMCULUS =-=[4, 28]-=- These methods have been studied from a rigorous statistical viewpoint (see, e.g., [31] for a review) and also employed successfully in practice (see, e.g., [17] and references therein). While there a... |

34 |
Methods for discovering novel motifs in nucleic acid sequences
- Staden
- 1989
(Show Context)
Citation Context ...nated against due to structural constraints of genomes or specific reservations for global transcription controls. Pattern discovery in computational biology originated with the work of Rodger Staten =-=[34]-=-. Along this research line, a multitude of patterns have been variously characterized, and criteria, algorithms and software have been developed in correspondence. We mention a few representatives of ... |

29 | Adaptive Query Processing for TimeSeries Data
- Huang, Yu
- 1999
(Show Context)
Citation Context ...length n < m of contiguous position from T, that is, C = tp,…,t p+n-1 for 1≤ p ≤ m – n + 1. A task associated with subsequences is to determine if a given subsequence is similar to other subse=-=quences [1, 2, 3, 8, 13, 19, 21, 22, 23, 24, 25, 27, 29, 35, 40]-=-. This idea is formalized in the definition of a match. Definition 3. Match: Given a positive real number R (called range) and a time series T containing a subsequence C beginning at position p and a ... |

26 | Epsilon grid order: An algorithm for the similarity join on massive high-dimensional data
- Böhm, Braunmüller, et al.
(Show Context)
Citation Context ...which we intend to extend this work. • As previously noted, we only considered the problem of speeding up main memory search. Techniques for dealing with large disk resident data are highly desirabl=-=e [6]. ��-=-� On large datasets, the number of returned motifs may be intimidating; we plan to investigate tools for visualizing and navigating the results of a motif search. • Our motif search algorithm utiliz... |

23 |
3–7). MALM: A framework for mining sequence database at multiple abstraction levels
- Li, Yu, et al.
- 1998
(Show Context)
Citation Context |

13 |
Syntactic recognition of ECG signals by attributed finite automata
- Koski, Juhola, et al.
- 1995
(Show Context)
Citation Context |

11 |
Symbolic analysis of experimental data, Review of Scientific Instruments
- Daw, Finney, et al.
- 2001
(Show Context)
Citation Context ...ed successfully in practice (see, e.g., [17] and references therein). While there are literally hundreds of papers on discretizing (symbolizing, tokenizing) time series [2, 3, 9, 13, 19, 25, 27] (see =-=[10]-=- for an extensive survey), and dozens of distance measures defined on these representations, none of the techniques allows a distance measure which lower bounds a distance measure defined on the origi... |

10 |
Measuring Time Series Similarity through Large Singular Features Revealed with Wavelet Transformation
- Struzik, Siebes
- 1999
(Show Context)
Citation Context |

10 | Meta-patterns: revealing hidden periodical patterns
- Yang, J, et al.
- 2001
(Show Context)
Citation Context ...ated) in the literature. Several researchers in data mining have addressed the discovery of reoccurring patterns in event streams [39], although such data sources are often referred to as time series =-=[38]-=-. The critical difference is that event streams are sequentially ordered variables that are nominal (have no natural ordering) and thus these researchers are concerned with similar subsets, not simila... |

8 |
Global Detectors of Unusual Words: Design, Implementation, and Applications to Pattern Discovery in Biosequences
- Lonardi
- 2001
(Show Context)
Citation Context ...in correspondence. We mention a few representatives of this large family of methods, without claiming to be exhaustive: CONSENSUS [16], GIBBS SAMPLER [26], WINNOWER [30], PROJECTION [36], VERBUMCULUS =-=[4, 28]-=- These methods have been studied from a rigorous statistical viewpoint (see, e.g., [31] for a review) and also employed successfully in practice (see, e.g., [17] and references therein). While there a... |

6 |
Using signature files for querying time-series data
- unknown authors
- 1997
(Show Context)
Citation Context ...] for a review) and also employed successfully in practice (see, e.g., [17] and references therein). While there are literally hundreds of papers on discretizing (symbolizing, tokenizing) time series =-=[2, 3, 9, 13, 19, 25, 27]-=- (see [10] for an extensive survey), and dozens of distance measures defined on these representations, none of the techniques allows a distance measure which lower bounds a distance measure defined on... |

5 |
Initialization of iterative refinement clustering algorithms
- unknown authors
- 1998
(Show Context)
Citation Context ... the initial points, or how to choose K. Motifs could potentially be used to address both problems. In addition, seeding the algorithm with motifs rather than random points could speed up convergence =-=[12]. -=-• Several time series classification algorithms work by constructing typical prototypes of each class [24]. While this approach works for small datasets, the construction of the prototypes (which we... |

3 |
Mining the MACHO dataset
- Hegland, Clarke, et al.
- 2002
(Show Context)
Citation Context ... interest, time series: Definition 1. Time Series: A time series T = t1,…,t m is an ordered set of m real-valued variables. Time series can be very long, sometimes containing billions of observation=-=s [15]-=-. We are typically not interested in any of the global properties of a time series; rather, data miners confine their interest to subsections of the time series [1, 20, 23], which are called subsequen... |

3 |
Querying shapes of histories. In: Proceedings of the 21st International conference on very large databases
- Agrawal, Psaila, et al.
- 1995
(Show Context)
Citation Context ...length n < m of contiguous position from T, that is, C = tp,…,t p+n-1 for 1≤ p ≤ m – n + 1. A task associated with subsequences is to determine if a given subsequence is similar to other subsequences =-=[1, 2, 3, 8, 13, 19, 21, 22, 23, 24, 25, 27, 29, 35, 40]-=-. This idea is formalized in the definition of a match. Definition 3. Match: Given a positive real number R (called range) and a time series T containing a subsequence C beginning at position p and a ... |

2 |
Mining long sequential patterns in a noisy environment
- unknown authors
- 2002
(Show Context)
Citation Context ...epeated patterns in time series has not been addressed (or even formulated) in the literature. Several researchers in data mining have addressed the discovery of reoccurring patterns in event streams =-=[39]-=-, although such data sources are often referred to as time series [38]. The critical difference is that event streams are sequentially ordered variables that are nominal (have no natural ordering) and... |