## On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration (2002)

### Cached

### Download Links

Venue: | SIGKDD'02 |

Citations: | 237 - 51 self |

### BibTeX

@INPROCEEDINGS{Keogh02onthe,

author = {Eamonn Keogh and Shruti Kasetty},

title = {On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration},

booktitle = {SIGKDD'02},

year = {2002},

pages = {102--111},

publisher = {}

}

### Years of Citing Articles

### OpenURL

### Abstract

... mining time series data. Literally hundreds of papers have introduced new algorithms to index, classify, cluster and segment time series. In this work we make the following claim. Much of this work has very little utility because the contribution made (speed in the case of indexing, accuracy in the case of classification and clustering, model accuracy in the case of segmentation) offer an amount of "improvement" that would have been completely dwarfed by the variance that would have been observed by testing on many real world datasets, or the variance that would have been observed by changing minor (unstated) implementation details. To illustrate our point

### Citations

447 | Fast subsequence matching in time-series databases - Faloutsos, Ranganathan, et al. - 1994 |

443 | Efficient Similarity Search in Sequence Databases - Agrawal, Faloutsos, et al. - 1993 |

258 | Exact Indexing of Dynamic Time Warping - Ratanamahatana, Keogh - 2004 |

252 | Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases
- Keogh, Chakrabarti, et al.
(Show Context)
Citation Context ...joys a tenfold speed up when performed on disk because anysindexing technique must perform costly random access, whereasssequential scan can take advantage of an optimized linear traversesof the disk =-=[32]-=-.sThe limited number of rival methods is particularly troubling forspapers that introduce a novel similarity measure. Although 29 ofsthe papers urveyed introduce anovel similarity measure, only 12sof ... |

217 | Efficient time series matching by wavelets - Chan, Fu - 1999 |

209 | Fast similarity search in the presence of noise, scaling, and translation in time-series databases
- Agrawal, Lin, et al.
- 1995
(Show Context)
Citation Context ...e series is of little utility unlessssome strawman comparison is used. Many papers ask us tosconsider the quality of their proposed similarity measure withoutsa single comparison to another technique =-=[2, 4, 8, 24, 31, 38, 39,s41, 42, 46, 57]-=-. This in particularly surprising since the mostsobvious trawman, Euclidian distance, is trivial to implement (Forsexample, in the Matlab programming language it requires only 19scharacters: sqrt(sum(... |

189 | Efficient retrieval of similar time sequences under time warping - Yi, Jagadish, et al. - 1998 |

157 | Fast time sequence indexing for arbitrary Lp norms - Yi, Faloutsos - 2000 |

154 | Rule Discovery from Time Series - Das, Lin, et al. |

140 | An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback
- Keogh, Pazzani
- 1998
(Show Context)
Citation Context ...at created thestime series has changed [19, 20], or segmentation may simply besperformed to created a high level representation f the time seriessthat supports indexing, clustering and classification =-=[20, 30, 31,s37, 39, 42, 44, 46, 48, 52, 57]-=-.sAs mentioned above, our experiments were conducted on 50 realsworld, highly diverse datasets. Space limitations prevent us fromsdescribing all 50 datasets in detail, so we simply note thesfollowing.... |

140 | Mendelzon A. Similarity-Based queries for time series data - Rafiei - 1997 |

116 | Finding Patterns in Time Series: A Dynamic Programming Approach, in Advances in knowledge discovery and data - Berndt, Clifford - 1996 |

110 | Querying shapes of histories - AGRAWAL, PSAILA, et al. - 1995 |

110 |
Event detection from time series data
- Guralnik, Srivastava
- 1999
(Show Context)
Citation Context ...ion queries [57] and relevance f edback [30].s• Support concurrent mining of text and time series [37].s• Support novel clustering and classification algorithms [30].s• Support change point detection =-=[20, 23]-=-.sSurprisingly, in spite of the ubiquity of this representation, withsthe exception of [52], there has been little attempt to understandsand compare the algorithms that produce it.sAlthough appearing ... |

108 | Efficiently supporting ad hoc queries in large datasets of time sequences
- Korn, Jagadish, et al.
- 1997
(Show Context)
Citation Context ...ction stage. Thesproposed representations include the Discrete Fourier Transforms(DFT) [1, I 1, 16, 28, 49, 50], several kinds of Wavelets (DWT)s[10, 27, 45, 51, 57, 60], Singular Value Decomposition =-=[32, 35]-=-,sAdaptive Piecewise Constant Approximation [32], Inner Productss[18] and Piecewise Aggregate Approximation (PAA) [61]. Thesmajority of work has focused solely on performance issues,showever some auth... |

106 | On similarity queries for time series data: Constraint specification and implementation
- Goldin, Kanellakis
- 1995
(Show Context)
Citation Context ...dissimilar. Surprisingly,smany of the papers included in the survey, whose mainscontribution was to introduce a new similarity measure, fail tosshow even one example of a matching pair of time series =-=[4, 8,s19, 22, 24, 26, 34, 36, 38, 42, 43, 48, 57]-=-. Moreover, showingssome examples of matching time series is of little utility unlessssome strawman comparison is used. Many papers ask us tosconsider the quality of their proposed similarity measure ... |

105 | A Probabilistic Approach to Fast Pattern Matching
- Keogh, Smyth
- 1997
(Show Context)
Citation Context ...at created thestime series has changed [19, 20], or segmentation may simply besperformed to created a high level representation f the time seriessthat supports indexing, clustering and classification =-=[20, 30, 31,s37, 39, 42, 44, 46, 48, 52, 57]-=-.sAs mentioned above, our experiments were conducted on 50 realsworld, highly diverse datasets. Space limitations prevent us fromsdescribing all 50 datasets in detail, so we simply note thesfollowing.... |

91 | Finding Similar Time Series - Das, Gunopulos, et al. - 1997 |

81 | Approximate queries and representations for large data sequences - SHATKAY, ZDONIK - 1996 |

66 | Deformable Markov Model Templates for Time-Series Pattern
- Ge, Smyth
- 2000
(Show Context)
Citation Context ...K << n) such that Q closely approximates Q. 102sNote that segmentation has two major uses. It may be performedsin order to determine when the underlying model that created thestime series has changed =-=[19, 20]-=-, or segmentation may simply besperformed to created a high level representation f the time seriessthat supports indexing, clustering and classification [20, 30, 31,s37, 39, 42, 44, 46, 48, 52, 57].sA... |

65 | Identifying Representative Trends in Massive Time Series Data Sets Using Sketches - Indyk, Koudas, et al. - 2000 |

63 | Similarity search over time series data using wavelets
- Popivanov
- 2002
(Show Context)
Citation Context ...suggest asdifferent approach to the dimensionality reduction stage. Thesproposed representations include the Discrete Fourier Transforms(DFT) [1, I 1, 16, 28, 49, 50], several kinds of Wavelets (DWT)s=-=[10, 27, 45, 51, 57, 60]-=-, Singular Value Decomposition [32, 35],sAdaptive Piecewise Constant Approximation [32], Inner Productss[18] and Piecewise Aggregate Approximation (PAA) [61]. Thesmajority of work has focused solely o... |

61 |
Machine learning as experimental science
- Kibler, Langley
- 1990
(Show Context)
Citation Context ...s tested on 1.85 datasets (1.26 real and 0.59ssynthetic). This numbers are astonishingly ow when you considersthat new machine learning algorithms are typically evaluated onsat least a dozen datasets =-=[12, 33]-=-.sIn fact, we feel that the numbers above are optimistic. Of the 30spapers that use two or more datasets, a very significant fractions(64%), use both stock market data and random walk data.sHowever, w... |

60 |
Distance Measures for Effective Clustering of ARIMA Time-Series
- Kalpakis, Gada, et al.
(Show Context)
Citation Context ...gle linkage method, with 4sdifferent distance measures. Euclidean distance andsDynamic Time Warping are decade old strawmen. Thesother two approaches have recently been proposed in datasmining papers =-=[57, 29]-=-sDendrograms are particularly attractive since a clustering of Msobjects summarizes O(M) measurements, however otherspossibilities of visualizing the quality of a similarity measuresincluded projectin... |

59 | Efficient retrieval of similar time sequences using dft - Rafiei, Mendelzon - 1998 |

58 | Pattern Extraction for Time Series Classification - Geurts - 2001 |

57 | FALCON: Feedback adaptive loop for content-based retrieval - Wu, Faloutsos, et al. - 2000 |

54 | Efficient pruning methods for separate-and-conquer rule learning systems
- Cohen
- 1993
(Show Context)
Citation Context ...s tested on 1.85 datasets (1.26 real and 0.59ssynthetic). This numbers are astonishingly ow when you considersthat new machine learning algorithms are typically evaluated onsat least a dozen datasets =-=[12, 33]-=-.sIn fact, we feel that the numbers above are optimistic. Of the 30spapers that use two or more datasets, a very significant fractions(64%), use both stock market data and random walk data.sHowever, w... |

52 | A comparison of DFT and DWT based Similarity Search in Time-Series Databases
- Wu, Agrawal, et al.
(Show Context)
Citation Context ...suggest asdifferent approach to the dimensionality reduction stage. Thesproposed representations include the Discrete Fourier Transforms(DFT) [1, I 1, 16, 28, 49, 50], several kinds of Wavelets (DWT)s=-=[10, 27, 45, 51, 57, 60]-=-, Singular Value Decomposition [32, 35],sAdaptive Piecewise Constant Approximation [32], Inner Productss[18] and Piecewise Aggregate Approximation (PAA) [61]. Thesmajority of work has focused solely o... |

51 | Variable length queries for time series data
- Kahveci, Singh
- 2001
(Show Context)
Citation Context ...suggest asdifferent approach to the dimensionality reduction stage. Thesproposed representations include the Discrete Fourier Transforms(DFT) [1, I 1, 16, 28, 49, 50], several kinds of Wavelets (DWT)s=-=[10, 27, 45, 51, 57, 60]-=-, Singular Value Decomposition [32, 35],sAdaptive Piecewise Constant Approximation [32], Inner Productss[18] and Piecewise Aggregate Approximation (PAA) [61]. Thesmajority of work has focused solely o... |

51 | Similarity search for multidimensional data sequences - Lee, Chun, et al. - 2000 |

49 | Mining the stock market: Which measure is best - Gavrilov, Anguelov, et al. |

48 | Approximate nearest neighbor searching in multimedia databases - Ferhatosmanoglu, Tuncel, et al. - 2001 |

48 | Mining of concurrent text and time series
- Lavrenko, Schmill, et al.
- 2000
(Show Context)
Citation Context ...ata to be the same, each paper in the survey is tested on averageson only 1.28 different datasets. This number might be reasonablesif the contribution had being claimed for only a single type of datas=-=[19, 37]-=-, or it had been shown that the choice of dataset has littlesinfluence on the outcome. However, the choice of dataset has ashuge effect on the performance of time series algorithms. We willsdemonstrat... |

46 | Fast time-series searching with scaling and shifting - CHU, WONG - 1999 |

44 | Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers." Supercomputing Review - Bailey - 1991 |

44 | An Index-based approach for similarity search supporting time warping in large sequence databases - Park, Chu - 2001 |

43 | Efficient searches for similar subsequences of different lengths in sequence databases - Park, Chu, et al. - 2000 |

41 | Matching and indexing sequences of different lengths - Bozkaya, Yazdani, et al. - 1997 |

38 | Supporting content-based searches on time series via approximation
- Wang, Wang
- 2000
(Show Context)
Citation Context ...at created thestime series has changed [19, 20], or segmentation may simply besperformed to created a high level representation f the time seriessthat supports indexing, clustering and classification =-=[20, 30, 31,s37, 39, 42, 44, 46, 48, 52, 57]-=-.sAs mentioned above, our experiments were conducted on 50 realsworld, highly diverse datasets. Space limitations prevent us fromsdescribing all 50 datasets in detail, so we simply note thesfollowing.... |

36 | Supporting fast search in time series for movement patterns in multiples scales
- QU, WANG, et al.
- 1998
(Show Context)
Citation Context |

30 | A signature technique for similarity-based queries - FALOUTSOS, JAGADISH, et al. - 1997 |

30 | Adaptive query processing for time-series data - HUANG, YU - 1999 |

26 |
MALM: A framework for mining sequence database at multiple abstraction levels. CIKM
- LI, YU, et al.
- 1998
(Show Context)
Citation Context |

24 | The Haar wavelet transform in the time series similarity paradigm - STRUZIK, SIEBES - 1999 |

22 | Interactive interpretation of Kohonen maps applied to curves - DEBREGEAS, HEBRAIL - 1998 |

22 | Segment-Based approach for subsequence searches in sequence databases
- Park, SW, et al.
- 2001
(Show Context)
Citation Context |

21 | Mining for similarities in aligned time series using wavelets
- Huhtala, Kärkkäinen, et al.
(Show Context)
Citation Context ...cfons [57] 0.380 0.116sCepstrum [29] 0.570 0.458sString (Suffix Tree) [24] 0.206 0.578sImportant Points [46] 0.387 0.478sEdit Distance [8] 0.603 0.622sString Signature [4] 0.444 0.695sCosine Wavelets =-=[25]-=- 0.130 0.371sHrlder [54] 0.331 0.593sPiecewise Probabilistic [31] 0.202 0.321sThe results are quite surprising. None of the proposed techniquesscan beat the simple strawman. Their error rates are an o... |

21 | Fast retrieval of similar subsequences in long sequence databases
- PARK, LEE, et al.
- 1999
(Show Context)
Citation Context ...dissimilar. Surprisingly,smany of the papers included in the survey, whose mainscontribution was to introduce a new similarity measure, fail tosshow even one example of a matching pair of time series =-=[4, 8,s19, 22, 24, 26, 34, 36, 38, 42, 43, 48, 57]-=-. Moreover, showingssome examples of matching time series is of little utility unlessssome strawman comparison is used. Many papers ask us tosconsider the quality of their proposed similarity measure ... |

14 |
HotBits: Genuine random numbers, generated by radioactive decay," Online at http://www.fourmilab.com/hotbits
- Walker
- 1999
(Show Context)
Citation Context ...aset. For fairness wesused the same 100,000 subsequences for each approach. Tosensure randomness in our sampling technique wsused truesrandom numbers that were created by a quantum mechanicalsprocess =-=[55]-=-.s3.3.1 Demonstration fdata biassThe three papers li ted above experimented on a maximum of 3sdatasets. If we use that number of datasets we can demonstratesessentially any finding we wish. For exampl... |