## Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases (2000)

### Cached

### Download Links

Citations: | 172 - 18 self |

### BibTeX

@MISC{Keogh00dimensionalityreduction,

author = {Eamonn Keogh and Kaushik Chakrabarti and Michael Pazzani and Sharad Mehrotra},

title = {Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases},

year = {2000}

}

### Years of Citing Articles

### OpenURL

### Abstract

The problem of similarity search in large time series databases has attracted much attention recently. It is a non-trivial problem because of the inherent high dimensionality of the data. The most promising solutions involve first performing dimensionality reduction on the data, and then indexing the reduced data with a spatial access method. Three major dimensionality reduction techniques have been proposed, Singular Value Decomposition (SVD), the Discrete Fourier transform (DFT), and more recently the Discrete Wavelet Transform (DWT). In this work we introduce a new dimensionality reduction technique which we call Piecewise Aggregate Approximation (PAA). We theoretically and empirically compare it to the other techniques and demonstrate its superiority. In addition to being competitive with or faster than the other methods, our approach has numerous other advantages. It is simple to understand and to implement, it allows more flexible distance measures, including weighted Euclidean queries, and the index can be built in linear time.

### Citations

2352 | R-Trees: A Dynamic Index Structure for Spatial Searching - Guttman - 1984 |

1194 |
Pattern Recognition and neural networks
- Ripley
- 1996
(Show Context)
Citation Context ...timal transform is several senses, including the following. If we take the SVD of some dataset, then attempt to reconstruct the data, SVD is the (linear) transform that minimizes reconstruction error =-=[25]-=-. Given this we should expect SVD to perform very well for the indexing task. SVD however, has several drawbacks as an indexing scheme. The most important of these relate to its complexity. The classi... |

441 | Fast Subsequence Matching in Time-Series Database
- Faloutsos, Rangantathan, et al.
- 1994
(Show Context)
Citation Context ...lf terabyte of data and is updated at the rate of several gigabytes a day [21, 32]. Given the magnitude of many time series databases, much research has been devoted to speeding up the search process =-=[1, 2, 3, 6, 11, 14, 17, 18, 19, 22, 23, 24, 30, 35]-=-. The most promising methods are techniques that perform dimensionality reduction on the data, then use spatial access methods to index the data in the transformed space. The technique was introduced ... |

438 | Efficient Similarity Search In Sequence Databases
- Agrawal, Faloutsos, et al.
- 1993
(Show Context)
Citation Context ...lf terabyte of data and is updated at the rate of several gigabytes a day [21, 32]. Given the magnitude of many time series databases, much research has been devoted to speeding up the search process =-=[1, 2, 3, 6, 11, 14, 17, 18, 19, 22, 23, 24, 30, 35]-=-. The most promising methods are techniques that perform dimensionality reduction on the data, then use spatial access methods to index the data in the transformed space. The technique was introduced ... |

432 | FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets - Faloutsos, Lin - 1995 |

215 | Efficient Time Series Matching by Wavelets - Chan, Fu - 1999 |

208 | Fast similarity search in the presence of noise, scaling, and translation in times-series databases - Agrawal, KI, et al. - 1995 |

186 | Efficient retrieval of similar time sequences under time warping
- YI, JAGADISH, et al.
- 1998
(Show Context)
Citation Context ...lf terabyte of data and is updated at the rate of several gigabytes a day [21, 32]. Given the magnitude of many time series databases, much research has been devoted to speeding up the search process =-=[1, 2, 3, 6, 11, 14, 17, 18, 19, 22, 23, 24, 30, 35]-=-. The most promising methods are techniques that perform dimensionality reduction on the data, then use spatial access methods to index the data in the transformed space. The technique was introduced ... |

153 |
The uci kdd archive. http://kdd.ics.uci.edu
- Hettich, Bay
- 1999
(Show Context)
Citation Context |

153 | Rule discovery from time series
- DAS, LIN, et al.
- 1998
(Show Context)
Citation Context ...wn right as a tool for exploring time series databases, and it is also an important subroutine in many KDD applications such as clustering [9], classification [18, 21] and mining of association rules =-=[8]-=-. Time series databases are often extremely large. Consider the MACHCO project. This astronomical database contains a half terabyte of data and is updated at the rate of several gigabytes a day [21, 3... |

139 | An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback - Keogh, Pazzani - 1998 |

123 | Review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms - Wettschereck, Aha, et al. - 1997 |

106 | The Hybrid Tree: An Index Structure for High Dimensional Feature Spaces
- Chakrabarti, Mehrotra
- 2000
(Show Context)
Citation Context ...ests that time series could be indexed by Spatial Access Methods (SAMs) such as the R-tree and its many variants [12]. However most SAMs begin to degrade rapidly at dimensionalities greater than 8-12 =-=[5]-=-, and realistic queries typically contain 20 to 1,000 datapoints. In order to utilize SAMs it is necessary to first perform dimensionality reduction. In [11] the authors introduced GEneric Multimedia ... |

104 | A Probabilistic Approach to Fast Pattern Matching in Time Series Databases
- Keogh, Smyth
- 1997
(Show Context)
Citation Context |

103 | Dimensionality reduction for similarity searching in dynamic databases
- Kanth, Agrawal, et al.
- 1998
(Show Context)
Citation Context ...till consider DWT in our experimental section. 4.3 Singular Value decomposition Although Singular Value Decomposition (SVD) has been successfully used for indexing images and other multimedia objects =-=[16, 34]-=- and has been proposed for time series indexing [6], our paper contains the first actual implementation that the authors are aware of. Singular Value Decomposition differs from the three other propose... |

102 | EM Algorithms for PCA and SPCA
- Roweis
(Show Context)
Citation Context ...en much work recently on faster algorithms to compute SVD. Because of space limitations we will not discuss the rival merits of the various approaches. We will simply use the EM approach suggested in =-=[26]-=-. This method has a sound theoretical basis and requires only O(mnN) time and more importantly for us, only O(Nn + N 2 ) space. The problem of incremental updating has also received attention from man... |

81 | Towards an analysis of indexing schemes
- Hellerstein, Papadimitriou, et al.
- 1997
(Show Context)
Citation Context ...ntal methodology We performed all tests over a range of dimensionalities and query lengths. We chose the range of 8 to 20 dimensions because that is the useful range for most spatial index structures =-=[5, 13]-=-. Because we wished to include the DWT in our experiments, we are limited to query lengths that are an integer power of two. We consider a length of 1024 to be the longest query likely to be encounter... |

81 | Approximate queries and representations for large data sequences
- SHATKAY, ZDONIK
- 1996
(Show Context)
Citation Context |

55 | New techniques for best-match retrieval
- Shasha, Wang
- 1990
(Show Context)
Citation Context ...ime". 7) The index should be able to handle different distance measures, where appropriate. The sixth requirement is introduced to exclude from consideration techniques like Approximate Distance Maps =-=[28]-=-. This technique involves precomputing the distances between every pair of objects in the database to build a distance matrix, which becomes the index. The triangular inequality can then be used to pr... |

46 |
The FBI wavelet/scalar quantization standard for gray-scale image compression
- Bradley, Brislawn, et al.
- 1993
(Show Context)
Citation Context .... Figure 5 illustrates this. Recently they has been an explosion of interest in using wavelets for data compression, filtering, analysis and other areas where Fourier methods had previously been used =-=[4]-=-. Not surprisingly, researchers have begun to advocate wavelets for indexing [20]. Chan & Fu produced the breakthrough by producing a distance measure defined on wavelet coefficients which provably sa... |

30 | Adaptive query processing for time-series data
- HUANG, YU
- 1999
(Show Context)
Citation Context |

28 | Studies in astronomical time series analysis: v. Bayesian blocks, a new method to analyze structure in photon counting data - Scargle - 1998 |

27 | Efficient retrieval for browsing large image databases
- WU, AGRAWAL, et al.
- 1996
(Show Context)
Citation Context ...al work by Agrawal et al. utilizes the Discrete Fourier Transform (DFT) to perform the dimensionality reduction, but other techniques have been suggested, including Singular Value Decomposition (SVD) =-=[34]-=- and the Discrete Wavelet Transform (DWT) [6]. In this paper we introduce a novel transform to achieve dimensionality reduction. The method is motivated by the simple observation that for most time se... |

24 | The Haar wavelet transform in the time series similarity paradigm - STRUZIK, SIEBES - 1999 |

23 | Relevance feedback retrieval of time series data
- KEOGH, PAZZANI
- 1999
(Show Context)
Citation Context |

22 | Interactive interpretation of Kohonen maps applied to curves
- DEBREGEAS, HEBRAIL
- 1998
(Show Context)
Citation Context ...nd scientific databases. Similarity search is useful in its own right as a tool for exploring time series databases, and it is also an important subroutine in many KDD applications such as clustering =-=[9]-=-, classification [18, 21] and mining of association rules [8]. Time series databases are often extremely large. Consider the MACHCO project. This astronomical database contains a half terabyte of data... |

21 | Fast retrieval of similar subsequences in long sequence databases
- PARK, LEE, et al.
- 1999
(Show Context)
Citation Context |

9 |
On similarity-based queries for timeseries data
- Refiei
- 1999
(Show Context)
Citation Context |

9 | The Fourier Transform - a Primer
- Shatkay
(Show Context)
Citation Context ...ws queries which are shorter 1 KAIS Long paper submitted 5/16/00than length for which the index was built. This very desirable feature is impossible in DFT, SVD and DWT due to translation invariance =-=[29, 31]-=-. The rest of the paper is organized as follows. In Section 2, we state the similarity search problem more formally and review GEMINI, a generic framework that utilizes any dimensionality reduction te... |

8 | Data-mining massive time series astronomical data sets—A case study
- NG, HUANG, et al.
- 1998
(Show Context)
Citation Context ...ses. Similarity search is useful in its own right as a tool for exploring time series databases, and it is also an important subroutine in many KDD applications such as clustering [9], classification =-=[18, 21]-=- and mining of association rules [8]. Time series databases are often extremely large. Consider the MACHCO project. This astronomical database contains a half terabyte of data and is updated at the ra... |

2 |
An eigenspace update algorithm for image analysis
- unknown authors
- 1997
(Show Context)
Citation Context ...ating has also received attention from many researchers. The fastest exact methods are still linear in m. Much faster approximate methods exist, but they introduce the possibility of false dismissals =-=[7]-=-. 0 20 40 60 80 100 120 140 Figure 7: The first eight eigenwaves can be combined in a linear combination to produce X’, an approximation of the sequence X X X’ eigenwave 0 eigenwave 1 eigenwave 2 eige... |

1 |
Mining for similarities in aligned time series using wavelets
- unknown authors
- 1999
(Show Context)
Citation Context ...). DWT does have some drawbacks however. It is only defined for sequences whose length is an integral power of two. Although there has been a lot of work on more flexible distance measures using Haar =-=[15, 31]-=-, none of these techniques are indexable. 0 20 40 60 80 100 120 140 Figure 5: The first eight Haar wavelet bases can be combined in a linear combination to produce X’, an approximation of the sequence... |

1 |
Fast nearest-neighbor search in medical image databases
- E, Protopapas
- 1996
(Show Context)
Citation Context ...hich can exploit any dimensionality reduction method to allow efficient indexing. The technique was originally introduced for time series, but has been successfully extend to many other types of data =-=[20]-=-. A crucial result in [11] is that the authors proved that in order to guarantee no false dismissals, the distance measure in the index space must satisfy the following condition: D (A,B) ≤ D (A,B) (2... |