Querying and Mining of Time Series Data: Experimental Comparison of Representations and Distance Measures
The last decade has witnessed a tremendous growths of interests in applications that deal with querying and mining of time series data. Numerous representation methods for dimensionality reduction and similarity measures geared towards time series have been introduced. Each individual work introducing a particular method has made specific claims and, aside from the occasional theoretical justifications, provided quantitative experimental observations. However, for the most part, the comparative aspects of these experiments were too narrowly focused on demonstrating the benefits of the proposed methods over some of the previously introduced ones. In order to provide a comprehensive validation, we conducted an extensive set of time series experiments reimplementing 8 different representation methods and 9 similarity measures and their variants, and testing their effectiveness on 38 time series data sets from a wide variety of application domains. In this paper, we give an overview of these different techniques and present our comparative experimental findings regarding their effectiveness. Our experiments have provided both a unified validation of some of the existing achievements, and in some cases, suggested that certain claims in the literature may be unduly optimistic. 1.
Fast Time Series Classification Using Numerosity Reduction
 In ICML’06
, 2006
Many algorithms have been proposed for the problem of time series classification. However, it is clear that onenearestneighbor with Dynamic Time Warping (DTW) distance is exceptionally difficult to beat. This approach has one weakness, however; it is computationally too demanding for many realtime applications. One way to mitigate this problem is to speed up the DTW calculations. Nonetheless, there is a limit to how much this can help. In this work, we propose an additional technique, numerosity reduction, to speed up onenearestneighbor DTW. While the idea of numerosity reduction for nearestneighbor classifiers has a long history, we show here that we can leverage off an original observation about the relationship between dataset size and DTW constraints to produce an extremely compact dataset with little or no loss in accuracy. We test our ideas with a comprehensive set of experiments, and show that it can efficiently produce extremely fast accurate classifiers. 1.
Anytime classification using the nearest neighbor algorithm with applications to stream mining
 IEEE International Conference on Data Mining (ICDM
, 2006
For many real world problems we must perform classification under widely varying amounts of computational resources. For example, if asked to classify an instance taken from a bursty stream, we may have from milliseconds to minutes to return a class prediction. For such problems an anytime algorithm may be especially useful. In this work we show how we can convert the ubiquitous nearest neighbor classifier into an anytime algorithm that can produce an instant classification, or if given the luxury of additional time, can utilize the extra time to increase classification accuracy. We demonstrate the utility of our approach with a comprehensive set of experiments on data from diverse domains.
Approximate embeddingbased subsequence matching of time series
 In SIGMOD ’08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data
, 2008
A method for approximate subsequence matching is introduced, that significantly improves the efficiency of subsequence matching in large time series data sets under the dynamic time warping (DTW) distance measure. Our method is called EBSM, shorthand for EmbeddingBased Subsequence Matching. The key idea is to convert subsequence matching to vector matching using an embedding. This embedding maps each database time series into a sequence of vectors, so that every step of every time series in the database is mapped to a vector. The embedding is computed by applying full dynamic time warping between reference objects and each database time series. At runtime, given a query object, an embedding of that object is computed in the same manner, by running dynamic time warping between the reference objects and the query. Comparing the embedding of the query with the database vectors is used to efficiently identify relatively few areas of interest in the database sequences. Those areas of interest are then fully explored using the exact DTWbased subsequence matching algorithm. Experiments on a large, public time series data set produce speedups of over one order of magnitude compared to bruteforce search, with very small losses (< 1%) in retrieval accuracy.
Accelerating dynamic time warping subsequence search with GPUs and FPGAs
 in Proc. ICDM, 2010
Abstract—Many time series data mining problems require subsequence similarity search as a subroutine. While this can be performed with any distance measure, and dozens of distance measures have been proposed in the last decade, there is increasing evidence that Dynamic Time Warping (DTW) is the best measure across a wide range of domains. Given DTW’s usefulness and ubiquity, there has been a large communitywide effort to mitigate its relative lethargy. Proposed speedup techniques include early abandoning strategies, lowerbound based pruning, indexing and embedding. In this work we argue that we are now close to exhausting all possible speedup from software, and that we must turn to hardwarebased solutions if we are to tackle the many problems that are currently untenable even with stateoftheart algorithms running on highend desktops. With this motivation, we investigate both GPU (Graphics Processing Unit) and FPGA (Field Programmable Gate Array) based acceleration of subsequence similarity search under the DTW measure. As we shall show, our novel algorithms allow GPUs, which are typically bundled with standard desktops, to achieve two orders of magnitude speedup. For problem domains which require even greater scale up, we show that FPGAs costing just a few thousand dollars can be used to produce four orders of magnitude speedup. We conduct detailed case studies on the classification of astronomical observations and similarity search in commercial agriculture, and demonstrate that our ideas allow us to tackle problems that would be simply untenable otherwise. Keywords time series; similarity search; dynamic time warping; FPGA; GPU; I.
Finding Motifs in Database of Shapes
 IN PROC. OF SIAM INTERNATIONAL CONFERENCE ON DATA MINING (SDM’07
, 2007
The problem of efficiently finding images that are similar to a target image has attracted much attention in the image processing community and is rightly considered an information retrieval task. However, the problem of finding structure and regularities in large image datasets is an area in which data mining is beginning to make fundamental contributions. In this work, we consider the new problem of discovering shape motifs, which are approximately repeated shapes within (or between) image collections. As we shall show, shape motifs can have applications in tasks as diverse as anthropology, law enforcement, and historical manuscript mining. Brute force discovery of shape motifs could be untenably slow, especially as many domains may require an expensive rotation invariant distance measure. We introduce an algorithm that is two to three orders of magnitude faster than brute force search, and demonstrate the utility of our approach with several real world datasets from diverse domains.
A bit level representation for time series data mining with shape based similarity
, 2006
Clipping is the process of transforming a real valued series into a sequence of bits representing whether each data is above or below the average. In this paper we argue that clipping is a useful and flexible transformation for the exploratory analysis of large time dependent data sets. We demonstrate how time series stored as bits can be very efficiently compressed and manipulated and that, under some assumptions, the discriminatory power with clipped series is asymptotically equivalent to that achieved with the raw data. Unlike other transformations, clipped series can be compared directly to the raw data series. We show that this means we can form a tight lower bounding metric for Euclidean and Dynamic Time Warping distance and hence efficiently query by content. Clipped data can be used in conjunction with a host of algorithms and statistical tests that naturally follow from the binary nature of the data. A series of experiments illustrate how clipped series can be used in increasingly complex ways to achieve better results than with other popular techniques. The usefulness of the representation is demonstrated by the fact that the results with clipped data are consistently better than those achieved with a Wavelet or Discrete Fourier Transformation at the same compression ratio for both clustering and query by content. The flexibility of the representation is shown by the fact that we can take advantage of a variable run length encoding of clipped series to define an approximation of the Kolmogorov complexity and hence perform
Faster Retrieval with a TwoPass DynamicTimeWarping Lower Bound
, 2009
The Dynamic Time Warping (DTW) is a popular similarity measure between time series. The DTW fails to satisfy the triangle inequality and its computation requires quadratic time. Hence, to find closest neighbors quickly, we use bounding techniques. We can avoid most DTW computations with an inexpensive lower bound (LB Keogh). We compare LB Keogh with a tighter lower bound (LB Improved). We find that LB Improvedbased search is faster. As an example, our approach is 2–3 times faster over randomwalk and shape time series.
Multiresolution Motif Discovery in Time Series
Time series motif discovery is an important problem with applications in a variety of areas that range from telecommunications to medicine. Several algorithms have been proposed to solve the problem. However, these algorithms heavily use expensive random disk accesses or assume the data can fit into main memory. They only consider motifs at a single resolution and are not suited to interactivity. In this work, we tackle the motif discovery problem as an approximate TopK frequent subsequence discovery problem. We fully exploit state of the art iSAX representation multiresolution capability to obtain motifs at different resolutions. This property yields interactivity, allowing the user to navigate along the TopK motifs structure. This permits a deeper understanding of the time series database. Further, we apply the
FAST MULTISEGMENT ALIGNMENTS FOR TEMPORAL EXPRESSION PROFILES
We present two heuristics for speeding up a time series alignment algorithm that is related to dynamic time warping (DTW). In previous work, we developed our multisegment alignment algorithm to answer similarity queries for toxicogenomic timeseries data. Our multisegment algorithm returns more accurate alignments than DTW at the cost of time complexity; the multisegment algorithm is O(n 5)whereasDTWisO(n 2). The first heuristic we present speeds up our algorithm by a constant factor by restricting alignments to a cone shape in alignment space. The second heuristic restricts the alignments considered to those near one returned by a DTWlike method. This heuristic adjusts the time complexity to O(n 3). Importantly, neither heuristic results in a loss in accuracy. 1.