## iSAX 2.0: Indexing and Mining One Billion Time Series

Citations: | 7 - 1 self |

### BibTeX

@MISC{Camerra_isax2.0:,

author = {Alessandro Camerra and Themis Palpanas and Jin Shieh and Eamonn Keogh},

title = {iSAX 2.0: Indexing and Mining One Billion Time Series},

year = {}

}

### OpenURL

### Abstract

Abstract—There is an increasingly pressing need, by several applications in diverse domains, for developing techniques able to index and mine very large collections of time series. Examples of such applications come from astronomy, biology, the web, and other domains. It is not unusual for these applications to involve numbers of time series in the order of hundreds of millions to billions. However, all relevant techniques that have been proposed in the literature so far have not considered any data collections much larger than onemillion time series. In this paper, we describe iSAX 2.0, a data structure designed for indexing and mining truly massive collections of time series. We show that the main bottleneck in mining such massive datasets is the time taken to build the index, and we thus introduce a novel bulk loading mechanism, the first of this kind specifically tailored to a time series index. We show how our method allows mining on datasets that would otherwise be completely untenable, including the first published experiments to index one billion time series, and experiments in mining massive data from domains as diverse as entomology, DNA and web-scale image collections. Keywords-time series; data mining; representations; indexing I.

### Citations

241 | 80 million tiny images: A large data set for nonparametric object and scene recognition
- Torralba, Fergus, et al.
(Show Context)
Citation Context ...re hundreds of possible distance measures proposed for images, a recent paper has shown that simple Euclidean distance between color histograms is very effective if the training dataset is very large =-=[8]-=-. More generally, there is an increasing understanding that having lots of data without a model can often beat smaller datasets, even if they are accompanied by a sophisticated model [9][10]. Indeed, ... |

229 | Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases
- Keogh, Chakrabati, et al.
- 2001
(Show Context)
Citation Context ...our occasions the best paper winners at SIGKDD/SIGMOD have looked at the problem of indexing time series, with the largest dataset considered by each paper being 500,000 objects [20], 100,000 objects =-=[21]-=-, 6,480 objects [1], and 27,000 objects [23]. Thus the 1,000,000,000 objects considered here represent real progress, beyond the inevitable improvements in hardware performance. We further show that t... |

155 | Dimensionality Reduction for fast similarity search in large time series databases
- Keogh, Chakrabarti, et al.
- 2000
(Show Context)
Citation Context ...space by a vector of real numbers w c c C , , 1 … = . The i th n i w w ci = n ∑ Tj n j= ( i−1) + 1 w element of C is calculated by: Figure 1(ii) shows T converted into this representation (called PAA =-=[22]-=-) reducing the dimensionality from 16 to 4. Note that the PAA coefficients are intrinsically realvalued, and for reasons we will make clear later, it can be advantageous to have discrete coefficients.... |

100 | A probabilistic approach to fast pattern matching in time series databases
- Keogh, Smyth
- 1997
(Show Context)
Citation Context ...DD/SIGMOD have looked at the problem of indexing time series, with the largest dataset considered by each paper being 500,000 objects [20], 100,000 objects [21], 6,480 objects [1], and 27,000 objects =-=[23]-=-. Thus the 1,000,000,000 objects considered here represent real progress, beyond the inevitable improvements in hardware performance. We further show that the scalability achieved by our ideas allows ... |

73 |
The unreasonable effectiveness of data
- HALEVY, NORVIG, et al.
- 2009
(Show Context)
Citation Context ...is very large [8]. More generally, there is an increasing understanding that having lots of data without a model can often beat smaller datasets, even if they are accompanied by a sophisticated model =-=[9]-=-[10]. Indeed, Peter Norvig, Google’s research director, recently noted that “All models are wrong, and increasingly you can succeed without them”. The ideas introduced in this work offer us a chance t... |

60 | Similarity search over time series data using wavelets
- Popivanov, Miller
- 2002
(Show Context)
Citation Context ...i.e. dimensionality reduction) of time series data, including Discrete Fourier Transformation [20], Singular Value Decomposition (SVD), Discrete Cosine Transformation, Discrete Wavelet Transformation =-=[26]-=-, Piecewise Aggregate Approximation [22], Adaptive Piecewise Constant Approximation [21], Chebyshev polynomials [1]. However, recent extensive empirical evaluations suggest that on average, there is l... |

52 | A generic approach to bulk loading multidimensional index structures
- Bercken, Seeger, et al.
- 1997
(Show Context)
Citation Context ...e representation. The problem of bulk loading has been studied in the context of traditional database indices, such as B-trees and R-trees and other multi-dimensional index structures [15][16][17][18]=-=[19]-=-[27]. For these structures two main approaches have been proposed. First, we have the mergebased techniques [15] that preprocess data into clusters. For each cluster, they proceed with the creation of... |

52 | Experiencing sax: a novel symbolic representation of time series
- Lin, Keogh, et al.
(Show Context)
Citation Context ...entations in terms of fidelity of approximation, and thus indexing power [14]. The approximation we use in this work is intrinsically different from the techniques listed above in that it is discrete =-=[25]-=-, rather than real-valued. This discreteness is advantageous in that the average byte used by discrete representations carries much more information than its real valued counterparts. This allows our ... |

50 | Indexing spatio-temporal trajectories with chebyshev 146 onManagementofdata,SIGMOD’04,pp.599–610,ACM,2004
- Cai, Ng
(Show Context)
Citation Context ...st paper winners at SIGKDD/SIGMOD have looked at the problem of indexing time series, with the largest dataset considered by each paper being 500,000 objects [20], 100,000 objects [21], 6,480 objects =-=[1]-=-, and 27,000 objects [23]. Thus the 1,000,000,000 objects considered here represent real progress, beyond the inevitable improvements in hardware performance. We further show that the scalability achi... |

24 | sax: indexing and mining terabyte sized time series
- Shieh, Keogh
- 2008
(Show Context)
Citation Context ...rising conclusion: For all attempts at large scale mining of time series, it is the time complexity of building the index that remains the most significant bottleneck: e.g., a state-of-the-art method =-=[3]-=- needs over 6 days to build an index with 100-million items. Additionally, there is a pressing need to reduce retrieval times, especially as such data is clearly doomed to be disk resident. Once a dim... |

17 |
The end of theory: the data deluge makes the scientific method obsolete
- Anderson
(Show Context)
Citation Context ...very large [8]. More generally, there is an increasing understanding that having lots of data without a model can often beat smaller datasets, even if they are accompanied by a sophisticated model [9]=-=[10]-=-. Indeed, Peter Norvig, Google’s research director, recently noted that “All models are wrong, and increasingly you can succeed without them”. The ideas introduced in this work offer us a chance to te... |

14 | E.: GBI: a generalized R-tree bulk-insertion strategy
- Choubey, Chen, et al.
- 1999
(Show Context)
Citation Context ...ete nature of the representation. The problem of bulk loading has been studied in the context of traditional database indices, such as B-trees and R-trees and other multi-dimensional index structures =-=[15]-=-[16][17][18][19][27]. For these structures two main approaches have been proposed. First, we have the mergebased techniques [15] that preprocess data into clusters. For each cluster, they proceed with... |

9 |
An initial genetic linkage map of the rhesus macaque (Macaca mulatta) genome using human microsatellite loci
- Rogers, Garcia, et al.
- 2006
(Show Context)
Citation Context ...umn corresponding to Macaque 7, the two darkest cells are rows 14 and 15. The first paper to publish a genetic linkage map of the two primates tells us “macaque7 is homologous to human14 and human15” =-=[12]-=-. More generally, this correspondence matrix is at least 95% in agreement with the current agreement on homology between these two primates [12]. This experiment demonstrates that we can easily index ... |

5 | Improving Performance with Bulk-Inserts in Oracle R-Trees
- An, Kothuri, et al.
- 2003
(Show Context)
Citation Context ...nature of the representation. The problem of bulk loading has been studied in the context of traditional database indices, such as B-trees and R-trees and other multi-dimensional index structures [15]=-=[16]-=-[17][18][19][27]. For these structures two main approaches have been proposed. First, we have the mergebased techniques [15] that preprocess data into clusters. For each cluster, they proceed with the... |

2 |
The TS-Tree: Efficient Time
- Assent, Krieger, et al.
- 2008
(Show Context)
Citation Context ...-10). Among the segments for which this is true, the algorithm picks the one whose μ value lies closer to a breakpoint (lines 9-10). breakpoint for cardinality 2 1 2 3 4 5 6 7 8 9 10 11 μ[1] + 3σ[1] μ=-=[2]-=- + 3σ[2] μ [3] + 3σ[3] μ [4] + 3σ[4] segment 1 segment 2 segment 3 segment 4 Figure 4. Node splitting policy example. Function Split() mean[ ] = ComputeSymbolMean() // using highest iSAX representatio... |

2 |
The AC-DC Correlation Monitor: New EPG Design with Flexible Input Resistors to Detect Both R and emf Components for any Piercing-sucking Hemipteran
- Backus, Bennett
(Show Context)
Citation Context ...e significant progress on the problem. As USDA scientist Dr. Elaine Backus recently noted, “Much of what is known today about hemipteran feeding biology .. has been learned via use of EPG technology” =-=[6]-=-. However, in spite of the current successes, there is a bottleneck in progress due to the huge volumes of data produced. For example, a single experiment can last up to 24 hours. At 100 Hz that will ... |

2 |
Vahrenhold, Jeffrey Scott Vitter: Efficient Bulk Operations on Dynamic R-Trees. Algorithmica 33(1
- Arge, Hinrichs
(Show Context)
Citation Context ...mory is used for disk buffer management (i.e., buffers corresponding to the leaf level nodes). We also compare to iSAX-BufferTree, which is an adaptation of the Buffered R-Tree bulk loading algorithm =-=[17]-=-. In this case, instead of having buffers only at the first and leaf levels, we also have some buffers at intermediate levels of the index tree. These buffers are of equal size, which depends on the s... |

1 |
Tjallingii WF. 2003. Characterisation of the feeding behaviour of western flower thrips in terms of EPG waveforms
- Kindt, NN, et al.
(Show Context)
Citation Context ...tool, which will eventually be made freely available to the entomological community. Let us consider a typical scenario in which the tool may be used. In Figure 9(bottom) we see a copy of Fig. 2 from =-=[4]-=-. This time series shows a behavior observed in a Western Flower Thrip (Frankliniella occidentalis), an insect which is a vector for more than 20 plant diseases. The Beet Leafhopper (Circulifer tenell... |

1 |
Assimilation efficiency of free and protein amino acids by H. vitripennis feeding on C.sinensis and V. vinifera. Florida Entomologist
- Andersen, Brodbeck, et al.
- 2009
(Show Context)
Citation Context ...Leafhoppers), Homalodisca coagulate first appeared in California around 1993, and has since done several billions of dollars of damage and now threatens California’s $34 billion dollar grape industry =-=[5]-=-. In order to understand and ultimately control these harmful behaviors, entomologists glue a thin wire to the insect’s back, and then measure fluctuations in voltage level to create an Electrical Pen... |

1 |
2009) Personal Communication. August 12 th
- Walker
(Show Context)
Citation Context ... asked us to create an efficient tool for mining massive EPG 1 Xylem is plant sap responsible for the transport of water and soluble mineral nutrients from the roots throughout the plant. collections =-=[7]-=-. We have used the techniques introduced in this work as a beta version of such a tool, which will eventually be made freely available to the entomological community. Let us consider a typical scenari... |

1 | Mouse chromosome 16 - Reeves, Cabin - 1999 |

1 | Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520–562 - Asif - 2002 |

1 |
Querying and mining of time series data: experimental comparison of representations and distance measures
- unknown authors
- 2008
(Show Context)
Citation Context ...hods. However there is an increasing understanding that this is a red-herring. It has been forcedly shown that averaged over many datasets, the time series representation makes very little difference =-=[14]-=-. Finally, a unique property of iSAX is its tiny bit-aware index size. This means that an iSAX index is very small compared to the data it indexes, and thus we can fit the entire index in main memory ... |

1 |
den Bercken, Bernhard Seeger: An Evaluation of Generic Bulk Loading Techniques
- Van
(Show Context)
Citation Context ...f the representation. The problem of bulk loading has been studied in the context of traditional database indices, such as B-trees and R-trees and other multi-dimensional index structures [15][16][17]=-=[18]-=-[19][27]. For these structures two main approaches have been proposed. First, we have the mergebased techniques [15] that preprocess data into clusters. For each cluster, they proceed with the creatio... |

1 |
Widmayer: Single and Bulk Updates in Stratified Trees: An Amortized and Worst-Case Analysis. Computer Science in Perspective 2003
- Soisalon-Soininen, Peter
(Show Context)
Citation Context ...presentation. The problem of bulk loading has been studied in the context of traditional database indices, such as B-trees and R-trees and other multi-dimensional index structures [15][16][17][18][19]=-=[27]-=-. For these structures two main approaches have been proposed. First, we have the mergebased techniques [15] that preprocess data into clusters. For each cluster, they proceed with the creation of a s... |