## Optimal Distance Bounds on Time-Series Data

Citations: | 6 - 1 self |

### BibTeX

@MISC{Vlachos_optimaldistance,

author = {Michail Vlachos and Suleyman S. Kozat and Philip S. Yu},

title = {Optimal Distance Bounds on Time-Series Data},

year = {}

}

### OpenURL

### Abstract

Most data mining operations include an integral search component at their core. For example, the performance of similarity search or classification based on Nearest Neighbors is largely dependent on the underlying compression and distance estimation techniques. As data repositories grow larger, there is an explicit need not only for storing the data in a compressed form, but also for facilitating mining operations directly on the compressed data. Naturally, the quality or tightness of the estimated distances on the compressed objects directly affects the search performance. We motivate our work within the setting of search engine weblog repositories, where keyword demand trends over time are represented and stored as compressed timeseries data. Search and analysis over such sequence data has important applications for the search engines, including discovery of important news events, keyword recommendation and efficient keyword-to-advertisement mapping. We present new mechanisms for very fast search operations over the compressed time-series data, with specific focus on weblog data. An important contribution of this work is the derivation of optimally tight bounds on the Euclidean distance estimation between compressed sequences. Since our methodology is applicable to sequential data in general, the proposed technique is of independent interest. Additionally, our distance estimation strategy is not tied to a specific compression methodology, but can be applied on top of any orthonormal based compression technique (Fourier, Wavelet, PCA, etc). The experimental results indicate that the new optimal bounds lead to a significant improvement in the pruning power of search compared to previous state-of-the-art, in many cases eliminating more than 80 % of the candidate search sequences. 1

### Citations

513 | A quantitative analysis and performance study for similaritysearch methods
- Weber, Schek, et al.
- 1998
(Show Context)
Citation Context ...e to further reduce the search space (e.g. the creation of an index on the compressed features). However, the steps that we described are rudimentary in the majority of search and indexing techniques =-=[16, 17]-=-. Additionally, the aforementioned search procedure constitutes a bias-free approach to evaluating the search performance of a technique, since it does not depend on any implementation details. We uti... |

287 | Bursty and Hierarchical Structure in Streams
- Kleinberg
(Show Context)
Citation Context ...tical demand pattern. (4) Identification of news events: Query logs can help understand and predict behavioral patterns [5]. Important events usually manifest themselves as bursts in the query demand =-=[6, 7]-=-. News travel fast, and web queries travel even faster. By monitoring increasing demands in a query, search engines can accurately pinpoint developing news events. (5) Advertising impact: The financia... |

247 | Exact indexing of dynamic time warping
- Keogh
- 2002
(Show Context)
Citation Context ...e to further reduce the search space (e.g. the creation of an index on the compressed features). However, the steps that we described are rudimentary in the majority of search and indexing techniques =-=[16, 17]-=-. Additionally, the aforementioned search procedure constitutes a bias-free approach to evaluating the search performance of a technique, since it does not depend on any implementation details. We uti... |

237 | Locally adaptive dimensionality reduction for indexing large time series databases
- Chakrabarti, Keogh, et al.
- 2002
(Show Context)
Citation Context ... utilize two query examples and depict their approximation under various compression techniques, such as Piecewise Aggregate Approximation (PAA) [18], Adaptive Piecewise Constant Approximation (APCA) =-=[19]-=- (high energy Haar coefficients), Chybechev Polynomials [20], first Fourier coefficients [12] and high energy Fourier coefficients ([7]). We observe, that the sequence reconstruction error e is genera... |

232 | On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration”, Data Mining and Knowledge Discovery 7
- Keogh, Kasetty
- 2003
(Show Context)
Citation Context ...arities under arbitrary time-shift. The Periodic Distance can be reverted into a Euclidean computation of the magnitude components in the frequency domain distance measures in the mining bibliography =-=[21]-=-. Here we illustrate briefly how our methodology can be applied on a wider range of linear and even non-linear distance functions. We examine the applicability of the proposed technique under two dist... |

151 |
Fast time sequence indexing for arbitrary Lp norms
- Yi, Faloutsos
- 2000
(Show Context)
Citation Context ...y is retained even when highly compressing the data. We utilize two query examples and depict their approximation under various compression techniques, such as Piecewise Aggregate Approximation (PAA) =-=[18]-=-, Adaptive Piecewise Constant Approximation (APCA) [19] (high energy Haar coefficients), Chybechev Polynomials [20], first Fourier coefficients [12] and high energy Fourier coefficients ([7]). We obse... |

115 |
Efficient Similarity Search
- Agrawal, Faloutos, et al.
- 1993
(Show Context)
Citation Context ...ary to all the above approaches, by allowing them to scale up to even larger dataset sizes. In the data-mining community, search on time-series under the Euclidean metric has been studied extensively =-=[12, 13, 14]-=- but, typically, compression using the first Fourier or wavelets are considered. [7] studies the use of diverse sets of coefficients, but this is the first work that offers the tightest possible lower... |

62 | Semantic similarity between search engine queries using temporal correlation
- Chien, Immorlica
- 2005
(Show Context)
Citation Context ...Related Work Previous work considered various applications of temporal sequences on weblogs. [8] examines the discovery of causal relationships across query logs by deploying an event causality test. =-=[9]-=-, [10] study similarity search and clustering in query data based on metrics such as correlation and periodicity. While the above utilize linear metrics to quantify the similarity, [5] examines the us... |

57 | A Mendelzon, Efficient Retrieval of Similar Time Sequences Using DFT
- Rafiei
- 1998
(Show Context)
Citation Context ...ary to all the above approaches, by allowing them to scale up to even larger dataset sizes. In the data-mining community, search on time-series under the Euclidean metric has been studied extensively =-=[12, 13, 14]-=- but, typically, compression using the first Fourier or wavelets are considered. [7] studies the use of diverse sets of coefficients, but this is the first work that offers the tightest possible lower... |

52 | Indexing spatio-temporal trajectories with Chebyshev polynomials
- Cai, Ng
- 2004
(Show Context)
Citation Context ...nder various compression techniques, such as Piecewise Aggregate Approximation (PAA) [18], Adaptive Piecewise Constant Approximation (APCA) [19] (high energy Haar coefficients), Chybechev Polynomials =-=[20]-=-, first Fourier coefficients [12] and high energy Fourier coefficients ([7]). We observe, that the sequence reconstruction error e is generally lower when using techniques that utilize the highest ene... |

24 |
Improving Query Spelling Correction Using Web Search Results
- Chen, Li, et al.
(Show Context)
Citation Context ...ing correction: No dictionary or ontology can cover the wide range of keywords that appear on the web. However, relationships between keywords can be deduced by the systematic study of the query logs =-=[4]-=-. Figure 2(b) illustrates an instance of such an example, for the query ‘florida’ and the misspelled keyword ‘flordia’, which exhibits an almost identical demand pattern. (4) Identification of news ev... |

24 |
Why We Search: Visualizing and Predicting User Behavior
- Adar, Weld, et al.
- 2007
(Show Context)
Citation Context ...ry ‘florida’ and the misspelled keyword ‘flordia’, which exhibits an almost identical demand pattern. (4) Identification of news events: Query logs can help understand and predict behavioral patterns =-=[5]-=-. Important events usually manifest themselves as bursts in the query demand [6, 7]. News travel fast, and web queries travel even faster. By monitoring increasing demands in a query, search engines c... |

14 | Time-dependent semantic similarity measure of queries using historical click-through data
- Zhao, Hoi, et al.
(Show Context)
Citation Context ...a based on metrics such as correlation and periodicity. While the above utilize linear metrics to quantify the similarity, [5] examines the use of non-linear metrics such as Time-Warping. Finally, in =-=[11]-=- the authors examine a similar application of search on temporal logs, but using clickthrough data. However, none of the above work examines how to tailor search based on compressed representations of... |

12 |
Rotation invariant indexing of shapes and line drawings
- Vlachos, Vagena, et al.
- 2005
(Show Context)
Citation Context ...res are very flexible, because they allow phaseinvariant matching, and have been used with great success for a multitude of difficult mining tasks, including the rotation invariant matching of shapes =-=[22, 23]-=-. They can be used as an inexpensive substitute of DTW when speed is of essence. The periodic distance computes in its core a Euclidean distance in the frequency domain, therefore our bounding techniq... |

7 | Multilevel filtering for high dimensional nearest neighbor search
- Wang, Wang
(Show Context)
Citation Context ...ary to all the above approaches, by allowing them to scale up to even larger dataset sizes. In the data-mining community, search on time-series under the Euclidean metric has been studied extensively =-=[12, 13, 14]-=- but, typically, compression using the first Fourier or wavelets are considered. [7] studies the use of diverse sets of coefficients, but this is the first work that offers the tightest possible lower... |

2 |
Measuring the Meaning
- Lie, Jones, et al.
- 2005
(Show Context)
Citation Context ...ed Work Previous work considered various applications of temporal sequences on weblogs. [8] examines the discovery of causal relationships across query logs by deploying an event causality test. [9], =-=[10]-=- study similarity search and clustering in query data based on metrics such as correlation and periodicity. While the above utilize linear metrics to quantify the similarity, [5] examines the use of n... |

2 |
Exact Indexing of Shapes under Rotation Invariance with Arbitrary Representations and Distance Measures
- Keogh, Wei, et al.
- 2006
(Show Context)
Citation Context ...res are very flexible, because they allow phaseinvariant matching, and have been used with great success for a multitude of difficult mining tasks, including the rotation invariant matching of shapes =-=[22, 23]-=-. They can be used as an inexpensive substitute of DTW when speed is of essence. The periodic distance computes in its core a Euclidean distance in the frequency domain, therefore our bounding techniq... |

1 |
Extracting User Behavior by Web
- Otsuka, Toyoda, et al.
- 2004
(Show Context)
Citation Context ... affinity in the demand trends clearly suggests a semantic relation between the specific keywords. Generally speaking, as previous studies note: “user behavior is deeply related to search keyword[s]” =-=[1]-=-. One can distill this behavior, which can prove beneficial in a variety of applications: (1) Search engine optimization: Understanding the semantic similarity between keywords can assist in construct... |

1 |
Clustering of Search Engine Keywords Using Access Logs
- Otsuka, Kitsuregawa
- 2006
(Show Context)
Citation Context ...cations: (1) Search engine optimization: Understanding the semantic similarity between keywords can assist in constructing more accurate keyword taxonomies and achieving better clustering of keywords =-=[2]-=-. This can serve in providing better search results and ultimately help understand the true relationship between web pages. A number of features can assist in this process, such as repetition in the s... |

1 |
Examining Repetition
- Sanderson, Dumais
- 2007
(Show Context)
Citation Context ...e in providing better search results and ultimately help understand the true relationship between web pages. A number of features can assist in this process, such as repetition in the search behavior =-=[3]-=-, something that is easily conveyed by the temporal representation of the query demand. (2) Keyword recommendation: Related queries are manifested as similar demand patterns. A search engine can explo... |