#### DMCA

## Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized

### Cached

### Download Links

Citations: | 20 - 6 self |

### Citations

3800 |
Density Estimation for Statistics and Data Analysis
- Silverman
- 1986
(Show Context)
Citation Context ...the distance based outliers [14]. The definition can be generalized further to compute the average distance to all k nearest neighbors, which is in fact the non-parametric density estimation approach =-=[20]-=-. The algorithm proposed in this work can easily be adapted with any of these outlier definitions. We use Definition 1 because of its intuitive interpretation. Our choice is further justified by the e... |

3437 | Mapreduce: Simplified data processing on large clusters
- Dean, Ghemawat
(Show Context)
Citation Context ...doption of these methods across the global data mining community. Recently, however, an intuitive yet extremely scalable framework for parallel data mining has emerged, namely the MapReduce framework =-=[13]-=-. MapReduce is quickly turning into a parallel data mining standard and is already adopted by large companies, such as Google, Yahoo! and Microsoft. The framework operates in two steps. All examples i... |

532 | Fast subsequence matching in time-series databases
- Faloutsos, Ranganathan, et al.
- 1994
(Show Context)
Citation Context ...avelets [9]. However, we are working with just the raw data. It is worth explaining why. Most time series data mining algorithms achieve speed-up with the Gemini framework (or some variation thereof) =-=[8]-=-. The basic idea is to approximate the full dataset in main memory, approximately solve the problem at hand, and then make (hopefully few) accesses to the disk to confirm or adjust the solution. Note ... |

516 | Lof: Identifying density-based local outliers.
- Breunig, Kriegel, et al.
- 2000
(Show Context)
Citation Context ... distance based algorithm is a parallel modification of Bay’s randomized nested loop algorithm [7], and the density based version is a modification of the popular local outlier factor (LOF) algorithm =-=[9]-=-. Both parallel variants proceed in a similar fashion: First the data space is partitioned across different computers and outliers, local for each computer, are identified. Subsequently, the results f... |

359 | Algorithms for mining distance-based outliers in large datasets. - Knorr, Ng - 1998 |

325 | On e need for time series data mining benchmarks: A survey and empirical demonstration”.
- Keogh, Kasetty
- 2002
(Show Context)
Citation Context ... all subsequences are stored in the database in the above normalized form. This requirement is imposed so that the nearest neighbor search is invariant to transformations, such as shifting or scaling =-=[12]-=-. 4. Finding Discords In Secondary Storage So far we have introduced the notion of time series discords, which is the focus of the current work. Here, we are going to present an efficient algorithm fo... |

203 | A cost model for nearest neighbor search in highdimensional data space
- Keim, Berchtold, et al.
- 1997
(Show Context)
Citation Context ...ion (nndd) of the dataset, and more precisely the number of elements that fall in its tail. Computing the nndd, however, is hard, especially in high dimensional spaces as is the case with time series =-=[4]-=-[21]. The available methods require that random portions of the space are sampled and the nearest neighbor distances in those portions to be computed. Unfortunately, for a robust estimate, this requir... |

185 | Exact discovery of time series motifs.
- Mueen, Keogh, et al.
- 2009
(Show Context)
Citation Context ...ead to a conclusion that the subsequence C is not a rare example in the database. In these cases, when p1 and p2 are not “significantly” different, the subsequences C and M are called trivial matches =-=[5]-=-. The positions p1 and p2 are significantly different with respect to a distance function Dist, if there exists a subsequence Q starting at position p3, such that p1 < p3 < p2 and Dist(C, M) < Dist(C,... |

159 | Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule,”
- Bay, Schwabacher
- 2003
(Show Context)
Citation Context ...Lozano et al. [22] propose two algorithms for parallel mining of distance and density based outliers. The distance based algorithm is a parallel modification of Bay’s randomized nested loop algorithm =-=[7]-=-, and the density based version is a modification of the popular local outlier factor (LOF) algorithm [9]. Both parallel variants proceed in a similar fashion: First the data space is partitioned acro... |

108 | Hot sax: Efficiently finding the most unusual time series subsequence,” in Data Mining,
- Keogh, Lin, et al.
- 2005
(Show Context)
Citation Context ...t our algorithm can tackle multi-gigabyte data sets containing tens of millions of time series in just a few hours. 2. Related Work And Background The time series discord definition was introduced in =-=[13]-=-. Since then, it has attracted considerable interest and followup work. For example, [6] provide independent confirmation of the utility of discords for discovering abnormal heartbeats, in [3] the aut... |

69 | Fast time series classification using numerosity reduction, in:
- Xi, Keogh, et al.
- 2006
(Show Context)
Citation Context ...later use to prune non-outlier examples. They explicitly require that the distance function used be a metric, but for time series data non-metric functions have been demonstrated to be often superior =-=[35]-=-. The method again is intended for a lower dimensional data and is demonstrated to require two linear scans of the disk. Another method that scales to large disk resident datasets, requiring only two ... |

54 | Classifying spatiotemporal object trajectories using unsupervised learning in the coefficient feature space,
- Naftel, Khalid
- 2006
(Show Context)
Citation Context ...eak periodicity corresponding to calendar months, this query has a strong periodicity of 29.5 days, corresponds to the synodic month. 6.1.3 Trajectory Data We obtained two trajectory datasets used in =-=[16]-=- and [17] respectively, which have been purposefully created to test anomaly detection in video sequences. The time series are two dimensional (comprised of the x and y coordinates forseach data point... |

45 |
Mining deviants in a time series database.
- Jagadish, Koudas, et al.
- 1999
(Show Context)
Citation Context ....5 hours of CPU time (on a 2.4GHz machine). In contrast, we can handle a dataset of size two million objects with dimensionality 512 in less than two hours, most of which is I/O time. Jagadish et al. =-=[11]-=- produced an influential paper on finding unusual time series (which they call deviants) with a dynamic programming approach. Again this method is quadratic in the length of the time series, and thus ... |

44 |
The choice of reference points in bestmatch file searching
- Shapiro
- 1977
(Show Context)
Citation Context ...ce function, using pivots for indexing poses additional challenges too. For example, selecting good pivot points might itself require to first detect a set of outliers. This is so, because as Shapiro =-=[28]-=- suggests good pivots tend to be points that are far from any dense region in the data. The problem of outliers detection in higher dimensional spaces was target in an influential paper by Jagadish et... |

33 | Incremental local outlier detection for data streams
- Pokrajac
- 2007
(Show Context)
Citation Context ...dicity corresponding to calendar months, this query has a strong periodicity of 29.5 days, corresponds to the synodic month. 6.1.3 Trajectory Data We obtained two trajectory datasets used in [16] and =-=[17]-=- respectively, which have been purposefully created to test anomaly detection in video sequences. The time series are two dimensional (comprised of the x and y coordinates forseach data point), and ar... |

24 | Fast mining of distance-based outliers in high dimensional spaces
- Ghoting, Parthasarathy, et al.
- 2006
(Show Context)
Citation Context ...ters, including choices for discretization of real variables, a maximum number of iterations for EM (a sub-routine), the number of mixture components, etc. In a sequence of papers Otey and colleagues =-=[10]-=- introduce a series of algorithms for mining distance based outliers. Their approach has many advantages, including the ability to handle both real-valued and discrete data. Furthermore, like our appr... |

22 | Mining Distance-Based Outliers from Large Databases in Any Metric Space,”
- Tao, Xiao, et al.
- 2006
(Show Context)
Citation Context ... dataset of size two million objects with dimensionality 512 in less than an hour, most of which is I/O time. Distance based outliers are also the problem of study in Knorr et al. [21] and Tao et al. =-=[31]-=-. Both works discuss a quadratic (in the dataset size) nested loop algorithm for outlier detection and subsequently suggest ways for its improvement. Knorr et al. [21] propose an algorithm that perfor... |

20 | SAXually Explicit Images: Finding Unusual Shapes
- Wei, Keogh, et al.
- 2006
(Show Context)
Citation Context ...k. For example, [6] provide independent confirmation of the utility of discords for discovering abnormal heartbeats, in [3] the authors apply discord discovery to electricity consumption data, and in =-=[24]-=- the authors modify the definition slightly to discover unusual shapes. However, all discord discovery algorithms, and indeed virtually all algorithms for discovering unusual time series under any def... |

20 |
Detecting Distance-Based Outliers in Streams of Data.
- Anguiulli, Fassetti
- 2007
(Show Context)
Citation Context ...sets, requiring only two linear scans of the disk, was recently proposed by Angiulli and colleagues [6]. An extension of the method for online detection of distance based outliers from streaming data =-=[5]-=- has also been presented by the same authors. To perform faster range queries, the algorithm again builds an index in main memory, based on the concept of pivot points. Apart of the already pointed pr... |

16 | EKG Anomaly Detection via Time Series Analysis.
- Chuah, Fu
- 2007
(Show Context)
Citation Context ...eries in just a few hours. 2. Related Work And Background The time series discord definition was introduced in [13]. Since then, it has attracted considerable interest and followup work. For example, =-=[6]-=- provide independent confirmation of the utility of discords for discovering abnormal heartbeats, in [3] the authors apply discord discovery to electricity consumption data, and in [24] the authors mo... |

13 | Finding Time Series Discords based on Haar Transform
- Fu, Leung, et al.
(Show Context)
Citation Context ...hat will increase sufficiently the initial size |C|. 6. Empirical Evaluation In this section we conduct two kinds of experiments. Although the utility of discords has been noted before, e.g. in [3][6]=-=[9]-=-[13][24], we first provide additional examples of its usefulness for areas where large time series databases are traditionally encountered. Then we empirically demonstrate the scalability of our algor... |

12 | 2000, “Mix nets: Factored Mixtures of Gaussians in Bayesian Networks with Mixed Continuous and Discrete
- Davies, Moore
(Show Context)
Citation Context ...ith our ability to collect and store data. There are only a handful of works in the literature that have addressed anomaly detection in datasets of anythingslike the scale considered in this work. In =-=[7]-=- the authors consider an astronomical data set taken from the Sloan Digital Sky Survey, with 111,456 records and 68 variables. They find anomalies by building a Bayesian network and then looking for o... |

11 | Multilevel filtering for high dimensional nearest neighbor search
- Wang, Wang
- 2000
(Show Context)
Citation Context ...bounds are required to prune objects that cannot be the furthest nearest neighbor. While there has been some work on providing upper bounds for time series, these bounds tend to be exceptionally weak =-=[22]-=-. Intuitively this makes sense, there are only so many ways two time series can be similar to each other, hence the ability to tightly lower bound. However, there is a much larger space of possible wa... |

11 | Parallel algorithms for distance-based and density-based outliers,
- Lozano, Acuna
- 2005
(Show Context)
Citation Context ...parallel outlier detection problem too. For example, Hung et al. [17] introduce a parallel version of the quadratic nested loop algorithm for distance based outliers, discussed in [21]. Lozano et al. =-=[22]-=- propose two algorithms for parallel mining of distance and density based outliers. The distance based algorithm is a parallel modification of Bay’s randomized nested loop algorithm [7], and the densi... |

8 |
Very efficient mining of distance-based outliers
- Angiulli, Fassetti
(Show Context)
Citation Context ...ated to require two linear scans of the disk. Another method that scales to large disk resident datasets, requiring only two linear scans of the disk, was recently proposed by Angiulli and colleagues =-=[6]-=-. An extension of the method for online detection of distance based outliers from streaming data [5] has also been presented by the same authors. To perform faster range queries, the algorithm again b... |

8 | Parallel mining of outliers in large database
- Hung, Cheung
- 2002
(Show Context)
Citation Context ...ine learning algorithms, through the means of parallelization across grids of multiple computers. A limited number of works target the parallel outlier detection problem too. For example, Hung et al. =-=[17]-=- introduce a parallel version of the quadratic nested loop algorithm for distance based outliers, discussed in [21]. Lozano et al. [22] propose two algorithms for parallel mining of distance and densi... |

6 | Finding outlier light-curves in catalogs of periodic variable stars. Monthly Notices of the Royal Astronomical Society
- Protopapas, Giammarco, et al.
- 2006
(Show Context)
Citation Context ...or being anomalous. For instance, the topmost anomaly in every class has ranking 0, the second anomaly has ranking 1, and so on. This ordering is based on the results of the first method presented in =-=[18]-=-. The method is an O(n 2 ) algorithm that exhaustively computes the similarity (via cross correlation) between each pair of light-curves. The anomaly score for each light-curve is simply the weighted ... |

6 |
On estimators of the nearest neighbour distance distribution function for stationary point processes
- Stoyan
- 2006
(Show Context)
Citation Context ... (nndd) of the dataset, and more precisely the number of elements that fall in its tail. Computing the nndd, however, is hard, especially in high dimensional spaces as is the case with time series [4]=-=[21]-=-. The available methods require that random portions of the space are sampled and the nearest neighbor distances in those portions to be computed. Unfortunately, for a robust estimate, this requires s... |

5 | Accessing scientific data: Simpler is better
- Riedewald, Agrawal, et al.
- 2003
(Show Context)
Citation Context ...id that random access to just 10% of a disk resident dataset takes about the same time as a linear search over the entire data. In fact, recent studies suggest that this gap is widening. For example, =-=[19]-=- notes that the internal data rate of IBM’s hard disks improved from about 4 MB/sec to more than 60 MB/sec. In the same time period, the positioning time only improved from about 18 msec to 9 msec. Th... |

4 |
The AAVSO data validation project
- Malatesta, Beck, et al.
- 2005
(Show Context)
Citation Context ...SA urebbapr@cs.tufts.edu time series at hand can fit in main memory. However, for many applications this is not be the case. For example, multi-terabyte time series datasets are the norm in astronomy =-=[15]-=-, while the daily volume of web queries logged by search engines is even larger. Confronted with data of such scale current algorithms resort to numerous scans of the external media and are thus intra... |

3 |
Mining time series for identifying unusual sub-sequences with applications
- Ameen, Basha
- 2006
(Show Context)
Citation Context ...ced in [13]. Since then, it has attracted considerable interest and followup work. For example, [6] provide independent confirmation of the utility of discords for discovering abnormal heartbeats, in =-=[3]-=- the authors apply discord discovery to electricity consumption data, and in [24] the authors modify the definition slightly to discover unusual shapes. However, all discord discovery algorithms, and ... |

3 |
Hierarchical agglomerative clustering based t-outlier detection
- Wang, Fortier, et al.
- 2006
(Show Context)
Citation Context ...t paper suggests that finding unusual time series in financial datasets could be used to allow diversification of an investment portfolio, which in turn is essential for reducing portfolio volatility =-=[23]-=-. Despite its importance, the detection of unusual time series remains relatively unstudied when data reside on external storage. Most existing approaches demonstrate efficient detection of anomalous ... |