Results 1 -
7 of
7
Anomaly Detection: A Survey
, 2007
"... Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and c ..."
Abstract
-
Cited by 69 (1 self)
- Add to MetaCart
Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the di®erent directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.
Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized
"... The problem of finding unusual time series has recently attracted much attention, and several promising methods are now in the literature. However, virtually all proposed methods assume that the data reside in main memory. For many real-world problems this is not be the case. For example, in astrono ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
The problem of finding unusual time series has recently attracted much attention, and several promising methods are now in the literature. However, virtually all proposed methods assume that the data reside in main memory. For many real-world problems this is not be the case. For example, in astronomy, multi-terabyte time series datasets are the norm. Most current algorithms faced with data which cannot fit in main memory resort to multiple scans of the disk/tape and are thus intractable. In this work we show how one particular definition of unusual time series, the time series discord, can be discovered with a disk aware algorithm. The proposed algorithm is exact and requires only two linear scans of the disk with a tiny buffer of main memory. Furthermore, it is very simple to implement. We use the algorithm to provide further evidence of the effectiveness of the discord definition in areas as diverse as astronomy, web query mining, video surveillance, etc., and show the efficiency of our method on datasets which are many orders of magnitude larger than anything else attempted in the literature. 1.
Wat: Finding top-k discords in time series database
- In Proceedings of 7th SIAM International Conference on Data Mining
, 2007
"... Finding discords in time series database is an important problem in a great variety of applications, such as space shuttle telemetry, mechanical industry, biomedicine, and financial data analysis. However, most previous methods for this problem suffer from too many parameter settings which are diffi ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Finding discords in time series database is an important problem in a great variety of applications, such as space shuttle telemetry, mechanical industry, biomedicine, and financial data analysis. However, most previous methods for this problem suffer from too many parameter settings which are difficult for users. The best known approach to our knowledge that has comparatively fewer parameters still requires users to choose a word size for the compression of subsequences. In this paper, we propose a Haar wavelet and augmented trie based algorithm to mine the top-K discords from a time series database, which can dynamically determine the word size for compression. Due to the characteristics of Haar wavelet transform, our algorithm has greater pruning power than previous approaches. Through experiments with some annotated datasets, the effectiveness and efficiency of our algorithm are both attested. 1
Learning from Time Series in the Presence of Noise: Unsupervised and Semi-Supervised Approaches
, 2008
"... Needless to say, I would not reach this stage of graduate school if it was not for my advisor Dr. Eamonn Keogh. I have never worked with another person with so much drive and passion for what they do, and I just hope that at least part of these qualities were acquired by me too. Eamonn taught me the ..."
Abstract
- Add to MetaCart
Needless to say, I would not reach this stage of graduate school if it was not for my advisor Dr. Eamonn Keogh. I have never worked with another person with so much drive and passion for what they do, and I just hope that at least part of these qualities were acquired by me too. Eamonn taught me the basic practices and knowledge a data mining researcher needs to have, but for what its worth, it is his attitude that probably made the biggest impact on me. Any single time I would talk to him, he would be positive and encouraging. Thank you, Eamonn, for being there for me and for the rest of your students! I would like to thank my dissertation committee- Dr. Vassilis Tsotras and Dr. Stefano Lonardi. Vassilis sent me my acceptance letter exactly five years ago promising that I will enjoy the atmosphere in UC Riverside. I really did! Stefano was there for my most important publication- the first one. From him I learned that every detail matters, that every word needs to be accurately placed. I had three great internships with Yahoo!. I worked with incredible people to whom I am greatly thankful. The first summer my mentor was Dr. Dennis DeCoste. Dennis inspired many of my subsequent interests, such as ensemble learning and support vector
Automated Load Curve Data Cleansing in Power Systems
"... Abstract—Load curve data refers to the electric energy consumption recorded by meters at certain time intervals at delivery points or end user points, and contains vital information for day-to-day operations, system analysis, system visualization, system reliability performance, energy saving and ad ..."
Abstract
- Add to MetaCart
Abstract—Load curve data refers to the electric energy consumption recorded by meters at certain time intervals at delivery points or end user points, and contains vital information for day-to-day operations, system analysis, system visualization, system reliability performance, energy saving and adequacy in system planning. Unfortunately, it is unavoidable that load curves contain corrupted data and missing data due to various random failure factors in meters and transfer processes. This paper presents the B-Spline smoothing and Kernel smoothing based techniques to automatically cleanse corrupted and missing data. In implementation, a man–machine dialogue procedure is proposed to enhance the performance. The experiment results on the real British Columbia Transmission Corporation (BCTC) load curve data demonstrated the effectiveness of the presented solution. Index Terms—Load management, load modeling, power systems, smoothing methods, power quality.
Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series
"... Abstract. Time series discord has proven to be a useful concept for timeseries anomaly identification. To search for discords, various algorithms have been developed. Most of these algorithms rely on pre-building an index (such as a trie) for subsequences. Users of these algorithms are typically req ..."
Abstract
- Add to MetaCart
Abstract. Time series discord has proven to be a useful concept for timeseries anomaly identification. To search for discords, various algorithms have been developed. Most of these algorithms rely on pre-building an index (such as a trie) for subsequences. Users of these algorithms are typically required to choose optimal values for word-length and/or alphabetsize parameters of the index, which are not intuitive. In this paper, we propose an algorithm to directly search for the top-K discords, without the requirement of building an index or tuning external parameters. The algorithm exploits quasi-periodicity present in many time series. For quasiperiodic time series, the algorithm gains significant speedup by reducing the number of calls to the distance function.
Anomaly Detection for Symbolic Sequences . . .
, 2009
"... This thesis deals with the problem of anomaly detection for sequence data. Anomaly detection has been a widely researched problem in several application domains such as system health management, intrusion detection, healthcare, bioinformatics, fraud detection, and mechanical fault detection. Traditi ..."
Abstract
- Add to MetaCart
This thesis deals with the problem of anomaly detection for sequence data. Anomaly detection has been a widely researched problem in several application domains such as system health management, intrusion detection, healthcare, bioinformatics, fraud detection, and mechanical fault detection. Traditional anomaly detection techniques analyze each data instance (as a univariate or multivariate record) independently, and ignore the sequential aspect of the data. Often, anomalies in sequences can be detected only by analyzing data instances together as a sequence, and hence cannot detected by traditional anomaly detection techniques. The problem of anomaly detection for sequence data is a rich area of research because of two main reasons. First, sequences can be of different types, e.g., symbolic sequences, time series data, etc., and each type of sequence poses unique set of problems. Second, anomalies in sequences can be defined in multiple ways and hence there are different problem formulations. In this thesis we focus on solving one particular problem formulation called semi-supervised anomaly detection. We study the problem

