Results 1  10
of
154
Anomaly Detection: A Survey
, 2007
"... Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and c ..."
Abstract

Cited by 511 (5 self)
 Add to MetaCart
Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the di®erent directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.
Feature Bagging for Outlier Detection
, 2005
"... Outlier detection has recently become an important problem in many industrial and financial applications. In this paper, a novel feature bagging approach for detecting outliers in very large, high dimensional and noisy databases is proposed. It combines results from multiple outlier detection algori ..."
Abstract

Cited by 54 (3 self)
 Add to MetaCart
Outlier detection has recently become an important problem in many industrial and financial applications. In this paper, a novel feature bagging approach for detecting outliers in very large, high dimensional and noisy databases is proposed. It combines results from multiple outlier detection algorithms that are applied using different set of features. Every outlier detection algorithm uses a small subset of features that are randomly selected from the original feature set. As a result, each outlier detector identifies different outliers, and thus assigns to all data records outlier scores that correspond to their probability of being outliers. The outlier scores computed by the individual outlier detection algorithms are then combined in order to find the better quality outliers. Experiments performed on several synthetic and real life data sets show that the proposed methods for combining outputs from multiple outlier detection algorithms provide nontrivial improvements over the base algorithm.
Outlier Detection in Sensor Networks
 MobiHoc'07
, 2007
"... Outlier detection has many important applications in sensor networks, e.g., abnormal event detection, animal behavior change, etc. It is a difficult problem since global information about data distributions must be known to identify outliers. In this paper, we use a histogrambased method for outlie ..."
Abstract

Cited by 34 (1 self)
 Add to MetaCart
Outlier detection has many important applications in sensor networks, e.g., abnormal event detection, animal behavior change, etc. It is a difficult problem since global information about data distributions must be known to identify outliers. In this paper, we use a histogrambased method for outlier detection to reduce communication cost. Rather than collecting all the data in one location for centralized processing, we propose collecting hints (in the form of a histogram) about the data distribution, and using the hints to filter out unnecessary data and identify potential outliers. We show that this method can be used for detecting outliers in terms of two different definitions. Our simulation results show that the histogram method can dramatically reduce the communication cost.
Incremental local outlier detection for data streams
 In Proceedings of IEEE Symposium on Computational Intelligence and Data Mining
, 2007
"... Abstract. Outlier detection has recently become an important problem in many industrial and financial applications. This problem is further complicated by the fact that in many cases, outliers have to be detected from data streams that arrive at an enormous pace. In this paper, an incremental LOF (L ..."
Abstract

Cited by 33 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Outlier detection has recently become an important problem in many industrial and financial applications. This problem is further complicated by the fact that in many cases, outliers have to be detected from data streams that arrive at an enormous pace. In this paper, an incremental LOF (Local Outlier Factor) algorithm, appropriate for detecting outliers in data streams, is proposed. The proposed incremental LOF algorithm provides equivalent detection performance as the iterated static LOF algorithm (applied after insertion of each data record), while requiring significantly less computational time. In addition, the incremental LOF algorithm also dynamically updates the profiles of data points. This is a very important property, since data profiles may change over time. The paper provides theoretical evidence that insertion of a new data point as well as deletion of an old data point influence only limited number of their closest neighbors and thus the number of updates per such insertion/deletion does not depend on the total number of points N in the data set. Our experiments performed on several simulated and real life data sets have demonstrated that the proposed incremental LOF algorithm is computationally efficient, while at the same time very successful in detecting outliers and changes of distributional behavior in various data stream applications. I.
Problem diagnosis in largescale computing environments
 In ACM/IEEE Conf. on Supercomputing (SC
, 2006
"... ..."
(Show Context)
AngleBased Outlier Detection in Highdimensional Data
"... Detectingoutliersinalargesetofdataobjectsisamajor data mining task aiming at finding different mechanisms responsible for different groups of objects in a data set. All existing approaches, however, are based on an assessment of distances (sometimes indirectly by assuming certain distributions) in t ..."
Abstract

Cited by 31 (8 self)
 Add to MetaCart
(Show Context)
Detectingoutliersinalargesetofdataobjectsisamajor data mining task aiming at finding different mechanisms responsible for different groups of objects in a data set. All existing approaches, however, are based on an assessment of distances (sometimes indirectly by assuming certain distributions) in the fulldimensional Euclidean data space. In highdimensional data, these approaches are bound to deteriorate due to the notorious “curse of dimensionality”. In this paper, we propose a novel approach named ABOD (AngleBased Outlier Detection) and some variants assessing the variance in the angles between the difference vectors of a point to the other points. This way, the effects of the “curse of dimensionality ” are alleviated compared to purely distancebased approaches. A main advantage of our new approach is that our method does not rely on any parameter selection influencing the quality of the achieved ranking. In a thorough experimental evaluation, we compare ABOD to the wellestablished distancebased method LOF for various artificial and a real world data set and show ABOD to perform especially well on highdimensional data.
Fast mining of distancebased outliers in high dimensional datasets
 PAKDD 2006. LNCS (LNAI
, 2006
"... Defining outliers by their distance to neighboring data points has been shown to be an effective nonparametric approach to outlier detection. Existing algorithms for mining distancebased outliers do not scale to large, highdimensional data sets. In this paper, we present RBRP, a fast algorithm for ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
(Show Context)
Defining outliers by their distance to neighboring data points has been shown to be an effective nonparametric approach to outlier detection. Existing algorithms for mining distancebased outliers do not scale to large, highdimensional data sets. In this paper, we present RBRP, a fast algorithm for mining distancebased outliers, particularly targeted at highdimensional data sets. RBRP scales loglinearly as a function of the number of data points and linearly as a function of the number of dimensions. Our empirical evaluation demonstrates that we outperform the stateoftheart, often by an order of magnitude.
Isolation Forest
 In ICDM ’08: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining. IEEE Computer Society
"... Most existing modelbased approaches to anomaly detection construct a profile of normal instances, then identify instances that do not conform to the normal profile as anomalies. This paper proposes a fundamentally different modelbased method that explicitly isolates anomalies instead of profile ..."
Abstract

Cited by 25 (6 self)
 Add to MetaCart
(Show Context)
Most existing modelbased approaches to anomaly detection construct a profile of normal instances, then identify instances that do not conform to the normal profile as anomalies. This paper proposes a fundamentally different modelbased method that explicitly isolates anomalies instead of profiles normal points. To our best knowledge, the concept of isolation has not been explored in current literature. The use of isolation enables the proposed method, iForest, to exploit subsampling to an extent that is not feasible in existing methods, creating an algorithm which has a linear time complexity with a low constant and a low memory requirement. Our empirical evaluation shows that iForest performs favourably to ORCA, a nearlinear time complexity distancebased method, LOF and Random Forests in terms of AUC and processing time, and especially in large data sets. iForest also works well in high dimensional problems which have a large number of irrelevant attributes, and in situations where training set does not contain any anomalies. 1
Enhancing Data Analysis with Noise Removal
"... Removing objects that are noise is an important goal of data cleaning as noise hinders most types of data analysis. Most existing data cleaning methods focus on removing noise that is the result of lowlevel data errors that result from an imperfect data collection process, but data objects that a ..."
Abstract

Cited by 23 (5 self)
 Add to MetaCart
Removing objects that are noise is an important goal of data cleaning as noise hinders most types of data analysis. Most existing data cleaning methods focus on removing noise that is the result of lowlevel data errors that result from an imperfect data collection process, but data objects that are irrelevant or only weakly relevant can also significantly hinder data analysis. Thus, if the goal is to enhance the data analysis as much as possible, these objects should also be considered as noise, at least with respect to the underlying analysis. Consequently, there is a need for data cleaning techniques that remove both types of noise. Because data sets can contain large amount of noise, these techniques also need to be able to discard a potentially large fraction of the data. This paper explores four techniques intended for noise removal to enhance data analysis in the presence of high noise levels. Three of