Results 1  10
of
24
Anomaly Detection: A Survey
, 2007
"... Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and c ..."
Abstract

Cited by 186 (4 self)
 Add to MetaCart
Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the di®erent directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.
Mining DistanceBased Outliers in Near Linear Time with Randomization and a Simple Pruning Rule
, 2003
"... Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic ..."
Abstract

Cited by 104 (4 self)
 Add to MetaCart
Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real highdimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the e#ciency is because the time to process nonoutliers, which are the majority of examples, does not depend on the size of the data set.
Outlier mining in large highdimensional data sets
 IEEE Transactions on Knowledge and Data Engineering
, 2005
"... In this paper a new definition of distancebased outlier and an algorithm, called HilOut, designed to efficiently detect the top n outliers of a large and highdimensional data set are proposed. Given an integer k, the weight of a point is defined as the sum of the distances separating it from its k ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
In this paper a new definition of distancebased outlier and an algorithm, called HilOut, designed to efficiently detect the top n outliers of a large and highdimensional data set are proposed. Given an integer k, the weight of a point is defined as the sum of the distances separating it from its k nearestneighbors. Outlier are those points scoring the largest values of weight. The algorithm HilOut makes use of the notion of spacefilling curve to linearize the data set, and it consists of two phases. The first phase provides an approximate solution, within a rough factor, after the execution of at most d + 1 sorts and scans of the data set, with temporal cost quadratic in d and linear in N and in k, where d is the number of dimensions of the data set and N is the number of points in the data set. During this phase, the algorithm isolates points candidate to be outliers and reduces this set at each iteration. If the size of this set becomes n, then the algorithm stops reporting the exact solution. The second phase calculates the exact solution with a final scan examining further the candidate outliers remained after the first phase. Experimental results show that the algorithm always stops, reporting the exact solution, during the first phase after much less than d + 1 steps. We present both an inmemory and diskbased implementation of the HilOut algorithm and a thorough scaling analysis for real and synthetic data sets showing that the algorithm scales well in both cases.
Enhancing Data Analysis with Noise Removal
"... Removing objects that are noise is an important goal of data cleaning as noise hinders most types of data analysis. Most existing data cleaning methods focus on removing noise that is the result of lowlevel data errors that result from an imperfect data collection process, but data objects that a ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
Removing objects that are noise is an important goal of data cleaning as noise hinders most types of data analysis. Most existing data cleaning methods focus on removing noise that is the result of lowlevel data errors that result from an imperfect data collection process, but data objects that are irrelevant or only weakly relevant can also significantly hinder data analysis. Thus, if the goal is to enhance the data analysis as much as possible, these objects should also be considered as noise, at least with respect to the underlying analysis. Consequently, there is a need for data cleaning techniques that remove both types of noise. Because data sets can contain large amount of noise, these techniques also need to be able to discard a potentially large fraction of the data. This paper explores four techniques intended for noise removal to enhance data analysis in the presence of high noise levels. Three of
Fast mining of distancebased outliers in high dimensional datasets
 PAKDD 2006. LNCS (LNAI
, 2006
"... Defining outliers by their distance to neighboring data points has been shown to be an effective nonparametric approach to outlier detection. Existing algorithms for mining distancebased outliers do not scale to large, highdimensional data sets. In this paper, we present RBRP, a fast algorithm for ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
Defining outliers by their distance to neighboring data points has been shown to be an effective nonparametric approach to outlier detection. Existing algorithms for mining distancebased outliers do not scale to large, highdimensional data sets. In this paper, we present RBRP, a fast algorithm for mining distancebased outliers, particularly targeted at highdimensional data sets. RBRP scales loglinearly as a function of the number of data points and linearly as a function of the number of dimensions. Our empirical evaluation demonstrates that we outperform the stateoftheart, often by an order of magnitude.
Mining for misconfigured machines in grid systems
 In KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
, 2006
"... Grid systems are proving increasingly useful for managing the batch computing jobs of organizations. One wellknown example is Intel, whose internally developed NetBatch system manages tens of thousands of machines. The size, heterogeneity, and complexity of grid systems make them very difficult, ho ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
Grid systems are proving increasingly useful for managing the batch computing jobs of organizations. One wellknown example is Intel, whose internally developed NetBatch system manages tens of thousands of machines. The size, heterogeneity, and complexity of grid systems make them very difficult, however, to configure. This often results in misconfigured machines, which may adversely affect the entire system. We investigate a distributed data mining approach for detection of misconfigured machines. Our Grid Monitoring System (GMS) nonintrusively collects data from all sources (log files, system services, etc.) available throughout the grid system. It converts raw data to semantically meaningful data and stores this data on the machine it was obtained from, limiting incurred overhead and allowing scalability. Afterwards, when analysis is requested, a distributed outliers detection algorithm is employed to identify misconfigured machines. The algorithm itself is implemented as a recursive workflow of grid jobs. It is especially suited to grid systems, in which the machines might be unavailable most of the time and often fail altogether.
AngleBased Outlier Detection in Highdimensional Data
"... Detectingoutliersinalargesetofdataobjectsisamajor data mining task aiming at finding different mechanisms responsible for different groups of objects in a data set. All existing approaches, however, are based on an assessment of distances (sometimes indirectly by assuming certain distributions) in t ..."
Abstract

Cited by 7 (5 self)
 Add to MetaCart
Detectingoutliersinalargesetofdataobjectsisamajor data mining task aiming at finding different mechanisms responsible for different groups of objects in a data set. All existing approaches, however, are based on an assessment of distances (sometimes indirectly by assuming certain distributions) in the fulldimensional Euclidean data space. In highdimensional data, these approaches are bound to deteriorate due to the notorious “curse of dimensionality”. In this paper, we propose a novel approach named ABOD (AngleBased Outlier Detection) and some variants assessing the variance in the angles between the difference vectors of a point to the other points. This way, the effects of the “curse of dimensionality ” are alleviated compared to purely distancebased approaches. A main advantage of our new approach is that our method does not rely on any parameter selection influencing the quality of the achieved ranking. In a thorough experimental evaluation, we compare ABOD to the wellestablished distancebased method LOF for various artificial and a real world data set and show ABOD to perform especially well on highdimensional data.
Discovering Cluster Based Local Outliers
 Pattern Recognition Letters
, 2003
"... In this paper, we present the new definition for outlier: clusterbased local outlier, which is meaningful and provides importance to the local data behavior. A measure for identifying the physical significance of an outlier is designed, which is called CBLOF (ClusterBased Local Outlier Factor). We ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
In this paper, we present the new definition for outlier: clusterbased local outlier, which is meaningful and provides importance to the local data behavior. A measure for identifying the physical significance of an outlier is designed, which is called CBLOF (ClusterBased Local Outlier Factor). We also propose the FindCBLOF algorithm for discovering outliers. The experimental results show that our approach outperformed the existing methods on identifying meaningful and interesting outliers.
LoOP: Local Outlier Probabilities
, 2009
"... Many outlier detection methods do not merely provide the decision for a single data object being or not being an outlier but give also an outlier score or “outlier factor ” signaling “how much ” the respective data object is an outlier. A major problem for any user not very acquainted with the outli ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
Many outlier detection methods do not merely provide the decision for a single data object being or not being an outlier but give also an outlier score or “outlier factor ” signaling “how much ” the respective data object is an outlier. A major problem for any user not very acquainted with the outlier detection method in question is how to interpret this “factor” in order to decide for the numeric score again whether or not the data object indeed is an outlier. Here, we formulate a local density based outlier detection method providing an outlier “score ” in the range of [0, 1] that is directly interpretable as a probability of a data object for being an outlier.
Interpreting and Unifying Outlier Scores
"... Outlier scores provided by different outlier models differ widely in their meaning, range, and contrast between different outlier models and, hence, are not easily comparable or interpretable. We propose a unification of outlier scores provided by various outlier models and a translation of the arbi ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Outlier scores provided by different outlier models differ widely in their meaning, range, and contrast between different outlier models and, hence, are not easily comparable or interpretable. We propose a unification of outlier scores provided by various outlier models and a translation of the arbitrary “outlier factors ” to values in the range [0, 1] interpretable as values describing the probability of a data object of being an outlier. As an application, we show that this unification facilitates enhanced ensembles for outlier detection. 1