Results 1 - 10
of
95
LOF: Identifying Density-Based Local Outliers
- PROCEEDINGS OF THE 2000 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA
, 2000
"... For many KDD applications, such as detecting criminal activities in E-commerce, finding the rare instances or the outliers, can be more interesting than finding the common patterns. Existing work in outlier detection regards being an outlier as a binary property. In this paper, we contend that for m ..."
Abstract
-
Cited by 214 (6 self)
- Add to MetaCart
For many KDD applications, such as detecting criminal activities in E-commerce, finding the rare instances or the outliers, can be more interesting than finding the common patterns. Existing work in outlier detection regards being an outlier as a binary property. In this paper, we contend that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. This degree is called the local outlier factor (LOF) of an object. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. We give a detailed formal analysis showing that LOF enjoys many desirable properties. Using realworld datasets, we demonstrate that LOF can be used to find outliers which appear to be meaningful, but can otherwise not be identified with existing approaches. Finally, a careful performance evaluation of our algorithm confirms we show that our approach of finding local outliers can be practical.
Outlier detection for high dimensional data
, 2001
"... The outlier detection problem has important applications in the eld of fraud detection, netw ork robustness analysis, and intrusion detection. Most suc h applications are high dimensional domains in whic hthe data can con tain hundreds of dimensions. Many recen t algorithms use concepts of pro ximit ..."
Abstract
-
Cited by 128 (0 self)
- Add to MetaCart
The outlier detection problem has important applications in the eld of fraud detection, netw ork robustness analysis, and intrusion detection. Most suc h applications are high dimensional domains in whic hthe data can con tain hundreds of dimensions. Many recen t algorithms use concepts of pro ximity in order to nd outliers based on their relationship to the rest of the data. Ho w ever, in high dimensional space, the data is sparse and the notion of proximity fails to retain its meaningfulness. In fact, the sparsity of high dimensional data implies that every point is an almost equally good outlier from the perspective ofproximity-based de nitions. Consequently, for high dimensional data, the notion of nding meaningful outliers becomes substantially more complex and non-obvious. In this paper, w e discuss new techniques for outlier detection whic h nd the outliers by studying the behavior of projections from the data set. 1.
Distance-Based Outliers: Algorithms and Applications
, 2000
"... . This paper deals with finding outliers (exceptions) in large, multidimensional datasets. The identification of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the analysis of performance statistics of professional a ..."
Abstract
-
Cited by 104 (0 self)
- Add to MetaCart
. This paper deals with finding outliers (exceptions) in large, multidimensional datasets. The identification of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the analysis of performance statistics of professional athletes. Existing methods that we have seen for finding outliers can only deal efficiently with two dimensions/attributes of a dataset. In this paper, we study the notion of DB- (Distance- Based) outliers. Specifically, we show that: (i) outlier detection can be done efficiently for large datasets, and for k-dimensional datasets with large values of k (e.g., k 5); and (ii), outlier detection is a meaningful and important knowledge discovery task. First, we present two simple algorithms, both having a complexity of O(kN 2 ), k being the dimensionality and N being the number of objects in the dataset. These algorithms readily support datasets with many more than two attributes. Second, we ...
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule
, 2003
"... Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic ..."
Abstract
-
Cited by 84 (4 self)
- Add to MetaCart
Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the e#ciency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
Anomaly Detection: A Survey
, 2007
"... Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and c ..."
Abstract
-
Cited by 69 (1 self)
- Add to MetaCart
Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the di®erent directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.
Mining Top-n Local Outliers in Large Databases
, 2001
"... Outlier detection is an important task in data mining with numerous applications, including credit card fraud detection, video surveillance, etc. A recent work on outlier detection has introduced a novel notion of local outlier in which the degree to which an object is outlying is dependant on the d ..."
Abstract
-
Cited by 39 (6 self)
- Add to MetaCart
Outlier detection is an important task in data mining with numerous applications, including credit card fraud detection, video surveillance, etc. A recent work on outlier detection has introduced a novel notion of local outlier in which the degree to which an object is outlying is dependant on the density of its local neighborhood, and each object can be assigned a Local Outlier Factor (LOF) which represents the likelihood of that object being an outlier. Although the concept of local outliers is a useful one, the computation of LOFvalues for every data objects requires a large number of k-nearest neighbors searches and can be computationally expensive. Since most objects are usually not outliers, it is useful to provide users with the option of finding only n most outstanding local outliers, i.e., the top-n data objects which are most likely to be local outliers according to their LOFs. However, if the pruning is not done carefully, finding top-n outliers could result in the same amount of computation as finding LOFfor all objects. In this paper, we propose a novel method to efficiently find the top-n local outliers in large databases. The concept of "micro-cluster" is introduced to compress the data. An efficient micro-cluster-based local outlier mining algorithm is designed based on this concept. As our algorithm can be adversely affected by the overlapping in the micro-clusters, we proposed a meaningful cut-plane solution for overlapping data. The formal analysis and experiments show that this method can achieve good performance in finding the most outstanding local outliers .
Overcoming Limitations of Sampling for Aggregation Queries
- In ICDE
, 2001
"... We study the problem of approximately answering aggregation queries using sampling. We observe that uniform sampling performs poorly when the distribution of the aggregated attribute is skewed. To address this issue, we introduce a technique called outlier-indexing. Uniform sampling is also ineffect ..."
Abstract
-
Cited by 36 (6 self)
- Add to MetaCart
We study the problem of approximately answering aggregation queries using sampling. We observe that uniform sampling performs poorly when the distribution of the aggregated attribute is skewed. To address this issue, we introduce a technique called outlier-indexing. Uniform sampling is also ineffective for queries with low selectivity. We rely on weighted sampling based on workload information to overcome this shortcoming. We demonstrate that a combination of outlier-indexing with weighted sampling can be used to answer aggregation queries with significantly reduced approximation error compared to either uniform sampling or weighted sampling alone. We discuss the implementation of these techniques on Microsoft’s SQL Server, and present experimental results that demonstrate the merits of our techniques. 1
Unsupervised Learning Techniques for an Intrusion Detection System
, 2004
"... With the continuous evolution of the types of attacks against computer networks, traditional intrusion detection systems, based on pattern matching and static signatures, are increasingly limited by their need of an up-to-date and comprehensive knowledge base. Data mining techniques have been succes ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
With the continuous evolution of the types of attacks against computer networks, traditional intrusion detection systems, based on pattern matching and static signatures, are increasingly limited by their need of an up-to-date and comprehensive knowledge base. Data mining techniques have been successfully applied in host-based intrusion detection. Applying data mining techniques on raw network data, however, is made di#cult by the sheer size of the input; this is usually avoided by discarding the network packet contents. In this paper, we introduce a two-tier architecture to overcome this problem: the first tier is an unsupervised clustering algorithm which reduces the network packets payload to a tractable size. The second tier is a traditional anomaly detection algorithm, whose e#ciency is improved by the availability of data on the packet payload content.
A Unified Notion of Outliers: Properties and Computation
- In Proc. of the International Conference on Knowledge Discovery and Data Mining
, 1997
"... As said in signal processing, "One person's noise is another person's signal." For many applications, such as the exploration of satellite or medical images, and the monitoring of criminal activities in electronic commerce, identifying exceptions can often lead to the discovery of truly unexpected k ..."
Abstract
-
Cited by 35 (0 self)
- Add to MetaCart
As said in signal processing, "One person's noise is another person's signal." For many applications, such as the exploration of satellite or medical images, and the monitoring of criminal activities in electronic commerce, identifying exceptions can often lead to the discovery of truly unexpected knowledge. In this paper, we study an intuitive notion of outliers. A key contribution of this paper is to show how the proposed notion of outliers unifies or generalizes many existing notions of outliers provided by discordancy tests for standard statistical distributions. Thus, a unified outlier detection system can replace a whole spectrum of statistical discordancy tests with a single module detecting only the kinds of outliers proposed. A second contribution of this paper is the development of an approach to find all outliers in a dataset. The structure underlying this approach resembles a data cube, which has the advantage of facilitating integration with the many OLAP and data mining s...
Fast Computation of 2-Dimensional Depth Contours
, 1998
"... "One person's noise is another person's signal." For many applications, including the detection of credit card frauds and the monitoring of criminal activities in electronic commerce, an important knowledge discovery problem is the detection of exceptional/outlying events. In computational stati ..."
Abstract
-
Cited by 30 (3 self)
- Add to MetaCart
"One person's noise is another person's signal." For many applications, including the detection of credit card frauds and the monitoring of criminal activities in electronic commerce, an important knowledge discovery problem is the detection of exceptional/outlying events. In computational statistics, one well-known approach to detect outlying data points in a 2-D dataset is to assign a depth to each data point. Based on the assigned depths, the data points are organized in layers in the 2-D space, with the expectation that shallow layers are more likely to contain outlying points than are the deep layers. One robust notion of depth, called depth contours, was introduced by Tukey [17,18]. ISODEPTH, developed by Ruts and Rousseeuw [16], is an algorithm that computes 2-D depth contours. In this paper, we give a fast algorithm, called FDC, for computing 2-D depth contours. The idea is that to compute the first k depth contours, it is sufficient to restrict the computation to a...

