## Detecting outliers using transduction and statistical testing (2006)

Venue: | In Proceedings of the 12th Annual SIGKDD International Conference on Knowledge Discovery and Data Mining |

Citations: | 11 - 1 self |

### BibTeX

@INPROCEEDINGS{Barbará06detectingoutliers,

author = {Daniel Barbará and Carlotta Domeniconi},

title = {Detecting outliers using transduction and statistical testing},

booktitle = {In Proceedings of the 12th Annual SIGKDD International Conference on Knowledge Discovery and Data Mining},

year = {2006},

pages = {54--60},

publisher = {ACM Press}

}

### OpenURL

### Abstract

Outlier detection can uncover malicious behavior in fields like intrusion detection and fraud analysis. Although there has been a significant amount of work in outlier detection, most of the algorithms proposed in the literature are based on a particular definition of outliers (e.g., density-based), and use ad-hoc thresholds to detect them. In this paper we present a novel technique to detect outliers with respect to an existing clustering model. However, the test can also be successfully utilized to recognize outliers when the clustering information is not available. Our method is based on Transductive Confidence Machines, which have been previously proposed as a mechanism to provide individual confidence measures on classification decisions. The test uses hypothesis testing to prove or disprove whether a point is fit to be in each of the clusters of the model. We experimentally demonstrate that the test is highly robust, and produces very few misdiagnosed points, even when no clustering information is available. Furthermore, our experiments demonstrate the robustness of our method under the circumstances of data contaminated by outliers. We finally show that our technique can be successfully applied to identify outliers in a noisy data set for which no information is available (e.g., ground truth, clustering structure, etc.). As such our proposed methodology is capable of bootstrapping from a noisy data set a clean one that can be used to identify future outliers.

### Citations

1687 | An Introduction to Kolmogorov Complexity and its Applications
- Li, Vitanyi
- 1997
(Show Context)
Citation Context ...class of a point and attach confidence to the estimate. The transductive reliability estimation process has its theoretical foundations in the algorithmic theory of randomness developed by Kolmogorov =-=[20]-=-. Unlike traditional methods in machine learning, transduction can offer measures of reliability to individual examples, and uses very broad assumptions (it only assumes that the data points are indep... |

1124 |
Multidimensional binary search trees used for associative searching
- Bentley
- 1975
(Show Context)
Citation Context ...ling the data). We expect that this technique will result in a decrease in false positives with respect to the rates obtained by uniform sampling. Multidimensional search structures, such as Kd-trees =-=[3]-=- can also be utilized to speed the search for k-nearest neighbors. A more promising approach is to employ the space-filling curves techniques in [1], based on maintaining lists of the points ordered b... |

1104 | A density-based algorithm for discovering clusters in large spatial databases with noise
- Ester, Kriegel, et al.
- 1996
(Show Context)
Citation Context ...ccuracy. Such technique can easily be employed in StrOUD to speed the computation. A large body of work has been published in the area of discovering outliers with respect to clustering models (e.g., =-=[7, 23, 27, 29, 12]-=-). However, most of these algorithms do not aim at the discovery of outliers, but rather offer ways to deal with them. And, in all the cases, the discovery of outliers is done by careful setting of th... |

635 |
Textures: A Photographic Album for Artists and Designers
- Brodatz
- 1966
(Show Context)
Citation Context ...ample 10 % sample TP 99.72 100 100 100 FP 4.97 4.90 5.37 5.80 the Laboratory of Image Processing and Pattern Recognition (INPG-LTIRF) in Grenoble, France, using as the original source the material in =-=[5]-=-, and referenced in [10, 11]. The data set contains a large number of classes (11) and a high dimensionality (40). The original aim was to distinguish between 11 different textures (Grass lawn, Presse... |

594 | Efficient and effective clustering method for spatial data mining
- Ng, Han
- 1994
(Show Context)
Citation Context ...ccuracy. Such technique can easily be employed in StrOUD to speed the computation. A large body of work has been published in the area of discovering outliers with respect to clustering models (e.g., =-=[7, 23, 27, 29, 12]-=-). However, most of these algorithms do not aim at the discovery of outliers, but rather offer ways to deal with them. And, in all the cases, the discovery of outliers is done by careful setting of th... |

563 | CURE: An efficient clustering algorithm for large databases
- Guha, Rastogi, et al.
- 1998
(Show Context)
Citation Context ...ccuracy. Such technique can easily be employed in StrOUD to speed the computation. A large body of work has been published in the area of discovering outliers with respect to clustering models (e.g., =-=[7, 23, 27, 29, 12]-=-). However, most of these algorithms do not aim at the discovery of outliers, but rather offer ways to deal with them. And, in all the cases, the discovery of outliers is done by careful setting of th... |

305 | Lof: identifying density-based local outliers
- Breunig, Kriegel, et al.
- 2000
(Show Context)
Citation Context ...ume a clustering model or a known distribution have been proposed. They fall under two categories. The first is distance-based techniques (see [2, 18, 26]. The second is density-based techniques (see =-=[4, 17]-=-). Again, in all these algorithms one must threshold parameters to obtain the set of outliers. The method of [18], which we use here to provide a comparison to ours, uses distance and density calculat... |

259 | Algorithms for mining distance-based outliers in large datasets
- Knorr, Ng
- 1998
(Show Context)
Citation Context ... purposes of this paper, we will limit ourselves to this definition. In particular, we use the Euclidean distance to compute the distance between pairs of points (as done in the distance based method =-=[18]-=- against which we compare ours). The strangeness being the ratio of the sum of the K smallest distances from the same class we propose to put i in, to the sum of the K smallest distances from other cl... |

225 | Efficient algorithms for mining outliers from large data sets
- Ramaswamy, Rastogi, et al.
- 2000
(Show Context)
Citation Context ...arameters. Some outlier detection schemes that do not assume a clustering model or a known distribution have been proposed. They fall under two categories. The first is distance-based techniques (see =-=[2, 18, 26]-=-. The second is density-based techniques (see [4, 17]). Again, in all these algorithms one must threshold parameters to obtain the set of outliers. The method of [18], which we use here to provide a c... |

182 |
Identification of outliers
- Hawkins
- 1980
(Show Context)
Citation Context ...can reveal clues to solve the problem at hand. In other cases, the sudden appearance of a large number of outliers can point to a change in the underlying process that is generating the data. Hawkins =-=[14]-=- defines an outlier as “an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.” This suggests the possibility of detecting ... |

174 | WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases
- Sheikholslami, Chatterjee, et al.
- 1998
(Show Context)
Citation Context |

117 | Learning relational probability trees
- Neville, Jensen, et al.
- 2003
(Show Context)
Citation Context ... data, this assumption is too strong. We plan to modify our method to deal with correlated data by introducing a technique that has been used successfully in the field of Relational Data Mining (e.g. =-=[16]-=-). Randomization testing [8] is a type of statistical test which involves generating replicates of the actual data set -called pseudo-samples. Randomization has been proven effective in compensating t... |

105 | Mining distance-based outliers in near linear time with randomization and a simple pruning rule
- Bay, Schwabacher
- 2003
(Show Context)
Citation Context ...arameters. Some outlier detection schemes that do not assume a clustering model or a known distribution have been proposed. They fall under two categories. The first is distance-based techniques (see =-=[2, 18, 26]-=-. The second is density-based techniques (see [4, 17]). Again, in all these algorithms one must threshold parameters to obtain the set of outliers. The method of [18], which we use here to provide a c... |

92 |
Randomization Tests
- Edgington
- 1995
(Show Context)
Citation Context ... strong. We plan to modify our method to deal with correlated data by introducing a technique that has been used successfully in the field of Relational Data Mining (e.g. [16]). Randomization testing =-=[8]-=- is a type of statistical test which involves generating replicates of the actual data set -called pseudo-samples. Randomization has been proven effective in compensating the effects of autocorrelatio... |

46 |
Space filling curves
- Sagan
- 1994
(Show Context)
Citation Context ...to avoid the costly computation of the K-nearest neighbors of each data point. The technique is based on transforming the d-dimensional points into one-dimensional points lying on a Hilbert or zcurve =-=[25]-=-. The transformation guarantees that two points that are close in the curve are also close in the d space. However, two points in the d space are not necessarily close in the Hilbert curve, so it is n... |

25 |
The nature of statistical learning theory
- V
- 1995
(Show Context)
Citation Context ...ifficult as the original problem of finding outliers. This approach can be seen as inducing amodel over the normal data and using it to test points. Recently, the field of statistical learning theory =-=[31]-=- has developed alternatives to induction: instead of using all the available points to induce a model, one can use the data (usually a small subset of it) to estimate unknown properties of points that... |

23 | 2005) Outlier mining in large high-dimensional data sets
- Angiulli, Pizzuti
(Show Context)
Citation Context ...new definition will make the strangeness value of a point far away from the cluster considerably larger than the strangeness of points already inside the cluster. This definition has been employed by =-=[1]-=- (see Section 5) as a measure of isolation. Using the α values, we can compute a series of p-values for the new point, one for each cluster y =1, ··· ,c (where c is the number of clusters). We call th... |

20 |
Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator
- Hardin, Rocke
- 2004
(Show Context)
Citation Context ...., it is equivalent to consider as outliers those points that are 3 or more standard deviations away from the centroid, which is the common practice of univariate statistical techniques. Moreover, in =-=[13]-=-, which uses a technique based on MCD to find multiple clusters and diagnose outliers, the authors argue that not every data set will give rise to an obvious separation of outliers and non-outliers by... |

19 |
Enhancing effectiveness of outlier detections for low density patterns
- Tang, Chen, et al.
- 2002
(Show Context)
Citation Context |

16 | 2001) A unified approach to spatial outliers detection
- Shekhar, Lu, et al.
(Show Context)
Citation Context ...riate data, and all of them assume knowledge of the underlying distribution of the data, which in practice is very restrictive. More recently a technique for spatial outlier detection was proposed in =-=[28]-=-. The method uses the difference between the attribute value at location x and the average attribute value of x’s neighbors to determinesFigure 5: Histogram of distances to the closest centroid for th... |

15 | High dimensional similarity search with space filling curves
- Liao, Lopez, et al.
- 2001
(Show Context)
Citation Context ...o points in the d space are not necessarily close in the Hilbert curve, so it is necessary to produce multiple transformations by shifting the curves. The idea of shifted Hilbert curves, described in =-=[21]-=- guarantees that a search for a nearest neighbor of a query point conducted over several lists of linearized points (each list corresponding to a shifted Hilbert curve) finds the true nearest neighbor... |

12 |
Prediction algorithms and confidence measures based on algorithmic randomness theory
- Gammerman, Vovk
- 2002
(Show Context)
Citation Context ...dea leads to elegant algorithms that use standard statistical tests to compute the confidence on the estimation. Using transduction, researchers have built Transductive Confidence Machines (TCM) (see =-=[9]-=-) which are able to estimate the unknown class of a point and attach confidence to the estimate. The transductive reliability estimation process has its theoretical foundations in the algorithmic theo... |

8 |
Outliers in statistical data
- Lewis
- 1994
(Show Context)
Citation Context ...the normal observations were generated and testing points for “membership” to this mechanism. Indeed, that is the path that early work in outlier detection followed (in the statistical community; see =-=[19]-=- for a comprehensive review): postulate a model for the probability distribution of normal points (e.g., a Gaussian model), and compute the likelihood of a point being generated by the postulated mode... |

8 |
Transductive confidence machines for pattern recognition
- Proedru, Nouretdinov, et al.
- 2002
(Show Context)
Citation Context ...” examples are the points already clustered. Transduction has been previously used to offer confidence measures for the decision of labelling a point as belonging to a set of pre-defined classes (see =-=[24, 33, 9]-=-). TCM [9] introduced the computation of the confidence using Algorithmic Randomness Theory [20]. (The first proposed application of Algorithmic Randomness Theory to machine learning problems, however... |

6 |
Machine learning application of algorithmic randomness
- Vovk, Gammerman, et al.
- 1999
(Show Context)
Citation Context ...putation of the confidence using Algorithmic Randomness Theory [20]. (The first proposed application of Algorithmic Randomness Theory to machine learning problems, however, corresponds to Vovk et al. =-=[32]-=-.) The confidence measure used in TCM is based upon universal tests for randomness, or their approximation. A MartinLof randomness deficiency test [20] based on such tests is a universal version of th... |

4 | Fractal Characterization of Web Workloads
- Menasce, Abrahao, et al.
- 2002
(Show Context)
Citation Context ...% 100% 100% 100% 100% 100% TP 100% 100% 100% FP 0% 0% 0% 0% 0% 0% 2% FP 2% 0% 0% 4.3 The on-line bookstore data This data set was generated from an e-commerce workload and has been used previously in =-=[22]-=-. The logs correspond to a couple of weekdays in which a large number of HTTP requests were processed. Entries corresponding to images, errors, etc., were deleted and the URLs of the remaining entries... |

3 |
High order statistics from natural textured images
- Gu'erin-Dugu'e, Avilez-Cruz
- 1993
(Show Context)
Citation Context ....72 100 100 100 FP 4.97 4.90 5.37 5.80 the Laboratory of Image Processing and Pattern Recognition (INPG-LTIRF) in Grenoble, France, using as the original source the material in [5], and referenced in =-=[10, 11]-=-. The data set contains a large number of classes (11) and a high dimensionality (40). The original aim was to distinguish between 11 different textures (Grass lawn, Pressed calf leather, Handmade pap... |

2 |
Deliverable R3-B4-P - Task B4 : Benchmarks. Technical report, Elena-NervesII "Enhanced Learning for Evolutive Neural Architecture. Anonymous FTP: on ftp.dice.ucl.ac.be /pub/neural-nets/ELENA/Benchmarks.ps.Z
- Gu'erin-Dugu'e
- 1995
(Show Context)
Citation Context ....72 100 100 100 FP 4.97 4.90 5.37 5.80 the Laboratory of Image Processing and Pattern Recognition (INPG-LTIRF) in Grenoble, France, using as the original source the material in [5], and referenced in =-=[10, 11]-=-. The data set contains a large number of classes (11) and a high dimensionality (40). The original aim was to distinguish between 11 different textures (Grass lawn, Pressed calf leather, Handmade pap... |

1 |
project data. ftp://ftp.dice.ucl.ac.be/pub/neuralnets/ELENA/databases
- Elena
(Show Context)
Citation Context ...is possible to use a sample of the normal data to capture outliers without significant loss of accuracy. 4.4 Texture data This is one of the real data sets of the Elena project, which can be found in =-=[6]-=-. The data set was generated bysFigure 3: Distance histogram for the bookstore data. Each bar shows the percentage of points within each group (non-outliers, outliers) with distances to the closest ce... |

1 |
Multivariate Outlier Detection and Robustness
- Hubert, Rousseeuw, et al.
- 2005
(Show Context)
Citation Context ...rmal distribution for the non-spatial attribute value. An exception to the univariate restriction in the statistics techniques is the work of Rousseeuw et al (whose most recent survey can be found in =-=[15]-=-). Their approach is to find a robust fit model, i.e., one that is similar to the fit that would have been found without the outliers, and use it to diagnose which points do not fit the model well. Th... |

1 |
Mining (2001) Top-n local outliers in large databases
- Jin, Tung, et al.
(Show Context)
Citation Context ...ume a clustering model or a known distribution have been proposed. They fall under two categories. The first is distance-based techniques (see [2, 18, 26]. The second is density-based techniques (see =-=[4, 17]-=-). Again, in all these algorithms one must threshold parameters to obtain the set of outliers. The method of [18], which we use here to provide a comparison to ours, uses distance and density calculat... |

1 |
Transductive Confidence
- Ho, Wechsler
- 2003
(Show Context)
Citation Context ...” examples are the points already clustered. Transduction has been previously used to offer confidence measures for the decision of labelling a point as belonging to a set of pre-defined classes (see =-=[24, 33, 9]-=-). TCM [9] introduced the computation of the confidence using Algorithmic Randomness Theory [20]. (The first proposed application of Algorithmic Randomness Theory to machine learning problems, however... |