Results 1  10
of
31
Detecting outliers using transduction and statistical testing
 In Proceedings of the 12th Annual SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2006
"... Outlier detection can uncover malicious behavior in fields like intrusion detection and fraud analysis. Although there has been a significant amount of work in outlier detection, most of the algorithms proposed in the literature are based on a particular definition of outliers (e.g., densitybased), ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
(Show Context)
Outlier detection can uncover malicious behavior in fields like intrusion detection and fraud analysis. Although there has been a significant amount of work in outlier detection, most of the algorithms proposed in the literature are based on a particular definition of outliers (e.g., densitybased), and use adhoc thresholds to detect them. In this paper we present a novel technique to detect outliers with respect to an existing clustering model. However, the test can also be successfully utilized to recognize outliers when the clustering information is not available. Our method is based on Transductive Confidence Machines, which have been previously proposed as a mechanism to provide individual confidence measures on classification decisions. The test uses hypothesis testing to prove or disprove whether a point is fit to be in each of the clusters of the model. We experimentally demonstrate that the test is highly robust, and produces very few misdiagnosed points, even when no clustering information is available. Furthermore, our experiments demonstrate the robustness of our method under the circumstances of data contaminated by outliers. We finally show that our technique can be successfully applied to identify outliers in a noisy data set for which no information is available (e.g., ground truth, clustering structure, etc.). As such our proposed methodology is capable of bootstrapping from a noisy data set a clean one that can be used to identify future outliers.
A taxonomy framework for unsupervised outlier detection techniques for multitype data sets
, 2007
"... The term “outlier” can generally be defined as an observation that is significantly different from the other values in a data set. The outliers may be instances of error or indicate events. The task of outlier detection aims at identifying such outliers in order to improve the analysis of data and f ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
The term “outlier” can generally be defined as an observation that is significantly different from the other values in a data set. The outliers may be instances of error or indicate events. The task of outlier detection aims at identifying such outliers in order to improve the analysis of data and further discover interesting and useful knowledge about unusual events within numerous applications domains. In this paper, we report on contemporary unsupervised outlier detection techniques for multiple types of data sets and provide a comprehensive taxonomy framework and two decision trees to select the most suitable technique based on data set. Furthermore, we highlight the advantages, disadvantages and performance issues of each class of outlier detection techniques under this taxonomy framework.
New Outlier Detection Method Based on Fuzzy Clustering
, 2010
"... In this paper, a new efficient method for outlier detection is proposed. The proposed method is based on fuzzy clustering techniques. The cmeans algorithm is first performed, then small clusters are determined and considered as outlier clusters. Other outliers are then determined based on computing ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
In this paper, a new efficient method for outlier detection is proposed. The proposed method is based on fuzzy clustering techniques. The cmeans algorithm is first performed, then small clusters are determined and considered as outlier clusters. Other outliers are then determined based on computing differences between objective function values when points are temporarily removed from the data set. If a noticeable change occurred on the objective function values, the points are considered outliers. Test results were performed on different wellknown data sets in the data mining literature. The results showed that the proposed method gave good results.
MODELLING OF CONDITIONAL VARIANCE AND UNCERTAINTY USING INDUSTRIAL PROCESS DATA
, 2006
"... Academic dissertation to be presented, with the assent of ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Academic dissertation to be presented, with the assent of
The Robust Distance for Similarity Measure Of Content Based Image Retrieval
"... Abstract—Content based image retrieval (CBIR) is a retrieval technique which uses the visual information by retrieving collections of digital images. The process of retrieval is carried out by measuring the similarity between query image and the image in the database through similarity measure. Dist ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract—Content based image retrieval (CBIR) is a retrieval technique which uses the visual information by retrieving collections of digital images. The process of retrieval is carried out by measuring the similarity between query image and the image in the database through similarity measure. Distance is a metric often used as similarity measure on CBIR. The query image is relevant to an image in the database if the value of similarity measure is ‘small’. This means that a good CBIR retrieval system must be supported by an accurate similarity measure. The classical distance is generated from the arithmetic mean which is vulnerable to the masking effect. The appearance of extreme data causes the inflation of deviation of the arithmetic mean, this implies the distance between the extreme data or the outlier becomes closer than it supposed to be. This paper proposes a robust distance on the CBIR process which is derived from the measure of multivariate dispersion called vector variance (VV). The minimum vector variance (MVV) estimator is high breakdown point and insensitive to outliers. Another good property of VV is VV takes a shorter time of computation than covariance determinant (CD).
Measuring the Distance from Training Data Set
 in Proc. Int. Symp. on Applied Stochastic Models and Data Analysis
, 2005
"... Abstract. In this paper, a new method is proposed for measuring the distance between a training data set and a single, new observation. The novel distance measure reflects the expected squared prediction error, when the prediction is based on the k nearest neighbours of the training data set. The si ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Abstract. In this paper, a new method is proposed for measuring the distance between a training data set and a single, new observation. The novel distance measure reflects the expected squared prediction error, when the prediction is based on the k nearest neighbours of the training data set. The simulation shows that the distance measure correlates well with the true expected squared prediction error in practice. The distance measure can be applied, for example, to assessing the uncertainty of prediction.
On Detecting Clustered Anomalies using SCiForest
"... Abstract. Detecting local clustered anomalies is an intricate problem for many existing anomaly detection methods. Distancebased and densitybased methods are inherently restricted by their basic assumptions—anomalies are either far from normal points or being sparse. Clustered anomalies are able t ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Detecting local clustered anomalies is an intricate problem for many existing anomaly detection methods. Distancebased and densitybased methods are inherently restricted by their basic assumptions—anomalies are either far from normal points or being sparse. Clustered anomalies are able to avoid detection since they defy these assumptions by being dense and, in many cases, in close proximity to normal instances. In this paper, without using any density or distance measure, we propose a new method called SCiForest to detect clustered anomalies. SCiForest separates clustered anomalies from normal points effectively even when clustered anomalies are very close to normal points. It maintains the ability of existing methods to detect scattered anomalies, and it has superior time and space complexities against existing distancebased and densitybased methods. 1
Rough set, kernel set and spatiotemporal outlier detection
 IEEE Trans. Knowledge and Data Engineering
"... Abstract—Nowadays, the high availability of data gathered from wireless sensor networks and telecommunication systems has drawn the attention of researchers on the problem of extracting knowledge from spatiotemporal data. Detecting outliers which are grossly different from or inconsistent with the r ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Nowadays, the high availability of data gathered from wireless sensor networks and telecommunication systems has drawn the attention of researchers on the problem of extracting knowledge from spatiotemporal data. Detecting outliers which are grossly different from or inconsistent with the remaining spatiotemporal data set is a major challenge in realworld knowledge discovery and data mining applications. In this paper, we deal with the outlier detection problem in spatiotemporal data and describe a rough set approach that finds the top outliers in an unlabeled spatiotemporal data set. The proposed method, called Rough Outlier Set Extraction (ROSE), relies on a rough set theoretic representation of the outlier set using the rough set approximations, i.e., lower and upper approximations. We have also introduced a new set, named Kernel Set, that is a subset of the original data set, which is able to describe the original data set both in terms of data structure and of obtained results. Experimental results on realworld data sets demonstrate the superiority of ROSE, both in terms of some quantitative indices and outliers detected, over those obtained by various rough fuzzy clustering algorithms and by the stateoftheart outlier detection methods. It is also demonstrated that the kernel set is able to detect the same outliers set but with less computational time. Index Terms—Spatiotemporal data, outlier detection, spatiotemporal uncertainty management, rough set and granular computing Ç 1
A Fast DistanceBased Algorithm to Detect Outliers
, 2007
"... A fast distancebased algorithm for outlier detection will be proposed. It was found that the proposed algorithm reduced the number of distance calculations compared to the nestedloop algorithm. Test results were performed on different wellknown data sets. The results showed that the proposed algo ..."
Abstract
 Add to MetaCart
A fast distancebased algorithm for outlier detection will be proposed. It was found that the proposed algorithm reduced the number of distance calculations compared to the nestedloop algorithm. Test results were performed on different wellknown data sets. The results showed that the proposed algorithm gave a reasonable amount of CPU time saving.
Data Mining and Knowledge Discovery manuscript No. (will be inserted by the editor) A Fast Algorithm for Robust Mixtures in the Presence of Measurement Errors
"... Abstract In experimental sciences, detecting atypical or peculiar data from large sets of measurements has the potential of highlighting candidates of interesting new types of objects that deserve more detailed domainspecific followup study. However, measurement data is nearly never free of measur ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract In experimental sciences, detecting atypical or peculiar data from large sets of measurements has the potential of highlighting candidates of interesting new types of objects that deserve more detailed domainspecific followup study. However, measurement data is nearly never free of measurement errors. These errors can generate false outliers that are not truly interesting. Although many approaches exist for finding outliers, they have no means to tell to what extent the peculiarity is not simply due to measurement errors. We have been therefore developing a modelbased approach to infer genuine outliers from multivariate data sets in the presence of measurement error information, by explicitly incorporating knowledge of measurement errors into the model. This is based on a probabilistic mixture of hierarchical density models, in which parameter estimation is made feasible by a treestructured variational EM algorithm. Here, we further develop an algorithmic enhancement to address the scalability of this approach, in order to make it applicable to large data sets. This is achieved by a KDtree based partitioning of the variational posterior assignments. We conduct extensive