Results 1  10
of
19
Detecting outliers using transduction and statistical testing
 In Proceedings of the 12th Annual SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2006
"... Outlier detection can uncover malicious behavior in fields like intrusion detection and fraud analysis. Although there has been a significant amount of work in outlier detection, most of the algorithms proposed in the literature are based on a particular definition of outliers (e.g., densitybased), ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
Outlier detection can uncover malicious behavior in fields like intrusion detection and fraud analysis. Although there has been a significant amount of work in outlier detection, most of the algorithms proposed in the literature are based on a particular definition of outliers (e.g., densitybased), and use adhoc thresholds to detect them. In this paper we present a novel technique to detect outliers with respect to an existing clustering model. However, the test can also be successfully utilized to recognize outliers when the clustering information is not available. Our method is based on Transductive Confidence Machines, which have been previously proposed as a mechanism to provide individual confidence measures on classification decisions. The test uses hypothesis testing to prove or disprove whether a point is fit to be in each of the clusters of the model. We experimentally demonstrate that the test is highly robust, and produces very few misdiagnosed points, even when no clustering information is available. Furthermore, our experiments demonstrate the robustness of our method under the circumstances of data contaminated by outliers. We finally show that our technique can be successfully applied to identify outliers in a noisy data set for which no information is available (e.g., ground truth, clustering structure, etc.). As such our proposed methodology is capable of bootstrapping from a noisy data set a clean one that can be used to identify future outliers.
New Outlier Detection Method Based on Fuzzy Clustering
, 2010
"... In this paper, a new efficient method for outlier detection is proposed. The proposed method is based on fuzzy clustering techniques. The cmeans algorithm is first performed, then small clusters are determined and considered as outlier clusters. Other outliers are then determined based on computing ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
In this paper, a new efficient method for outlier detection is proposed. The proposed method is based on fuzzy clustering techniques. The cmeans algorithm is first performed, then small clusters are determined and considered as outlier clusters. Other outliers are then determined based on computing differences between objective function values when points are temporarily removed from the data set. If a noticeable change occurred on the objective function values, the points are considered outliers. Test results were performed on different wellknown data sets in the data mining literature. The results showed that the proposed method gave good results.
Havinga P: A taxonomy framework for unsupervised outlier detection techniques for multitype data sets
, 2007
"... The term “outlier ” can generally be defined as an observation that is significantly different from the other values in a data set. The outliers may be instances of error or indicate events. The task of outlier detection aims at identifying such outliers in order to improve the analysis of data and ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
The term “outlier ” can generally be defined as an observation that is significantly different from the other values in a data set. The outliers may be instances of error or indicate events. The task of outlier detection aims at identifying such outliers in order to improve the analysis of data and further discover interesting and useful knowledge about unusual events within numerous applications domains. In this paper, we report on contemporary unsupervised outlier detection techniques for multiple types of data sets and provide a comprehensive taxonomy framework and two decision trees to select the most suitable technique based on data set. Furthermore, we highlight the advantages, disadvantages and performance issues of each class of outlier detection techniques under this taxonomy framework.
The Robust Distance for Similarity Measure Of Content Based Image Retrieval
"... Abstract—Content based image retrieval (CBIR) is a retrieval technique which uses the visual information by retrieving collections of digital images. The process of retrieval is carried out by measuring the similarity between query image and the image in the database through similarity measure. Dist ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract—Content based image retrieval (CBIR) is a retrieval technique which uses the visual information by retrieving collections of digital images. The process of retrieval is carried out by measuring the similarity between query image and the image in the database through similarity measure. Distance is a metric often used as similarity measure on CBIR. The query image is relevant to an image in the database if the value of similarity measure is ‘small’. This means that a good CBIR retrieval system must be supported by an accurate similarity measure. The classical distance is generated from the arithmetic mean which is vulnerable to the masking effect. The appearance of extreme data causes the inflation of deviation of the arithmetic mean, this implies the distance between the extreme data or the outlier becomes closer than it supposed to be. This paper proposes a robust distance on the CBIR process which is derived from the measure of multivariate dispersion called vector variance (VV). The minimum vector variance (MVV) estimator is high breakdown point and insensitive to outliers. Another good property of VV is VV takes a shorter time of computation than covariance determinant (CD).
LEARNING AND CLEANUP IN A LARGE SCALE MUSIC DATABASE
"... We have collected a database of musical features from radio broadcasts and CD collections (N> 10 5). The database poses a number of hard modelling challenges including: Segmentation problems and missing and wrong metadata. We describe our efforts towards cleaning the data using probability densi ..."
Abstract
 Add to MetaCart
We have collected a database of musical features from radio broadcasts and CD collections (N> 10 5). The database poses a number of hard modelling challenges including: Segmentation problems and missing and wrong metadata. We describe our efforts towards cleaning the data using probability density estimation. We train conditional densities for checking the relation between metadata and music features, and unconditional densities for spotting unlikely music features. We show that the rejected samples indeed represent various types of problems in the music data. The models may in some cases assist reconstruction of metadata. 1.
Algorithms for Detecting DistanceBased Outliers in High Dimensions
"... In this paper, we study efficient algorithms for detecting distancebased outliers from a set of n ddimensional points with respect to two threshold integer parameters k and M, and a radius parameter r. First, we design an algorithm utilizing ddimensional balls to facilitate fast local search, ach ..."
Abstract
 Add to MetaCart
In this paper, we study efficient algorithms for detecting distancebased outliers from a set of n ddimensional points with respect to two threshold integer parameters k and M, and a radius parameter r. First, we design an algorithm utilizing ddimensional balls to facilitate fast local search, achieving an O((3.1 √ 2eπ) d kn) time complexity for detecting DB(k, r) outliers. This algorithm improves the best known O((2 √ 2d + 1) d kn) time complexity of the cellbased algorithm in [13, 14]. Second, we develop a randomized QuickSelect strategy and apply it to design two randomized algorithms for detecting outliers with M largest kdistances or with M largest average kdistances in time Õ(dnM + (3.1 √ 2eπ) d kn) and Õ(dnM + (1 + k)d (2 √ 2eπ) d), respectively. Here, the Õ() notation suppresses logarithmic terms. We observe that the distancebased outlier detection problem is somehow related to the classical, extensively studied nearest neighbor search problem. However, to our best knowledge, the complexity of our algorithms cannot be achieved with the help of the existing nearest neighbor search algorithms. 1.
© 2008 Science Publications Fast Algorithms for Outlier Detection 1
"... Abstract: Finding fast algorithms to detect outliers (as unusual objects) by their distance to neighboring objects is a big desire. Two algorithms were proposed to detect outliers quickly. The first was based on the Partial Distance (PD) algorithm and the second was an improved version of the PD alg ..."
Abstract
 Add to MetaCart
Abstract: Finding fast algorithms to detect outliers (as unusual objects) by their distance to neighboring objects is a big desire. Two algorithms were proposed to detect outliers quickly. The first was based on the Partial Distance (PD) algorithm and the second was an improved version of the PD algorithm. It was found that the proposed algorithms reduced the number of distance calculations compared to the nestedloop method. Key words: Outlier detection, KNearest Neighbour (KNN), partial distance, data mining
Discovering Local Outliers using Dynamic Minimum Spanning Tree with SelfDetection of Best Number of Clusters
"... Detecting outliers in database (as unusual objects) using Clustering and Distancebased approach is a big desire. Minimum spanning tree based clustering algorithm is capable of detecting clusters with irregular boundaries. In this paper we propose a new algorithm to detect outliers based on minimum ..."
Abstract
 Add to MetaCart
Detecting outliers in database (as unusual objects) using Clustering and Distancebased approach is a big desire. Minimum spanning tree based clustering algorithm is capable of detecting clusters with irregular boundaries. In this paper we propose a new algorithm to detect outliers based on minimum spanning tree clustering and distancebased approach. Outlier detection is an extremely important task in a wide variety of application. The algorithm partition the dataset into optimal number of clusters. Small clusters are then determined and considered as outliers. The rest of the outliers (if any) are then detected in the clusters using Distancebased method. The algorithm uses a new cluster validation criterion based on the geometric property of data partition of the dataset in order to find the proper number of clusters. The algorithm works in two phases. The first phase of the algorithm creates optimal number of clusters, where as the second phase of the algorithm detect outliers in the clusters. The key feature of our approach is it combines the best features of Distancebased and Clusteringbased outlier detection to find noisefree/errorfree clusters for a given dataset without using any input parameters.
Data Mining and Knowledge Discovery manuscript No. (will be inserted by the editor) A Fast Algorithm for Robust Mixtures in the Presence of Measurement Errors
"... Abstract In experimental sciences, detecting atypical or peculiar data from large sets of measurements has the potential of highlighting candidates of interesting new types of objects that deserve more detailed domainspecific followup study. However, measurement data is nearly never free of measur ..."
Abstract
 Add to MetaCart
Abstract In experimental sciences, detecting atypical or peculiar data from large sets of measurements has the potential of highlighting candidates of interesting new types of objects that deserve more detailed domainspecific followup study. However, measurement data is nearly never free of measurement errors. These errors can generate false outliers that are not truly interesting. Although many approaches exist for finding outliers, they have no means to tell to what extent the peculiarity is not simply due to measurement errors. We have been therefore developing a modelbased approach to infer genuine outliers from multivariate data sets in the presence of measurement error information, by explicitly incorporating knowledge of measurement errors into the model. This is based on a probabilistic mixture of hierarchical density models, in which parameter estimation is made feasible by a treestructured variational EM algorithm. Here, we further develop an algorithmic enhancement to address the scalability of this approach, in order to make it applicable to large data sets. This is achieved by a KDtree based partitioning of the variational posterior assignments. We conduct extensive