Results 1 
5 of
5
DistanceBased Outliers: Algorithms and Applications
, 2000
"... . This paper deals with finding outliers (exceptions) in large, multidimensional datasets. The identification of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the analysis of performance statistics of professional a ..."
Abstract

Cited by 122 (0 self)
 Add to MetaCart
. This paper deals with finding outliers (exceptions) in large, multidimensional datasets. The identification of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the analysis of performance statistics of professional athletes. Existing methods that we have seen for finding outliers can only deal efficiently with two dimensions/attributes of a dataset. In this paper, we study the notion of DB (Distance Based) outliers. Specifically, we show that: (i) outlier detection can be done efficiently for large datasets, and for kdimensional datasets with large values of k (e.g., k 5); and (ii), outlier detection is a meaningful and important knowledge discovery task. First, we present two simple algorithms, both having a complexity of O(kN 2 ), k being the dimensionality and N being the number of objects in the dataset. These algorithms readily support datasets with many more than two attributes. Second, we ...
Mining DistanceBased Outliers in Near Linear Time with Randomization and a Simple Pruning Rule
, 2003
"... Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic ..."
Abstract

Cited by 103 (4 self)
 Add to MetaCart
Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real highdimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the e#ciency is because the time to process nonoutliers, which are the majority of examples, does not depend on the size of the data set.
Finding Aggregate Proximity Relationships and Commonalities in Spatial Data Mining
 IEEE Transactions on Knowledge and Data Engineering
, 1996
"... AbstractIn this paper, we study two spatial knowledge discovery problems involvingproximity relationships between clusters and features. The first problem is: Given a clusterof points, how can we efficiently find features (represented as polygons) that are closest to the majority of points in the ..."
Abstract

Cited by 38 (5 self)
 Add to MetaCart
AbstractIn this paper, we study two spatial knowledge discovery problems involvingproximity relationships between clusters and features. The first problem is: Given a clusterof points, how can we efficiently find features (represented as polygons) that are closest to the majority of points in the cluster? We measure proximity in an aggregate sense dueto the nonuniform distribution of points in a cluster (e.g., houses on a map), and the different shapes and sizes of features (e.g., natural or manmade geographic features). Thesecond problem is: Given n clusters of points, how can we extract the aggregate proximitycommonalities (i.e., features) that apply to most, if not all, of the n clusters? Regarding the first problem, the main contribution of the paper is the development of Algorithm CRHwhich uses geometric approximations (i.e., circles, rectangles, and convex hulls) to filter and select features. Highly scalable and incremental, Algorithm CRH can examine over 50,000features and their spatial relationships with a given cluster in approximately one second of CPU time. Regarding the second problem, the key contribution is the development ofAlgorithm GenCom that makes use of concept generalization to effectively derive many meaningful commonalities that cannot be found otherwise.
The 3W Model and Algebra for Unified Data Mining
, 2000
"... Real data mining/analysis applications call for a framework which adequately supports knowledge discovery as a multistep process, where the input of one mining operation can be the output of another. Previous studies, primarily focusing on fast computation of one specific mining task at a tim ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
Real data mining/analysis applications call for a framework which adequately supports knowledge discovery as a multistep process, where the input of one mining operation can be the output of another. Previous studies, primarily focusing on fast computation of one specific mining task at a time, ignore this vital issue. Motivated by
Fast mining of distancebased outliers in high dimensional datasets
 PAKDD 2006. LNCS (LNAI
, 2006
"... Defining outliers by their distance to neighboring data points has been shown to be an effective nonparametric approach to outlier detection. Existing algorithms for mining distancebased outliers do not scale to large, highdimensional data sets. In this paper, we present RBRP, a fast algorithm for ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
Defining outliers by their distance to neighboring data points has been shown to be an effective nonparametric approach to outlier detection. Existing algorithms for mining distancebased outliers do not scale to large, highdimensional data sets. In this paper, we present RBRP, a fast algorithm for mining distancebased outliers, particularly targeted at highdimensional data sets. RBRP scales loglinearly as a function of the number of data points and linearly as a function of the number of dimensions. Our empirical evaluation demonstrates that we outperform the stateoftheart, often by an order of magnitude.