Results 1 -
5 of
5
Distance-Based Outliers: Algorithms and Applications
, 2000
"... . This paper deals with finding outliers (exceptions) in large, multidimensional datasets. The identification of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the analysis of performance statistics of professional a ..."
Abstract
-
Cited by 104 (0 self)
- Add to MetaCart
. This paper deals with finding outliers (exceptions) in large, multidimensional datasets. The identification of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the analysis of performance statistics of professional athletes. Existing methods that we have seen for finding outliers can only deal efficiently with two dimensions/attributes of a dataset. In this paper, we study the notion of DB- (Distance- Based) outliers. Specifically, we show that: (i) outlier detection can be done efficiently for large datasets, and for k-dimensional datasets with large values of k (e.g., k 5); and (ii), outlier detection is a meaningful and important knowledge discovery task. First, we present two simple algorithms, both having a complexity of O(kN 2 ), k being the dimensionality and N being the number of objects in the dataset. These algorithms readily support datasets with many more than two attributes. Second, we ...
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule
, 2003
"... Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic ..."
Abstract
-
Cited by 84 (4 self)
- Add to MetaCart
Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the e#ciency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
Finding Aggregate Proximity Relationships and Commonalities in Spatial Data Mining
- IEEE Transactions on Knowledge and Data Engineering
, 1996
"... Abstract--In this paper, we study two spatial knowledge discovery problems involvingproximity relationships between clusters and features. The first problem is: Given a clusterof points, how can we efficiently find features (represented as polygons) that are closest to the majority of points in the ..."
Abstract
-
Cited by 37 (5 self)
- Add to MetaCart
Abstract--In this paper, we study two spatial knowledge discovery problems involvingproximity relationships between clusters and features. The first problem is: Given a clusterof points, how can we efficiently find features (represented as polygons) that are closest to the majority of points in the cluster? We measure proximity in an aggregate sense dueto the non-uniform distribution of points in a cluster (e.g., houses on a map), and the different shapes and sizes of features (e.g., natural or man-made geographic features). Thesecond problem is: Given n clusters of points, how can we extract the aggregate proximitycommonalities (i.e., features) that apply to most, if not all, of the n clusters? Regarding the first problem, the main contribution of the paper is the development of Algorithm CRHwhich uses geometric approximations (i.e., circles, rectangles, and convex hulls) to filter and select features. Highly scalable and incremental, Algorithm CRH can examine over 50,000features and their spatial relationships with a given cluster in approximately one second of CPU time. Regarding the second problem, the key contribution is the development ofAlgorithm GenCom that makes use of concept generalization to effectively derive many meaningful commonalities that cannot be found otherwise.
The 3W Model and Algebra for Unified Data Mining
, 2000
"... Real data mining/analysis applications call for a framework which adequately supports knowledge discovery as a multi-step process, where the input of one mining operation can be the output of another. Previous studies, primarily focusing on fast computation of one specific mining task at a tim ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
Real data mining/analysis applications call for a framework which adequately supports knowledge discovery as a multi-step process, where the input of one mining operation can be the output of another. Previous studies, primarily focusing on fast computation of one specific mining task at a time, ignore this vital issue. Motivated by
Fast mining of distance-based outliers in high dimensional datasets
- PAKDD 2006. LNCS (LNAI
, 2006
"... Defining outliers by their distance to neighboring data points has been shown to be an effective non-parametric approach to outlier detection. Existing algorithms for mining distance-based outliers do not scale to large, highdimensional data sets. In this paper, we present RBRP, a fast algorithm for ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Defining outliers by their distance to neighboring data points has been shown to be an effective non-parametric approach to outlier detection. Existing algorithms for mining distance-based outliers do not scale to large, highdimensional data sets. In this paper, we present RBRP, a fast algorithm for mining distance-based outliers, particularly targeted at high-dimensional data sets. RBRP scales log-linearly as a function of the number of data points and linearly as a function of the number of dimensions. Our empirical evaluation demonstrates that we outperform the state-of-the-art, often by an order of magnitude.

