#### DMCA

## Efficient Algorithms for Mining Outliers from Large Data Sets (2000)

Citations: | 322 - 0 self |

### Citations

2796 |
Dubes. Algorithms for clustering data
- Jain, C
- 1988
(Show Context)
Citation Context ...a clustering algorithm for partitioning the data points is a good choice. A number of clustering algorithms have been proposed in the literature, most of which have at least quadratic time complexity =-=[JD88]-=-. Since N could be quite large, we are more interested in clustering algorithms that can handle large data sets. Among algorithms with lower complexities is the pre-clustering phase of BIRCH [ZRL96], ... |

2049 | Computation Geometry: An Introduction - Preparata, Shamos - 1985 |

1344 |
The Design and Analysis of Spatial Data Structures
- Samet
- 1990
(Show Context)
Citation Context ...ty that is linear in the input size and performs a single scan of the database. It stores a compact summarization for each cluster in a CF-tree which is a balanced tree structure similar to an R-tree =-=[Sam89]-=-. For each successive point, it traverses the CF-tree to find the closest cluster, and if the point is within a threshold distance ffl of the cluster, it is absorbed into it; else, it starts a new clu... |

722 | CURE: An efficient clustering algorithm for large databases
- Guha, Rastogi, et al.
- 1998
(Show Context)
Citation Context ...tial process with a lower bound of \Omega\Gamma N dffi=2e ). This makes these techniques infeasible for dimensionss2. Clustering algorithms like CLARANS [NH94], DBSCAN [EKX95], BIRCH [ZRL96] and CURE =-=[GRS98]-=- consider outliers, but only to the point of ensuring that they do not interfere with the clustering process. Further, the definition of outliers used is in a sense subjective and related to the clust... |

709 | Efficient and effective clustering methods for spatial data min ing - Ng, Han - 1994 |

591 | Nearest Neighbor Queries,”
- Roussopoulos, Kelley, et al.
- 1995
(Show Context)
Citation Context ...he minimum distance between point p and rectangle R by MINDIST(p; R). Every point in R is at a distance of at least MINDIST(p; R) from p (see Figure 1(a)). The following definition of MINDIST is from =-=[RKV95]-=-: Definition 3.2: MINDIST(p; R) = P ffi i=1 x 2 i , where x i = 8 ? ! ? : r i \Gamma p i if p i ! r i p i \Gamma r 0 i if r 0 i ! p i 0 otherwise We denote the maximum distance between point p and rec... |

514 |
The R* tree: an efficient and robust access method for points and rectangles, in:
- Beckmann, Kriegel, et al.
- 1990
(Show Context)
Citation Context ...tions. This is expensive computationally, especially if the dimensionality of points is high. The number of distance computations can be substantially reduced by using a spatial index like an R -tree =-=[BKSS90]-=-. If we have all the points stored in a spatial index like the R -tree, the following pruning optimization, which was pointed out in [RKV95], can be applied to reduce the number of distance computatio... |

431 |
Outliers in Statistical Data.
- Barnett, Lewis
- 1994
(Show Context)
Citation Context ...average customer -- specifically, the customers who call very often and generate large telephone bills. The problem of detecting outliers has been extensively studied in the statistics community (see =-=[BL94]-=- for a good survey of statistical techniques). Typically, the user has to model the data points using a statistical distribution, CONTACT AUTHOR (Phone: 908-582-2541, Fax: 908-582-5809) 1 and points a... |

359 | Algorithms for mining distance-based outliers in large datasets.
- Knorr, Ng
- 1998
(Show Context)
Citation Context ...roblem with these approaches is that in a number of situations, the user might simply not have enough knowledge about the underlying data distribution. In order to overcome this problem, Knorr and Ng =-=[KN98]-=- propose the following distance-based definition for outliers that is both simple and intuitive: A point p in a data set is an outlier with respect to parameters k and d if no more than k points in th... |

263 |
an e cient data clustering method for very large databases
- BIRCH
- 1996
(Show Context)
Citation Context ...rently an exponential process with a lower bound of \Omega\Gamma N dffi=2e ). This makes these techniques infeasible for dimensionss2. Clustering algorithms like CLARANS [NH94], DBSCAN [EKX95], BIRCH =-=[ZRL96]-=- and CURE [GRS98] consider outliers, but only to the point of ensuring that they do not interfere with the clustering process. Further, the definition of outliers used is in a sense subjective and rel... |

226 |
New sampling-based summary statistics for improving approximate query answers
- Gibbons, Matias
- 1998
(Show Context)
Citation Context ...he exponential growth in the number of cells as the number of dimensions is increased, the nested loop outperforms the cell-based algorithm for dimensions 4 and higher. The notion of hot list queries =-=[GM98]-=- that return the n most frequently occurring values in a data set is somewhat related to our work. However, instead of ranking points using a scalar function (e.g., number of occurrences), we consider... |

110 | Discovery-driven exploration of olap data cubes.
- Sarawagi, Agrawal, et al.
- 1998
(Show Context)
Citation Context ...em of detecting deviations -- after seeing a series of similar data, an element disturbing the series is considered an exception. Table analysis methods from the statistics literature are employed in =-=[SAM98]-=- to attack the problem of finding exceptions in OLAP data cubes. A detailed value of the data cube is called an exception if it is found to differ significantly from the anticipated value calculated u... |

101 | A Linear Method for Deviation Detection in Large Databases”,
- Arning, Agrawal, et al.
- 1996
(Show Context)
Citation Context ...hat are detected by these algorithms. This is in contrast to our definition of distance-based outliers which is more objective and independent of how clusters in the input data set are identified. In =-=[AAR96]-=-, the authors address the problem of detecting deviations -- after seeing a series of similar data, an element disturbing the series is considered an exception. Table analysis methods from the statist... |

80 | Finding Intensional Knowledge of Distance-based Outliers”, - Knorr, Ng - 1999 |

79 | Computing Depth Contours of Bivariate Point Clouds, - Ruts, Rousseeuw - 1996 |

77 | PUBLIC: a decision tree classifier that integrates building and pruning
- Rastogi, Shim
- 1998
(Show Context)
Citation Context ... characteristics of the input data that are exhibited by a (typically userdefined) significant portion of the data. Examples of these large patterns include association rules[AMS + 95], classification=-=[RS98] and clust-=-ering[ZRL96, NH94, EKX95, GRS98]. In this paper, we focus on the converse problem of finding "small patterns" or outliers. An outlier in a set of data is an observation or a point that is co... |

58 | A database interface for clustering in large spatial databases.
- Ester, Kriegel, et al.
- 1995
(Show Context)
Citation Context ...s which is inherently an exponential process with a lower bound of \Omega\Gamma N dffi=2e ). This makes these techniques infeasible for dimensionss2. Clustering algorithms like CLARANS [NH94], DBSCAN =-=[EKX95]-=-, BIRCH [ZRL96] and CURE [GRS98] consider outliers, but only to the point of ensuring that they do not interfere with the clustering process. Further, the definition of outliers used is in a sense sub... |

2 |
Pattern Classification and Scene Analysis. AWiley-Interscience Publication
- Duda, Hart
- 1973
(Show Context)
Citation Context ...to their k th nearest neighbors. Finally, we note that using the k neighbors of a point in order to draw inferences about it's properties has been used before in the context of pattern classification =-=[DH73]-=-. Given a data set in which each point has an associated label---the classification problem is to assign a label to a new point p. The k nearest neighbor rule for classifying point p is to assign it t... |

1 | Lof:indetifying density-based local outliers - Breunig, Kriegel, et al. - 2000 |