## Outlier detection for high dimensional data (2001)

### Cached

### Download Links

- [users.cs.dal.ca]
- [www.cs.ust.hk]
- [www.charuaggarwal.net]
- [charuaggarwal.net]
- [web.mit.edu]
- [www2.in.tu-clausthal.de]
- DBLP

### Other Repositories/Bibliography

Citations: | 173 - 4 self |

### BibTeX

@INPROCEEDINGS{Aggarwal01outlierdetection,

author = {Charu C. Aggarwal},

title = {Outlier detection for high dimensional data},

booktitle = {},

year = {2001},

pages = {37--46}

}

### Years of Citing Articles

### OpenURL

### Abstract

The outlier detection problem has important applications in the eld of fraud detection, netw ork robustness analysis, and intrusion detection. Most suc h applications are high dimensional domains in whic hthe data can con tain hundreds of dimensions. Many recen t algorithms use concepts of pro ximity in order to nd outliers based on their relationship to the rest of the data. Ho w ever, in high dimensional space, the data is sparse and the notion of proximity fails to retain its meaningfulness. In fact, the sparsity of high dimensional data implies that every point is an almost equally good outlier from the perspective ofproximity-based de nitions. Consequently, for high dimensional data, the notion of nding meaningful outliers becomes substantially more complex and non-obvious. In this paper, w e discuss new techniques for outlier detection whic h nd the outliers by studying the behavior of projections from the data set. 1.

### Citations

8299 |
Genetic Algorithms
- Goldberg
- 1989
(Show Context)
Citation Context ...is refined in subsequent iterations of the algorithm, and the best set of projections found so far is always maintained by the evolutionary algorithm. ffl Selection: Several alternatives are possible =-=[17]-=- for selection in an evolutionary algorithm; the most popularly known ones are rank selection and fitness proportional selection. The idea is to replicate copies of a solution by ordering them by rank... |

3930 | Optimization by simulated annealing
- Kirkpatrick, Gelatt, et al.
- 1983
(Show Context)
Citation Context ...n, and the other species which this individual has to compete with are a group of other solutions to the problems; thus, unlike other optimization methods such as hill climbing or simulated annealing =-=[21]-=- they work with an entire population of current solutions rather than a single solution. This is one of the reasons why evolutionary algorithms are more effective as search methods than either hill-cl... |

3035 |
Adaptation in Natural and Artificial Systems
- Holland
- 1975
(Show Context)
Citation Context ...em. In order to overcome this, we will illustrate an innovative use of evolutionary search techniques for the outlier detection problem. 2.1 An Overview of Evolutionary Search Evolutionary Algorithms =-=[20]-=- are methods which imitate the process of organic evolution [12] in order to solve parameter optimization problems. The fundamental idea underlying Darwinian evolution is that in nature, resources are... |

1206 | A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise
- Ester, Kriegel, et al.
- 1996
(Show Context)
Citation Context ...e methods do not work quite as well when the dimensionality is high and the data becomes sparse. Many data-mining algorithms in the literature find outliers as a side-product of clustering algorithms =-=[2, 3, 5, 15, 18, 27]-=-. However, these techniques define outliers as points which do not lie in clusters. Thus, the techniques implicitly define outliers as the background noise in which the clusters are embedded. Another ... |

925 |
An Analysis of the Behavior of a Class of Genetic Adaptive Systems
- Jong
- 1975
(Show Context)
Citation Context ... fitness value. As the process of evolution progresses, the individuals in the population become more and more genetically similar to each other. This phenomenon is referred to as convergence. Dejong =-=[14]-=- defined convergence of a gene as the stage at which 95% of the population had the same value for that gene. The population is said to have converged when all genes have converged. The application of ... |

628 | Efficient and effective clustering methods for spatial data mining - Ng, Han - 1994 |

600 | Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications
- Agrawal, Johannes, et al.
- 1998
(Show Context)
Citation Context ...e methods do not work quite as well when the dimensionality is high and the data becomes sparse. Many data-mining algorithms in the literature find outliers as a side-product of clustering algorithms =-=[2, 3, 5, 15, 18, 27]-=-. However, these techniques define outliers as points which do not lie in clusters. Thus, the techniques implicitly define outliers as the background noise in which the clusters are embedded. Another ... |

588 | CURE: an efficient clustering algorithm for large databases
- Guha, Rastogi, et al.
- 1998
(Show Context)
Citation Context ...e methods do not work quite as well when the dimensionality is high and the data becomes sparse. Many data-mining algorithms in the literature find outliers as a side-product of clustering algorithms =-=[2, 3, 5, 15, 18, 27]-=-. However, these techniques define outliers as points which do not lie in clusters. Thus, the techniques implicitly define outliers as the background noise in which the clusters are embedded. Another ... |

291 | Algorithms for Mining DistanceBased Outliers in Large Datasets
- Knorr, Ng
(Show Context)
Citation Context ...hniques define outliers as points which do not lie in clusters. Thus, the techniques implicitly define outliers as the background noise in which the clusters are embedded. Another class of techniques =-=[7, 10, 13, 22, 23, 25]-=- defines outliers as points which are neither a part of a cluster nor a part of the background noise; rather they are specifically points which behave very differently from the norm. Outliers are more... |

282 |
Outliers in Statistical Data
- Barnett, Lewis
- 1994
(Show Context)
Citation Context ...l behavior of the underlying application. The algorithms of this paper determine outliers based only on their deviation value. Many algorithms have been proposed in recent years for outlier detection =-=[7, 8, 10, 22, 23, 25, 26]-=-, but they are not methods which are specifically designed in order to deal with the curse of high dimensionality. The statistics community has studied the concept of outliers quite extensively [8]. I... |

249 | Efcient algorithms for mining outliers from large data sets
- Ramaswamy, Rastogi, et al.
- 2000
(Show Context)
Citation Context ...hniques define outliers as points which do not lie in clusters. Thus, the techniques implicitly define outliers as the background noise in which the clusters are embedded. Another class of techniques =-=[7, 10, 13, 22, 23, 25]-=- defines outliers as points which are neither a part of a cluster nor a part of the background noise; rather they are specifically points which behave very differently from the norm. Outliers are more... |

229 |
an ecient data clustering method for very large databases
- BIRCH
- 1996
(Show Context)
Citation Context |

214 |
Identication of outliers
- Hawkins
- 1980
(Show Context)
Citation Context ...hniques define outliers as points which do not lie in clusters. Thus, the techniques implicitly define outliers as the background noise in which the clusters are embedded. Another class of techniques =-=[7, 10, 13, 22, 23, 25]-=- defines outliers as points which are neither a part of a cluster nor a part of the background noise; rather they are specifically points which behave very differently from the norm. Outliers are more... |

171 |
Adaptation in Natural and Arti cial Systems
- Holland
- 1992
(Show Context)
Citation Context ...em. In order to overcome this, we will illustrate an innovative use of evolutionary search techniques for the outlier detection problem. 2.1 An Overview of Evolutionary Search Evolutionary Algorithms =-=[20]-=- are methods which imitate the process of organic evolution [12] in order to solve parameter optimization problems. The fundamental idea underlying Darwinian evolution is that in nature, resources are... |

126 |
Mining Association Rules between
- Agrawal, Imielinski, et al.
- 1993
(Show Context)
Citation Context ...are no upward or downward-closed properties in the set of dimensions (along with associated ranges) which are unusually sparse. This is not unexpected: unlike problems such as large itemset detection =-=[6]-=- where one is looking for large aggregate patterns, the problem of finding subsets of dimensions which are sparsely populated has the flavor of finding a needle in haystack, since one is looking for p... |

110 | Local dimensionality reduction: A new approach to indexing high dimensional spaces - Chakrabarti, Mehrotra - 2000 |

68 | Finding intensional knowledge of distance-based outliers
- Knorr, Ng
- 1999
(Show Context)
Citation Context ...more applicable to low dimensional versions. On the other hand, we note that most practical data mining applications are likely to arise in the context of a very large number of features. Recent work =-=[23]-=- has discussed the concept of intensional knowledge of distance-based outliers in terms of subsets of attributes. This technique provides excellent interpretability by providing the reasoning behind w... |

37 |
Re-designing Distance Functions and Distance-based Applications for High Dimensional Data
- Aggarwal
- 2001
(Show Context)
Citation Context ...t then meaningful clusters cannot be found in the data [2, 3, 5, 11]; similarly it is difficult to detect abnormal deviations. For problems such as clustering and similarity search, it has been shown =-=[1, 2, 3, 5, 11, 19]-=- that by examining the behavior of the data in subspaces, it is possible to design more effective algorithms. This is because different localities of the data are dense with respect to different subse... |

37 | E cient and E ective Clustering Methods for Spatial Data - Ng, Han - 1994 |

31 |
What is the Nearest Neighbor
- Hinneburg, Aggarwal, et al.
- 2000
(Show Context)
Citation Context ...t then meaningful clusters cannot be found in the data [2, 3, 5, 11]; similarly it is difficult to detect abnormal deviations. For problems such as clustering and similarity search, it has been shown =-=[1, 2, 3, 5, 11, 19]-=- that by examining the behavior of the data in subspaces, it is possible to design more effective algorithms. This is because different localities of the data are dense with respect to different subse... |

25 |
Finding Generalized Projected Clusters
- Aggarwal, Yu
- 2000
(Show Context)
Citation Context |

22 | Optimized crossover for the independent set problem
- AGGARWAL, ORLIN, et al.
- 1997
(Show Context)
Citation Context ...plication of evolutionary search procedures should be based on a good understanding of the problem at hand. Typically black-box GA software on straightforward string encodings does not work very well =-=[4]-=-, and it is often a nontrivial task to design the recombinations, selections and mutations which work well for a given problem. In the next section, we will discuss the details of the evolutionary sea... |

17 |
1859, The Origin of Species by Natural selection
- Darwin
(Show Context)
Citation Context ...use of evolutionary search techniques for the outlier detection problem. 2.1 An Overview of Evolutionary Search Evolutionary Algorithms [20] are methods which imitate the process of organic evolution =-=[12]-=- in order to solve parameter optimization problems. The fundamental idea underlying Darwinian evolution is that in nature, resources are scarce and this leads to a competition among the species. Conse... |

12 |
CURE: An E cient Clustering Algorithm for Large Databases
- Guha, Rastogi, et al.
- 1998
(Show Context)
Citation Context ...ese methods do not work quite as well when the dimensionality is high and the data becomes sparse. Many data-mining algorithms in the literature nd outliers as a side-product of clustering algorithms =-=[2, 3, 5, 15, 18, 27]-=-. Ho w ever, these tec hniques de ne outliers as poin ts which do not lie in clusters. Th us,the techniques implicitly de ne outliers as the bac kground noise in whic hthe clusters are em bedded. Anot... |

11 |
A Linear Method for Deviation Detection
- Arning, Agrawal, et al.
- 1996
(Show Context)
Citation Context |

5 |
Aggarwal et al. Fast algorithms for projected clustering
- C
- 1999
(Show Context)
Citation Context |

2 | Genesis Software Version 5.0, Available at http://www.santafe.edu - Grefenstette |

1 |
Identi cation of Outliers
- Hawkins
- 1980
(Show Context)
Citation Context ...niques de ne outliers as poin ts which do not lie in clusters. Th us,the techniques implicitly de ne outliers as the bac kground noise in whic hthe clusters are em bedded. Another class of techniques =-=[7, 10, 13, 22, 23, 25]-=- de nes outliers as points whic h are neither a part of a Permission to make digital or hard copies of part or all of this work or personal or classroom use is granted without fee provided that copies... |