## On the Surprising Behavior of Distance Metrics in High Dimensional Space (2001)

Venue: | Lecture Notes in Computer Science |

Citations: | 154 - 3 self |

### BibTeX

@INPROCEEDINGS{Aggarwal01onthe,

author = {Charu C. Aggarwal and Alexander Hinneburg and Daniel A. Keim},

title = {On the Surprising Behavior of Distance Metrics in High Dimensional Space},

booktitle = {Lecture Notes in Computer Science},

year = {2001},

pages = {420--434},

publisher = {Springer}

}

### Years of Citing Articles

### OpenURL

### Abstract

In recent years, the effect of the curse of high dimensionality has been studied in great detail on several problems such as clustering, nearest neighbor search, and indexing. In high dimensional space the data becomes sparse, and traditional indexing and algorithmic techniques fail from a efficiency and/or effectiveness perspective. Recent research results show that in high dimensional space, the concept of proximity, distance or nearest neighbor may not even be qualitatively meaningful. In this paper, we view the dimensionality curse from the point of view of the distance metrics which are used to measure the similarity between objects. We specifically examine the behavior of the commonly used Lk norm and show that the problem of meaningfulness in high dimensionality is sensitive to the value of k. For example, this means that the Manhattan distance metric (L1 norm) is consistently more preferable than the Euclidean distance metric (L2 norm) for high dimensional data mining applications. Using the intuition derived from our analysis, we introduce and examine a natural extension of the Lk norm to fractional distance metrics. We show that the fractional distance metric provides more meaningful results both from the theoretical and empirical perspective. The results show that fractional distance metrics can significantly improve the effectiveness of standard clustering algorithms such as the k-means algorithm. 1

### Citations

2221 | R-trees : A Dynamic Index Structure for Spatial Searching
- Guttman
- 1984
(Show Context)
Citation Context .... 1 Introduction In recent years, high dimensional search and retrieval have become very well studied problems because of the increased importance of data mining applications [1], [2], [3], [4], [5], =-=[8]-=-, [10], [11]. Typically, most real applications which require the use of such techniques comprise very high dimensional data. For such applications, the curse of high dimensionality tends to be a majo... |

376 | The SR-tree: an indexstructure for highdimensional nearest neighborqueries
- Katayama, Satoh
- 1997
(Show Context)
Citation Context ...ntroduction In recent years, high dimensional search and retrieval have become very well studied problems because of the increased importance of data mining applications [1], [2], [3], [4], [5], [8], =-=[10]-=-, [11]. Typically, most real applications which require the use of such techniques comprise very high dimensional data. For such applications, the curse of high dimensionality tends to be a major obst... |

184 | A cost model for nearest neighbor search in highdimensional data space
- Berchtold, Böhm, et al.
- 1997
(Show Context)
Citation Context ... algorithm. 1 Introduction In recent years, high dimensional search and retrieval have become very well studied problems because of the increased importance of data mining applications [1], [2], [3], =-=[4]-=-, [5], [8], [10], [11]. Typically, most real applications which require the use of such techniques comprise very high dimensional data. For such applications, the curse of high dimensionality tends to... |

30 |
What is the Nearest Neighbor
- Hinneburg, Aggarwal, et al.
- 2000
(Show Context)
Citation Context ...1 (kx i \Gamma y i k k ) 1=k ) in high dimensional space. The L k norm distance function is also susceptible to the dimensionality curse for many classes of data distributions [6]. Our recent results =-=[9]-=- seem to suggest that the L k -norm may be more relevant for k = 1 or 2 than values of ks3. In this paper, we provide some surprising theoretical and experimental results in analyzing the dependency o... |

30 |
Faloutsos C.: ‘The TV-Tree: An Index Structure for High-Dimensional
- Lin, Jagadish
- 1995
(Show Context)
Citation Context ...ction In recent years, high dimensional search and retrieval have become very well studied problems because of the increased importance of data mining applications [1], [2], [3], [4], [5], [8], [10], =-=[11]-=-. Typically, most real applications which require the use of such techniques comprise very high dimensional data. For such applications, the curse of high dimensionality tends to be a major obstacle i... |

10 |
Nearest Neighbor Query Performance for Unstable Distributions
- Shaft, Goldstein, et al.
- 1998
(Show Context)
Citation Context ...n the proof of Lemma 1. We have shown in the proof of the previous result that PAd d 1=k ! i 1 k+1 j 1=k . Using Slutsky's theorem we can derive that: minf PA d d 1=k ; PB d d 1=k g ! ` 1 k + 1 ' 1=k =-=(7)-=- We have also shown in the previous result that: lim d!1E jP A d \Gamma PB d j d 1=k\Gamma1=2 = C \Delta ` 1 (k + 1) 1=k ' s ` 1 2 \Delta k + 1 ' (8) We can combine the results in Equation 7 and 8 to ... |

2 |
H.-P.: The Pyramid Technique: Towards Breaking the
- Berchtold, Bohm, et al.
- 1998
(Show Context)
Citation Context ...means algorithm. 1 Introduction In recent years, high dimensional search and retrieval have become very well studied problems because of the increased importance of data mining applications [1], [2], =-=[3]-=-, [4], [5], [8], [10], [11]. Typically, most real applications which require the use of such techniques comprise very high dimensional data. For such applications, the curse of high dimensionality ten... |

1 |
Schek H.-J., Blott S.: A Quantitative Analysis and Performance Study for Similarity-Search Methods
- Weber
(Show Context)
Citation Context ... as the k-means algorithm. 1 Introduction In recent years, high dimensional search and retrieval have become very well studied problems because of the increased importance of data mining applications =-=[1]-=-, [2], [3], [4], [5], [8], [10], [11]. Typically, most real applications which require the use of such techniques comprise very high dimensional data. For such applications, the curse of high dimensio... |

1 |
Fayyad U., Geiger D.: Density-Based Indexing for Approximate Nearest Neighbor Queries
- Bennett
- 1999
(Show Context)
Citation Context ...he k-means algorithm. 1 Introduction In recent years, high dimensional search and retrieval have become very well studied problems because of the increased importance of data mining applications [1], =-=[2]-=-, [3], [4], [5], [8], [10], [11]. Typically, most real applications which require the use of such techniques comprise very high dimensional data. For such applications, the curse of high dimensionalit... |