## Similarity estimation techniques from rounding algorithms (2002)

### Cached

### Download Links

- [www.cs.princeton.edu]
- [www.cs.princeton.edu]
- [www.cs.princeton.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proc. of 34th STOC |

Citations: | 229 - 6 self |

### BibTeX

@INPROCEEDINGS{Charikar02similarityestimation,

author = {Moses S. Charikar},

title = {Similarity estimation techniques from rounding algorithms},

booktitle = {In Proc. of 34th STOC},

year = {2002},

pages = {380--388},

publisher = {ACM}

}

### Years of Citing Articles

### OpenURL

### Abstract

A locality sensitive hashing scheme is a distribution on a family F of hash functions operating on a collection of objects, such that for two objects x, y, Prh∈F[h(x) = h(y)] = sim(x,y), where sim(x,y) ∈ [0, 1] is some similarity function defined on the collection of objects. Such a scheme leads to a compact representation of objects so that similarity of objects can be estimated from their compact sketches, and also leads to efficient algorithms for approximate nearest neighbor search and clustering. Min-wise independent permutations provide an elegant construction of such a locality sensitive hashing scheme for a collection of subsets with the set similarity measure sim(A, B) = |A∩B| |A∪B |. We show that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects. Based on this insight, we construct new locality sensitive hashing schemes for: 1. A collection of vectors with the distance between ⃗u and ⃗v measured by θ(⃗u,⃗v)/π, where θ(⃗u,⃗v) is the angle between ⃗u and ⃗v. This yields a sketching scheme for estimating the cosine similarity measure between two vectors, as well as a simple alternative to minwise independent permutations for estimating set similarity. 2. A collection of distributions on n points in a metric space, with distance between distributions measured by the Earth Mover Distance (EMD), (a popular distance measure in graphics and vision). Our hash functions map distributions to points in the metric space such that, for distributions P and Q,