## Similarity estimation techniques from rounding algorithms (2002)

### Cached

### Download Links

- [www.cs.princeton.edu]
- [www.cs.princeton.edu]
- [www.cs.princeton.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proc. of 34th STOC |

Citations: | 230 - 6 self |

### BibTeX

@INPROCEEDINGS{Charikar02similarityestimation,

author = {Moses S. Charikar},

title = {Similarity estimation techniques from rounding algorithms},

booktitle = {In Proc. of 34th STOC},

year = {2002},

pages = {380--388},

publisher = {ACM}

}

### Years of Citing Articles

### OpenURL

### Abstract

A locality sensitive hashing scheme is a distribution on a family F of hash functions operating on a collection of objects, such that for two objects x, y, Prh∈F[h(x) = h(y)] = sim(x,y), where sim(x,y) ∈ [0, 1] is some similarity function defined on the collection of objects. Such a scheme leads to a compact representation of objects so that similarity of objects can be estimated from their compact sketches, and also leads to efficient algorithms for approximate nearest neighbor search and clustering. Min-wise independent permutations provide an elegant construction of such a locality sensitive hashing scheme for a collection of subsets with the set similarity measure sim(A, B) = |A∩B| |A∪B |. We show that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects. Based on this insight, we construct new locality sensitive hashing schemes for: 1. A collection of vectors with the distance between ⃗u and ⃗v measured by θ(⃗u,⃗v)/π, where θ(⃗u,⃗v) is the angle between ⃗u and ⃗v. This yields a sketching scheme for estimating the cosine similarity measure between two vectors, as well as a simple alternative to minwise independent permutations for estimating set similarity. 2. A collection of distributions on n points in a metric space, with distance between distributions measured by the Earth Mover Distance (EMD), (a popular distance measure in graphics and vision). Our hash functions map distributions to points in the metric space such that, for distributions P and Q,

### Citations

937 | Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming
- Goemans, Williamson
- 1995
(Show Context)
Citation Context ...sensitive hash functions do not exist for certain commonly used similarity measures in information retrieval, the Dice coefficient and the Overlap coefficient. In seminal work, Goemans and Williamson =-=[24]-=- introduced semidefinite programming relaxations as a tool for approximation algorithms. They used the random hyperplane rounding technique to round vector solutions for the MAX-CUT problem. We will s... |

713 | Approximate nearest neighbors: towards removing the curse of dimensionality, Proceedings of the thirtieth annual ACM symposium on Theory of computing
- Indyk, Motwani
- 1998
(Show Context)
Citation Context ...and reducing the input size significantly. In fact, the minwise independent permutations hashing scheme is a particular instance of a locality sensitive hashing scheme introduced by Indyk and Motwani =-=[31]-=- in their work on nearest neighbor search in high dimensions. Definition 1. A locality sensitive hashing scheme is a distribution on a family F of hash functions operating on a collection of objects, ... |

701 | The space complexity of approximating the frequency moments - ALON, MATIAS, et al. - 1999 |

456 | The earth mover’s distance as a metric for image retrieval
- Rubner, Tomasi, et al.
(Show Context)
Citation Context ...the minimum amount of work that must be done in transforming one distribution to the other.) This is a popular metric for images and is used for image similarity, navigating image databases and so on =-=[37, 38, 39, 40, 36, 15, 16, 41, 42]-=-. The idea is to represent an image as a distribution on features with an underlying distance metric on features (e.g. colors in a color spectrum). Since the earth mover distance is expensive to compu... |

414 |
Syntactic clustering of the Web
- Broder, Glassman, et al.
- 1997
(Show Context)
Citation Context ..., 1], measuring the degree of similarity between x and y. sim(x, y) = 1 corresponds to objects x, y that are identical while sim(x, y) = 0 corresponds to objects that are very different. Broder et al =-=[8, 5, 7, 6]-=- introduced the notion of min-wise independent permutations, a technique for constructing such sketching functions for a collection of sets. The similarity measure considered there was sim(A,B) = |A ∩... |

342 | On the resemblance and containment of documents
- Broder
- 1997
(Show Context)
Citation Context ..., 1], measuring the degree of similarity between x and y. sim(x, y) = 1 corresponds to objects x, y that are identical while sim(x, y) = 0 corresponds to objects that are very different. Broder et al =-=[8, 5, 7, 6]-=- introduced the notion of min-wise independent permutations, a technique for constructing such sketching functions for a collection of sets. The similarity measure considered there was sim(A,B) = |A ∩... |

314 | Probabilistic approximation of metric spaces and its algorithmic applications
- Bartal
- 1996
(Show Context)
Citation Context ...o not describe LP relaxations for general metrics. Their LP relaxations are defined for Hierarchically well Separated Trees (HSTs). They convert a general metric to such an HST using Bartal’s results =-=[3, 4]-=- on probabilistic approximation of metric spaces via tree metrics. However, it follows from combining ideas in [33] with those in Chekuri et al [11]. Chekuri et al do in fact give an LP relaxation for... |

309 | A metric for distributions with applications to image databases
- Rubner, Tomasi, et al.
- 1998
(Show Context)
Citation Context ...the minimum amount of work that must be done in transforming one distribution to the other.) This is a popular metric for images and is used for image similarity, navigating image databases and so on =-=[37, 38, 39, 40, 36, 15, 16, 41, 42]-=-. The idea is to represent an image as a distribution on features with an underlying distance metric on features (e.g. colors in a color spectrum). Since the earth mover distance is expensive to compu... |

253 | On approximating arbitrary metrices by tree metrics
- Bartal
- 1998
(Show Context)
Citation Context ...o not describe LP relaxations for general metrics. Their LP relaxations are defined for Hierarchically well Separated Trees (HSTs). They convert a general metric to such an HST using Bartal’s results =-=[3, 4]-=- on probabilistic approximation of metric spaces via tree metrics. However, it follows from combining ideas in [33] with those in Chekuri et al [11]. Chekuri et al do in fact give an LP relaxation for... |

251 | Clustering data streams
- Guha, Mishra, et al.
- 2000
(Show Context)
Citation Context ...ss (or a few passes) over the data while using a limited amount of storage space and time. To cite a few examples, Alon et al [2] considered the problem of estimating frequency moments and Guha et al =-=[25]-=- considered the problem of clustering points in a streaming fashion. Many of these streaming algorithms need to represent important aspects of the data they have seen so far in a small amount of space... |

195 | Surfing wavelets on streams: One-pass summaries for approximate aggregate queries
- Gilbert, Kotidis, et al.
- 2001
(Show Context)
Citation Context ...s. Gibbons and Matias [18] give sketching algorithms producing so called synopsis data structures for various problems including maintaining approximate histograms, hot lists and so on. Gilbert et al =-=[19]-=- give algorithms to compute sketches for data streams so as to estimate any linear projection of the data and use this to get individual point and range estimates. Recently, Gilbert et al [21] gave ef... |

193 | Min-wise independent permutations
- Broder, Charikar, et al.
(Show Context)
Citation Context ..., 1], measuring the degree of similarity between x and y. sim(x, y) = 1 corresponds to objects x, y that are identical while sim(x, y) = 0 corresponds to objects that are very different. Broder et al =-=[8, 5, 7, 6]-=- introduced the notion of min-wise independent permutations, a technique for constructing such sketching functions for a collection of sets. The similarity measure considered there was sim(A,B) = |A ∩... |

136 | Finding interesting associations without support prunning
- Cohen, Datar, et al.
- 2000
(Show Context)
Citation Context ...pendent permutations results in efficient data structures for set similarity queries and leads to efficient clustering algorithms. This was exploited later in several experimental papers: Cohen et al =-=[14]-=- for association-rule mining, Haveliwala et al [27] for clustering web documents, Chen et al [13] for selectivity estimation of boolean queries, Chen et al [12] for twig queries, and Gionis et al [22]... |

114 | Broder ,"Identifying and Filtering Near-Duplicate Documents
- Andrei
(Show Context)
Citation Context |

108 | Synopsis data structures for massive data sets
- Gibbons, Matias
- 1999
(Show Context)
Citation Context ...on the original data set can be estimated by efficient computations on the compact sketches. Building on the ideas of [2], Alon et al [1] give algorithms for estimating join sizes. Gibbons and Matias =-=[18]-=- give sketching algorithms producing so called synopsis data structures for various problems including maintaining approximate histograms, hot lists and so on. Gilbert et al [19] give algorithms to co... |

105 |
small-space algorithms for approximate histogram maintenance
- Fast
- 2002
(Show Context)
Citation Context ...t et al [19] give algorithms to compute sketches for data streams so as to estimate any linear projection of the data and use this to get individual point and range estimates. Recently, Gilbert et al =-=[21]-=- gave efficient algorithms for the dynamic maintenance of histograms. Their algorithm processes a stream of updates and maintains a small sketch of the data from which the optimal histogram representa... |

75 | Perceptual metrics for image database navigation
- Rubner
- 1999
(Show Context)
Citation Context ...the minimum amount of work that must be done in transforming one distribution to the other.) This is a popular metric for images and is used for image similarity, navigating image databases and so on =-=[37, 38, 39, 40, 36, 15, 16, 41, 42]-=-. The idea is to represent an image as a distribution on features with an underlying distance metric on features (e.g. colors in a color spectrum). Since the earth mover distance is expensive to compu... |

66 | Approximation Algorithms for the 0– Extension Problem
- Calinescu, Karloff, et al.
- 2001
(Show Context)
Citation Context ...ur family is based on rounding algorithms for LP relaxations for the problem of classification with pairwise relationships studied by Kleinberg and Tardos [33], and further studied by Calinescu et al =-=[10]-=- and Chekuri et al [11]. Combining a new LP formulation described by Chekuri et al together with a rounding technique of Kleinberg and Tardos, we show a construction of a hash function family which ap... |

65 | Approximation algorithms for the metric labeling problem via a new linear programming formulation
- Chekuri, Khanna, et al.
- 2001
(Show Context)
Citation Context ...ounding algorithms for LP relaxations for the problem of classification with pairwise relationships studied by Kleinberg and Tardos [33], and further studied by Calinescu et al [10] and Chekuri et al =-=[11]-=-. Combining a new LP formulation described by Chekuri et al together with a rounding technique of Kleinberg and Tardos, we show a construction of a hash function family which approximates the earth mo... |

65 | Counting Twig Matches in a Tree
- Chen, Jagadish, et al.
- 2001
(Show Context)
Citation Context ...everal experimental papers: Cohen et al [14] for association-rule mining, Haveliwala et al [27] for clustering web documents, Chen et al [13] for selectivity estimation of boolean queries, Chen et al =-=[12]-=- for twig queries, and Gionis et al [22] for indexing set value 1 One question left open in [7] was the issue of compact representation of hash functions in this family; this was settled by Indyk [28]... |

56 | A small approximately min-wise independent family of hash functions
- Indyk
(Show Context)
Citation Context ...[12] for twig queries, and Gionis et al [22] for indexing set value 1 One question left open in [7] was the issue of compact representation of hash functions in this family; this was settled by Indyk =-=[28]-=-, who gave a construction of a small family of minwise independent permutations. attributes. All of this work used the hashing technique for set similarity together with ideas from [31]. We note that ... |

48 |
Color edge detection with the compass operator
- Ruzon, Tomasi
- 1999
(Show Context)
Citation Context |

46 | M.J.: QuickSAND: Quick summary and analysis of network data
- Gilbert, Kotidis, et al.
(Show Context)
Citation Context ...n vectors is more direct and is also more general since it applies to general vectors.) We also note that the cosine between vectors can be estimated from known techniques based on random projections =-=[2, 1, 20]-=-. However, the advantage of a locality sensitive hashing based scheme is that this directly yields techniques for nearest neighbor search for the cosine similarity measure.An attractive feature of th... |

41 | Scalable techniques for clustering the web
- Haveliwala, Gionis, et al.
- 2000
(Show Context)
Citation Context ...ctures for set similarity queries and leads to efficient clustering algorithms. This was exploited later in several experimental papers: Cohen et al [14] for association-rule mining, Haveliwala et al =-=[27]-=- for clustering web documents, Chen et al [13] for selectivity estimation of boolean queries, Chen et al [12] for twig queries, and Gionis et al [22] for indexing set value 1 One question left open in... |

39 | The earth mover’s distance under transformation sets
- Cohen, Guibas
- 1999
(Show Context)
Citation Context |

33 | Selectivity Estimation for Boolean Queries
- SIGKDD, Korn, et al.
- 2000
(Show Context)
Citation Context ... efficient clustering algorithms. This was exploited later in several experimental papers: Cohen et al [14] for association-rule mining, Haveliwala et al [27] for clustering web documents, Chen et al =-=[13]-=- for selectivity estimation of boolean queries, Chen et al [12] for twig queries, and Gionis et al [22] for indexing set value 1 One question left open in [7] was the issue of compact representation o... |

33 | On Approximate Nearest Neighbors in Non-Euclidean Spaces - Indyk - 1998 |

28 | Derandomized dimensionality reduction with applications
- Engebretsen, Indyk, et al.
- 2002
(Show Context)
Citation Context ...n be chosen by picking O(log 2 n) random bits, i.e. we can restrict the random hyperplanes to be in a family of size 2 O(log2 n) . This follows from the techniques in Indyk [30] and Engebretsen et al =-=[17]-=-, which in turn use Nisan’s pseudorandom number generator for space bounded computations [35]. We omit the details since they are similar to those in [30, 17]. Using this random hyperplane based hash ... |

15 | Efficient and tunable similar set retrieval
- Gionis, Gunopulos, et al.
- 2001
(Show Context)
Citation Context ...[14] for association-rule mining, Haveliwala et al [27] for clustering web documents, Chen et al [13] for selectivity estimation of boolean queries, Chen et al [12] for twig queries, and Gionis et al =-=[22]-=- for indexing set value 1 One question left open in [7] was the issue of compact representation of hash functions in this family; this was settled by Indyk [28], who gave a construction of a small fam... |

13 |
AND ÉVA TARDOS: Approximation algorithms for classification problems with pairwise relationships: Metric labeling and Markov random fields
- KLEINBERG
(Show Context)
Citation Context ...ly for estimating the earth mover distance. Our family is based on rounding algorithms for LP relaxations for the problem of classification with pairwise relationships studied by Kleinberg and Tardos =-=[33]-=-, and further studied by Calinescu et al [10] and Chekuri et al [11]. Combining a new LP formulation described by Chekuri et al together with a rounding technique of Kleinberg and Tardos, we show a co... |

10 |
The Earth Mover’s
- Rubner, Guibas, et al.
- 1997
(Show Context)
Citation Context |

9 | Improved classification via connectivity information
- Broder, Krauthgamer, et al.
- 2000
(Show Context)
Citation Context ...stance based on rounding algorithms for the problem of classification with pairwise relationships, introduced by Kleinberg and Tardos [33]. (A closely related problem was also studied by Broder et al =-=[9]-=-). In designing hash functions to estimate the Earth Mover Distance, we will relax the definition of locality sensitive hashing (1) in three ways. 1. Firstly, the quantity we are trying to estimate is... |

9 |
Texture metrics
- Rubner, Tomasi
- 1998
(Show Context)
Citation Context |

8 | Corner detection in textured color images
- Ruzon, Tomasi
- 1999
(Show Context)
Citation Context |

6 |
Tracking Join and
- Alon, Gibbons, et al.
- 1999
(Show Context)
Citation Context ...cality sensitive hashing scheme is a distribution on a family F of hash functions operating on a collection of objects, such that for two objects x, y, Prh∈F[h(x) = h(y)] = sim(x,y), where sim(x,y) ∈ =-=[0, 1]-=- is some similarity function defined on the collection of objects. Such a scheme leads to a compact representation of objects so that similarity of objects can be estimated from their compact sketches... |

6 | The earth mover”s distance: Lower bounds and invariance under translation
- Cohen, Guibas
- 1997
(Show Context)
Citation Context |

5 | Locality-Preserving Hashing - Indyk, Motwani, et al. - 1997 |

4 |
Pseudorandom sequences for space bounded computations
- Nisan
- 1992
(Show Context)
Citation Context ... be in a family of size 2 O(log2 n) . This follows from the techniques in Indyk [30] and Engebretsen et al [17], which in turn use Nisan’s pseudorandom number generator for space bounded computations =-=[35]-=-. We omit the details since they are similar to those in [30, 17]. Using this random hyperplane based hash function, we obtain a hash function family for set similarity, for a slightly different measu... |

3 | and Éva Tardos. A constant factor approximation algorithm for a class of classification problems - Gupta - 2000 |

1 |
Efficient 387 Tunable Similar Set Retrieval
- Gionis, Gunopulos, et al.
- 2001
(Show Context)
Citation Context ...[14] for association-rule mining, Haveliwala et al [27] for clustering web documents, Chen et al [13] for selectivity estimation of boolean queries, Chen et al [12] for twig queries, and Gionis et al =-=[22]-=- for indexing set value 1 One question left open in [7] was the issue of compact representation of hash functions in this family; this was settled by Indyk [28], who gave a construction of a small fam... |