## On the Resemblance and Containment of Documents (1997)

### Cached

### Download Links

Venue: | In Compression and Complexity of Sequences (SEQUENCES’97 |

Citations: | 397 - 7 self |

### BibTeX

@INPROCEEDINGS{Broder97onthe,

author = {Andrei Z. Broder},

title = {On the Resemblance and Containment of Documents},

booktitle = {In Compression and Complexity of Sequences (SEQUENCES’97},

year = {1997},

pages = {21--29},

publisher = {IEEE Computer Society}

}

### Years of Citing Articles

### OpenURL

### Abstract

Given two documents A and B we define two mathematical notions: their resemblance r(A, B)andtheircontainment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document.

### Citations

1760 | The Probabilistic Method
- Alon, Spencer
- 2000
(Show Context)
Citation Context ...ultiple collisions, that is, three or more distinct elements of S having the same image under f. Furthermore, it can be argued that the size of f(S) is fairly well concentrated. By Azuma’s inequalit=-=y [1] we have Pr �� � �|f(S-=-)|−E(|f(S)|) � � √ � �>λ n <e −λ2 /2 . However, for practical situations, ℓ is sufficiently large so that it is simpler to �use � Markov inequality. (E.g.: the probability that t... |

460 |
Syntactic clustering of the Web
- Broder, Glassman, et al.
- 1997
(Show Context)
Citation Context ...sters contained only identical documents (5.3 million documents). The remaining 1.5 million clusters contained 7 million documents (a mixture of exact duplicates and similar). For further details see =-=[5]-=-. The basic approach has two aspects: First, resemblance and containment are expressed as set intersection problems (this is explained in Section 2) and second, the relative size of these intersection... |

256 |
Fingerprinting by random polynomials
- Rabin
- 1981
(Show Context)
Citation Context ...,f(A, B) − rw(A, B)| < 2ℓ−11 . In an actual implementation, f is not totally random, and the probability of collision might be higher. A good choice is to take f to be Rabin’s fingerprinting f=-=unction [10] i-=-n which case the probability of collision of two strings s1 and s2 can be bounded (in a adversarial model for s1 and s2) by max(|s1|,|s2|)/2l−1 where |s1| is the length of the string s1 in bits. The... |

234 | Finding similar files in a large file system
- Manber
- 1994
(Show Context)
Citation Context ...ndently by Heintze [6], though there are differences in detail and in the precise definition of the measures used. Related sampling mechanisms for determining similarity were also developed by Manber =-=[7]-=- and within the Stanford SCAM project [2, 8, 9]. We tested the ideas discussed above by building a clustering of a collection of over 30,000,000 documents into sets of closely resembling document. The... |

224 | Min-wise independent permutations
- Broder, Charikar, et al.
- 1998
(Show Context)
Citation Context ...ample we need a random permutation π : {0, 1,...,2 ℓ } → {0,1,...,2 ℓ }. In practice we can imagine that the fingerprints implement π(f(·)) rather than f(·) or use random linear transformati=-=ons. (See [4]-=- for an in-depth discussion of this topic.) Using Rabin fingerprints, in which strings are viewed as polynomials over Z2, we can imagine that the underlying polynomial is first multiplied by a suitabl... |

182 | Copy detection mechanisms for digital documents
- Brin, Davis, et al.
- 1995
(Show Context)
Citation Context ...differences in detail and in the precise definition of the measures used. Related sampling mechanisms for determining similarity were also developed by Manber [7] and within the Stanford SCAM project =-=[2, 8, 9]-=-. We tested the ideas discussed above by building a clustering of a collection of over 30,000,000 documents into sets of closely resembling document. The documents were retrieved from a walk of the Wo... |

128 | Scam: A copy detection mechanism for digital documents
- Shivakumar, Garcia-Molina
- 1995
(Show Context)
Citation Context ...differences in detail and in the precise definition of the measures used. Related sampling mechanisms for determining similarity were also developed by Manber [7] and within the Stanford SCAM project =-=[2, 8, 9]-=-. We tested the ideas discussed above by building a clustering of a collection of over 30,000,000 documents into sets of closely resembling document. The documents were retrieved from a walk of the Wo... |

100 | Some applications of rabin’s fingerprinting method
- Broder
- 1993
(Show Context)
Citation Context ...ible polynomials) rather than some arbitrary hash functions is that their probability of collision is well understood. Furthermore Rabin fingerprints can be computed very efficiently in software (see =-=[3]) an-=-d we can take advantage of their algebraic properties when we compute the fingerprints of “sliding windows.” (See section 4.3.) 6sFor the clustering experiment discussed in the introduction the si... |

81 | Scalable document fingerprinting
- Heintze
- 1996
(Show Context)
Citation Context ...l be explained this problem can be finessed at the cost of a loss of precision. Our approach to determining syntactic similarity is related to the sampling approach developed independently by Heintze =-=[6]-=-, though there are differences in detail and in the precise definition of the measures used. Related sampling mechanisms for determining similarity were also developed by Manber [7] and within the Sta... |

74 | Building a scalable and accurate copy detection mechanism
- Shivakumar, Garcia-Molina
- 1996
(Show Context)
Citation Context ...differences in detail and in the precise definition of the measures used. Related sampling mechanisms for determining similarity were also developed by Manber [7] and within the Stanford SCAM project =-=[2, 8, 9]-=-. We tested the ideas discussed above by building a clustering of a collection of over 30,000,000 documents into sets of closely resembling document. The documents were retrieved from a walk of the Wo... |