## Min-wise Independent Permutations (1998)

### Cached

### Download Links

Venue: | Journal of Computer and System Sciences |

Citations: | 195 - 11 self |

### BibTeX

@ARTICLE{Broder98min-wiseindependent,

author = {Andrei Z. Broder and Moses Charikar and Alan M. Frieze and Michael Mitzenmacher},

title = {Min-wise Independent Permutations},

journal = {Journal of Computer and System Sciences},

year = {1998},

volume = {60},

pages = {327--336}

}

### Years of Citing Articles

### OpenURL

### Abstract

We define and study the notion of min-wise independent families of permutations. We say that F ⊆ Sn is min-wise independent if for any set X ⊆ [n] and any x ∈ X, when π is chosen at random in F we have Pr(min{π(X)} = π(x)) = 1 |X |. In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under π. Our research was motivated by the fact that such a family (under some relaxations) is essential to the algorithm used in practice by the AltaVista web index software to detect and filter near-duplicate documents. However, in the course of our investigation we have discovered interesting and challenging theoretical questions related to this concept – we present the solutions to some of them and we list the rest as open problems.

### Citations

8622 |
Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...s fact allows us to obtain a lower bound on the size of F using graph entropy. We review briefly the basic facts about graph entropy. We begin with some standard concepts from information theory (see =-=[15]-=-.) Note that in what follows we will use X to be a random variable, and not a set as previously, for notational convenience. Definition 1 (Entropy) Given a random variable X with a finite range, its e... |

1693 |
The Probabilistic Method
- Alon, Spencer
- 2000
(Show Context)
Citation Context ...rive lower bounds for the approximate problem. 3.1 Existential Upper Bounds We obtain existential upper bounds on the sizes of approximately min-wise independent families via the probabilistic method =-=[3]-=-, by simply choosing a number of random permutations from Sn. Theorem 4 There exist families of size O( n2 ɛ2 ) that are approximately min-wise independent and there exist families of size O( k2 ln(n/... |

853 | An introduction to the theory of numbers - Hardy, Wright - 1979 |

676 |
Universal classes of hash functions
- Carter, Wegman
- 1979
(Show Context)
Citation Context ...osen uniformly at random from all possible functions until a suitable one is found but not necessarily if the search is limited to a smaller set of functions. This situation has led Carter and Wegman =-=[13]-=- to the concept of universal hashing. A family of hash functions H is called weakly universal if for any pair of distinct elements x1,x2 ∈U,ifh is chosen uniformly at random from H then Pr(h(x1) =h(x2... |

534 | Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web
- Karger, Lehman, et al.
- 1997
(Show Context)
Citation Context ... results appear novel, similar ideas have appeared in the literature. For example, the property of min-wise independence appears to be a key feature of the monotone ranged hash functions described in =-=[19]-=-. Cohen uses the properties of the minimum element of a random permutation to estimate the size of the transitive closure, as well as to solve similar related problems [14]. Given these connections, a... |

430 | Introduction to Analytic Number Theory - Apostol - 1976 |

418 |
Syntactic clustering of the Web
- Broder, Glassman, et al.
- 1997
(Show Context)
Citation Context ...of documents this is not feasible, and a sampling mechanism per document is necessary. 3It turns out that the problem can be reduced to a set intersection problem by a process called shingling. (See =-=[7, 11]-=- for details.) Via shingling each document D gets an associated set SD. For the purpose of the discussion here we can view SD as a set of natural numbers. (The size of SD is about equal to the number ... |

346 | On the resemblance and containment of documents
- Broder
- 1997
(Show Context)
Citation Context ...of documents this is not feasible, and a sampling mechanism per document is necessary. 3It turns out that the problem can be reduced to a set intersection problem by a process called shingling. (See =-=[7, 11]-=- for details.) Via shingling each document D gets an associated set SD. For the purpose of the discussion here we can view SD as a set of natural numbers. (The size of SD is about equal to the number ... |

290 | The World-Wide Web
- Berners-Lee, Cailliau, et al.
- 1994
(Show Context)
Citation Context ...that such a family (under some relaxations) is essential to the algorithm currently used in practice by the AltaVista Web indexing software [23] to detect and filter near-duplicate documents. The Web =-=[5]-=- has undergone exponential growth since its birth, and this has lead to the proliferation of documents that are identical or near identical. Experiments indicate that over 20% of the publicly availabl... |

271 | Simple construction of almost k-wise independent random variables. Random Struct
- Alon, Goldreich, et al.
- 1992
(Show Context)
Citation Context ...ntinue recursively. Our construction of these hash functions is based on the explicit construction of almost k-wise independent distributions on N bit binary strings. We use the following result from =-=[2]-=-: Proposition 1 We can construct a family of N bit strings which are δ away (in the L1 norm) from k-wise independence, such that log |F| is at most k+2 log( k log N 2δ )+2. We use this proposition to ... |

128 | Dynamic perfect hashing: upper and lower bounds
- Dietzfelbinger, Karlin, et al.
- 1988
(Show Context)
Citation Context ... Furthermore, there exist universal families of size O(|M| 2 )thatcanbe easily implemented in practice. Thus, universal hash functions are very useful in the design of adaptive hash schemes (see e.g. =-=[12, 16]-=-) and are actually used in commercial high-performance products (see e.g. [24]). Moreover, the concept of 1 We use log for log2 throughout. 2pairwise independence has important theoretical applicatio... |

114 | Identifying and Filtering Near-Duplicate Documents
- Broder
- 1998
(Show Context)
Citation Context ...ents in ¯ SA and ¯ SB are common. (For a set of documents, we avoid quadratic processing time, because a particular value for any coordinate is usually shared by only a few documents. For details see =-=[7, 8, 11]-=-.) In practice, as in the case of hashing discussed above, we have to deal with the sad reality that it is impossible to choose π uniformly at random in Sn. We are thus led to consider smaller familie... |

80 | The Art of Computer Programming, Vol. I: Fundamental Algorithms - Knuth - 1968 |

45 | Multilevel adaptive hashing - Broder, Karlin - 1990 |

34 | Pairwise Independence and Derandomization
- Luby, Wigderson
- 1995
(Show Context)
Citation Context ...ce products (see e.g. [24]). Moreover, the concept of 1 We use log for log2 throughout. 2pairwise independence has important theoretical applications. (See the excellent survey by Luby and Wigderson =-=[22]-=-.) It is often convenient to consider permutations rather than functions. Let Sn be the set of all permutations of [n]. We say that a family of permutations F⊆Sn is pair-wise independent if for any {x... |

31 |
Fredman-Komlos bounds and information theory
- Korner
- 1986
(Show Context)
Citation Context ...a. Similarly, we associate all symmetric a-triples satisfied by a permutation σ with the edges of another, 17smaller graph Gσ,a. We then show, using the concept of graph entropy introduced by Körner =-=[20]-=-, that many smaller graphs Gσ,a are required to cover the edges of the larger graph Ga. This argument will lead to our lower bound. ) We now formally define the graphs Ga and Gσ,a. LetV (Ga) =V (Gσ,a)... |

18 | A derandomization using min-wise independent permutations
- Broder, Charikar, et al.
(Show Context)
Citation Context ...y version of this work has appeared in [9]. Since then new constructions have been proposed by Indyk [18] and others [25]. The use of min-wise independent families for derandomization is discussed in =-=[10]-=-. 2 Exact Min-Wise Independence In this section, we provide bounds for the size of families that are exactly min-wise independent. We begin by determining a lower bound, demonstrating that the size of... |

14 | Combinatorics: Set Systems - Bollobás - 1986 |

11 |
Estimating the size of the transitive closure in linear time
- Cohen
- 1994
(Show Context)
Citation Context ...hash functions described in [19]. Cohen uses the properties of the minimum element of a random permutation to estimate the size of the transitive closure, as well as to solve similar related problems =-=[14]-=-. Given these connections, as well as the history of the development of pairwise independence, we expect that the concept of min-wise independence will prove useful in many future applications. A prel... |

11 |
The AltaVista Search Revolution: How to Find Anything on the Internet
- Seltzer, Ray, et al.
- 1996
(Show Context)
Citation Context ...s explained below, this definition is motivated by the fact that such a family (under some relaxations) is essential to the algorithm currently used in practice by the AltaVista Web indexing software =-=[23]-=- to detect and filter near-duplicate documents. The Web [5] has undergone exponential growth since its birth, and this has lead to the proliferation of documents that are identical or near identical. ... |

5 |
Is linear hashing good
- Alon, Dietzfelbinger, et al.
- 1997
(Show Context)
Citation Context ...ular we are interested in linear transformations, since they are used in the AltaVista implementation and are known to perform better in some situations than other pair-wise independent families (see =-=[1]-=-). The way we evaluate this performance is to consider a set X and study the distribution of the minimum of the image of X. It suffices to examine the two elements that are respectively most likely an... |

5 | GIGAswitch: A high-performance packet switching platform - Souza, Krishnakumar, et al. - 1994 |