Results 1  10
of
27
Minwise Independent Permutations
 Journal of Computer and System Sciences
, 1998
"... We define and study the notion of minwise independent families of permutations. We say that F ⊆ Sn is minwise independent if for any set X ⊆ [n] and any x ∈ X, when π is chosen at random in F we have Pr(min{π(X)} = π(x)) = 1 X . In other words we require that all the elements of any fixed set ..."
Abstract

Cited by 191 (11 self)
 Add to MetaCart
We define and study the notion of minwise independent families of permutations. We say that F ⊆ Sn is minwise independent if for any set X ⊆ [n] and any x ∈ X, when π is chosen at random in F we have Pr(min{π(X)} = π(x)) = 1 X . In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under π. Our research was motivated by the fact that such a family (under some relaxations) is essential to the algorithm used in practice by the AltaVista web index software to detect and filter nearduplicate documents. However, in the course of our investigation we have discovered interesting and challenging theoretical questions related to this concept – we present the solutions to some of them and we list the rest as open problems.
Coding for Computing
 IEEE Transactions on Information Theory
, 1998
"... A sender communicates with a receiver who wishes to reliably evaluate a function of their combined data. We show that if only the sender can transmit, the number of bits required is a conditional entropy of a naturally defined graph. We also determine the number of bits needed when the communicators ..."
Abstract

Cited by 70 (0 self)
 Add to MetaCart
A sender communicates with a receiver who wishes to reliably evaluate a function of their combined data. We show that if only the sender can transmit, the number of bits required is a conditional entropy of a naturally defined graph. We also determine the number of bits needed when the communicators exchange two messages. 1 Introduction Let f be a function of two random variables X and Y . A sender PX knows X, a receiver PY knows Y , and both want PY to reliably determine f(X; Y ). How many bits must PX transmit? Embedding this communicationcomplexity scenario (Yao [22]) in the standard informationtheoretic setting (Shannon [17]), we assume that (1) f(X; Y ) must be determined for a block of many independent (X; Y )instances, (2) PX transmits after observing the whole block of X instances, (3) a vanishing block error probability is allowed, and (4) the problem's rate L f (XjY ) is the number of bits transmitted for the block, normalized by the number of instances. Two simple bou...
Derandomization, witnesses for Boolean matrix multiplication and construction of perfect hash functions
 Algorithmica
, 1996
"... Small sample spaces with almost independent random variables are applied to design efficient sequential deterministic algorithms for two problems. The first algorithm, motivated by the attempt to design efficient algorithms for the All Pairs Shortest Path problem using fast matrix multiplication, so ..."
Abstract

Cited by 62 (6 self)
 Add to MetaCart
Small sample spaces with almost independent random variables are applied to design efficient sequential deterministic algorithms for two problems. The first algorithm, motivated by the attempt to design efficient algorithms for the All Pairs Shortest Path problem using fast matrix multiplication, solves the problem of computing witnesses for the Boolean product of two matrices. That is, if A and B are two n by n matrices, and C = AB is their Boolean product, the algorithm finds for every entry Cij = 1 a witness: an index k so that Aik = Bkj = 1. Its running time exceeds that of computing the product of two n by n matrices with small integer entries by a polylogarithmic factor. The second algorithm is a nearly linear time deterministic procedure for constructing a perfect hash function for a given nsubset of {1,..., m}.
Interactive Communication of Balanced Distributions and of Correlated Files
, 1993
"... (X; Y ) is a pair of random variables distributed over a support set S. Person PX knows X, Person P Y knows Y , and both know S. Using a predetermined protocol, they exchange binary messages in order for P Y to learn X. PX may or may not learn Y . The mmessage complexity, Cm , is the number of ..."
Abstract

Cited by 40 (1 self)
 Add to MetaCart
(X; Y ) is a pair of random variables distributed over a support set S. Person PX knows X, Person P Y knows Y , and both know S. Using a predetermined protocol, they exchange binary messages in order for P Y to learn X. PX may or may not learn Y . The mmessage complexity, Cm , is the number of information bits that must be transmitted (by both persons) in the worst case if only m messages are allowed. C1 is the number of bits required when there is no restriction on the number of messages exchanged. We consider a natural class of random pairs. ¯ is the maximum number of X values possible with a given Y value. j is the maximum number of Y values possible with a given X value. The random pair (X; Y ) is balanced if ¯ = j. The following hold for all balanced random pairs. Oneway communication requires at most twice the minimum number of bits: C 1 2 C1 + 1. This bound is almost tight: for every ff, there is a balanced random pair for which C 1 2 C1 \Gamma 6 ff. Three...
Splitters and NearOptimal Derandomization
, 1995
"... We present a fairly general method for finding deterministic constructions obeying what we call k restrictions; this yields structures of size not much larger than the probabilistic bound. The structures constructed by our method include (n; k)universal sets (a collection of binary vectors of leng ..."
Abstract

Cited by 39 (2 self)
 Add to MetaCart
We present a fairly general method for finding deterministic constructions obeying what we call k restrictions; this yields structures of size not much larger than the probabilistic bound. The structures constructed by our method include (n; k)universal sets (a collection of binary vectors of length n such that for any subset of size k of the indices, all 2 configurations appear) and families of perfect hash functions. The nearoptimal constructions of these objects imply the very efficient derandomization of algorithms in learning, of fixedsubgraph finding algorithms, and of near optimal \Sigma\Pi\Sigma threshold formulae. In addition, they derandomize the reduction showing the hardness of approximation of set cover. They also yield deterministic constructions for a localcoloring protocol, and for exhaustive testing of circuits.
Source Coding and Graph Entropies
 IEEE Trans. Inform. Theory
, 1995
"... A sender wants to accurately convey information to a receiver who has some, possibly related, data. We study the expected number of bits the sender must transmit for one and for multiple instances in two communication scenarios and relate this number to the chromatic and Korner entropies of a natura ..."
Abstract

Cited by 38 (0 self)
 Add to MetaCart
A sender wants to accurately convey information to a receiver who has some, possibly related, data. We study the expected number of bits the sender must transmit for one and for multiple instances in two communication scenarios and relate this number to the chromatic and Korner entropies of a naturally defined graph. 1 Introduction We study the expected number of bits a sender must transmit to convey information to a receiver who has some, possibly related, data. We consider single and multipleinstances of two related scenarios. This section describes the two scenarios and the results obtained. We begin with the familiar, standard sourcecoding scenario, dubbed restricted inputs because the inputs are restricted to belong to a given support set. 1.1 Restricted inputs (X; Y ) is a pair of random variables distributed over a countable product set X \Theta Y according to a probability distribution p(x; y). A sender PX knows X while a receiver P Y knows Y and wants to learn X without e...
Discovering Important Nodes through Graph Entropy: The Case of Enron Email Database
 KDD, Proceedings of the 3rd international workshop on Link discovery
, 2005
"... A major problem in social network analysis and link discovery is the discovery of hidden organizational structure and selection of interesting influential members based on lowlevel, incomplete and noisy evidence data. To address such a challenge, we exploit an information theoretic model that combi ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
A major problem in social network analysis and link discovery is the discovery of hidden organizational structure and selection of interesting influential members based on lowlevel, incomplete and noisy evidence data. To address such a challenge, we exploit an information theoretic model that combines information theory with statistical techniques from area of text mining and natural language processing. The Entropy model identifies the most interesting and important nodes in a graph. We show how entropy models on graphs are relevant to study of information flow in an organization. We review the results of two different experiments which are based on entropy models. The first version of this model has been successfully tested and evaluated on the Enron email dataset.
AverageCase Interactive Communication
 IEEE Trans. Info. Thy
, 1996
"... X and Y are random variables. Person PX knows X , Person P Y knows Y , and both know the joint probability distribution of the pair (X; Y ). Using a predetermined protocol, they communicate over a binary, errorfree, channel in order for P Y to learn X . PX may or may not learn Y . How many informat ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
X and Y are random variables. Person PX knows X , Person P Y knows Y , and both know the joint probability distribution of the pair (X; Y ). Using a predetermined protocol, they communicate over a binary, errorfree, channel in order for P Y to learn X . PX may or may not learn Y . How many information bits must be transmitted (by both persons) on the average? At least H(X jY ) bits must be transmitted and H(X)+ 1 bits always suffice 1 . If the support set of (X; Y ) is a Cartesian product of two sets, then H(X) bits must be transmitted. If the random pair (X; Y ) is uniformly distributed over its support set, then H(X jY ) + 3 log (H(X jY ) + 1) + 17 bits suffice. Furthermore, this number of bits is achieved when PX and P Y exchange four messages (sequences of binary bits). The last two results show that when the arithmetic average number of bits is considered: (1) there is no asymptotic advantage to PX knowing Y in advance; (2) four messages are asymptotically optimum. By cont...
Three Results on Interactive Communication
 IEEE Transactions on Information Theory
, 1993
"... X and Y are random variables. Person PX knows X, Person P Y knows Y , and both know the underlying probability distribution of the random pair (X; Y ). Using a predetermined protocol, they exchange messages over a binary, errorfree, channel in order for P Y to learn X. PX may or may not learn Y . ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
X and Y are random variables. Person PX knows X, Person P Y knows Y , and both know the underlying probability distribution of the random pair (X; Y ). Using a predetermined protocol, they exchange messages over a binary, errorfree, channel in order for P Y to learn X. PX may or may not learn Y . Cm is the number of information bits that must be transmitted (by both persons) in the worst case if only m messages are allowed. C1 is the corresponding number of bits when there is no restriction on the number of messages exchanged. We consider three aspects of this problem. C 4 . It is known that onemessage communication may require exponentially more bits than the minimum possible: for some random pairs, C 1 = 2 C1 \Gamma1 . Yet just two messages suffice to reduce communication to almost the minimum: for all random pairs, C 2 4 C1+3. We show that, asymptotically, four messages require at most three times the minimum number of bits: for all random pairs, C 4 3 C1 + ...
Generalized Hashing and ParentIdentifying Codes
, 2003
"... Let C be a code of length n over an alphabet of q letters. For a pair of integers 2 t < u, C is (t; u)hashing if for any two subsets T ; U C, satisfying T U , jT j = t, jU j = u, there is a coordinate 1 i n such that for any x 2 T , y 2 U x, x and y dier in the ith coordinate. This de nit ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
Let C be a code of length n over an alphabet of q letters. For a pair of integers 2 t < u, C is (t; u)hashing if for any two subsets T ; U C, satisfying T U , jT j = t, jU j = u, there is a coordinate 1 i n such that for any x 2 T , y 2 U x, x and y dier in the ith coordinate. This de nition, generalizing the standard notion of a thashing family, is motivated by an application in designing the socalled parent identifying codes, used in digital ngerprinting. In this paper we provide lower and upper bounds on the best possible rate of (t; u)hashing families for xed t; u and growing n. We also describe an explicit construction of (t; u)hashing families. The obtained lower bound on the rate of (t; u)hashing families is applied to get a new lower bound on the rate of tparent identifying codes.