Results 1 - 10
of
112
On the Resemblance and Containment of Documents
- In Compression and Complexity of Sequences (SEQUENCES’97
, 1997
"... Given two documents A and B we define two mathematical notions: their resemblance r(A, B)andtheircontainment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection problems that can be eas ..."
Abstract
-
Cited by 254 (5 self)
- Add to MetaCart
Given two documents A and B we define two mathematical notions: their resemblance r(A, B)andtheircontainment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document.
Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation
, 2000
"... In this paper we show several results obtained by combining the use of stable distributions with pseudorandom generators for bounded space. In particular: ffl we show how to maintain (using only O(log n=ffl 2 ) words of storage) a sketch C(p) of a point p 2 l n 1 under dynamic updates of its coo ..."
Abstract
-
Cited by 205 (13 self)
- Add to MetaCart
In this paper we show several results obtained by combining the use of stable distributions with pseudorandom generators for bounded space. In particular: ffl we show how to maintain (using only O(log n=ffl 2 ) words of storage) a sketch C(p) of a point p 2 l n 1 under dynamic updates of its coordinates, such that given sketches C(p) and C(q) one can estimate jp \Gamma qj 1 up to a factor of (1 + ffl) with large probability. This solves the main open problem of [10]. ffl we obtain another sketch function C 0 which maps l n 1 into a normed space l m 1 (as opposed to C), such that m = m(n) is much smaller than n; to our knowledge this is the first dimensionality reduction lemma for l 1 norm ffl we give an explicit embedding of l n 2 into l n O(log n) 1 with distortion (1 + 1=n \Theta(1) ) and a nonconstructive embedding of l n 2 into l O(n) 1 with distortion (1 + ffl) such that the embedding can be represented using only O(n log 2 n) bits (as opposed to at least...
Informed content delivery across adaptive overlay networks
- In Proceedings of ACM SIGCOMM
, 2002
"... Abstract—Overlay networks have emerged as a powerful and highly flexible method for delivering content. We study how to optimize throughput of large transfers across richly connected, adaptive overlay networks, focusing on the potential of collaborative transfers between peers to supplement ongoing ..."
Abstract
-
Cited by 179 (9 self)
- Add to MetaCart
Abstract—Overlay networks have emerged as a powerful and highly flexible method for delivering content. We study how to optimize throughput of large transfers across richly connected, adaptive overlay networks, focusing on the potential of collaborative transfers between peers to supplement ongoing downloads. First, we make the case for an erasure-resilient encoding of the content. Using the digital fountain encoding approach, end hosts can efficiently reconstruct the original content of size from a subset of any symbols drawn from a large universe of encoding symbols. Such an approach affords reliability and a substantial degree of application-level flexibility, as it seamlessly accommodates connection migration and parallel transfers while providing resilience to packet loss. However, since the sets of encoding symbols acquired by peers during downloads may overlap substantially, care must be taken to enable them to collaborate effectively. Our main contribution is a collection of useful algorithmic tools for efficient summarization and approximate reconciliation of sets of symbols between pairs of collaborating peers, all of which keep message complexity and computation to a minimum. Through simulations and experiments on a prototype implementation, we demonstrate the performance benefits of our informed content-delivery mechanisms and how they complement existing overlay network architectures. Index Terms—Bloom filter, content delivery, digital fountain, erasure code, min-wise sketch, overlay, peer-to-peer, reconciliation. I.
Similarity estimation techniques from rounding algorithms
- In Proc. of 34th STOC
, 2002
"... A locality sensitive hashing scheme is a distribution on a family F of hash functions operating on a collection of objects, such that for two objects x, y, Prh∈F[h(x) = h(y)] = sim(x,y), where sim(x,y) ∈ [0, 1] is some similarity function defined on the collection of objects. Such a scheme leads ..."
Abstract
-
Cited by 135 (6 self)
- Add to MetaCart
A locality sensitive hashing scheme is a distribution on a family F of hash functions operating on a collection of objects, such that for two objects x, y, Prh∈F[h(x) = h(y)] = sim(x,y), where sim(x,y) ∈ [0, 1] is some similarity function defined on the collection of objects. Such a scheme leads to a compact representation of objects so that similarity of objects can be estimated from their compact sketches, and also leads to efficient algorithms for approximate nearest neighbor search and clustering. Min-wise independent permutations provide an elegant construction of such a locality sensitive hashing scheme for a collection of subsets with the set similarity measure sim(A, B) = |A∩B| |A∪B |. We show that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects. Based on this insight, we construct new locality sensitive hashing schemes for: 1. A collection of vectors with the distance between ⃗u and ⃗v measured by θ(⃗u,⃗v)/π, where θ(⃗u,⃗v) is the angle between ⃗u and ⃗v. This yields a sketching scheme for estimating the cosine similarity measure between two vectors, as well as a simple alternative to minwise independent permutations for estimating set similarity. 2. A collection of distributions on n points in a metric space, with distance between distributions measured by the Earth Mover Distance (EMD), (a popular distance measure in graphics and vision). Our hash functions map distributions to points in the metric space such that, for distributions P and Q,
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
, 2008
"... In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The ..."
Abstract
-
Cited by 131 (1 self)
- Add to MetaCart
In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The problem is of significant interest in a wide variety of areas.
Synopsis Data Structures for Massive Data Sets
"... Abstract. Massive data sets with terabytes of data are becoming commonplace. There is an increasing demand for algorithms and data structures that provide fast response times to queries on such data sets. In this paper, we describe a context for algorithmic work relevant to massive data sets and a f ..."
Abstract
-
Cited by 96 (13 self)
- Add to MetaCart
Abstract. Massive data sets with terabytes of data are becoming commonplace. There is an increasing demand for algorithms and data structures that provide fast response times to queries on such data sets. In this paper, we describe a context for algorithmic work relevant to massive data sets and a framework for evaluating such work. We consider the use of "synopsis" data structures, which use very little space and provide fast (typically approximated) answers to queries. The design and analysis of effective synopsis data structures o er many algorithmic challenges. We discuss a number of concrete examples of synopsis data structures, and describe fast algorithms for keeping them up-to-date in the presence of online updates to the data sets.
Counting Twig Matches in a Tree
, 2001
"... We describe efficient algorithms for accurately estimating the number of matches of a small node-labeled tree, i.e., a twig, in a large node-labeled tree, using a summary data structure. This problem is of interest for queries on XML and other hierarchical data, to provide query feedback and for cos ..."
Abstract
-
Cited by 61 (2 self)
- Add to MetaCart
We describe efficient algorithms for accurately estimating the number of matches of a small node-labeled tree, i.e., a twig, in a large node-labeled tree, using a summary data structure. This problem is of interest for queries on XML and other hierarchical data, to provide query feedback and for costbased query optimization. Our summary data structure scalably representsapproximate frequencyinformation about twiglets (i.e., small twigs) in the data tree. Given a twig query, the number of matches is estimated by creating a set of query twiglets, and combining two complementary approaches: Set Hashing, used to estimate the number of matches of each query twiglet, and Maximal Overlap, used to combine the query twiglet estimates into an estimate for the twig query. We propose several estimation algorithms that apply these approaches on query twiglets formed using variations on different twiglet decomposition techniques. We present an extensive experimental evaluation using several real XML...
Evaluating Strategies for Similarity Search on the Web
, 2002
"... Finding pages on the Web that are similar to a query page (Related Pages) is an important component of modern search engines. A variety of strategies have been proposed for answering Related Pages queries, but comparative evaluation by user studies is expensive, especially when large strategy spaces ..."
Abstract
-
Cited by 60 (3 self)
- Add to MetaCart
Finding pages on the Web that are similar to a query page (Related Pages) is an important component of modern search engines. A variety of strategies have been proposed for answering Related Pages queries, but comparative evaluation by user studies is expensive, especially when large strategy spaces must be searched (e.g., when tuning parameters). We present a technique for automatically evaluating strategies using Web hierarchies, such as Open Directory, in place of user feedback. We apply this evaluation methodology to a mix of document representation strategies, including the use of text, anchor-text, and links. We discuss the relative advantages and disadvantages of the various approaches examined. Finally, we describe how to eciently construct a similarity index out of our chosen strategies, and provide sample results from our index.
Testing that distributions are close
- In IEEE Symposium on Foundations of Computer Science
, 2000
"... Given two distributions over an n element set, we wish to check whether these distributions are statistically close by only sampling. We give a sublinear algorithm which uses O(n 2/3 ɛ −4 log n) independent samples from each distribution, runs in time linear in the sample size, makes no assumptions ..."
Abstract
-
Cited by 59 (12 self)
- Add to MetaCart
Given two distributions over an n element set, we wish to check whether these distributions are statistically close by only sampling. We give a sublinear algorithm which uses O(n 2/3 ɛ −4 log n) independent samples from each distribution, runs in time linear in the sample size, makes no assumptions about the structure of the distributions, and distinguishes the cases ɛ when the distance between the distributions is small (less than max ( 2 32 3 √ n, ɛ 4 √)) or large (more n than ɛ) in L1-distance. We also give an Ω(n 2/3 ɛ −2/3) lower bound. Our algorithm has applications to the problem of checking whether a given Markov process is rapidly mixing. We develop sublinear algorithms for this problem as well.

