Results 1  10
of
89
Similarity estimation techniques from rounding algorithms
 In Proc. of 34th STOC
, 2002
"... A locality sensitive hashing scheme is a distribution on a family F of hash functions operating on a collection of objects, such that for two objects x, y, Prh∈F[h(x) = h(y)] = sim(x,y), where sim(x,y) ∈ [0, 1] is some similarity function defined on the collection of objects. Such a scheme leads ..."
Abstract

Cited by 229 (6 self)
 Add to MetaCart
A locality sensitive hashing scheme is a distribution on a family F of hash functions operating on a collection of objects, such that for two objects x, y, Prh∈F[h(x) = h(y)] = sim(x,y), where sim(x,y) ∈ [0, 1] is some similarity function defined on the collection of objects. Such a scheme leads to a compact representation of objects so that similarity of objects can be estimated from their compact sketches, and also leads to efficient algorithms for approximate nearest neighbor search and clustering. Minwise independent permutations provide an elegant construction of such a locality sensitive hashing scheme for a collection of subsets with the set similarity measure sim(A, B) = A∩B A∪B . We show that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects. Based on this insight, we construct new locality sensitive hashing schemes for: 1. A collection of vectors with the distance between ⃗u and ⃗v measured by θ(⃗u,⃗v)/π, where θ(⃗u,⃗v) is the angle between ⃗u and ⃗v. This yields a sketching scheme for estimating the cosine similarity measure between two vectors, as well as a simple alternative to minwise independent permutations for estimating set similarity. 2. A collection of distributions on n points in a metric space, with distance between distributions measured by the Earth Mover Distance (EMD), (a popular distance measure in graphics and vision). Our hash functions map distributions to points in the metric space such that, for distributions P and Q,
Minwise Independent Permutations
 Journal of Computer and System Sciences
, 1998
"... We define and study the notion of minwise independent families of permutations. We say that F ⊆ Sn is minwise independent if for any set X ⊆ [n] and any x ∈ X, when π is chosen at random in F we have Pr(min{π(X)} = π(x)) = 1 X . In other words we require that all the elements of any fixed set ..."
Abstract

Cited by 191 (11 self)
 Add to MetaCart
We define and study the notion of minwise independent families of permutations. We say that F ⊆ Sn is minwise independent if for any set X ⊆ [n] and any x ∈ X, when π is chosen at random in F we have Pr(min{π(X)} = π(x)) = 1 X . In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under π. Our research was motivated by the fact that such a family (under some relaxations) is essential to the algorithm used in practice by the AltaVista web index software to detect and filter nearduplicate documents. However, in the course of our investigation we have discovered interesting and challenging theoretical questions related to this concept – we present the solutions to some of them and we list the rest as open problems.
Nearest Neighbors In HighDimensional Spaces
, 2004
"... In this chapter we consider the following problem: given a set P of points in a highdimensional space, construct a data structure which given any query point q nds the point in P closest to q. This problem, called nearest neighbor search is of significant importance to several areas of computer sci ..."
Abstract

Cited by 75 (2 self)
 Add to MetaCart
In this chapter we consider the following problem: given a set P of points in a highdimensional space, construct a data structure which given any query point q nds the point in P closest to q. This problem, called nearest neighbor search is of significant importance to several areas of computer science, including pattern recognition, searching in multimedial data, vector compression [GG91], computational statistics [DW82], and data mining. Many of these applications involve data sets which are very large (e.g., a database containing Web documents could contain over one billion documents). Moreover, the dimensionality of the points is usually large as well (e.g., in the order of a few hundred). Therefore, it is crucial to design algorithms which scale well with the database size as well as with the dimension. The nearestneighbor problem is an example of a large class of proximity problems, which, roughly speaking, are problems whose definitions involve the notion of...
Evaluating Strategies for Similarity Search on the Web
, 2002
"... Finding pages on the Web that are similar to a query page (Related Pages) is an important component of modern search engines. A variety of strategies have been proposed for answering Related Pages queries, but comparative evaluation by user studies is expensive, especially when large strategy spaces ..."
Abstract

Cited by 69 (3 self)
 Add to MetaCart
Finding pages on the Web that are similar to a query page (Related Pages) is an important component of modern search engines. A variety of strategies have been proposed for answering Related Pages queries, but comparative evaluation by user studies is expensive, especially when large strategy spaces must be searched (e.g., when tuning parameters). We present a technique for automatically evaluating strategies using Web hierarchies, such as Open Directory, in place of user feedback. We apply this evaluation methodology to a mix of document representation strategies, including the use of text, anchortext, and links. We discuss the relative advantages and disadvantages of the various approaches examined. Finally, we describe how to eciently construct a similarity index out of our chosen strategies, and provide sample results from our index.
Difference Engine: Harnessing Memory Redundancy in Virtual Machines
"... Virtual machine monitors (VMMs) are a popular platform for Internet hosting centers and cloudbased compute services. By multiplexing hardware resources among virtual machines (VMs) running commodity operating systems, VMMs decrease both the capital outlay and management overhead of hosting centers. ..."
Abstract

Cited by 65 (2 self)
 Add to MetaCart
Virtual machine monitors (VMMs) are a popular platform for Internet hosting centers and cloudbased compute services. By multiplexing hardware resources among virtual machines (VMs) running commodity operating systems, VMMs decrease both the capital outlay and management overhead of hosting centers. Appropriate placement and migration policies can take advantage of statistical multiplexing to effectively utilize available processors. However, main memory is not amenable to such multiplexing and is often the primary bottleneck in achieving higher degrees of consolidation. Previous efforts have shown that contentbased page sharing provides modest decreases in the memory footprint of VMs running similar operating systems and applications. Our studies show that significant additional gains can be had by leveraging both subpage level sharing (through page patching) and incore memory compression. We build Difference Engine, an extension to the Xen virtual machine monitor, to support each of these—in addition to standard copyonwrite full page sharing—and demonstrate substantial savings not only between VMs running similar applications and operating systems (up to 90%), but even across VMs running disparate workloads (up to 65%). In headtohead memorysavings comparisons, Difference Engine outperforms VMware ESX server by a factor of 1.5 for homogeneous workloads and by a factor of 1.6–2.5 for heterogeneous workloads. In all cases, the performance overhead of Difference Engine is less than 7%. 1
Redundancy Elimination Within Large Collections of Files
, 2004
"... Ongoing advancements in technology lead to everincreasing storage capacities. In spite of this, optimizing storage usage can still provide rich dividends. Several techniques based on deltaencoding and duplicate block suppression have been shown to reduce storage overheads, with varying requirements ..."
Abstract

Cited by 64 (2 self)
 Add to MetaCart
Ongoing advancements in technology lead to everincreasing storage capacities. In spite of this, optimizing storage usage can still provide rich dividends. Several techniques based on deltaencoding and duplicate block suppression have been shown to reduce storage overheads, with varying requirements for resources such as computation and memory. We propose a new scheme for storage reduction that reduces data sizes with an effectiveness comparable to the more expensive techniques, but at a cost comparable to the faster but less effective ones. The scheme, called Redundancy Elimination at the Block Level (REBL), leverages the benefits of compression, duplicate block suppression, and deltaencoding to eliminate a broad spectrum of redundant data in a scalable and efficient manner. REBL generally encodes more compactly than compression (up to a factor of 14) and a combination of compression and duplicate suppression (up to a factor of 6.7). REBL also encodes similarly to a technique based on deltaencoding, reducing overall space significantly in one case. Furthermore, REBL uses superfingerprints, a technique that reduces the data needed to identify similar blocks while dramatically reducing the computational requirements of matching the blocks: it turns comparisons into hash table lookups. As a result, using superfingerprints to avoid enumerating matching data objects decreases computation in the resemblance detection phase of REBL by up to a couple orders of magnitude.
A comparison of techniques to find mirrored hosts on the WWW
 Journal of the American Society for Information Science
, 2000
"... We compare several algorithms for identifying mirrored hosts on the World Wide Web. The algorithms operate on the basis of URL strings and linkage data: the type of information easily available from web proxies and crawlers. Identification of mirrored hosts can improve webbased information retrieva ..."
Abstract

Cited by 56 (5 self)
 Add to MetaCart
We compare several algorithms for identifying mirrored hosts on the World Wide Web. The algorithms operate on the basis of URL strings and linkage data: the type of information easily available from web proxies and crawlers. Identification of mirrored hosts can improve webbased information retrieval in several ways: First, by identifying mirrored hosts, search engines can avoid storing and returning duplicate documents. Second, several new information retrieval techniques for the Web make inferences based on the explicit links among hypertext documents – mirroring perturbs their graph model and degrades performance. Third, mirroring information can be used to redirect users to alternate mirror sites to compensate for various failures, and can thus improve the performance of web browsers and proxies. We evaluated 4 classes of “topdown ” algorithms for detecting mirrored host pairs (that is, algorithms that are based on page attributes such as URL, IP address, and connectivity, and not on the page content) on a collection of 140 million URLs (on 230,000 hosts) and their associated connectivity information. Our best approach is one which combines 5 algorithms and achieved a precision of 0.57 for a recall of 0.86 considering 100,000 ranked host pairs. 1
Applicationspecific Deltaencoding via Resemblance Detection
, 2003
"... Many objects, such as les, electronic messages, and web pages, contain overlapping content. Numerous past research projects have observed that one can compress one object relative to another one by computing the differences between the two, but these deltaencoding systems have almost invariably req ..."
Abstract

Cited by 50 (3 self)
 Add to MetaCart
Many objects, such as les, electronic messages, and web pages, contain overlapping content. Numerous past research projects have observed that one can compress one object relative to another one by computing the differences between the two, but these deltaencoding systems have almost invariably required knowledge of a specific relationship between them most commonly, two versions using the same name at different points in time. We consider cases in which this relationship is determined dynamically, by efficiently determining when a sufficient resemblance exists between two objects in a relatively large collection. We look at specific examples of this technique, namely web pages, email, and files in a file system, and evaluate the potential data reduction and the factors that influence this reduction. We find that deltaencoding using this resemblance detection technique can improve on simple compression by up to a factor of two, depending on workload, and that a small fraction of objects can potentially account for a large portion of these savings.
Scalable Techniques for Clustering the Web
 In Proc. of the WebDB Workshop
, 2000
"... Clustering is one of the most crucial techniques for dealing with the massive amount of information present on the web. Clustering can either be performed once offline, independent of search queries, or performed online on the results of search queries. Our offline approach aims to efficiently clust ..."
Abstract

Cited by 41 (5 self)
 Add to MetaCart
Clustering is one of the most crucial techniques for dealing with the massive amount of information present on the web. Clustering can either be performed once offline, independent of search queries, or performed online on the results of search queries. Our offline approach aims to efficiently cluster similar pages on the web, using the technique of LocalitySensitive Hashing (LSH), in which web pages are hashed in such a way that similar pages have a much higher probability of collision than dissimilar pages. Our preliminary experiments on the Stanford WebBase have shown that the hashbased scheme can be scaled to millions of urls. 1.
Efficient semistreaming algorithms for local triangle counting in massive graphs
 in KDD’08, 2008
"... In this paper we study the problem of local triangle counting in large graphs. Namely, given a large graph G = (V, E) we want to estimate as accurately as possible the number of triangles incident to every node v ∈ V in the graph. The problem of computing the global number of triangles in a graph ha ..."
Abstract

Cited by 41 (4 self)
 Add to MetaCart
In this paper we study the problem of local triangle counting in large graphs. Namely, given a large graph G = (V, E) we want to estimate as accurately as possible the number of triangles incident to every node v ∈ V in the graph. The problem of computing the global number of triangles in a graph has been considered before, but to our knowledge this is the first paper that addresses the problem of local triangle counting with a focus on the efficiency issues arising in massive graphs. The distribution of the local number of triangles and the related local clustering coefficient can be used in many interesting applications. For example, we show that the measures we compute can help to detect the presence of spamming activity in largescale Web graphs, as well as to provide useful features to assess content quality in social networks. For computing the local number of triangles we propose two approximation algorithms, which are based on the idea of minwise independent permutations (Broder et al. 1998). Our algorithms operate in a semistreaming fashion, using O(V ) space in main memory and performing O(log V ) sequential scans over the edges of the graph. The first algorithm we describe in this paper also uses O(E) space in external memory during computation, while the second algorithm uses only main memory. We present the theoretical analysis as well as experimental results in massive graphs demonstrating the practical efficiency of our approach. Luca Becchetti was partially supported by EU Integrated