## A Sketch-Based Distance Oracle for Web-Scale Graphs

Citations: | 16 - 1 self |

### BibTeX

@MISC{Sarma_asketch-based,

author = {Atish Das Sarma and Sreenivas Gollapudi and Rina Panigrahy and Marc Najork},

title = {A Sketch-Based Distance Oracle for Web-Scale Graphs},

year = {}

}

### OpenURL

### Abstract

We study the fundamental problem of computing distances between nodes in large graphs such as the web graph and social networks. Our objective is to be able to answer distance queries between pairs of nodes in real time. Since the standard shortest path algorithms are expensive, our approach moves the time-consuming shortest-path computation offline, and at query time only looks up precomputed values and performs simple and fast computations on these precomputed values. More specifically, during the offline phase we compute and store a small “sketch ” for each node in the graph, and at query-time we look up the sketches of the source and destination nodes and perform a simple computation using these two sketches to estimate the distance. Categories and Subject Descriptors G.2.2 [Graph Theory]: Graph algorithms, path and circuit problems

### Citations

8521 |
Introduction to Algorithms
- Cormen, Stein, et al.
- 2001
(Show Context)
Citation Context ...in degree of relatedness between the two pages [13, 20]. However, given the large size of these graphs, shortest-path computation is challenging. Running Dijkstra’s well known shortest-path algorithm =-=[8]-=- on a web graph containing tens of billions of nodes and trillions of edges would take several hours, if not days. Moreover, it is not feasible to store a web-scale graph in the main memory of a singl... |

414 |
Syntactic clustering of the Web
- Broder, Glassman, et al.
- 1997
(Show Context)
Citation Context ...tions has been studied extensively and is also used in many applications. For example, several search engines compute a small sketch of documents to detect near-duplicate documents at query time; see =-=[5, 10, 4]-=- and references therein. This eliminates the need to compare large documents, which is time consuming. Instead it is achieved by comparing short sketches of these documents and measuring the similarit... |

278 |
On Lipschitz embedding of finite metric spaces in Hilbert space
- Bourgain
- 1985
(Show Context)
Citation Context ...f each point can be projected onto a small number of dimensions such that distances are approximately preserved, one can store the small dimensional vector as a sketch. The classic result of Bourgain =-=[3]-=- shows how such an embedding can be achieved for certain distance metrics. Another line of work in estimating distances is the study of spanner construction. A spanner is a sparse subgraph of the give... |

251 | On approximating arbitrary metrices by tree metrics
- Bartal
- 1998
(Show Context)
Citation Context ...ing distance may take a long time. Some theoretically efficient algorithms for spanners are presented by Feigenbaum et al. [12], and Baswana [2]. Other fundamental results in this area include Bartal =-=[1]-=- and Fakcharoenphol et al. [11]. Cohen et al. [6] proposed an approximate distance scheme using 2-hop covers of all paths in a directed (or undirected) graphs. However, finding the near optimal 2-hop ... |

205 | Approximate distance oracles
- Thorup, Zwick
- 2005
(Show Context)
Citation Context ...ary pairs of nodes, in real time. While there is not much empirical work on computing sketches for distances, there are the aforementioned theoretical studies using embeddings [3, 1, 11] and spanners =-=[12, 19, 2]-=- in the algorithms literature. All of these algorithms provably work only on undirected graphs, and some are complicated and probably impractical. The classical result by Bourgain [3] shows how one ca... |

114 | Broder ,"Identifying and Filtering Near-Duplicate Documents
- Andrei
(Show Context)
Citation Context ...tions has been studied extensively and is also used in many applications. For example, several search engines compute a small sketch of documents to detect near-duplicate documents at query time; see =-=[5, 10, 4]-=- and references therein. This eliminates the need to compare large documents, which is time consuming. Instead it is achieved by comparing short sketches of these documents and measuring the similarit... |

113 | Random walk computation of similarities between nodes of a graph with application to collaborative filtering
- Fouss, Pirotte, et al.
(Show Context)
Citation Context ...ding the shortest sequence of friends that connects one to a celebrity. In the web graph, a short sequence of links between two URLs may indicate a certain degree of relatedness between the two pages =-=[13, 20]-=-. However, given the large size of these graphs, shortest-path computation is challenging. Running Dijkstra’s well known shortest-path algorithm [8] on a web graph containing tens of billions of nodes... |

99 | Distance labeling in graphs
- Gavoille, Peleg, et al.
(Show Context)
Citation Context ...urpose of distance computation is also called distance labeling. Some papers that study the problem of the size of labels required with each node to allow distance computation include Gavoille et al. =-=[14]-=-, Katz et al. [16], and Cohen et al. [7]. The field of metric embedding deals with mapping a set of points from a high-dimensional space to a low-dimensional space, such that the distortion is minimiz... |

76 | Reachability and distance queries via 2-hop labels
- Cohen, Halperin, et al.
- 2002
(Show Context)
Citation Context ...ally efficient algorithms for spanners are presented by Feigenbaum et al. [12], and Baswana [2]. Other fundamental results in this area include Bartal [1] and Fakcharoenphol et al. [11]. Cohen et al. =-=[6]-=- proposed an approximate distance scheme using 2-hop covers of all paths in a directed (or undirected) graphs. However, finding the near optimal 2-hop cover of a given set of paths is expensive in a l... |

52 | Graph distances in the streaming model: the value of space
- Feigenbaum, Kannan, et al.
- 2005
(Show Context)
Citation Context ...exactly provide a sketch for each node; thus the online algorithm for estimating distance may take a long time. Some theoretically efficient algorithms for spanners are presented by Feigenbaum et al. =-=[12]-=-, and Baswana [2]. Other fundamental results in this area include Bartal [1] and Fakcharoenphol et al. [11]. Cohen et al. [6] proposed an approximate distance scheme using 2-hop covers of all paths in... |

50 |
Kunal Talwar. A tight bound on approximating arbitrary metrics by tree metrics
- Fakcharoenphol, Rao
(Show Context)
Citation Context ...ime. Some theoretically efficient algorithms for spanners are presented by Feigenbaum et al. [12], and Baswana [2]. Other fundamental results in this area include Bartal [1] and Fakcharoenphol et al. =-=[11]-=-. Cohen et al. [6] proposed an approximate distance scheme using 2-hop covers of all paths in a directed (or undirected) graphs. However, finding the near optimal 2-hop cover of a given set of paths i... |

35 |
Labeling schemes for flow and connectivity
- Katz, Katz, et al.
- 2002
(Show Context)
Citation Context ... computation is also called distance labeling. Some papers that study the problem of the size of labels required with each node to allow distance computation include Gavoille et al. [14], Katz et al. =-=[16]-=-, and Cohen et al. [7]. The field of metric embedding deals with mapping a set of points from a high-dimensional space to a low-dimensional space, such that the distortion is minimized. If each point ... |

14 |
On the distortion required for embedding finite metric spaces into normed spaces
- Matousek
- 1996
(Show Context)
Citation Context ...y impractical. The classical result by Bourgain [3] shows how one can project a graph onto a low dimensional space, giving a small sketch that approximates distances to a factor of O(log n). Matousek =-=[17]-=- later showed that the same algorithm can be used to get a 2c − 1 factor approximation using sketches of size Õ(n1/c ). However, we find in our experiments that Bourgain’s algorithm performs very poor... |

14 | A family of dissimilarity measures between nodes generalizing both the shortest-path and the commute-time distances
- Yen, Mantrach, et al.
- 2008
(Show Context)
Citation Context ...ding the shortest sequence of friends that connects one to a celebrity. In the web graph, a short sequence of links between two URLs may indicate a certain degree of relatedness between the two pages =-=[13, 20]-=-. However, given the large size of these graphs, shortest-path computation is challenging. Running Dijkstra’s well known shortest-path algorithm [8] on a web graph containing tens of billions of nodes... |

12 |
D.: Labeling schemes for tree representation
- Cohen, Fraigniaud, et al.
- 2005
(Show Context)
Citation Context ...lled distance labeling. Some papers that study the problem of the size of labels required with each node to allow distance computation include Gavoille et al. [14], Katz et al. [16], and Cohen et al. =-=[7]-=-. The field of metric embedding deals with mapping a set of points from a high-dimensional space to a low-dimensional space, such that the distortion is minimized. If each point can be projected onto ... |

12 | Asymmetric distance estimation with sketches for similarity search in high-dimensional spaces
- Dong, Charikar, et al.
- 2008
(Show Context)
Citation Context ...tions has been studied extensively and is also used in many applications. For example, several search engines compute a small sketch of documents to detect near-duplicate documents at query time; see =-=[5, 10, 4]-=- and references therein. This eliminates the need to compare large documents, which is time consuming. Instead it is achieved by comparing short sketches of these documents and measuring the similarit... |

10 | The scalable hyperlink store
- Najork
- 2009
(Show Context)
Citation Context ...ain 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Figure 4: Estimates of directed distances using as a function of k k In order to conduct our experiments, we used the Scalable Hyperlink Store =-=[18]-=-, a distributed system that partitions the web graph across many SHS servers, with each server maintaining a portion of the graph in main memory. Our implementation consists of an offline phase and an... |

7 |
Streaming algorithm for graph spanners - single pass and constant processing time per edge
- Baswana
(Show Context)
Citation Context ...sketch for each node; thus the online algorithm for estimating distance may take a long time. Some theoretically efficient algorithms for spanners are presented by Feigenbaum et al. [12], and Baswana =-=[2]-=-. Other fundamental results in this area include Bartal [1] and Fakcharoenphol et al. [11]. Cohen et al. [6] proposed an approximate distance scheme using 2-hop covers of all paths in a directed (or u... |

4 | Point-to-point shortest path algorithms with preprocessing
- Goldberg
- 2007
(Show Context)
Citation Context ...in a large graph. Moreover, the size of such a cover can be as large as Ω(n √ m), making their scheme quite hard to implement on large graphs. Several studies, for example the ones by Goldberg et al. =-=[15, 9]-=-, have focused on answering exact shortest path queries on road networks. These algorithms make use of a small set of precomputed landmarks and shortcuts and use them at query time to connect a source... |

1 |
Implementation challenge for shortest paths
- Demetrescu, Goldberg, et al.
- 2008
(Show Context)
Citation Context ...in a large graph. Moreover, the size of such a cover can be as large as Ω(n √ m), making their scheme quite hard to implement on large graphs. Several studies, for example the ones by Goldberg et al. =-=[15, 9]-=-, have focused on answering exact shortest path queries on road networks. These algorithms make use of a small set of precomputed landmarks and shortcuts and use them at query time to connect a source... |