Results 1  10
of
16
Naiad: A Timely Dataflow System
"... Naiad is a distributed system for executing data parallel, cyclic dataflow programs. It offers the high throughput of batch processors, the low latency of stream processors, and the ability to perform iterative and incremental computations. Although existing systems offer some of these features, app ..."
Abstract

Cited by 48 (1 self)
 Add to MetaCart
(Show Context)
Naiad is a distributed system for executing data parallel, cyclic dataflow programs. It offers the high throughput of batch processors, the low latency of stream processors, and the ability to perform iterative and incremental computations. Although existing systems offer some of these features, applications that require all three have relied on multiple platforms, at the expense of efficiency, maintainability, and simplicity. Naiad resolves the complexities of combining these features in one framework. A new computational model, timely dataflow, underlies Naiad and captures opportunities for parallelism across a wide class of algorithms. This model enriches dataflow computation with timestamps that represent logical points in the computation and provide the basis for an efficient, lightweight coordination mechanism. We show that many powerful highlevel programming models can be built on Naiad’s lowlevel primitives, enabling such diverse tasks as streaming data analysis, iterative machine learning, and interactive graph mining. Naiad outperforms specialized systems in their target application domains, and its unique features enable the development of new highperformance applications. 1
A SketchBased Distance Oracle for WebScale Graphs
"... We study the fundamental problem of computing distances between nodes in large graphs such as the web graph and social networks. Our objective is to be able to answer distance queries between pairs of nodes in real time. Since the standard shortest path algorithms are expensive, our approach moves t ..."
Abstract

Cited by 31 (2 self)
 Add to MetaCart
(Show Context)
We study the fundamental problem of computing distances between nodes in large graphs such as the web graph and social networks. Our objective is to be able to answer distance queries between pairs of nodes in real time. Since the standard shortest path algorithms are expensive, our approach moves the timeconsuming shortestpath computation offline, and at query time only looks up precomputed values and performs simple and fast computations on these precomputed values. More specifically, during the offline phase we compute and store a small “sketch ” for each node in the graph, and at querytime we look up the sketches of the source and destination nodes and perform a simple computation using these two sketches to estimate the distance. Categories and Subject Descriptors G.2.2 [Graph Theory]: Graph algorithms, path and circuit problems
Towards effective partition management for large graphs
 IN SIGMOD
, 2012
"... Searching and mining large graphs today is critical to a variety of application domains, ranging from community detection in social networks, to de novo genome sequence assembly. Scalable processing of large graphs requires careful partitioning and distribution of graphs across clusters. In this pap ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
Searching and mining large graphs today is critical to a variety of application domains, ranging from community detection in social networks, to de novo genome sequence assembly. Scalable processing of large graphs requires careful partitioning and distribution of graphs across clusters. In this paper, we investigate the problem of managing largescale graphs in clusters and study access characteristics of local graph queries such as breadthfirst search, random walk, and SPARQL queries, which are popular in real applications. These queries exhibit strong access locality, and therefore require specific data partitioning strategies. In this work, we propose a Self Evolving Distributed Graph Management Environment (Sedge), to minimize intermachine communication during graph query processing in multiple machines. In order to improve query response time and throughput, Sedge introduces a twolevel partition management architecture with complimentary primary partitions and dynamic secondary partitions. These two kinds of partitions are able to adapt in real time to changes in query workload. Sedge also includes a set of workload analyzing algorithms whose time complexity is linear or sublinear to graph size. Empirical results show that it significantly improves distributed graph processing on today’s commodity clusters.
Earlybird: RealTime Search at Twitter
"... Abstract — The web today is increasingly characterized by social and realtime signals, which we believe represent two frontiers in information retrieval. In this paper, we present Earlybird, the core retrieval engine that powers Twitter’s realtime search service. Although Earlybird builds and maint ..."
Abstract

Cited by 26 (5 self)
 Add to MetaCart
(Show Context)
Abstract — The web today is increasingly characterized by social and realtime signals, which we believe represent two frontiers in information retrieval. In this paper, we present Earlybird, the core retrieval engine that powers Twitter’s realtime search service. Although Earlybird builds and maintains inverted indexes like nearly all modern retrieval engines, its index structures differ from those built to support traditional web search. We describe these differences and present the rationale behind our design. A key requirement of realtime search is the ability to ingest content rapidly and make it searchable immediately, while concurrently supporting lowlatency, highthroughput query evaluation. These demands are met with a singlewriter, multiplereader concurrency model and the targeted use of memory barriers. Earlybird represents a point in the design space of realtime search engines that has worked well for Twitter’s needs. By sharing our experiences, we hope to spur additional interest and innovation in this exciting space. I.
Differential dataflow
"... Existing computational models for processing continuously changing input data are unable to efficiently support iterative queries except in limited special cases. This makes it difficult to perform complex tasks, such as socialgraph analysis on changing data at interactive timescales, which would g ..."
Abstract

Cited by 24 (1 self)
 Add to MetaCart
Existing computational models for processing continuously changing input data are unable to efficiently support iterative queries except in limited special cases. This makes it difficult to perform complex tasks, such as socialgraph analysis on changing data at interactive timescales, which would greatly benefit those analyzing the behavior of services like Twitter. In this paper we introduce a new model called differential computation, which extends traditional incremental computation to allow arbitrarily nested iteration, and explain—with reference to a publicly available prototype system called Naiad—how differential computation can be efficiently implemented in the context of a declarative dataparallel dataflow language. The resulting system makes it easy to program previously intractable algorithms such as incrementally updated strongly connected components, and integrate them with data transformation operations to obtain practically relevant insights from real data streams. 1.
Managing large graphs on multicores with graph awareness.
 In USENIX ATC’12,
, 2012
"... Abstract Grace is a graphaware, inmemory, transactional graph management system, specifically built for realtime queries and fast iterative computations. It is designed to run on large multicores, taking advantage of the inherent parallelism to improve its performance. Grace contains a number o ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
(Show Context)
Abstract Grace is a graphaware, inmemory, transactional graph management system, specifically built for realtime queries and fast iterative computations. It is designed to run on large multicores, taking advantage of the inherent parallelism to improve its performance. Grace contains a number of graphspecific and multicorespecific optimizations including graph partitioning, careful inmemory vertex ordering, updates batching, and loadbalancing. It supports queries, searches, iterative computations, and transactional updates. Grace scales to large graphs (e.g., a Hotmail graph with 320 million vertices) and performs up to two orders of magnitude faster than commercial keyvalue stores and graph databases.
HipG: Parallel Processing of LargeScale Graphs
"... Distributed processing of realworld graphs is challenging duetotheirsizeandtheinherentirregularstructureofgraph computations. We present HipG, a distributed framework that facilitates programming parallel graph algorithms by composing the parallel application automatically from the userdefined pie ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
(Show Context)
Distributed processing of realworld graphs is challenging duetotheirsizeandtheinherentirregularstructureofgraph computations. We present HipG, a distributed framework that facilitates programming parallel graph algorithms by composing the parallel application automatically from the userdefined pieces of sequential work on graph nodes. To make the user code highlevel, the framework provides a unified interface to executing methods on local and nonlocal graph nodes and an abstraction of exclusive execution. The graph computations are managed by logical objects called synchronizers, which we used, for example, to implement distributed divideandconquer decomposition into strongly connected components. The code written in HipG is independent of a particular graph representation, to the point that the graph can be created onthefly, i.e. by the algorithm that computes on this graph, which we used to implement a distributed model checker. HipG programs are in general short and elegant; they achieve good portability, memory utilization, and performance. 1.
Microsoft Research at TREC 2009: Web and relevance feedback tracks
 In Proceedings of the 18th Text REtrieval Conference (TREC). NIST Special Publication
, 2009
"... We took part in the Web and Relevance Feedback tracks, using the ClueWeb09 corpus. To process the corpus, we developed a parallel processing pipeline which avoids the generation of an inverted file. We describe the components of the parallel architecture and the pipeline and how we ran the TREC expe ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
We took part in the Web and Relevance Feedback tracks, using the ClueWeb09 corpus. To process the corpus, we developed a parallel processing pipeline which avoids the generation of an inverted file. We describe the components of the parallel architecture and the pipeline and how we ran the TREC experiments, and we present effectiveness results. 1
Of Hammers and Nails: An Empirical Comparison of Three Paradigms for Processing Large Graphs
"... Many phenomena and artifacts such as road networks, social networks and the web can be modeled as large graphs and analyzed using graph algorithms. However, given the size of the underlying graphs, efficient implementation of basic operations such as connected component analysis, approximate shortes ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
Many phenomena and artifacts such as road networks, social networks and the web can be modeled as large graphs and analyzed using graph algorithms. However, given the size of the underlying graphs, efficient implementation of basic operations such as connected component analysis, approximate shortest paths, and linkbased ranking (e.g.PageRank) becomes challenging. This paper presents an empirical study of computations on such large graphs in three wellstudied platform models, viz., a relational model, a dataparallel model, and a specialpurpose inmemory model. We choose a prototypical member of each platform model and analyze the computational efficiencies and requirements for five basic graph operations used in the analysis of realworld graphs viz., PageRank, SALSA, Strongly Connected Components (SCC), Weakly Connected Components (WCC), and Approximate Shortest Paths (ASP). Further, we characterize each platform in terms of these computations using modelspecific implementations of these algorithms on a large web graph. Our experiments show that there is no single platform that performs best across different classes of operations on large graphs. While relational databases are powerful and flexible tools that support a wide variety of computations, there are computations that benefit from using specialpurpose storage systems and others that can exploit dataparallel platforms. Categories and Subject Descriptors G.2.2 [Discrete mathematics]: Graph Theory—Graph algorithms, path and circuit problems; H.2.4 [Database management]: Systems—Distributed
Microsoft Research at TREC 2010 Web Track
"... This paper describes our entry into the TREC 2010 Web track. We extracted and ranked results for both last year’s and this year’s topics from the ClueWeb09 corpus using a parallel processing pipeline that avoids the generation of an inverted file. We describe the components of the parallel architect ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
This paper describes our entry into the TREC 2010 Web track. We extracted and ranked results for both last year’s and this year’s topics from the ClueWeb09 corpus using a parallel processing pipeline that avoids the generation of an inverted file. We describe the components of the parallel architecture and the pipeline and how we ran the TREC experiments, and we present effectiveness results. 1