Results 1  10
of
39
Discretized Streams: Faulttolerant streaming computation at scale
 In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP
, 2013
"... Many “big data ” applications must act on data in real time. Running these applications at everlarger scales requires parallel platforms that automatically handle faults and stragglers. Unfortunately, current distributed stream processing models provide fault recovery in an expensive manner, requir ..."
Abstract

Cited by 32 (5 self)
 Add to MetaCart
(Show Context)
Many “big data ” applications must act on data in real time. Running these applications at everlarger scales requires parallel platforms that automatically handle faults and stragglers. Unfortunately, current distributed stream processing models provide fault recovery in an expensive manner, requiring hot replication or long recovery times, and do not handle stragglers. We propose a new processing model, discretized streams (DStreams), that overcomes these challenges. DStreams enable a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes, and tolerates stragglers. We show that they support a rich set of operators while attaining high pernode throughput similar to singlenode systems, linear scaling to 100 nodes, subsecond latency, and subsecond fault recovery. Finally, DStreams can easily be composed with batch and interactive query models like MapReduce, enabling rich applications that combine these modes. We implement DStreams in a system called Spark Streaming. 1
GraphX: Graph Processing in a Distributed Dataflow Framework
 USENIX ASSOCIATION 11TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI ’14)
, 2014
"... In pursuit of graph processing performance, the systems community has largely abandoned generalpurpose distributed dataflow frameworks in favor of specialized graph processing systems that provide tailored programming abstractions and accelerate the execution of iterative graph algorithms. In thi ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
In pursuit of graph processing performance, the systems community has largely abandoned generalpurpose distributed dataflow frameworks in favor of specialized graph processing systems that provide tailored programming abstractions and accelerate the execution of iterative graph algorithms. In this paper we argue that many of the advantages of specialized graph processing systems can be recovered in a modern generalpurpose distributed dataflow system. We introduce GraphX, an embedded graph processing framework built on top of Apache Spark, a widely used distributed dataflow system. GraphX presents a familiar composable graph abstraction that is sufficient to express existing graph APIs, yet can be implemented using only a few basic dataflow operators (e.g., join, map, groupby). To achieve performance parity with specialized graph systems, GraphX recasts graphspecific optimizations as distributed join optimizations and materialized view maintenance. By leveraging advances in distributed dataflow frameworks, GraphX brings lowcost fault tolerance to graph processing. We evaluate GraphX on real workloads and demonstrate that GraphX achieves an order of magnitude performance gain over the base dataflow framework and matches the performance of specialized graph processing systems while enabling a wider range of computation.
Dandelion: A compiler and runtime for heterogeneous systems
 in Proceedings of the TwentyFourth ACM Symposium on Operating Systems Principles, ser. SOSP ’13
, 2013
"... Computer systems increasingly rely on heterogeneity to achieve greater performance, scalability and energy efficiency. Because heterogeneous systems typically comprise multiple execution contexts with different programming abstractions and runtimes, programming them remains extremely challenging ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
Computer systems increasingly rely on heterogeneity to achieve greater performance, scalability and energy efficiency. Because heterogeneous systems typically comprise multiple execution contexts with different programming abstractions and runtimes, programming them remains extremely challenging. Dandelion is a system designed to address this programmability challenge for dataparallel applications. Dandelion provides a unified programming model for heterogeneous systems that span diverse execution contexts including CPUs, GPUs, FPGAs, and the cloud. It adopts the.NET LINQ (Language INtegrated Query) approach, integrating dataparallel operators into general purpose programming languages such as C # and F#. It therefore provides an expressive data model and native language integration for userdefined functions, enabling programmers to write applications using standard highlevel languages and development tools. Dandelion automatically and transparently distributes dataparallel portions of a program to available computing resources, including compute clusters for distributed execution and CPU and GPU cores of individual nodes for parallel execution. To enable automatic execution of.NET code on GPUs, Dandelion crosscompiles.NET code to CUDA kernels and uses the PTask runtime [85] to manage GPU execution. This paper discusses the design and implementation of Dandelion, focusing on the distributed CPU and GPU implementation. We evaluate the system using a diverse set of workloads. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for thirdparty components of this work must be honored. For all other uses, contact the Owner/Author. Copyright is held by the Owner/Author(s).
Alldistances sketches, revisited: Hip estimators for massive graphs analysis
 PROC. 33RD ACM SYMPOSIUM ON PRINCIPLES OF DATABASE SYSTEMS, ACM
, 2014
"... Graph datasets with billions of edges, such as social and Web graphs, are prevalent. To be feasible, computation on such large graphs should scale linearly with graph size. Alldistances sketches (ADSs) are emerging as a powerful tool for scalable computation of some basic properties of individual n ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
Graph datasets with billions of edges, such as social and Web graphs, are prevalent. To be feasible, computation on such large graphs should scale linearly with graph size. Alldistances sketches (ADSs) are emerging as a powerful tool for scalable computation of some basic properties of individual nodes or the whole graph. ADSs were first proposed two decades ago (Cohen 1994) and more recent algorithms include ANF (Palmer, Gibbons, and Faloutsos 2002) and hyperANF (Boldi, Rosa, and Vigna 2011). A sketch of logarithmic size is computed for each node in the graph and the computation in total requires only a near linear number of edge relaxations. From the ADS of a node, we can estimate its neighborhood cardinalities (the number of nodes within some query distance) and closeness centrality. More generally we can estimate the distance distribution, effective diameter, similarities, and other parameters of the full graph. We make several contributions which facilitate a more effective use of ADSs for scalable analysis of massive graphs. We provide, for the first time, a unified exposition of ADS algorithms and applications. We present the Historic Inverse Probability (HIP) estimators which are applied to the ADS of a node to estimate a large natural class of queries including neighborhood cardinalities and closeness centralities. We show that our HIP estimators have at most half the variance of previous neighborhood cardinality estimators and that this is essentially optimal. Moreover, HIP obtains a polynomial improvement for more general queries and the estimators are simple, flexible, unbiased, and elegant. We apply HIP for approximate distinct counting on streams by comparing HIP and the original estimators applied to the HyperLogLog MinHash sketches (Flajolet et al. 2007). We demonstrate significant improvement in estimation quality for this stateoftheart practical algorithm and also illustrate the ease of applying HIP. Finally, we study the quality of ADS estimation of distance ranges, generalizing the nearlinear time factor2 approximation of the diameter.
GraphX: Unifying DataParallel and GraphParallel Analytics
, 2014
"... From social networks to language modeling, the growing scale and importance of graph data has driven the development of numerous new graphparallel systems (e.g., Pregel, GraphLab). By restricting the computation that can be expressed and introducing new techniques to partition and distribute the ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
From social networks to language modeling, the growing scale and importance of graph data has driven the development of numerous new graphparallel systems (e.g., Pregel, GraphLab). By restricting the computation that can be expressed and introducing new techniques to partition and distribute the graph, these systems can efficiently execute iterative graph algorithms orders of magnitude faster than more general dataparallel systems. However, the same restrictions that enable the performance gains also make it difficult to express many of the important stages in a typical graphanalytics pipeline: constructing the graph, modifying its structure, or expressing computation that spans multiple graphs. As a consequence, existing graph analytics pipelines compose graphparallel and dataparallel systems using external storage systems, leading to extensive data movement and complicated programming model. To address these challenges we introduce GraphX, a distributed graph computation framework that unifies graphparallel and dataparallel computation. GraphX provides a small, core set of graphparallel operators expressive enough to implement the Pregel and PowerGraph abstractions, yet simple enough to be cast in relational algebra. GraphX uses a collection of query optimization techniques such as automatic join rewrites to efficiently implement these graphparallel operators. We evaluate GraphX on realworld graphs and workloads and demonstrate that GraphX achieves comparable performance as specialized graph computation systems, while outperforming them in endtoend graph pipelines. Moreover, GraphX achieves a balance between expressiveness, performance, and ease of use.
Broom: sweeping out Garbage Collection from Big Data systems
"... Many popular systems for processing “big data ” are implemented in highlevel programming languages with automatic memory management via garbage collection (GC). However, high object churn and large heap sizes put severe strain on the garbage collector. As a result, applications underperform signi ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Many popular systems for processing “big data ” are implemented in highlevel programming languages with automatic memory management via garbage collection (GC). However, high object churn and large heap sizes put severe strain on the garbage collector. As a result, applications underperform significantly: GC increases the runtime of typical data processing tasks by up to 40%. We propose to use regionbased memory management instead of GC in distributed data processing systems. In these systems, many objects have clearly defined lifetimes. Hence, it is natural to allocate these objects in fatesharing regions, obviating the need to scan a large heap. Regions can be memorysafe and could be inferred automatically. Our initial results show that regionbased memory management reduces emulated Naiad vertex runtime by 34 % for typical data analytics jobs. 1
NUMAaware graphstructured analytics
 In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP
, 2015
"... Graphstructured analytics has been widely adopted in a number of big data applications such as social computation, websearch and recommendation systems. Though much prior research focuses on scaling graphanalytics on distributed environments, the strong desire on performance per core, dollar and ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Graphstructured analytics has been widely adopted in a number of big data applications such as social computation, websearch and recommendation systems. Though much prior research focuses on scaling graphanalytics on distributed environments, the strong desire on performance per core, dollar and joule has generated considerable interests of processing largescale graphs on a single serverclass machine, which may have several terabytes of RAM and 80 or more cores. However, prior graphanalytics systems are largely neutral to NUMA characteristics and thus have suboptimal performance. This paper presents a detailed study of NUMA characteristics and their impact on the efficiency of graphanalytics. Our study uncovers two insights: 1) either random or interleaved allocation of graph data will significantly hamper data locality and parallelism; 2) sequential internode (i.e., remote) memory accesses have much higher bandwidth than both intra and internode random ones. Based on them, this paper describes Polymer, a NUMAaware graphanalytics system on multicore with two key design decisions. First, Polymer differentially allocates and places topology data, applicationdefined data and mutable runtime states of a graph system according to their access patterns to minimize remote accesses. Second, for some remaining random accesses, Polymer carefully converts random remote accesses into sequential remote accesses, by using lightweight replication of vertices across NUMA nodes. To improve load balance and vertex convergence, Polymer is further built with a hierarchical barrier to boost parallelism and locality, an edgeoriented balanced partitioning for skewed graphs, and adaptive data structures according to the proportion of active vertices. A detailed evaluation on an 80core machine shows that Polymer often outperforms the stateoftheart singlemachine graphanalytics systems, including Ligra, XStream and Galois, for a set of popular realworld and synthetic graphs.
Malt: distributed dataparallelism for existing ml applications
 In Proceedings of the Tenth European Conference on Computer Systems. ACM
, 2015
"... Machine learning methods, such as SVM and neural networks, often improve their accuracy by using models with more parameters trained on large numbers of examples. Building such models on a single machine is often impractical because of the large amount of computation required. We introduce MALT, a ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Machine learning methods, such as SVM and neural networks, often improve their accuracy by using models with more parameters trained on large numbers of examples. Building such models on a single machine is often impractical because of the large amount of computation required. We introduce MALT, a machine learning library that integrates with existing machine learning software and provides data parallel machine learning. MALT provides abstractions for finegrained inmemory updates using onesided RDMA, limiting data movement costs during incremental model updates. MALT allows machine learning developers to specify the dataflow and apply communication and representation optimizations. Through its generalpurpose API, MALT can be used to provide dataparallelism to existing ML applications written in C++ and Lua and based on SVM, matrix factorization and neural networks. In our results, we show MALT provides fault tolerance, network efficiency and speedup to these applications. 1.
Parallel Algorithms for Geometric Graph Problems
, 2014
"... We give algorithms for geometric graph problems in the modern parallel models such as MapReduce [DG04, KSV10, GSZ11, BKS13]. For example, for the Minimum Spanning Tree (MST) problem over a set of points in the twodimensional space, our algorithm computes a (1 + )approximate MST. Our algorithms wor ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We give algorithms for geometric graph problems in the modern parallel models such as MapReduce [DG04, KSV10, GSZ11, BKS13]. For example, for the Minimum Spanning Tree (MST) problem over a set of points in the twodimensional space, our algorithm computes a (1 + )approximate MST. Our algorithms work in a constant number of rounds of communication, while using total space and communication proportional to the size of the data (linear space and near linear time algorithms). In contrast, for general graphs, achieving the same result for MST (or even connectivity) remains a challenging open problem [BKS13], despite drawing significant attention in recent years. We develop a general algorithmic framework that, besides MST, also applies to EarthMover Distance (EMD) and the transportation cost problem. Our algorithmic framework has implications beyond the MapReduce model. For example it yields a new algorithm for computing EMD cost in the plane in nearlinear time, n1+o(1). We note that while recently [SA12b] have developed a nearlinear time algorithm for (1 + )approximating EMD, our algorithm is fundamentally different, and, for example, also solves the transportation (cost) problem, raised as an open question in [SA12b]. Furthermore, our algorithm immediately gives a (1 + )approximation algorithm with nδ space in the streamingwithsorting model with 1/δO(1) passes. As such, it is tempting to conjecture that the parallel models may also constitute a concrete playground in the quest for efficient algorithms for EMD (and other similar problems) in the vanilla streaming model, a wellknown open problem [P07, P49]. ∗Supported in part by the Simons Postdoctoral Fellowship. Research initiated while at CMU. 1
Exploiting iterativeness for parallel ML computations
"... Many largescale machine learning (ML) applications use iterative algorithms to converge on parameter values that make the chosen model fit the input data. Often, this approach results in the same sequence of accesses to parameters repeating each iteration. This paper shows that these repeating pa ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Many largescale machine learning (ML) applications use iterative algorithms to converge on parameter values that make the chosen model fit the input data. Often, this approach results in the same sequence of accesses to parameters repeating each iteration. This paper shows that these repeating patterns can and should be exploited to improve the efficiency of the parallel and distributed ML applications that will be a mainstay in cloud computing environments. Focusing on the increasingly popular “parameter server ” approach to sharing model parameters among worker threads, we describe and demonstrate how the repeating patterns can be exploited. Examples include replacing dynamic cache and server structures with static preserialized structures, informing prefetch and partitioning decisions, and determining which data should be cached at each thread to avoid both contention and slow accesses to memory banks attached to other sockets. Experiments show that such exploitation reduces periteration time by 33–98%, for three real ML workloads, and that these improvements are robust to variation in the patterns over time. 1.