Results 1 - 10
of
48
Discretized Streams: Fault-tolerant streaming computation at scale
- In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP
, 2013
"... Many “big data ” applications must act on data in real time. Running these applications at ever-larger scales requires parallel platforms that automatically handle faults and stragglers. Unfortunately, current distributed stream processing models provide fault recovery in an expensive manner, requir ..."
Abstract
-
Cited by 45 (6 self)
- Add to MetaCart
(Show Context)
Many “big data ” applications must act on data in real time. Running these applications at ever-larger scales requires parallel platforms that automatically handle faults and stragglers. Unfortunately, current distributed stream processing models provide fault recovery in an expensive manner, requiring hot replication or long recovery times, and do not handle stragglers. We propose a new processing model, discretized streams (D-Streams), that overcomes these challenges. D-Streams enable a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes, and tolerates stragglers. We show that they support a rich set of operators while attaining high per-node throughput similar to single-node systems, linear scaling to 100 nodes, subsecond latency, and sub-second fault recovery. Finally, D-Streams can easily be composed with batch and interactive query models like MapReduce, enabling rich applications that combine these modes. We implement D-Streams in a system called Spark Streaming. 1
GraphX: Graph Processing in a Distributed Dataflow Framework
- USENIX ASSOCIATION 11TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI ’14)
, 2014
"... In pursuit of graph processing performance, the systems community has largely abandoned general-purpose dis-tributed dataflow frameworks in favor of specialized graph processing systems that provide tailored programming ab-stractions and accelerate the execution of iterative graph algorithms. In thi ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
In pursuit of graph processing performance, the systems community has largely abandoned general-purpose dis-tributed dataflow frameworks in favor of specialized graph processing systems that provide tailored programming ab-stractions and accelerate the execution of iterative graph algorithms. In this paper we argue that many of the advan-tages of specialized graph processing systems can be re-covered in a modern general-purpose distributed dataflow system. We introduce GraphX, an embedded graph pro-cessing framework built on top of Apache Spark, a widely used distributed dataflow system. GraphX presents a fa-miliar composable graph abstraction that is sufficient to express existing graph APIs, yet can be implemented us-ing only a few basic dataflow operators (e.g., join, map, group-by). To achieve performance parity with special-ized graph systems, GraphX recasts graph-specific op-timizations as distributed join optimizations and mate-rialized view maintenance. By leveraging advances in distributed dataflow frameworks, GraphX brings low-cost fault tolerance to graph processing. We evaluate GraphX on real workloads and demonstrate that GraphX achieves an order of magnitude performance gain over the base dataflow framework and matches the performance of spe-cialized graph processing systems while enabling a wider range of computation.
Scaling distributed machine learning with the parameter server.
- In USENIX OSDI,
, 2014
"... Abstract We propose a parameter server framework for distributed machine learning problems. Both data and workloads are distributed over worker nodes, while the server nodes maintain globally shared parameters, represented as dense or sparse vectors and matrices. The framework manages asynchronous ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
(Show Context)
Abstract We propose a parameter server framework for distributed machine learning problems. Both data and workloads are distributed over worker nodes, while the server nodes maintain globally shared parameters, represented as dense or sparse vectors and matrices. The framework manages asynchronous data communication between nodes, and supports flexible consistency models, elastic scalability, and continuous fault tolerance. To demonstrate the scalability of the proposed framework, we show experimental results on petabytes of real data with billions of examples and parameters on problems ranging from Sparse Logistic Regression to Latent Dirichlet Allocation and Distributed Sketching.
Dandelion: a compiler and runtime for heterogeneous systems
- in Proc. of the Twenty-Fourth ACM Symp. on Operating Systems Principles. ACM
"... Computer systems increasingly rely on heterogeneity to achieve greater performance, scalability and en-ergy efficiency. Because heterogeneous systems typi-cally comprise multiple execution contexts with differ-ent programming abstractions and runtimes, program-ming them remains extremely challenging ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Computer systems increasingly rely on heterogeneity to achieve greater performance, scalability and en-ergy efficiency. Because heterogeneous systems typi-cally comprise multiple execution contexts with differ-ent programming abstractions and runtimes, program-ming them remains extremely challenging. Dandelion is a system designed to address this pro-grammability challenge for data-parallel applications. Dandelion provides a unified programming model for heterogeneous systems that span diverse execution con-texts including CPUs, GPUs, FPGAs, and the cloud. It adopts the.NET LINQ (Language INtegrated Query) ap-proach, integrating data-parallel operators into general purpose programming languages such as C # and F#. It therefore provides an expressive data model and native language integration for user-defined functions, enabling programmers to write applications using standard high-level languages and development tools. Dandelion automatically and transparently distributes data-parallel portions of a program to available comput-ing resources, including compute clusters for distributed execution and CPU and GPU cores of individual nodes for parallel execution. To enable automatic execution of.NET code on GPUs, Dandelion cross-compiles.NET code to CUDA kernels and uses the PTask runtime [85] to manage GPU execution. This paper discusses the de-sign and implementation of Dandelion, focusing on the distributed CPU and GPU implementation. We evaluate the system using a diverse set of workloads. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. Copyright is held by the Owner/Author(s).
All-distances sketches, revisited: Hip estimators for massive graphs analysis
- PROC. 33RD ACM SYMPOSIUM ON PRINCIPLES OF DATABASE SYSTEMS, ACM
, 2014
"... Graph datasets with billions of edges, such as social and Web graphs, are prevalent. To be feasible, computation on such large graphs should scale linearly with graph size. All-distances sketches (ADSs) are emerging as a powerful tool for scalable computation of some basic properties of individual n ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
Graph datasets with billions of edges, such as social and Web graphs, are prevalent. To be feasible, computation on such large graphs should scale linearly with graph size. All-distances sketches (ADSs) are emerging as a powerful tool for scalable computation of some basic properties of individual nodes or the whole graph. ADSs were first proposed two decades ago (Cohen 1994) and more recent algorithms include ANF (Palmer, Gibbons, and Faloutsos 2002) and hyperANF (Boldi, Rosa, and Vigna 2011). A sketch of logarithmic size is computed for each node in the graph and the computation in total requires only a near linear number of edge relaxations. From the ADS of a node, we can estimate its neighborhood cardinalities (the number of nodes within some query distance) and closeness centrality. More generally we can estimate the distance distribution, effective diameter, similarities, and other parameters of the full graph. We make several contributions which facili-tate a more effective use of ADSs for scalable analysis of massive graphs. We provide, for the first time, a unified exposition of ADS algorithms and applications. We present the Historic Inverse Probability (HIP) estimators which are applied to the ADS of a node to estimate a large natural class of queries including neighborhood cardinalities and closeness centralities. We show that our HIP estimators have at most half the variance of previous neighborhood cardinality estimators and that this is essentially optimal. Moreover, HIP obtains a polynomial improvement for more general queries and the estimators are simple, flexible, unbiased, and elegant. We apply HIP for approximate distinct counting on streams by comparing HIP and the original estimators applied to the HyperLogLog Min-Hash sketches (Flajolet et al. 2007). We demonstrate significant improvement in estimation quality for this state-of-the-art practical algorithm and also illustrate the ease of applying HIP. Finally, we study the quality of ADS estimation of distance ranges, generalizing the near-linear time factor-2 approximation of the diameter.
GraphX: Unifying Data-Parallel and Graph-Parallel Analytics
, 2014
"... From social networks to language modeling, the growing scale and importance of graph data has driven the development of numer-ous new graph-parallel systems (e.g., Pregel, GraphLab). By re-stricting the computation that can be expressed and introducing new techniques to partition and distribute the ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
From social networks to language modeling, the growing scale and importance of graph data has driven the development of numer-ous new graph-parallel systems (e.g., Pregel, GraphLab). By re-stricting the computation that can be expressed and introducing new techniques to partition and distribute the graph, these systems can efficiently execute iterative graph algorithms orders of magni-tude faster than more general data-parallel systems. However, the same restrictions that enable the performance gains also make it difficult to express many of the important stages in a typical graph-analytics pipeline: constructing the graph, modifying its structure, or expressing computation that spans multiple graphs. As a conse-quence, existing graph analytics pipelines compose graph-parallel and data-parallel systems using external storage systems, leading to extensive data movement and complicated programming model. To address these challenges we introduce GraphX, a distributed graph computation framework that unifies graph-parallel and data-parallel computation. GraphX provides a small, core set of graph-parallel operators expressive enough to implement the Pregel and PowerGraph abstractions, yet simple enough to be cast in relational algebra. GraphX uses a collection of query optimization techniques such as automatic join rewrites to effi-ciently implement these graph-parallel operators. We evaluate GraphX on real-world graphs and workloads and demonstrate that GraphX achieves comparable performance as specialized graph computation systems, while outperforming them in end-to-end graph pipelines. Moreover, GraphX achieves a balance between expressiveness, performance, and ease of use.
MALT: Distributed Data-Parallelism for Existing ML Applications.
- In Proceedings of the 10th European conference on Computer systems. ACM,
, 2015
"... Abstract We introduce MALT, a machine learning library that integrates with existing machine learning software and provides peer-to-peer data parallel machine learning. MALT provides abstractions for fine-grained in-memory updates using one-sided RDMA, limiting data movement costs during incrementa ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Abstract We introduce MALT, a machine learning library that integrates with existing machine learning software and provides peer-to-peer data parallel machine learning. MALT provides abstractions for fine-grained in-memory updates using one-sided RDMA, limiting data movement costs during incremental model updates. MALT allows machine learning developers to specify the dataflow and apply communication and representation optimizations. In our results, we find that MALT provides fault tolerance, network efficiency and speedup to SVM, matrix factorization and neural networks.
Broom: sweeping out Garbage Collection from Big Data systems
"... Many popular systems for processing “big data ” are im-plemented in high-level programming languages with automatic memory management via garbage collection (GC). However, high object churn and large heap sizes put severe strain on the garbage collector. As a result, ap-plications underperform signi ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Many popular systems for processing “big data ” are im-plemented in high-level programming languages with automatic memory management via garbage collection (GC). However, high object churn and large heap sizes put severe strain on the garbage collector. As a result, ap-plications underperform significantly: GC increases the runtime of typical data processing tasks by up to 40%. We propose to use region-based memory management instead of GC in distributed data processing systems. In these systems, many objects have clearly defined life-times. Hence, it is natural to allocate these objects in fate-sharing regions, obviating the need to scan a large heap. Regions can be memory-safe and could be in-ferred automatically. Our initial results show that region-based memory management reduces emulated Naiad ver-tex runtime by 34 % for typical data analytics jobs. 1
NUMA-aware graph-structured analytics
- In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP
, 2015
"... Graph-structured analytics has been widely adopted in a number of big data applications such as social computation, web-search and recommendation systems. Though much prior research focuses on scaling graph-analytics on distributed environments, the strong desire on performance per core, dollar and ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Graph-structured analytics has been widely adopted in a number of big data applications such as social computation, web-search and recommendation systems. Though much prior research focuses on scaling graph-analytics on distributed environments, the strong desire on performance per core, dollar and joule has generated considerable interests of processing large-scale graphs on a single server-class machine, which may have several terabytes of RAM and 80 or more cores. However, prior graph-analytics systems are largely neutral to NUMA characteristics and thus have suboptimal performance. This paper presents a detailed study of NUMA characteristics and their impact on the efficiency of graph-analytics. Our study uncovers two insights: 1) either random or interleaved allocation of graph data will significantly hamper data locality and paral-lelism; 2) sequential inter-node (i.e., remote) memory accesses have much higher bandwidth than both intra- and inter-node ran-dom ones. Based on them, this paper describes Polymer, a NUMA-aware graph-analytics system on multicore with two key design decisions. First, Polymer differentially allocates and places topol-ogy data, application-defined data and mutable runtime states of a graph system according to their access patterns to minimize remote accesses. Second, for some remaining random accesses, Polymer carefully converts random remote accesses into sequential remote accesses, by using lightweight replication of vertices across NUMA nodes. To improve load balance and vertex convergence, Polymer is further built with a hierarchical barrier to boost parallelism and locality, an edge-oriented balanced partitioning for skewed graphs, and adaptive data structures according to the proportion of ac-tive vertices. A detailed evaluation on an 80-core machine shows that Polymer often outperforms the state-of-the-art single-machine graph-analytics systems, including Ligra, X-Stream and Galois, for a set of popular real-world and synthetic graphs.
Queues don’t matter when you can JUMP them!
"... QJUMP is a simple and immediately deployable ap-proach to controlling network interference in datacenter networks. Network interference occurs when congestion from throughput-intensive applications causes queueing that delays traffic from latency-sensitive applications. To mitigate network interfere ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
QJUMP is a simple and immediately deployable ap-proach to controlling network interference in datacenter networks. Network interference occurs when congestion from throughput-intensive applications causes queueing that delays traffic from latency-sensitive applications. To mitigate network interference, QJUMP applies Inter-net QoS-inspired techniques to datacenter applications. Each application is assigned to a latency sensitivity level (or class). Packets from higher levels are rate-limited in the end host, but once allowed into the network can “jump-the-queue ” over packets from lower levels. In set-tings with known node counts and link speeds, QJUMP can support service levels ranging from strictly bounded latency (but with low rate) through to line-rate through-put (but with high latency variance). We have implemented QJUMP as a Linux Traffic Con-trol module. We show that QJUMP achieves bounded latency and reduces in-network interference by up to