Results 1 - 10
of
15
A Survey of Large-Scale Analytical Query Processing in MapReduce
- THE VLDB JOURNAL
, 2013
"... Enterprises today acquire vast volumes of data from different sources and leverage this information by means of data analysis to support effective decision-making and provide new functionality and services. The key requirement of data analytics is scalability, simply due to the immense volume of dat ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Enterprises today acquire vast volumes of data from different sources and leverage this information by means of data analysis to support effective decision-making and provide new functionality and services. The key requirement of data analytics is scalability, simply due to the immense volume of data that need to be extracted, processed, and analyzed in a timely fashion. Arguably the most popular framework for contemporary large-scale data analytics is MapReduce, mainly due to its salient features that include scalability, fault-tolerance, ease of programming, and flexibility. However, despite its merits, MapReduce has evident performance limitations in miscellaneous analytical tasks, and this has given rise to a significant body of research that aim at improving its efficiency, while maintaining its desirable properties. This survey aims to review the state-of-the-art in improving the performance of parallel query processing using MapReduce. A set of the most significant weaknesses and limitations of MapReduce is discussed at a high level, along with solving techniques. A taxonomy is presented for categorizing existing research on Map-Reduce improvements according to the specific problem they target. Based on the proposed taxonomy, a clas-
Network-Aware Scheduling for Data-Parallel Jobs
- Plan When You Can,” in Proc. of SIGCOMM
, 2015
"... To reduce the impact of network congestion on big data jobs, cluster management frameworks use various heuristics to schedule compute tasks and/or network flows. Most of these schedulers consider the job input data fixed and greed-ily schedule the tasks and flows that are ready to run. How-ever, a l ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
To reduce the impact of network congestion on big data jobs, cluster management frameworks use various heuristics to schedule compute tasks and/or network flows. Most of these schedulers consider the job input data fixed and greed-ily schedule the tasks and flows that are ready to run. How-ever, a large fraction of production jobs are recurring with predictable characteristics, which allows us to plan ahead for them. Coordinating the placement of data and tasks of these jobs allows for significantly improving their network local-ity and freeing up bandwidth, which can be used by other jobs running on the cluster. With this intuition, we develop Corral, a scheduling framework that uses characteristics of future workloads to determine an offline schedule which (i) jointly places data and compute to achieve better data local-ity, and (ii) isolates jobs both spatially (by scheduling them in different parts of the cluster) and temporally, improving their performance. We implement Corral on Apache Yarn, and evaluate it on a 210 machine cluster using production workloads. Compared to Yarn’s capacity scheduler, Corral reduces the makespan of these workloads up to 33 % and the median completion time up to 56%, with 20-90 % reduction in data transferred across racks. CCS Concepts •Networks→Data center networks; •Computer systems organization → Cloud computing;
Continuous cloud-scale query optimization and processing.
- In VLDB,
, 2013
"... ABSTRACT Massive data analysis in cloud-scale data centers plays a crucial role in making critical business decisions. Highlevel scripting languages free developers from understanding various system trade-offs, but introduce new challenges for query optimization. One key optimization challenge is m ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
ABSTRACT Massive data analysis in cloud-scale data centers plays a crucial role in making critical business decisions. Highlevel scripting languages free developers from understanding various system trade-offs, but introduce new challenges for query optimization. One key optimization challenge is missing accurate data statistics, typically due to massive data volumes and their distributed nature, complex computation logic, and frequent usage of user-defined functions. In this paper we propose novel techniques to adapt query processing in the Scope system, the cloud-scale computation environment in Microsoft Online Services. We continuously monitor query execution, collect actual runtime statistics, and adapt parallel execution plans as the query executes. We discuss similarities and differences between our approach and alternatives proposed in the context of traditional centralized systems. Experiments on large-scale Scope production clusters show that the proposed techniques systematically solve the challenge of missing/inaccurate data statistics, detect and resolve partition skew and plan structure, and improve query latency by a few folds for real workloads. Although we focus on optimizing high-level languages, the same ideas are also applicable for MapReduce systems.
Error-bounded Sampling for Analytics on Big Sparse Data
"... Aggregation queries are at the core of business intelligence and data analytics. In the big data era, many scalable shared-nothing systems have been developed to process aggregation queries over massive amount of data. Microsoft’s SCOPE is ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Aggregation queries are at the core of business intelligence and data analytics. In the big data era, many scalable shared-nothing systems have been developed to process aggregation queries over massive amount of data. Microsoft’s SCOPE is
Advanced Join Strategies for Large-Scale Distributed Computation
"... Companies providing cloud-scale data services have increasing needs to store and analyze massive data sets (e.g., search logs, click streams, and web graph data). For cost and performance reasons, processing is typically done on large clusters of thousands of commodity machines by using high level s ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Companies providing cloud-scale data services have increasing needs to store and analyze massive data sets (e.g., search logs, click streams, and web graph data). For cost and performance reasons, processing is typically done on large clusters of thousands of commodity machines by using high level scripting languages. In the recent past, there has been significant progress in adapting well-known techniques from traditional relational DBMSs to this new scenario. However, important challenges remain open. In this paper we study the very common join operation, discuss some unique challenges in the large-scale distributed scenario, and explain how to efficiently and robustly process joins in a distributed way. Specifically, we introduce novel execution strategies that leverage opportunities not available in centralized scenarios, and others that robustly handle data skew. We report experimental validations of our approaches on Scope production clusters, which power the Applications and Services Group at Microsoft. 1.
Hybrid centralized and distributed scheduling in large shared clusters.
, 2015
"... Abstract Datacenter-scale computing for analytics workloads is increasingly common. High operational costs force heterogeneous applications to share cluster resources for achieving economy of scale. Scheduling such large and diverse workloads is inherently hard, and existing approaches tackle this ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract Datacenter-scale computing for analytics workloads is increasingly common. High operational costs force heterogeneous applications to share cluster resources for achieving economy of scale. Scheduling such large and diverse workloads is inherently hard, and existing approaches tackle this in two alternative ways: 1) centralized solutions offer strict, secure enforcement of scheduling invariants (e.g., fairness, capacity) for heterogeneous applications, 2) distributed solutions offer scalable, efficient scheduling for homogeneous applications. We argue that these solutions are complementary, and advocate a blended approach. Concretely, we propose Mercury, a hybrid resource management framework that supports the full spectrum of scheduling, from centralized to distributed. Mercury exposes a programmatic interface that allows applications to trade-off between scheduling overhead and execution guarantees. Our framework harnesses this flexibility by opportunistically utilizing resources to improve task throughput. Experimental results on production-derived workloads show gains of over 35% in task throughput. These benefits can be translated by appropriate application and framework policies into job throughput or job latency improvements. We have implemented and contributed Mercury as an extension of Apache Hadoop / YARN. 1
Nondeterminism in mapreduce considered harmful? an empirical study on non-commutative aggregators in mapreduce programs
- In ICSE Companion
, 2014
"... ABSTRACT The simplicity of MapReduce introduces unique subtleties that cause hard-to-detect bugs; in particular, the unfixed order of reduce function input is a source of nondeterminism that is harmful if the reduce function is not commutative and sensitive to input order. Our extensive study of pr ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
ABSTRACT The simplicity of MapReduce introduces unique subtleties that cause hard-to-detect bugs; in particular, the unfixed order of reduce function input is a source of nondeterminism that is harmful if the reduce function is not commutative and sensitive to input order. Our extensive study of production MapReduce programs reveals interesting findings on commutativity, nondeterminism, and correctness. Although non-commutative reduce functions lead to five bugs in our sample of well-tested production programs, we surprisingly have found that many non-commutative reduce functions are mostly harmless due to, for example, implicit data properties. These findings are instrumental in advancing our understanding of MapReduce program correctness.
Impression Store: Compressive Sensing-based Storage for Big Data Analytics
"... Abstract For many big data analytics workloads, approximate results suffice. This begs the question, whether and how the underlying system architecture can take advantage of such relaxations, thereby lifting constraints inherent in today's architectures. This position paper explores one of the ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract For many big data analytics workloads, approximate results suffice. This begs the question, whether and how the underlying system architecture can take advantage of such relaxations, thereby lifting constraints inherent in today's architectures. This position paper explores one of the possible directions. Impression Store is a distributed storage system with the abstraction of big data vectors. It aggregates updates internally and responds to the retrieval of top-K high-value entries. With proper extension, Impression Store supports various aggregations, top-K queries, outlier and major mode detection. While restricted in scope, such queries represent a substantial and important portion of many production workloads. In return, the system has unparalleled scalability; any node in the system can process any query, both reads and updates. The key technique we leverage is compressive sensing, a technique that substantially reduces the amount of active memory state, IO, and traffic volume needed to achieve such scalability.
JetScope: Reliable and Interactive Analytics at Cloud Scale
"... ABSTRACT Interactive, reliable, and rich data analytics at cloud scale is a key capability to support low latency data exploration and experimentation over terabytes of data for a wide range of business scenarios. Besides the challenges in massive scalability and low latency distributed query proce ..."
Abstract
- Add to MetaCart
(Show Context)
ABSTRACT Interactive, reliable, and rich data analytics at cloud scale is a key capability to support low latency data exploration and experimentation over terabytes of data for a wide range of business scenarios. Besides the challenges in massive scalability and low latency distributed query processing, it is imperative to achieve all these requirements with effective fault tolerance and efficient recovery, as failures and fluctuations are the norm in such a distributed environment. We present a cloud scale interactive query processing system, called JetScope, developed at Microsoft. The system has a SQL-like declarative scripting language and delivers massive scalability and high performance through advanced optimizations. In order to achieve low latency, the system leverages various access methods, optimizes delivering first rows, and maximizes network and scheduling efficiency. The system also provides a fine-grained fault tolerance mechanism which is able to efficiently detect and mitigate failures without significantly impacting the query latency and user experience. JetScope has been deployed to hundreds of servers in production at Microsoft, serving a few million queries every day.
Microsoft Bing Peking University
"... To minimize the amount of data-shuffling I/O that occurs between the pipeline stages of a distributed dataparallel program, its procedural code must be optimized with full awareness of the pipeline that it executes in. Unfortunately, neither pipeline optimizers nor traditional compilers examine both ..."
Abstract
- Add to MetaCart
(Show Context)
To minimize the amount of data-shuffling I/O that occurs between the pipeline stages of a distributed dataparallel program, its procedural code must be optimized with full awareness of the pipeline that it executes in. Unfortunately, neither pipeline optimizers nor traditional compilers examine both the pipeline and procedural code of a data-parallel program so programmers must either hand-optimize their program across pipeline stages or live with poor performance. To resolve this tension between performance and programmability, this paper describes PeriSCOPE, which automatically optimizes a data-parallel program’s procedural code in the context of data flow that is reconstructed from the program’s pipeline topology. Such optimizations eliminate unnecessary code and data, perform early data filtering, and calculate small derived values (e.g., predicates) earlier in the pipeline, so that less data—sometimes much less data—is transferred between pipeline stages. We describe how PeriSCOPE is implemented and evaluate its effectiveness on real production jobs. 1