Results 1 - 10
of
23
CIEL: a universal execution engine for distributed data-flow computing
- in Proceedings of the 8th USENIX Symposium on Networked System Design and Implementation (NSDI). USENIX
"... This paper introduces CIEL, a universal execution engine for distributed data-flow programs. Like previous execution engines, CIEL masks the complexity of distributed programming. Unlike those systems, a CIEL job can make data-dependent control-flow decisions, which enables it to compute iterative a ..."
Abstract
-
Cited by 19 (6 self)
- Add to MetaCart
This paper introduces CIEL, a universal execution engine for distributed data-flow programs. Like previous execution engines, CIEL masks the complexity of distributed programming. Unlike those systems, a CIEL job can make data-dependent control-flow decisions, which enables it to compute iterative and recursive algorithms. We have also developed Skywriting, a Turingcomplete scripting language that runs directly on CIEL. The execution engine provides transparent fault tolerance and distribution to Skywriting scripts and highperformance code written in other programming languages. We have deployed CIEL on a cloud computing platform, and demonstrate that it achieves scalable performance for both iterative and non-iterative algorithms. 1
Scripting the cloud with Skywriting
- 2ND USENIX WORKSHOP ON HOT TOPICS IN CLOUD COMPUTING (HOTCLOUD)
, 2010
"... Recent distributed computing frameworks—such as MapReduce, Hadoop and Dryad—have made it simple to exploit multiple machines in a compute cloud. However, these frameworks use coordination languages that are insufficiently expressive for many classes of computation, including iterative and recursive ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
Recent distributed computing frameworks—such as MapReduce, Hadoop and Dryad—have made it simple to exploit multiple machines in a compute cloud. However, these frameworks use coordination languages that are insufficiently expressive for many classes of computation, including iterative and recursive algorithms. To address this problem, and generalise previous approaches, we introduce Skywriting: a Turing-powerful, purely-functional script language for describing distributed computations. In this paper, we introduce the main features of Skywriting, and outline our novel cooperative task farming execution engine.
RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems
"... MapReduce-based data warehouse systems are playing important roles of supporting big data analytics to understand quickly the dynamics of user behavior trends and their needs in typical Web service providers and social network sites (e.g., Facebook). In such a system, the data placement structure is ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
MapReduce-based data warehouse systems are playing important roles of supporting big data analytics to understand quickly the dynamics of user behavior trends and their needs in typical Web service providers and social network sites (e.g., Facebook). In such a system, the data placement structure is a critical factor that can affect the warehouse performance in a fundamental way. Based on our observations and analysis of Facebook production systems, we have characterized four requirements for the data placement structure: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) strong adaptivity to highly dynamic workload patterns. We have examined three commonly accepted data placement structures in conventional databases, namely rowstores, column-stores, and hybrid-stores in the context of large data analysis using MapReduce. We show that they are not very suitable for big data processing in distributed systems. In this paper, we present a big data placement structure called RCFile (Record Columnar File) and its implementation in the Hadoop system. With intensive experiments, we show the effectiveness of RCFile in satisfying the four requirements. RCFile has been chosen in Facebook data warehouse system as the default option. It has also been adopted by Hive and Pig, the two most widely used data analysis systems developed in Facebook and Yahoo! I.
HadoopDB in Action: Building Real World Applications
"... HadoopDB is a hybrid of MapReduce and DBMS technologies, designed to meet the growing demand of analyzing massive datasets on very large clusters of machines. Our previous work has shown that HadoopDB approaches parallel databases in performance and still yields the scalability and fault tolerance o ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
HadoopDB is a hybrid of MapReduce and DBMS technologies, designed to meet the growing demand of analyzing massive datasets on very large clusters of machines. Our previous work has shown that HadoopDB approaches parallel databases in performance and still yields the scalability and fault tolerance of MapReduce-based systems. In this demonstration, we focus on HadoopDB’s flexible architecture and versatility with two real world application scenarios: a semantic web data application for protein sequence analysis and a business data warehousing application based on TPC-H. The demonstration offers a thorough walk-through of how to easily build applications on top of HadoopDB.
An Optimization Framework for Map-Reduce Queries
"... We present an effective optimization framework for general SQLlike map-reduce queries, which is based on a novel query algebra and uses a small number of higher-order physical operators that are directly implementable on existing map-reduce systems, such as Hadoop. Although our framework is applicab ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We present an effective optimization framework for general SQLlike map-reduce queries, which is based on a novel query algebra and uses a small number of higher-order physical operators that are directly implementable on existing map-reduce systems, such as Hadoop. Although our framework is applicable to any SQL-like map-reduce query language, we focus on a powerful query language, called MRQL. Current map-reduce query languages, such as HiveQL and PigLatin, enable users to plug-in custom map-reduce scripts into queries for those jobs that cannot be declaratively coded in the query language, which may result to suboptimal, error-prone, and hard-to-maintain code. In contrast to these languages, MRQL is expressive enough to capture most of these computations in declarative form and at the same time is amenable to optimization. We describe an optimization framework that maps the algebraic forms derived from the MRQL queries to efficient workflows of mapreduce operations that consist of our physical plan operators. We also describe many algebraic optimizations, such as fusing cascading map-reduce jobs into one job and synthesizing a combine function from the reduce function of a map-reduce job. Finally, we report on a prototype system implementation and we show some performance results of evaluating MRQL queries on a small cluster of computers. 1.
Full-Text Indexing for Optimizing Selection Operations in Large-Scale Data Analytics
"... MapReduce, especially the Hadoop open-source implementation, has recently emerged as a popular framework for large-scale data analytics. Given the explosion of unstructured data begotten by social media and other web-based applications, we take the position that any modern analytics platform must su ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
MapReduce, especially the Hadoop open-source implementation, has recently emerged as a popular framework for large-scale data analytics. Given the explosion of unstructured data begotten by social media and other web-based applications, we take the position that any modern analytics platform must support operations on free-text fields as first-class citizens. Toward this end, this paper addresses one inefficient aspect of Hadoop-based processing: the need to perform a full scan of the entire dataset, even in cases where it is clearly not necessary to do so. We show that it is possible to leverage a full-text index to optimize selection operations on text fields within records. The idea is simple and intuitive: the full-text index informs the Hadoop execution engine which compressed data blocks contain query terms of interest, and only those data blocks are decompressed and scanned. Experiments with a proof of concept show moderate improvements in end-to-end query running times and substantial savings in terms of cumulative processing time at the worker nodes. We present an analytical model and discuss a number of interesting challenges: some operational, others research in nature.
Author(s) Publication History
, 2003
"... Any software made available via TIMECENTER is provided “as is ” and without any express or implied warranties, including, without limitation, the implied warranty of merchantability and fitness for a particular purpose. The TIMECENTER icon on the cover combines two “arrows. ” These “arrows ” are let ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Any software made available via TIMECENTER is provided “as is ” and without any express or implied warranties, including, without limitation, the implied warranty of merchantability and fitness for a particular purpose. The TIMECENTER icon on the cover combines two “arrows. ” These “arrows ” are letters in the so-called Rune alphabet used one millennium ago by the Vikings, as well as by their precedessors and successors. The Rune alphabet (second phase) has 16 letters, all of which have angular shapes and lack horizontal lines because the primary storage medium was wood. Runes may also be found on jewelry, tools, and weapons and were perceived by many as having magic, hidden powers.
A Latency and Fault-Tolerance Optimizer for Online Parallel Query Plans
"... We address the problem of making online, parallel query plans fault-tolerant: i.e., provide intra-query fault-tolerance without blocking. We develop an approach that not only achieves this goal but does so through the use of different fault-tolerance techniques at different operators within a query ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We address the problem of making online, parallel query plans fault-tolerant: i.e., provide intra-query fault-tolerance without blocking. We develop an approach that not only achieves this goal but does so through the use of different fault-tolerance techniques at different operators within a query plan. Enabling each operator to use a different faulttolerance strategy leads to a space of fault-tolerance plans amenable to cost-based optimization. We develop FTOpt, a cost-based fault-tolerance optimizer that automatically selects the best strategy for each operator in a query plan in a manner that minimizes the expected processing time with failures for the entire query. We implement our approach in a prototype parallel query-processing engine. Our experiments demonstrate that (1) there is no single best fault-tolerance strategy for all query plans, (2) often hybrid strategies that mix-and-match recovery techniques outperform any uniform strategy, and (3) our optimizer correctly identifies winning fault-tolerance configurations. Categories and Subject Descriptors C.4 [Performance of Systems]: Fault tolerance, modeling techniques; H.2.4 [Database Management]: Systems—
CH-1015 Lausanne
"... A major component of many cloud services is query processing on data stored in the underlying cloud cluster. The traditional techniques for query processing on a cluster are those offered by parallel DBMS. These techniques, however, cannot guarantee high performance for cloud; parallel DBMS lack ade ..."
Abstract
- Add to MetaCart
A major component of many cloud services is query processing on data stored in the underlying cloud cluster. The traditional techniques for query processing on a cluster are those offered by parallel DBMS. These techniques, however, cannot guarantee high performance for cloud; parallel DBMS lack adequate fault tolerance mechanisms in order to deal with non-negligible software and hardware failures. MapReduce, on the other hand, allows query processing solutions that are fault tolerant, but imposes substantial overheads. Therefore, existing technology provides only for edge cases of query processing: (i) at one end, parallel DBMS are appropriate only for very short-running queries, as the likelihood of failure is low during the execution, (ii) and, at the other, MapReduce is appropriate for very long-running queries, as the overhead is relatively small and failures may be frequent during execution. In this paper, we propose an adaptive software architecture, which can effortlessly switch between MapReduce and parallel DBMS in order to efficiently process queries regardless of their response times. Switching between the two architectures is performed in a transparent manner based on a straightforward cost model, which computes the expected execution time in presence of failures. The experimental results show that the adaptive architecture achieves the lowest possible query execution time for various scenarios.
XML Query Optimization in Map-Reduce
"... We present a novel query language for large-scale analysis of XML data on a map-reduce environment, called MRQL, that is expressive enough to capture most common data analysis tasks and at the same time is amenable to optimization. Our evaluation plans are constructed using a small number of higher- ..."
Abstract
- Add to MetaCart
We present a novel query language for large-scale analysis of XML data on a map-reduce environment, called MRQL, that is expressive enough to capture most common data analysis tasks and at the same time is amenable to optimization. Our evaluation plans are constructed using a small number of higher-order physical operators that are directly implementable on existing map-reduce systems, such as Hadoop. We report on a prototype system implementation and we show some preliminary results on evaluating MRQL queries on a small cluster of PCs running Hadoop. 1.

