Results 1 - 10
of
13
Starfish: A Self-tuning System for Big Data Analytics
- In CIDR
, 2011
"... Timely and cost-effective analytics over “Big Data ” is now a key ingredient for success in many businesses, scientific and engineering disciplines, and government endeavors. The Hadoop software stack—which consists of an extensible MapReduce execution engine, pluggable distributed storage engines, ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
Timely and cost-effective analytics over “Big Data ” is now a key ingredient for success in many businesses, scientific and engineering disciplines, and government endeavors. The Hadoop software stack—which consists of an extensible MapReduce execution engine, pluggable distributed storage engines, and a range of procedural to declarative interfaces—is a popular choice for big data analytics. Most practitioners of big data analytics—like computational scientists, systems researchers, and business analysts—lack the expertise to tune the system to get good performance. Unfortunately, Hadoop’s performance out of the box leaves much to be desired, leading to suboptimal use of resources, time, and money (in payas-you-go clouds). We introduce Starfish, a self-tuning system for big data analytics. Starfish builds on Hadoop while adapting to user needs and system workloads to provide good performance automatically, without any need for users to understand and manipulate the many tuning knobs in Hadoop. While Starfish’s system architecture is guided by work on self-tuning database systems, we discuss how new analysis practices over big data pose new challenges; leading us to different design choices in Starfish. 1.
Towards Optimal Resource Provisioning for Running MapReduce Programs in Public Clouds
"... Abstract—Running MapReduce programs in the public cloud introduces the important problem: how to optimize resource provisioning to minimize the financial charge for a specific job? In this paper, we study the whole process of MapReduce processing and build up a cost function that explicitly models t ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract—Running MapReduce programs in the public cloud introduces the important problem: how to optimize resource provisioning to minimize the financial charge for a specific job? In this paper, we study the whole process of MapReduce processing and build up a cost function that explicitly models the relationship between the amount of input data, the available system resources (Map and Reduce slots), and the complexity of the Reduce function for the target MapReduce job. The model parameters can be learned from test runs with a small number of nodes. Based on this cost model, we can solve a number of decision problems, such as the optimal amount of resources that can minimize the financial cost with a time deadline or minimize the time under certain financial budget. Experimental results show that this cost model performs well on tested MapReduce programs.
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs
"... MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computatio ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computational science. However, MapReduce systems lack a feature that has been key to the historical success of database systems, namely, cost-based optimization. A major challenge here is that, to the MapReduce system, a program consists of black-box map and reduce functions written in some programming language like C++, Java, Python, or Ruby. We introduce, to our knowledge, the first Cost-based Optimizer for simple to arbitrarily complex MapReduce programs. We focus on the optimization opportunities presented by the large space of configuration parameters for these programs. We also introduce a Profiler to collect detailed statistical information from unmodified MapReduce programs, and a What-if Engine for fine-grained cost estimation. All components have been prototyped for the popular Hadoop MapReduce system. The effectiveness of each component is demonstrated through a comprehensive evaluation using representative MapReduce programs from various application domains. 1.
ABSTRACT Real-time MapReduce Scheduling
"... In this paper, we explore the feasibility of enabling the scheduling of mixed hard and soft real-time MapReduce applications. We first present an experimental evaluation of the popular Hadoop MapReduce middleware on the Amazon EC2 cloud. Our evaluation reveals tradeoffs between overall system throug ..."
Abstract
- Add to MetaCart
In this paper, we explore the feasibility of enabling the scheduling of mixed hard and soft real-time MapReduce applications. We first present an experimental evaluation of the popular Hadoop MapReduce middleware on the Amazon EC2 cloud. Our evaluation reveals tradeoffs between overall system throughput and execution time predictability, as well as highlights a number of factors affecting real-time scheduling, such as data placement, concurrent users, and master scheduling overhead. Based on our evaluation study, we present a formal model for capturing real-time MapReduce applications and the Hadoop platform. Using this model, we formulate the offline scheduling of real-time MapReduce jobs on a heterogeneous distributed Hadoop architecture as a constraint satisfaction problem (CSP) and introduce various search strategies for the formulation. We propose an enhancement of MapReduce’s execution model and a range of heuristic techniques for the online scheduling. We further outline some of our future directions that apply state-of-the-art techniques in the real-time scheduling literature. 1.
Towards Scalable One-Pass Analytics Using MapReduce
"... Abstract—An integral part of many data-intensive applications is the need to collect and analyze enormous datasets efficiently. Concurrent with such application needs is the increasing adoption of MapReduce as a programming model for processing large datasets using a cluster of machines. Current Map ..."
Abstract
- Add to MetaCart
Abstract—An integral part of many data-intensive applications is the need to collect and analyze enormous datasets efficiently. Concurrent with such application needs is the increasing adoption of MapReduce as a programming model for processing large datasets using a cluster of machines. Current MapReduce systems, however, require the data set to be loaded into the cluster before running analytical queries, and thereby incur high delays to start query processing. Furthermore, existing systems are geared towards batch processing. In this paper, we seek to answer a fundamental question: what architectural changes are necessary to bring the benefits of the MapReduce computation model to incremental, onepass analytics, i.e., to support stream processing and online aggregation? To answer this question, we first conduct a detailed empirical performance study of current MapReduce implementations including Hadoop and MapReduce Online using a variety of workloads. By doing so, we identify several drawbacks of existing systems for one-pass analytics. Based on the insights from our study, we list key design requirements for incremental one-pass analytics and argue for architectural changes of MapReduce systems to overcome their current limitations. We conclude by sketching an initial design of our new MapReduce-based platform for incremental one-pass analytics and showing promising preliminary results. Keywords-MapReduce; performance analysis; data streams; parallel data processing I.
ColumnOriented Storage Techniques for MapReduce
"... Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However, translating these techniques to a Map-Reduce implementation su ..."
Abstract
- Add to MetaCart
Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However, translating these techniques to a Map-Reduce implementation such as Hadoop presents unique challenges that can lead to new design choices. This paper describes how column-oriented storage techniques can be incorporated in Hadoop in a way that preserves its popular programming APIs. We show that simply using binary storage formats in Hadoop can provide a 3x performance boost over the naive use of text files. We then introduce a column-oriented storage format that is compatible with the replication and scheduling constraints of Hadoop and show that it can speed up MapReduce jobs on real workloads by an order of magnitude. We also show that dealing with complex column types such as arrays, maps, and nested records, which are common in MapReduce jobs, can incur significant CPU overhead. Finally, we introduce a novel skip list column format and lazy record construction strategy that avoids deserializing unwanted records to provide an additional 1.5x performance boost. Experiments on a real intranet crawl are used to show that our column-oriented storage techniques can improve the performance of the map phase in Hadoop by as much as two orders of magnitude. 1.
U N I V E R S I T
"... Hadoop is a well known open-source implementation of the MapReduce paradigm that has as foundations the map and reduce algorithmic skeletons. In this thesis we study whether other types of skeletons can be added to Hadoop in order to increase its usability and performance. We investigate on how to m ..."
Abstract
- Add to MetaCart
Hadoop is a well known open-source implementation of the MapReduce paradigm that has as foundations the map and reduce algorithmic skeletons. In this thesis we study whether other types of skeletons can be added to Hadoop in order to increase its usability and performance. We investigate on how to modify Hadoop so as to accept other types of jobs and tasks while being as non-intrusive as possible. We pick two data-parallel skeletons: filter and zip, implement one of them (filter) and document why the other (zip) is not a good fit for Hadoop. A further two skeletons (sequential and sort) are then chosen and added to the framework. We develop a library that runs on top of Hadoop and lets the user specify complex chains of jobs in an more natural manner by decoupling the coordination and computation aspects. The skeletons and the library are evaluated in terms of usability and performance and compared with what can be done using the MapReduce implementation of Hadoop. i Acknowledgements Many thanks to my supervisor, Dr. Stratis Viglas for the prompt feedback and advice.
I V E R
"... Parallel Database Management systems are the dominant technology used for large scale data-analysis. The experience of query evaluation techniques used by Database Management Systems combined with the processing power offered by parallelism are some of the reasons for the wide use of the technology. ..."
Abstract
- Add to MetaCart
Parallel Database Management systems are the dominant technology used for large scale data-analysis. The experience of query evaluation techniques used by Database Management Systems combined with the processing power offered by parallelism are some of the reasons for the wide use of the technology. On the other hand, MapReduce is a new technology which is quickly spreading and becoming a commonly used tool for processing of large portions of data. The fault tolerance, parallelism and scalability, are only some of the characteristics that the framework can provide to any system based on it. The basic idea behind this work is to modify the query evaluation techniques used by parallel database management systems in order to use the Hadoop MapReduce framework as the underlying execution engine. For the purposes of this work we have focused on join evaluation. We have designed and implemented three algorithms which modify the data-flow of the MapReduce framework in order to simulate the data-flow that parallel Database Management Systems use in order to execute query evaluation. More specifically, we have implemented three algorithms that execute parallel hash join: Simple Hash Join is the implementation
Building Wavelet Histograms on Large Data in MapReduce
"... MapReduce is becoming the de facto framework for storing and processing massive data, due to its excellent scalability, reliability, and elasticity. In many MapReduce applications, obtaining a compact accurate summary of data is essential. Among various data summarization tools, histograms have prov ..."
Abstract
- Add to MetaCart
MapReduce is becoming the de facto framework for storing and processing massive data, due to its excellent scalability, reliability, and elasticity. In many MapReduce applications, obtaining a compact accurate summary of data is essential. Among various data summarization tools, histograms have proven to be particularly important and useful for summarizing data, and the wavelet histogram is one of the most widely used histograms. In this paper, we investigate the problem of building wavelet histograms efficiently on large datasets in MapReduce. We measure the efficiency of the algorithms by both end-to-end running time and communication cost. We demonstrate straightforward adaptations of existing exact and approximate methods for building wavelet histograms to MapReduce clusters are highly inefficient. To that end, we design new algorithms for computing exact and approximate wavelet histograms and discuss their implementation in MapReduce. We illustrate our techniques in Hadoop, and compare to baseline solutions with extensive experiments performed in a heterogeneous Hadoop cluster of 16 nodes, using large real and synthetic datasets, up to hundreds of gigabytes. The results suggest significant (often orders of magnitude) performance improvement achieved by our new algorithms. 1.

