• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Data-Intensive Text Processing with MapReduce (2010)

by Jimmy Lin, Chris Dyer
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 25
Next 10 →

Towards Optimal Resource Provisioning for Running MapReduce Programs in Public Clouds

by Fengguang Tian, Keke Chen
"... Abstract—Running MapReduce programs in the public cloud introduces the important problem: how to optimize resource provisioning to minimize the financial charge for a specific job? In this paper, we study the whole process of MapReduce processing and build up a cost function that explicitly models t ..."
Abstract - Cited by 5 (1 self) - Add to MetaCart
Abstract—Running MapReduce programs in the public cloud introduces the important problem: how to optimize resource provisioning to minimize the financial charge for a specific job? In this paper, we study the whole process of MapReduce processing and build up a cost function that explicitly models the relationship between the amount of input data, the available system resources (Map and Reduce slots), and the complexity of the Reduce function for the target MapReduce job. The model parameters can be learned from test runs with a small number of nodes. Based on this cost model, we can solve a number of decision problems, such as the optimal amount of resources that can minimize the financial cost with a time deadline or minimize the time under certain financial budget. Experimental results show that this cost model performs well on tested MapReduce programs.

Design patterns for efficient graph algorithms in mapreduce

by Jimmy Lin, Michael Schatz - In MLG ’10: Proceedings of the Eighth Workshop on Mining and Learning with Graphs , 2010
"... Graphs are analyzed in many important contexts, including ranking search results based on the hyperlink structure of the world wide web, module detection of proteinprotein interaction networks, and privacy analysis of social networks. Many graphs of interest are difficult to analyze because of their ..."
Abstract - Cited by 5 (2 self) - Add to MetaCart
Graphs are analyzed in many important contexts, including ranking search results based on the hyperlink structure of the world wide web, module detection of proteinprotein interaction networks, and privacy analysis of social networks. Many graphs of interest are difficult to analyze because of their large size, often spanning millions of vertices and billions of edges. As such, researchers have increasingly turned to distributed solutions. In particular, MapReduce has emerged as an enabling technology for large-scale graph processing. However, existing best practices for MapReduce graph algorithms have significant shortcomings that limit performance, especially with respect to partitioning, serializing, and distributing the graph. In this paper, we present three design patterns that address these issues and can be used to accelerate a large class of graph algorithms based on message passing, exemplified by PageRank. Experiments show that the application of our design patterns reduces the running time of PageRank on a web graph with 1.4 billion edges by 69%. 1.

No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics

by Herodotos Herodotou, Fei Dong, Shivnath Babu
"... Infrastructure-as-a-Service (IaaS) cloud platforms have brought two unprecedented changes to cluster provisioning practices. First, any (nonexpert) user can provision a cluster of any size on the cloud within minutes to run her data-processing jobs. The user can terminate the cluster once her jobs c ..."
Abstract - Cited by 2 (1 self) - Add to MetaCart
Infrastructure-as-a-Service (IaaS) cloud platforms have brought two unprecedented changes to cluster provisioning practices. First, any (nonexpert) user can provision a cluster of any size on the cloud within minutes to run her data-processing jobs. The user can terminate the cluster once her jobs complete, and she needs to pay only for the resources used and duration of use. Second, cloud platforms enable users to bypass the traditional middleman—the system administrator—in the cluster-provisioning process. These changes give tremendous power to the user, but place a major burden on her shoulders. The user is now faced regularly with complex cluster sizing problems that involve finding the cluster size, the type of resources to use in the cluster from the large number of choices offered by current IaaS cloud platforms, and the job configurations that best meet the performance needs of her workload. In this paper, we introduce the Elastisizer, a system to which users can express cluster sizing problems as queries in a declarative fashion. The Elastisizer provides reliable answers to these queries using an automated technique that uses a mix of job profiling, estimation using black-box and white-box models, and simulation. We have prototyped the Elastisizer for the Hadoop MapReduce framework, and present a comprehensive evaluation that shows the benefits of the Elastisizer in common scenarios where cluster sizing problems arise.

An Optimization Framework for Map-Reduce Queries

by Leonidas Fegaras, Chengkai Li, Upa Gupta
"... We present an effective optimization framework for general SQLlike map-reduce queries, which is based on a novel query algebra and uses a small number of higher-order physical operators that are directly implementable on existing map-reduce systems, such as Hadoop. Although our framework is applicab ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
We present an effective optimization framework for general SQLlike map-reduce queries, which is based on a novel query algebra and uses a small number of higher-order physical operators that are directly implementable on existing map-reduce systems, such as Hadoop. Although our framework is applicable to any SQL-like map-reduce query language, we focus on a powerful query language, called MRQL. Current map-reduce query languages, such as HiveQL and PigLatin, enable users to plug-in custom map-reduce scripts into queries for those jobs that cannot be declaratively coded in the query language, which may result to suboptimal, error-prone, and hard-to-maintain code. In contrast to these languages, MRQL is expressive enough to capture most of these computations in declarative form and at the same time is amenable to optimization. We describe an optimization framework that maps the algebraic forms derived from the MRQL queries to efficient workflows of mapreduce operations that consist of our physical plan operators. We also describe many algebraic optimizations, such as fusing cascading map-reduce jobs into one job and synthesizing a combine function from the reduce function of a map-reduce job. Finally, we report on a prototype system implementation and we show some performance results of evaluating MRQL queries on a small cluster of computers. 1.

CloudVista: Visual Cluster Exploration for Extreme Scale Data in the Cloud

by Keke Chen, Huiqi Xu, Fengguang Tian, Shumin Guo
"... Abstract. The problem of efficient and high-quality clustering of extreme scale datasets with complex clustering structures continues to be one of the most challenging data analysis problems. An innovate use of data cloud would provide unique opportunity to address this challenge. In this paper, we ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
Abstract. The problem of efficient and high-quality clustering of extreme scale datasets with complex clustering structures continues to be one of the most challenging data analysis problems. An innovate use of data cloud would provide unique opportunity to address this challenge. In this paper, we propose the Cloud-Vista framework to address (1) the problems caused by using sampling in the existing approaches and (2) the problems with the latency caused by cloud-side processing on interactive cluster visualization. The CloudVista framework aims to explore the entire large data stored in the cloud with the help of the data structure visual frame and the previously developed VISTA visualization model. The latency of processing large data is addressed by the RandGen algorithm that generates a series of related visual frames in the cloud without user’s intervention, and a hierarchical exploration model supported by cloud-side subset processing. Experimental study shows this framework is effective and efficient for visually exploring clustering structures for extreme scale datasets stored in the cloud. 1

MapReduce with Deltas

by R. Lämmel, D. Saile
"... Abstract — The MapReduce programming model is extended conservatively to deal with deltas for input data such that recurrent MapReduce computations can be more efficient for the case of input data that changes only slightly over time. That is, the extended model enables more frequent re-execution of ..."
Abstract - Add to MetaCart
Abstract — The MapReduce programming model is extended conservatively to deal with deltas for input data such that recurrent MapReduce computations can be more efficient for the case of input data that changes only slightly over time. That is, the extended model enables more frequent re-execution of MapReduce computations and thereby more up-to-date results in practical applications. Deltas can also be pushed through pipelines of MapReduce computations. The achievable speedup is analyzed and found to be highly predictable. The approach has been implemented in Hadoop, and a code distribution is available online. The correctness of the extended programming model relies on a simple algebraic argument.

XML Query Optimization in Map-Reduce

by Leonidas Fegaras, Chengkai Li, Upa Gupta, Jijo J. Philip
"... We present a novel query language for large-scale analysis of XML data on a map-reduce environment, called MRQL, that is expressive enough to capture most common data analysis tasks and at the same time is amenable to optimization. Our evaluation plans are constructed using a small number of higher- ..."
Abstract - Add to MetaCart
We present a novel query language for large-scale analysis of XML data on a map-reduce environment, called MRQL, that is expressive enough to capture most common data analysis tasks and at the same time is amenable to optimization. Our evaluation plans are constructed using a small number of higher-order physical operators that are directly implementable on existing map-reduce systems, such as Hadoop. We report on a prototype system implementation and we show some preliminary results on evaluating MRQL queries on a small cluster of PCs running Hadoop. 1.

Parsing in Parallel on Multiple Cores and GPUs

by Mark Johnson
"... This paper examines the ways in which parallelism can be used to speed the parsing of dense PCFGs. We focus on two kinds of parallelism here: Symmetric Multi-Processing (SMP) parallelism on shared-memory multicore ..."
Abstract - Add to MetaCart
This paper examines the ways in which parallelism can be used to speed the parsing of dense PCFGs. We focus on two kinds of parallelism here: Symmetric Multi-Processing (SMP) parallelism on shared-memory multicore

MapReduce Programming and Cost-based Optimization? Crossing this Chasm with Starfish

by Herodotos Herodotou, Fei Dong, Shivnath Babu
"... MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computatio ..."
Abstract - Add to MetaCart
MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computational science. However, MapReduce systems lack a feature that has been key to the historical success of database systems, namely, cost-based optimization. A major challenge here is that, to the MapReduce system, a program consists of black-box map and reduce functions written in some programming language like C++, Java, Python, or Ruby. Starfish is a self-tuning system for big data analytics that includes, to our knowledge, the first Cost-based Optimizer for simple to arbitrarily complex MapReduce programs. Starfish also includes a Profiler to collect detailed statistical information from unmodified MapReduce programs, and a What-if Engine for fine-grained cost estimation. This demonstration will present the profiling, whatif analysis, and cost-based optimization of MapReduce programs in Starfish. We will show how (nonexpert) users can employ the Starfish Visualizer to (a) get a deep understanding of a MapReduce program’s behavior during execution, (b) ask hypothetical questions on how the program’s behavior will change when parameter settings, cluster resources, or input data properties change, and (c) ultimately optimize the program. 1.

U N I V E R S

by Calum Robert, William Clark
"... This project considers a number of the methods for instance/example selection in training data for language models with the most promising being experimented with and evaluated via hypothesis testing. The most successful, the expansion on the perplexity based work of Roger Moore was selected for fur ..."
Abstract - Add to MetaCart
This project considers a number of the methods for instance/example selection in training data for language models with the most promising being experimented with and evaluated via hypothesis testing. The most successful, the expansion on the perplexity based work of Roger Moore was selected for further development due to its good test results and ability to locate related sentences. A number of possible filter methods were produced for improving the performance and results of that method. Each of these filters were tested with a decrease in data size of between 2.6 and 75 % being returned. The best performing of these filters with a decrease in data of 57 % was then selected and after some fine tuning a combination of it and the original method were tested to gauge its full abilities. The results show that the combination of methods managed to form a scalable solution to the problem with datasets with on average 48 % lower perplexity than a baseline approach being produced. The additional optimization features were shown to reduce the time to run by between 50 and 60%. i Acknowledgements Many thanks to my supervisor Miles Osbourne for his advice and guidance and to my colleges whose opinions helped me gain a full perspective on my work. Also to my proof readers for dealing with countless unnecessary commas. ii Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University