Results 1 - 10
of
66
Mapreduce online
, 2009
"... MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, many implementations of MapReduce materialize the entire output of each map and reduce task before it can be consumed. In this paper, we propose a modified MapReduce architecture tha ..."
Abstract
-
Cited by 43 (2 self)
- Add to MetaCart
MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, many implementations of MapReduce materialize the entire output of each map and reduce task before it can be consumed. In this paper, we propose a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, and can reduce completion times and improve system utilization for batch jobs as well. We present a modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see “early returns” from a job as it is being computed. Our Hadoop Online Prototype (HOP) also supports continuous queries, which enable MapReduce programs to be written for applications such as event monitoring and stream processing. HOP retains the fault tolerance properties of Hadoop and can run unmodified user-defined MapReduce programs. 1
Hive- A Warehousing Solution Over a Map-Reduce Framework
- IN VLDB '09: PROCEEDINGS OF THE VLDB ENDOWMENT
, 2009
"... The size of data sets being collected and analyzed in the
industry for business intelligence is growing rapidly, mak-
ing traditional warehousing solutions prohibitively expen-
sive. Hadoop [3] is a popular open-source map-reduce im-
plementation which is being used as an alternative to store
and pr ..."
Abstract
-
Cited by 41 (1 self)
- Add to MetaCart
The size of data sets being collected and analyzed in the
industry for business intelligence is growing rapidly, mak-
ing traditional warehousing solutions prohibitively expen-
sive. Hadoop [3] is a popular open-source map-reduce im-
plementation which is being used as an alternative to store
and process extremely large data sets on commodity hard-
ware. However, the map-reduce programming model is very
low level and requires developers to write custom programs
which are hard to maintain and reuse.
In this paper, we present Hive, an open-source data ware-
housing solution built on top of Hadoop. Hive supports
queries expressed in a SQL-like declarative language - HiveQL,
which are compiled into map-reduce jobs executed on Hadoop.
In addition, HiveQL supports custom map-reduce scripts to
be plugged into queries. The language includes a type sys-
tem with support for tables containing primitive types, col-
lections like arrays and maps, and nested compositions of
the same. The underlying IO libraries can be extended to
query data in custom formats. Hive also includes a system
catalog, Hive-Metastore, containing schemas and statistics,
which is useful in data exploration and query optimization.
In Facebook, the Hive warehouse contains several thousand
tables with over 700 terabytes of data and is being used ex-
tensively for both reporting and ad-hoc analyses by more
than 100 users.
Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling
- In Proc. EuroSys
, 2010
"... As organizations start to use data-intensive cluster computing systems like Hadoop and Dryad for more applications, there is a growing need to share clusters between users. However, there is a conflict between fairness in scheduling and data locality (placing tasks on nodes that contain their input ..."
Abstract
-
Cited by 29 (9 self)
- Add to MetaCart
As organizations start to use data-intensive cluster computing systems like Hadoop and Dryad for more applications, there is a growing need to share clusters between users. However, there is a conflict between fairness in scheduling and data locality (placing tasks on nodes that contain their input data). We illustrate this problem through our experience designing a fair scheduler for a 600-node Hadoop cluster at Facebook. To address the conflict between locality and fairness, we propose a simple algorithm called delay scheduling: when the job that should be scheduled next according to fairness cannot launch a local task, it waits for a small amount of time, letting other jobs launch tasks instead. We find that delay scheduling achieves nearly optimal data locality in a variety of workloads and can increase throughput by up to 2x while preserving fairness. In addition, the simplicity of delay scheduling makes it applicable under a wide variety of scheduling policies beyond fair sharing.
Large-scale Incremental Processing Using Distributed Transactions and Notifications
- 9th USENIX Symposium on Operating Systems Design and Implementation
"... Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These ta ..."
Abstract
-
Cited by 26 (0 self)
- Add to MetaCart
Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput requirements of these tasks: Google’s indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines. MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency. We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%. 1
Mapreduce: a flexible data processing tool
- Commun. ACM
"... contributed articles doi:10.1145/1629175.1629198 MapReduce advantages over parallel databases include storage-system independence and fine-grain fault tolerance for large jobs. ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
contributed articles doi:10.1145/1629175.1629198 MapReduce advantages over parallel databases include storage-system independence and fine-grain fault tolerance for large jobs.
HaLoop: Efficient Iterative Data Processing on Large Clusters
"... The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce and Dryad are two popular platforms in which the dataflow takes the form of a directed acyclic graph o ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce and Dryad are two popular platforms in which the dataflow takes the form of a directed acyclic graph of operators. These platforms lack built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph analysis, model fitting, and so on. This paper presents HaLoop, a modified version of the Hadoop MapReduce framework that is designed to serve these applications. HaLoop not only extends MapReduce with programming support for iterative applications, it also dramatically improves their efficiency by making the task scheduler loop-aware and by adding various caching mechanisms. We evaluated HaLoop on real queries and real datasets. Compared with Hadoop, on average, HaLoop reduces query runtimes by 1.85, and shuffles only 4 % of the data between mappers and reducers. 1.
Efficient Parallel Set-Similarity Joins Using MapReduce
"... In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end setsimilarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We eff ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end setsimilarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We also propose solutions for the case where, even if we use the most fine-grained partitioning, the data still does not fit in the main memory of a node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop. Categories and Subject Descriptors H.2.4 [Database Management]: Systems—query processing, parallel databases
Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing)
"... MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapRe ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapReduce processes tasks in a scan-oriented fashion. Hence, the performance of Hadoop — an open-source implementation of MapReduce — often does not match the one of a well-configured parallel DBMS. In this paper we propose a new type of system named Hadoop++: it boosts task performance without changing the Hadoop framework at all (Hadoop does not even ‘notice it’). To reach this goal, rather than changing a working system (Hadoop), we inject our technology at the right places through UDFs only and affect Hadoop from inside. This has three important consequences: First, Hadoop++ significantly
Radius Plots for Mining Tera-byte Scale Graphs: Algorithms, Patterns, and Observations
"... Given large, multi-million node graphs (e.g., FaceBook, web-crawls, etc.), how do they evolve over time? How are they connected? What are the central nodes and the outliers of the graphs? We show that the Radius Plot (pdf of node radii) can answer these questions. However, computing the Radius Plot ..."
Abstract
-
Cited by 13 (10 self)
- Add to MetaCart
Given large, multi-million node graphs (e.g., FaceBook, web-crawls, etc.), how do they evolve over time? How are they connected? What are the central nodes and the outliers of the graphs? We show that the Radius Plot (pdf of node radii) can answer these questions. However, computing the Radius Plot is prohibitively expensive for graphs reaching the planetary scale. There are two major contributions in this paper: (a) We propose HADI (HAdoop DIameter and radii estimator), a carefully designed and fine-tuned algorithm to compute the diameter of massive graphs, that runs on the top of the HADOOP /MAPREDUCE system, with excellent scale-up on the number of available machines (b) We run HADI on several real world datasets including YahooWeb (6B edges, 1/8 of a Terabyte), one of the largest public graphs ever analyzed. Thanks to HADI, we report fascinating patterns on large networks, like the surprisingly small effective diameter, the multi-modal/bi-modal shape of the Radius Plot, and its palindrome motion over time. 1

