@TECHREPORT{Murthy09winninga, author = {Arun C. Murthy}, title = {Winning a 60 second dash with a yellow elephant}, institution = {}, year = {2009} }
Bookmark
OpenURL
Abstract
Apache Hadoop is a open source software framework that dramatically simplifies writing distributed data intensive applications. It provides a distributed file system, which is modeled after the Google File System[2], and a map/reduce[1] implementation that manages distributed computation. Jim Gray defined a benchmark to compare large sorting programs. Since the core of map/reduce is a distributed sort, most of the custom code is glue to get the desired behavior. 1 Benchmark Rules Jim’s Gray’s sort benchmark consists of a set of many related benchmarks, each with their own rules. All of the sort benchmarks measure the time to sort different numbers of 100 byte records. The first 10 bytes of each record is the key and the rest is the value. The minute sort must finish end to end in less than a minute. The Gray sort must sort more than 100 terabytes and must run for at least an hour. • The input data must precisely match the data generated by the C data generator.