Results 1 -
3 of
3
Metadata Workloads for Testing Big Storage Systems
"... Efficient namespace metadata management is becoming more important as next-generation file systems are designed for the peta and exascale era. A number of new metadata management schemes have been proposed. However, evaluation of these designs has been insufficient, mainly due to a lack of appropria ..."
Abstract
- Add to MetaCart
Efficient namespace metadata management is becoming more important as next-generation file systems are designed for the peta and exascale era. A number of new metadata management schemes have been proposed. However, evaluation of these designs has been insufficient, mainly due to a lack of appropriate namespace metadata traces. Specifically, no Big Data storage system metadata trace is publicly available, and existing traces are a poor replacement. We studied publicly available traces and one Big Data trace from Yahoo! and note some of the differences and their implications to metadata management studies. We discuss the insufficiency of existing evaluation approaches and present a first step towards a statistical metadata workload model that can capture the relevant characteristics of a workload and is suitable for synthetic workload generation. 1
Performance
"... The norm for data analytics is now to run them on commodity clusters with MapReduce-like abstractions. One only needs to read the popular blogs to see the evidence of this. We believe that we could now say that “nobody ever got fired for using Hadoop on a cluster”! We completely agree that Hadoop on ..."
Abstract
- Add to MetaCart
The norm for data analytics is now to run them on commodity clusters with MapReduce-like abstractions. One only needs to read the popular blogs to see the evidence of this. We believe that we could now say that “nobody ever got fired for using Hadoop on a cluster”! We completely agree that Hadoop on a cluster is the right solution for jobs where the input data is multi-terabyte or larger. However, in this position paper we ask if this is the right path for general purpose data analytics? Evidence suggests that many MapReduce-like jobs process relatively small input data sets (less than 14 GB). Memory has reached a GB/ $ ratio such that it is now technically and financially feasible to have servers with 100s GB of DRAM. We therefore ask, should we be scaling by using single machines with very large memories rather than clusters? We conjecture that, in terms of hardware and programmer time, this may be a better option for the majority of data processing jobs.
Hadoop’s Adolescence: A Comparative Workload Analysis from Three Research Clusters
, 2012
"... We analyze Hadoop workloads from three different research clusters from an application-level perspective, with two goals: (1) explore new issues in application patterns and user behavior and (2) understand key performance challenges related to IO and load balance. Our analysis suggests that Hadoop u ..."
Abstract
- Add to MetaCart
We analyze Hadoop workloads from three different research clusters from an application-level perspective, with two goals: (1) explore new issues in application patterns and user behavior and (2) understand key performance challenges related to IO and load balance. Our analysis suggests that Hadoop usage is still in its adolescence. We see underuse of Hadoop features, extensions, and tools as well as significant opportunities for optimization. We see significant diversity in application styles, including some “interactive ” workloads, motivating new tools in the ecosystem. We find that some conventional approaches to improving performance are not especially effective and suggest some alternatives. Overall, we find significant opportunity for simplifying the use and optimization of Hadoop, and make recommendations for future research.

