Citations
5132 | Optimization by simulated annealing
- Kirkpatrick, Gelatt, et al.
- 1983
(Show Context)
Citation Context ...optimal specification that can help guide tenant’s decisions about data partitioning on the available storage services for their analytics workload(s). The solver uses a simulated annealing algorithm =-=[29]-=- to systematically search through the solution space and find a desirable tiering plan, given the workload specification, analytics models, and tenants’ goals. 4.2.1 CAST Solver: Modeling The data pla... |
3421 | Mapreduce: Simplified data processing on large clusters
- Dean, Ghemawat
- 2004
(Show Context)
Citation Context ...enants’ goals such as achieving high utility or reducing deadline miss rates. 4.1 Estimating Analytics Job Performance The well-defined execution phases of the MapReduce parallel programming paradigm =-=[20, 27]-=- implies that the runtime characteristics of analytics jobs can be predicted with high accuracy. Moreover, extensive recent research has focused on data analytics performance prediction [42, 25, 24, 1... |
756 | Dryad: distributed data-parallel programs from sequential building blocks
- Isard, Budiu, et al.
(Show Context)
Citation Context ..., private, and hybrid clouds for not only web applications, such as Netflix, Instagram and Airbnb, but also modern big data analytics using parallel programming paradigms such as Hadoop [2] and Dryad =-=[26]-=-. Cloud providers such as Amazon Web Services, Google Cloud, and Microsoft Azure, have started providing data analytics platform as a service [1, 8, 12], which is being adopted widely. With the improv... |
465 |
Advanced engineering mathematics
- Kreyszig
- 1999
(Show Context)
Citation Context ...as number of VMs and the estimated runtime based on Equation 1 as parameters. After carefully considering multiple regression models, we find that a third degree polynomial-based cubic Hermite spline =-=[30]-=- is a good fit for the applications and storage services considered in the paper. While we do not delve into details about the model, we show the accuracy of the splines in Figure 2. We also evaluate ... |
342 | The hadoop distributed file system
- Shvachko, Kuang, et al.
- 2010
(Show Context)
Citation Context ... a flash tier for serving reads is beneficial for HDFS-based HBase workloads with random I/Os. As opposed to HBase I/O characteristics, typical MapReduce-like batch jobs issues large, sequential I/Os =-=[40]-=- and run in multiple stages (map, shuffle, reduce). Hence, lessons learned from HBase 46 tiering are not directly applicable to such analytics workloads. hatS [31] and open source Hadoop community [9]... |
328 | Benchmarking cloud serving systems with YCSB - Cooper, Silberstein, et al. - 2010 |
238 | Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing
- Zaharia, Chowdhury, et al.
- 2012
(Show Context)
Citation Context ... analysis goals. Thus, optimizing the system for such applications, as in Cast, can significantly impact the data analytics field. Dynamic vs. Static Storage Tiering Big data frameworks such as Spark =-=[47]-=- and Impala [11] have been used for real-time interactive analytics, where dynamic storage tiering is likely to be more beneficial. In contrast, our work focuses on traditional batch processing analyt... |
210 | Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling.
- Zaharia, Borthakur, et al.
- 2010
(Show Context)
Citation Context ...t data placement decisions for the same applications. 3.1.1 Experimental Study Setup We select four representative analytics applications that are typical components of real-world analytics workloads =-=[18, 46]-=- and exhibit diversified I/O and computation characteristics, as listed in Table 2. Sort, Join and Grep are I/Ointensive applications. The execution time of Sort is dominated by the shuffle phase I/O,... |
79 | Starfish: A Self-tuning System for Big Data Analytics, in:
- Herodotou, Lim, et al.
- 2011
(Show Context)
Citation Context ...d Granularity We next study the impact of cross-job interactions within an analytics workload. While individual job-level optimization and tiering has been the major focus of a number of recent works =-=[31, 32, 44, 25, 24, 17]-=-, we argue that this is not sufficient for data placement in the cloud for analytics workloads. To this end, we analyze two typical workload characteristics that have been reported in production workl... |
68 | Interactive Analytical Processing in Big Data Systems: A Cross-industry Study of MapReduce Workloads. In
- Chen, Alspaugh, et al.
- 2012
(Show Context)
Citation Context ...eans spends most of the time performing computation. Furthermore, short-term (within hours) and long-term (daily, weekly or monthly) data reuse across jobs is common in production analytics workloads =-=[18, 15]-=-. As reported in [18], 78% of jobs in Cloudera Hadoop workloads involve data reuse. Another distinguishing feature of analytics workloads is the presence of workflows that represents interdependencies... |
67 | Towards automatic optimization of MapReduce programs
- Babu
- 2010
(Show Context)
Citation Context ...d Granularity We next study the impact of cross-job interactions within an analytics workload. While individual job-level optimization and tiering has been the major focus of a number of recent works =-=[31, 32, 44, 25, 24, 17]-=-, we argue that this is not sufficient for data placement in the cloud for analytics workloads. To this end, we analyze two typical workload characteristics that have been reported in production workl... |
54 | PACMan: Coordinated Memory Caching for Parallel Jobs.
- Ananthanarayanan, Ghodsi, et al.
- 2012
(Show Context)
Citation Context ...eans spends most of the time performing computation. Furthermore, short-term (within hours) and long-term (daily, weekly or monthly) data reuse across jobs is common in production analytics workloads =-=[18, 15]-=-. As reported in [18], 78% of jobs in Cloudera Hadoop workloads involve data reuse. Another distinguishing feature of analytics workloads is the presence of workflows that represents interdependencies... |
52 | A simulation approach to evaluating design decisions in MapReduce setups.
- Guanying, Butt, et al.
- 2009
(Show Context)
Citation Context ...adigm [20, 27] implies that the runtime characteristics of analytics jobs can be predicted with high accuracy. Moreover, extensive recent research has focused on data analytics performance prediction =-=[42, 25, 24, 13, 41, 27]-=-. We leverage and adapt MRCute [27] model in Cast to predict job execution time, due to its ease-of-use, availability, and applicability to our problem domain. Equation 1 defines our performance predi... |
51 | Auto-Scaling to Minimize Cost and Meet Application Deadlines in Cloud Work ows
- Mao, Humphrey
(Show Context)
Citation Context ...ion platforms. In contrast, we explore the inherent performance and cost trade-off of different storage services in public cloud environments. Analytics Workflow Optimization A large body of research =-=[45, 35, 21, 36, 33]-=- focuses on Hadoop workflow optimizations by integrating workflow-aware scheduler into Hadoop or interfacing Hadoop with a standalone workflow scheduler. Our workflow enhancement is orthogonal and com... |
37 |
what-if analysis, and cost-based optimization of mapreduce programs
- Profiling
(Show Context)
Citation Context ...d Granularity We next study the impact of cross-job interactions within an analytics workload. While individual job-level optimization and tiering has been the major focus of a number of recent works =-=[31, 32, 44, 25, 24, 17]-=-, we argue that this is not sufficient for data placement in the cloud for analytics workloads. To this end, we analyze two typical workload characteristics that have been reported in production workl... |
27 | Cost Effective Storage Using Extent Based Dynamic Tiering.
- Guerra, Pucha, et al.
- 2011
(Show Context)
Citation Context ...Cold Data Classification-based Tiering Recent research [28, 34, 43] has focused on improving storage cost and utilization efficiency by placing hot/cold data in different storage tiers. Guerra et. al.=-=[22]-=- builds an SSD-based dynamic tiering system to minimize cost and power consumption, and existing works handle file system and block level I/Os (e.g., 4 – 32 KB) for POSIX-style workloads (e.g., server... |
25 | A.: Bridging the tenantprovider gap in cloud services
- Jalaparti, Ballani, et al.
- 2012
(Show Context)
Citation Context ...nstrate very different access characteristics and data dependencies (described in Section 3); requiring a rethink of how storage tiering is done to benefit these workloads. Other works such as Bazaar =-=[27]-=- and Conductor [44], focus on automating cloud resource deployment to meet cloud tenants’ requirements while reducing deployment cost. Our work takes a thematically similar view — exploring the trade-... |
22 | Stubby: A transformation-based optimizer for mapreduce workflows.
- Lim, Herodotou, et al.
- 2012
(Show Context)
Citation Context ...ion platforms. In contrast, we explore the inherent performance and cost trade-off of different storage services in public cloud environments. Analytics Workflow Optimization A large body of research =-=[45, 35, 21, 36, 33]-=- focuses on Hadoop workflow optimizations by integrating workflow-aware scheduler into Hadoop or interfacing Hadoop with a standalone workflow scheduler. Our workflow enhancement is orthogonal and com... |
22 | A data placement strategy in scientific cloud workflows. Future Generation Computing Systems
- Yuan, Yang, et al.
- 2012
(Show Context)
Citation Context ...ion platforms. In contrast, we explore the inherent performance and cost trade-off of different storage services in public cloud environments. Analytics Workflow Optimization A large body of research =-=[45, 35, 21, 36, 33]-=- focuses on Hadoop workflow optimizations by integrating workflow-aware scheduler into Hadoop or interfacing Hadoop with a standalone workflow scheduler. Our workflow enhancement is orthogonal and com... |
17 | Scale-up vs scale-out for hadoop: Time to rethink? In
- Appuswamy, Gkantsidis, et al.
- 2013
(Show Context)
Citation Context ...f the same size as the input data. Others, such as inverted indexing, would require a large capacity for storing intermediate data as significant larger shuffle data is generated during the map phase =-=[16]-=-. The generic Equation 3 accounts for all such scenarios and guarantees that the workload will not fail. Given a specific tiering solution, the estimated total completion time of the workload is defin... |
14 | Restore: Reusing results of mapreduce jobs.
- Elghandour, Aboulnaga
- 2012
(Show Context)
Citation Context ...ion platforms. In contrast, we explore the inherent performance and cost trade-off of different storage services in public cloud environments. Analytics Workflow Optimization A large body of research =-=[45, 35, 21, 36, 33]-=- focuses on Hadoop workflow optimizations by integrating workflow-aware scheduler into Hadoop or interfacing Hadoop with a standalone workflow scheduler. Our workflow enhancement is orthogonal and com... |
12 |
Play it again, simmr!
- Verma, Cherkasova, et al.
- 2011
(Show Context)
Citation Context ...adigm [20, 27] implies that the runtime characteristics of analytics jobs can be predicted with high accuracy. Moreover, extensive recent research has focused on data analytics performance prediction =-=[42, 25, 24, 13, 41, 27]-=-. We leverage and adapt MRCute [27] model in Cast to predict job execution time, due to its ease-of-use, availability, and applicability to our problem domain. Equation 1 defines our performance predi... |
12 |
Orchestrating the deployment of computations in the cloud with Conductor.
- Wieder, Bhatotia, et al.
- 2012
(Show Context)
Citation Context ...ent access characteristics and data dependencies (described in Section 3); requiring a rethink of how storage tiering is done to benefit these workloads. Other works such as Bazaar [27] and Conductor =-=[44]-=-, focus on automating cloud resource deployment to meet cloud tenants’ requirements while reducing deployment cost. Our work takes a thematically similar view — exploring the trade-offs of cloud servi... |
11 |
Analysis of HDFS under HBase: a Facebook Messages Case Study. In FAST,
- Harter, Borthakur
- 2014
(Show Context)
Citation Context ... framework for cloud-based data analytics workloads. Fine-Grained Tiering for Analytics Storage tiering has been studied in the context of data-intensive analytics batch applications. Recent analysis =-=[23]-=- demonstrates that adding a flash tier for serving reads is beneficial for HDFS-based HBase workloads with random I/Os. As opposed to HBase I/O characteristics, typical MapReduce-like batch jobs issue... |
8 |
Evaluating Phase Change Memory for Enterprise Storage Systems: A Study of Caching and Tiering Approaches.
- Kim, Seshadri, et al.
- 2014
(Show Context)
Citation Context ... RELATEDWORK In the following, we provide a brief background of storage tiering, and categorize and compare previous work with our research. Hot/Cold Data Classification-based Tiering Recent research =-=[28, 34, 43]-=- has focused on improving storage cost and utilization efficiency by placing hot/cold data in different storage tiers. Guerra et. al.[22] builds an SSD-based dynamic tiering system to minimize cost an... |
6 | Woha: Deadline-aware map-reduce workflow scheduling framework over hadoop clusters. In
- Li, Hu, et al.
- 2014
(Show Context)
Citation Context ...ion platforms. In contrast, we explore the inherent performance and cost trade-off of different storage services in public cloud environments. Analytics Workflow Optimization A large body of research =-=[45, 35, 21, 36, 33]-=- focuses on Hadoop workflow optimizations by integrating workflow-aware scheduler into Hadoop or interfacing Hadoop with a standalone workflow scheduler. Our workflow enhancement is orthogonal and com... |
6 | Mixapart: decoupled analytics for shared storage systems.
- MIHAILESCU, SOUNDARARAJAN, et al.
- 2012
(Show Context)
Citation Context ...argue that this is not sufficient for data placement in the cloud for analytics workloads. To this end, we analyze two typical workload characteristics that have been reported in production workloads =-=[18, 15, 38, 23, 19]-=-, namely data reuse across jobs, and dependency between jobs, i.e., workflows, within a workload. Data Reuse across Jobs As reported in the analysis of production workloads from Facebook and Microsoft... |
6 | Balancing fairness and efficiency in tiered storage systems with bottleneck-aware allocation.
- WANG, VARMAN
- 2014
(Show Context)
Citation Context ... RELATEDWORK In the following, we provide a brief background of storage tiering, and categorize and compare previous work with our research. Hot/Cold Data Classification-based Tiering Recent research =-=[28, 34, 43]-=- has focused on improving storage cost and utilization efficiency by placing hot/cold data in different storage tiers. Guerra et. al.[22] builds an SSD-based dynamic tiering system to minimize cost an... |
5 | Janus: Optimal Flash Provisioning for Cloud Storage Workloads
- ALBRECHT, MERCHANT, et al.
- 2013
(Show Context)
Citation Context ... — exploring the trade-offs of cloud services — but with a different scope that targets data analytics workloads and leverages their unique characteristics to provide storage tiering. Several systems =-=[14, 37]-=- are specifically designed to tackle flash storage allocation inefficiency in virtualization platforms. In contrast, we explore the inherent performance and cost trade-off of different storage service... |
5 | hatS: A Heterogeneity-Aware Tiered Storage for Hadoop”,
- Krish, Anwar, et al.
- 2014
(Show Context)
Citation Context ...h jobs issues large, sequential I/Os [40] and run in multiple stages (map, shuffle, reduce). Hence, lessons learned from HBase 46 tiering are not directly applicable to such analytics workloads. hatS =-=[31]-=- and open source Hadoop community [9] have taken the first steps towards integrating heterogeneous storage devices in HDFS for local clusters. However, the absence of task-level tier-aware scheduling ... |
3 | φsched: A heterogeneity-aware hadoop workflow scheduler.
- Krish, Anwar, et al.
- 2014
(Show Context)
Citation Context ...d Granularity We next study the impact of cross-job interactions within an analytics workload. While individual job-level optimization and tiering has been the major focus of a number of recent works =-=[31, 32, 44, 25, 24, 17]-=-, we argue that this is not sufficient for data placement in the cloud for analytics workloads. To this end, we analyze two typical workload characteristics that have been reported in production workl... |
3 |
vCacheShare: Automated server flash cache space management in a virtualization environment
- Meng, Zhou, et al.
- 2014
(Show Context)
Citation Context ... — exploring the trade-offs of cloud services — but with a different scope that targets data analytics workloads and leverages their unique characteristics to provide storage tiering. Several systems =-=[14, 37]-=- are specifically designed to tackle flash storage allocation inefficiency in virtualization platforms. In contrast, we explore the inherent performance and cost trade-off of different storage service... |
2 | On the importance of evaluating storage systems’ $costs
- LI, MUKKER, et al.
- 2014
(Show Context)
Citation Context ... RELATEDWORK In the following, we provide a brief background of storage tiering, and categorize and compare previous work with our research. Hot/Cold Data Classification-based Tiering Recent research =-=[28, 34, 43]-=- has focused on improving storage cost and utilization efficiency by placing hot/cold data in different storage tiers. Guerra et. al.[22] builds an SSD-based dynamic tiering system to minimize cost an... |
2 |
Frugal storage for cloud file systems
- Puttaswamy, Nandagopal, et al.
(Show Context)
Citation Context ...cloud storage services. Cloud Resource Provisioning Considerable prior work has examined ways to automate resource configuration and provisioning process in the cloud. Frugal Cloud File System (FCFS) =-=[39]-=- is a cost-effective cloud-based file storage that spans multiple cloud storage services. In contrast to POSIX file system workloads, modern analytics jobs (focus of our study) running on parallel pro... |