Results 1 -
9 of
9
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale
"... scheduling Exascale computers (expected to be composed of millions of nodes and billions of threads of execution) will enable the unraveling of significant scientific mysteries. Many-task computing is a distributed paradigm, which can potentially address three of the four major challenges of exascal ..."
Abstract
-
Cited by 10 (8 self)
- Add to MetaCart
(Show Context)
scheduling Exascale computers (expected to be composed of millions of nodes and billions of threads of execution) will enable the unraveling of significant scientific mysteries. Many-task computing is a distributed paradigm, which can potentially address three of the four major challenges of exascale computing, namely Memory/Storage, Concurrency/Locality, and Resiliency. Exascale computing will require efficient job scheduling/management systems that are several orders of magnitude beyond the state-of-theart, which tend to have centralized architecture and are relatively heavy-weight. This paper proposes a light-weight discrete event simulator, SimMatrix, which simulates job scheduling system comprising of millions of nodes and billions of cores/tasks. SimMatrix supports both centralized (e.g. first-in-first-out) and distributed (e.g. work stealing) scheduling. We validated SimMatrix against two real systems, Falkon and MATRIX, with up to 4K-cores, running on an IBM Blue Gene/P system, and compared SimMatrix with SimGrid and GridSim in terms of resource consumption at scale. Results show that SimMatrix consumes up to two-orders of magnitude lower memory per task, and at least one-order of magnitude (and up to fourorders of magnitude) lower time per task overheads. For example, running a workload of 10 billion tasks on 1 million nodes and 1 billion cores required 142GB memory and 163 CPU-hours. These relatively low costs at exascale levels of concurrency will lead to innovative studies in scheduling algorithms at unprecedented scales. 1.
Using Simulation to Explore Distributed Key-Value Stores for Exascale System Services
"... Most of HPC services are still designed around a centralized paradigm and hence are susceptible to scaling issues. P2P services have proved themselves at scale for wide-area internet workloads. Distributed key-value stores (KVS) are widely used as a building block for these services, but are not pre ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
(Show Context)
Most of HPC services are still designed around a centralized paradigm and hence are susceptible to scaling issues. P2P services have proved themselves at scale for wide-area internet workloads. Distributed key-value stores (KVS) are widely used as a building block for these services, but are not prevalent in HPC services. In this paper, we simulate KVS for various service architectures and examine the design trade-offs as applied to HPC workloads to support exascale systems. Via simulation, we demonstrate how failure, replication, and consistency models affect performance at scale. Finally, we emphasize the general use of KVS to HPC services by feeding real workloads to the simulator. 1.
CloudKon: a Cloud enabled Distributed tasK executiON framework
"... Abstract — Task scheduling and execution over large scale, distributed systems plays an important role on achieving good performance and high system utilization. Job management systems need to support applications (e.g. Many-Task Computing – MTC, MapReduce) with a growing number of tasks with finer ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
(Show Context)
Abstract — Task scheduling and execution over large scale, distributed systems plays an important role on achieving good performance and high system utilization. Job management systems need to support applications (e.g. Many-Task Computing – MTC, MapReduce) with a growing number of tasks with finer granularity due to the explosion of parallelism found in today’s hardware which requires techniques such as over-decomposition to deliver good performance. Our goal in this work is to provide a compact, light-weight, scalable, and distributed task execution framework (CloudKon) that builds upon cloud computing building blocks (Amazon EC2, SQS, and DynamoDB). Most of Today’s state-of-the-art job execution systems have predominantly Master/Slaves architectures, which have inherent limitations, such as scalability issues at extreme scales and single point of failures. On the other hand distributed job management systems are complex, and employ non-trivial load balancing algorithms to maintain good utilization. CloudKon is a distributed job management system that can support millions of tasks from multiple users delivering over 2X the performance compared to other state-of-the-art systems in terms of throughput – all with a code-base of less than 5%. Although this work was motivated by the support of MTC applications, we will outline the possible support of HPC applications as well.
Optimizing Load Balancing and Data-Locality with Data-aware Scheduling
"... Abstract—Load balancing techniques (e.g. work stealing) are important to obtain the best performance for distributed task scheduling systems that have multiple schedulers making scheduling decisions. In work stealing, tasks are randomly migrated from heavy-loaded schedulers to idle ones. However, fo ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
Abstract—Load balancing techniques (e.g. work stealing) are important to obtain the best performance for distributed task scheduling systems that have multiple schedulers making scheduling decisions. In work stealing, tasks are randomly migrated from heavy-loaded schedulers to idle ones. However, for data-intensive applications where tasks are dependent and task execution involves processing a large amount of data, migrating tasks blindly yields poor data-locality and incurs significant data-transferring overhead. This work improves work stealing by using both dedicated and shared queues. Tasks are organized in queues based on task data size and location. We implement our technique in MATRIX, a distributed task scheduler for many-task computing. We leverage distributed key-value store to organize and scale the task metadata, task dependency, and data-locality. We evaluate the improved work stealing technique with both applications and micro-benchmarks structured as direct acyclic graphs. Results show that the proposed data-aware work stealing technique performs well. Keywords—data-intensive computing; data-aware scheduling; work stealing; key-value stores; many-task computing I.
I.: Towards scalable distributed workload manager with monitoring-based weakly consistent resource stealing
- In: Proceedings of the 24rd International Symposium on HighPerformance Parallel and Distributed Computing
"... ABSTRACT One way to efficiently utilize the coming exascale machines is to support a mixture of applications in various domains, such as traditional large-scale HPC, the ensemble runs, and the finegrained many-task computing (MTC). Delivering high performance in resource allocation, scheduling and ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
ABSTRACT One way to efficiently utilize the coming exascale machines is to support a mixture of applications in various domains, such as traditional large-scale HPC, the ensemble runs, and the finegrained many-task computing (MTC). Delivering high performance in resource allocation, scheduling and launching for all types of jobs has driven us to develop Slurm++, a distributed workload manager directly extended from the Slurm centralized production system. Slurm++ employs multiple controllers with each one managing a partition of compute nodes and participating in resource allocation through resource balancing techniques. In this paper, we propose a monitoring-based weakly consistent resource stealing technique to achieve resource balancing in distributed HPC job launch, and implement the technique in Slurm++. We compare Slurm++ with Slurm using microbenchmark workloads with different job sizes. Slurm++ showed 10X faster than Slurm in allocating resources and launching jobs -we expect the performance gap to grow as the job sizes and system scales increase in future high-end computing systems.
SCALABLE RESOURCE MANAGEMENT SYSTEM SOFTWARE FOR EXTREME-SCALE DISTRIBUTED SYSTEMS
, 2015
"... ..."
(Show Context)
Exploring the Design Tradeoffs for Extreme- Scale High-Performance Computing System Software
"... Abstract—Owing to the extreme parallelism and the high component failure rates of tomorrow’s exascale, high-performance computing (HPC) system software will need to be scalable, failure-resistant, and adaptive for sustained system operation and full system utilizations. Many of the existing HPC syst ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—Owing to the extreme parallelism and the high component failure rates of tomorrow’s exascale, high-performance computing (HPC) system software will need to be scalable, failure-resistant, and adaptive for sustained system operation and full system utilizations. Many of the existing HPC system software are still designed around a centralized server paradigm and hence are susceptible to scaling issues and single points of failure. In this article, we explore the design tradeoffs for scalable system software at extreme scales. We propose a general system software taxonomy by deconstructing common HPC system software into their basic components. The taxonomy helps us reason about system software as follows: (1) it gives us a systematic way to architect scalable system software by decomposing them into their basic components; (2) it allows us to categorize system software based on the features of these components, and finally (3) it suggests the configuration space to consider for design evaluation via simulations or real implementations. Further, we evaluate different design choices of a representative system software, i.e. key-value store, through simulations up to millions of nodes. Finally, we show evaluation results of two distributed system software, Slurm++ (a distributed HPC resource manager) and MATRIX (a distributed task execution framework), both developed based on insights from this work. We envision that the results in this article help to lay the foundations of developing next-generation HPC system software for extreme scales.
Exploring Distributed Resource Allocation Techniques in the SLURM Job Management System
"... Abstract — With the exponentially growth of distributed computing systems in both flops and cores, scientific applications are growing more diverse with a variety of workloads. These workloads include traditional large-scale High Performance Computing MPI jobs, and ensemble workloads, such as Many-T ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract — With the exponentially growth of distributed computing systems in both flops and cores, scientific applications are growing more diverse with a variety of workloads. These workloads include traditional large-scale High Performance Computing MPI jobs, and ensemble workloads, such as Many-Task Computing workloads comprised of extremely large number of tasks of finer granularity, where tasks are defined on a per-core or per-node level, and often execute in milliseconds to seconds. Delivering high throughput and low latency for these heterogeneous workloads requires developing distributed job management system that is magnitudes more scalable and available than today’s centralized batch-scheduled job management systems. In this paper, we present a distributed job launch prototype SLURM++, which extends the SLURM resource manager by integrating the ZHT zero-hop distributed key-value store for distributed state management. SLURM++ is comprised of multiple controllers with each one managing several SLURM daemons, while ZHT is used to store all the job metadata and the SLURM daemons ’ state. We compared SLURM with our SLURM++ prototype with a variety of microbenchmarks of different job sizes (small, medium, and large) at modest scales (500-nodes) with excellent results (10X higher job throughput). Scalability trends shows expected performance to be many orders of magnitude higher at tomorrow’s extreme scale systems. Keywords-job management systems; job launch; distributed scheduling; key-value stores A. Background I.
Exploring Distributed HPC Scheduling in MATRIX
"... Abstract- Efficiently scheduling large number of jobs over large scale distributed systems is very critical in order to achieve high system utilization and throughput. Today's state-of-the-art job schedulers mostly follow a centralized architecture that is master/slave architecture. The problem ..."
Abstract
- Add to MetaCart
Abstract- Efficiently scheduling large number of jobs over large scale distributed systems is very critical in order to achieve high system utilization and throughput. Today's state-of-the-art job schedulers mostly follow a centralized architecture that is master/slave architecture. The problem with this architecture is that it cannot scale efficiently upto even petascales and is always vulnerable to single point of failure. This is over come by the distributed job management system called MATRIX (MAny-Task computing execution fabRIc at eXascale) which adopts a work stealing algorithm which aims at load balancing throughout the distributed system. The MATRIX currently supports Many Task Computing (MTC) workloads. This project aims at extending MATRIX in order to support the High Performance Computing (HPC) workloads. The HPC workloads are nothing but long jobs which needs multiple nodes/cores to run the tasks. It is a challenge to support HPC on the framework which supports MTC jobs. The framework is focused at efficiently scheduling sub-second jobs on available workers. The design of scheduling HPC jobs should be efficient enough in order to not hamper the efficient working of MTC tasks. I.