Results 1 - 10
of
46
Job Scheduling in Multiprogrammed Parallel Systems
, 1997
"... Scheduling in the context of parallel systems is often thought of in terms of assigning tasks in a program to processors, so as to minimize the makespan. This formulation assumes that the processors are dedicated to the program in question. But when the parallel system is shared by a number of us ..."
Abstract
-
Cited by 145 (15 self)
- Add to MetaCart
Scheduling in the context of parallel systems is often thought of in terms of assigning tasks in a program to processors, so as to minimize the makespan. This formulation assumes that the processors are dedicated to the program in question. But when the parallel system is shared by a number of users, this is not necessarily the case. In the context of multiprogrammed parallel machines, scheduling refers to the execution of threads from competing programs. This is an operating system issue, involved with resource allocation, not a program development issue. Scheduling schemes for multiprogrammed parallel systems can be classified as one or two leveled. Single-level scheduling combines the allocation of processing power with the decision of which thread will use it. Two level scheduling decouples the two issues: first, processors are allocated to the job, and then the job's threads are scheduled using this pool of processors. The processors of a parallel system can be shared i...
The Design and Implementation of SOLAR, a Portable Library for Scalable Out-of-Core Linear Algebra Computations
- WORKSHOP ON I/O IN PARALLEL AND DISTRIBUTED SYSTEMS
, 1996
"... SOLAR is a portable high-performance library for out-of-core dense matrix computations. It combines portability with high performance by using existing high-performance in-core subroutine libraries and by using an optimized matrix input-output library. SOLAR works on parallel computers, workstations ..."
Abstract
-
Cited by 61 (4 self)
- Add to MetaCart
SOLAR is a portable high-performance library for out-of-core dense matrix computations. It combines portability with high performance by using existing high-performance in-core subroutine libraries and by using an optimized matrix input-output library. SOLAR works on parallel computers, workstations, and personal computers. It supports in-core computations on both shared-memory and distributed-memory machines, and its matrix input-output library supports both conventional I/O interfaces and parallel I/O interfaces. This paper discusses the overall design of SOLAR, its interfaces, and the design of several important subroutines. Experimental results show that SOLAR can factor on a single workstation an out-of-core positive-definite symmetric matrix at a rate exceeding 215 Mflops, and an out-of-core general matrix at a rate exceeding 195 Mflops. Less than 16 % of the running time is spent on I/O in these computations. These results indicate that SOLAR's portability does not compromise its performance. We expect that the combination of portability, modularity, and the use of a high-level I/O interface will make the library an important platform for research on out-of-core algorithms and on parallel I/O.
Toward Convergence in Job Schedulers for Parallel Supercomputers
- In Job Scheduling Strategies for Parallel Processing
, 1996
"... . The space of job schedulers for parallel supercomputers is rather fragmented, because different researchers tend to make different assumptions about the goals of the scheduler, the information that is available about the workload, and the operations that the scheduler may perform. We argue tha ..."
Abstract
-
Cited by 56 (6 self)
- Add to MetaCart
. The space of job schedulers for parallel supercomputers is rather fragmented, because different researchers tend to make different assumptions about the goals of the scheduler, the information that is available about the workload, and the operations that the scheduler may perform. We argue that by identifying these assumptions explicitly, it is possible to reach a level of convergence. For example, it is possible to unite most of the different assumptions into a common framework by associating a suitable cost function with the execution of each job. The cost function reflects knowledge about the job and the degree to which it fits the goals of the system. Given such cost functions, scheduling is done to maximize the system's profit. 1 Introduction Both theoreticians and practitioners have been investigating and implementing various types of schedulers, and analyzing their performance over a wide range of workloads, leading to a large and varied body of knowledge [13]. How...
Workload Evolution on the Cornell Theory Center IBM SP2
, 1996
"... . The Cornell Theory Center (CTC) put a 512-node IBM SP2 system into production in early 1995, and extended traces of batch jobs began to be collected in June of that year. An analysis of the workload shows that it has not only grown, but that its characteristics have changed over time. In particula ..."
Abstract
-
Cited by 56 (0 self)
- Add to MetaCart
. The Cornell Theory Center (CTC) put a 512-node IBM SP2 system into production in early 1995, and extended traces of batch jobs began to be collected in June of that year. An analysis of the workload shows that it has not only grown, but that its characteristics have changed over time. In particular, job duration increased with time, indicative of an expanding production workload. In addition, there was increasing use of parallelism. As the load has increased and larger jobs have become more frequent, the batch management software (IBM's LoadLeveler) has had difficulty in scheduling the requested resources. New policies were established to improve the situation. This paper will profile how the workload has changed over time and give an in-depth look at the maturing workload. It will examine how frequently certain resources are requested and analyze user submittal patterns. It will also describe the policies that were implemented to improve the scheduling situation and their effect on ...
The Theory, Practice, And A Tool For BSP Performance Prediction Applied To A CFD Application
- In Europar'96, volume 1124 of LNCS
, 1996
"... The Bulk Synchronous Parallel (BSP) model provides a theoretical framework to accurately predict the execution time of parallel programs. In this paper we describe a BSP programming library that has been developed, and contrast two approaches to analysing performance: (1) a pencil and paper method w ..."
Abstract
-
Cited by 31 (7 self)
- Add to MetaCart
The Bulk Synchronous Parallel (BSP) model provides a theoretical framework to accurately predict the execution time of parallel programs. In this paper we describe a BSP programming library that has been developed, and contrast two approaches to analysing performance: (1) a pencil and paper method with a theoretical cost model; (2) a profiling tool that analyses trace information generated during program execution. These approaches are evaluated on an industrial application code that solves fluid dynamics equations around a complex aircraft geometry on an IBM SP2 and SGI PowerChallenge. We show how the tool can be used to explore the communication patterns of the CFD code and accurately predict the performance of the application on any parallel machine. 1 Introduction The efficient implementation of complex algorithms onto parallel machines is an arduous task. Furthermore, the resulting performance is often only known once this task has been completed. This is unsatisfactory consideri...
Performance and Experience with LAPI - a New High-Performance
- Communication Library for the IBM RS/6000 SP. In Proceedings of the International Parallel Processing Symposium
, 1998
"... LAPI is a low-level, high-performance communication interface available on the IBM RS/6000 SP system. It provides an activemessage-like interface along with remote memory copy and synchronization functionality. It is designed primarily for use by experienced programmers in developing parallel subsys ..."
Abstract
-
Cited by 27 (8 self)
- Add to MetaCart
LAPI is a low-level, high-performance communication interface available on the IBM RS/6000 SP system. It provides an activemessage-like interface along with remote memory copy and synchronization functionality. It is designed primarily for use by experienced programmers in developing parallel subsystems, libraries and tools, but we also expect power programmers to use it in end-user applications. IBM developed LAPI as a part of a project with Pacific Northwest National Laboratory (PNNL) to optimize the performance of the Global Arrays (GA) toolkit and its applications on the IBM RS/6000 SP. We provide an overview of LAPI characteristics and discuss its differences from other models such as MPI-2. We present some base performance parameters of LAPI including latency and bandwidth and compare it with performance of the MPI/MPL. The Global Arrays library from PNNL was ported to LAPI to exploit the performance benefits of this new interface. Experience using LAPI to implement GA and the performance of the resulting library are presented. 1
Wormhole Routing Techniques for Directly Connected Multicomputer Systems
- ACM Computing Surveys
, 1998
"... Wormhole routing has emerged as the most widely used switching technique in massively parallel computers. We present here a detailed survey of various techniques for enhancing the performance and reliability of the wormhole routing schemes in directly connected networks. We start with an overview of ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
Wormhole routing has emerged as the most widely used switching technique in massively parallel computers. We present here a detailed survey of various techniques for enhancing the performance and reliability of the wormhole routing schemes in directly connected networks. We start with an overview of the direct network topologies and a comparison of various switching techniques. Next, the characteristics of wormhole routing mechanism are described in detail along with the theory behind deadlock-free routing. The performance of routing algorithms depends on the selection of path between the source and the destination, the network traffic, and the router design. The routing algorithms are implemented in the router chips. We outline the router characteristics and describe the functionality of various elements of the router. Depending on the usage of paths between the source and the destination, the routing algorithms are classified as deterministic, fully adaptive, and partially adaptive. ...
Modeling the Communication Performance of the IBM SP2
- In Proc. of 10 th IEEE Int. Parallel Processing Symp
, 1996
"... The objective of this paper is to develop models that characterize the communication performance of a messagepassing multicomputer by taking the IBM SP2 as a case study. The paper evaluates and models the three aspects of the communication performance: scheduling overhead, message-passing time, and ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
The objective of this paper is to develop models that characterize the communication performance of a messagepassing multicomputer by taking the IBM SP2 as a case study. The paper evaluates and models the three aspects of the communication performance: scheduling overhead, message-passing time, and synchronization overhead. Performance models are developed for the basic communication patterns, enabling the estimation of the communication times of a message-passing application. Such estimates facilitate activities such as application tuning, selection of the best available implementation technique, and performance comparisons among different multicomputers. 1. Introduction A distributed-memory multicomputer consists of multiple processor nodes interconnected by a message-passing network. Each processor node is an autonomous computer consistingof a central processing unit (CPU), memory, communication adapter, and---for at least some nodes---mass storage and I/O devices. Figure 1 shows a...
Processor Allocation in Multiprogrammed Distributed-Memory Parallel Computer Systems
, 1997
"... : In this paper, we examine three general classes of space-sharing scheduling policies under a workload representative of large-scale scientific computing. These policies differ in the way processors are partitioned among the jobs as well as in the way jobs are prioritized for execution on the parti ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
: In this paper, we examine three general classes of space-sharing scheduling policies under a workload representative of large-scale scientific computing. These policies differ in the way processors are partitioned among the jobs as well as in the way jobs are prioritized for execution on the partitions. We consider new static, adaptive and dynamic policies that differ from previously proposed policies by exploiting user-supplied information about the resource requirements of submitted jobs. We examine the performance characteristics of these policies from both the system and user perspectives. Our results demonstrate that existing static schemes do not perform well under varying workloads, and that the system scheduling policy for such workloads must distinguish between jobs with large differences in execution times. We show that obtaining good performance under adaptive policies requires some a priori knowledge of the job mix in these systems. We further show that a judiciously para...
Dynamic Resource Management on Distributed Systems Using Reconfigurable Applications
- IBM Journal of Research and Development
, 1997
"... Efficient management of distributed resources, under conditions of unpredictable and varying workload, requires enforcement of dynamic resource management policies. Execution of such policies requires a relatively fine grain control over the resources allocated to jobs in the system. Although this i ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
Efficient management of distributed resources, under conditions of unpredictable and varying workload, requires enforcement of dynamic resource management policies. Execution of such policies requires a relatively fine grain control over the resources allocated to jobs in the system. Although this is a difficult task using conventional job management and program execution models, reconfigurable applications can be used to make it viable. With reconfigurable applications, it is possible to dynamically change, during the course of program execution, the number of concurrently executing tasks of an application as well as the resources allocated. Thus, reconfigurable applications can adapt to internal changes in resource requirements and to external changes affecting available resources. In this paper, we discuss dynamic management of resources on distributed systems with the help of reconfigurable applications. We first characterize reconfigurable parallel applications. We then present a ...

