Results 1  10
of
37
The Workload on Parallel Supercomputers: Modeling the Characteristics of Rigid Jobs
 Journal of Parallel and Distributed Computing
, 2001
"... The analysis of workloads is important for understanding how systems are used. In addition, workload models are needed as input for the evaluation of new system designs, and for the comparison of system designs. This is especially important in costly largescale parallel systems. Luckily, workloa ..."
Abstract

Cited by 100 (11 self)
 Add to MetaCart
The analysis of workloads is important for understanding how systems are used. In addition, workload models are needed as input for the evaluation of new system designs, and for the comparison of system designs. This is especially important in costly largescale parallel systems. Luckily, workload data is available in the form of accounting logs. Using such logs from three dierent sites, we analyze and model the joblevel workloads with an emphasis on those aspects that are universal to all sites. As many distributions turn out to span a large range, we typically rst apply a logarithmic transformation to the data, and then t it to a novel hyperGamma distribution or one of its special cases. This is a generalization of distributions proposed previously, and leads to good goodnessoft scores. The parameters for the distribution are found using the iterative EM algorithm. The results of the analysis have been codied in a modeling program that creates a synthetic workload based on the results of the analysis. 1
Job Characteristics of a Production Parallel Scientific Workload on the NASA Ames iPSC/860
, 1995
"... . Statistics of a parallel workload on a 128node iPSC/860 located at NASA Ames are presented. It is shown that while the number of sequential jobs dominates the number of parallel jobs, most of the resources (measured in nodeseconds) were consumed by parallel jobs. Moreover, most of the sequen ..."
Abstract

Cited by 95 (24 self)
 Add to MetaCart
. Statistics of a parallel workload on a 128node iPSC/860 located at NASA Ames are presented. It is shown that while the number of sequential jobs dominates the number of parallel jobs, most of the resources (measured in nodeseconds) were consumed by parallel jobs. Moreover, most of the sequential jobs were for system administration. The average runtime of jobs grew with the number of nodes used, so the total resource requirements of large parallel jobs were larger by more than the number of nodes they used. The job submission rate during peak day activity was somewhat lower than one every two minutes, and the average job size was small. At night, submission rate was low but job sizes and system utilization were high, mainly due to NQS. Submission rate and utilization over the weekend were lower than on weekdays. The overall utilization was 50%, after accounting for downtime. About 2/3 of the applications were executed repeatedly, some for a significant number of times....
An Extensible MetaLearning Approach for Scalable and Accurate Inductive Learning
, 1996
"... Much of the research in inductive learning concentrates on problems with relatively small amounts of data. With the coming age of ubiquitous network computing, it is likely that orders of magnitude more data in databases will be available for various learning problems of real world importance. Som ..."
Abstract

Cited by 44 (8 self)
 Add to MetaCart
Much of the research in inductive learning concentrates on problems with relatively small amounts of data. With the coming age of ubiquitous network computing, it is likely that orders of magnitude more data in databases will be available for various learning problems of real world importance. Some learning algorithms assume that the entire data set fits into main memory, which is not feasible for massive amounts of data, especially for applications in data mining. One approach to handling a large data set is to partition the data set into subsets, run the learning algorithm on each of the subsets, and combine the results. Moreover, data can be inherently distributed across multiple sites on the network and merging all the data in one location can be expensive or prohibitive. In this thesis we propose, investigate, and evaluate a metalearning approach to integrating the results of mul...
Evaluating the Scalability of Distributed Systems
 IEEE Transactions on Parallel and Distributed Systems
, 2000
"... AbstractÐMany distributed systems must be scalable, meaning that they must be economically deployable in a wide range of sizes and configurations. This paper presents a scalability metric based on costeffectiveness, where the effectiveness is a function of the system's throughput and its quality of ..."
Abstract

Cited by 41 (2 self)
 Add to MetaCart
AbstractÐMany distributed systems must be scalable, meaning that they must be economically deployable in a wide range of sizes and configurations. This paper presents a scalability metric based on costeffectiveness, where the effectiveness is a function of the system's throughput and its quality of service. It is part of a framework which also includes a scaling strategy for introducing changes as a function of a scale factor, and an automated virtual design optimization at each scale factor. This is an adaptation of concepts for scalability measures in parallel computing. Scalability is measured by the range of scale factors that give a satisfactory value of the metric, and good scalability is a joint property of the initial design and the scaling strategy. The results give insight into the scaling capacity of the designs, and into how to improve the design. A rapid simple bound on the metric is also described. The metric is demonstrated in this work by applying it to some wellknown idealized systems, and to real prototypes of communications software. Index TermsÐScalability, distributed systems, scalability metric, software performance, performance model, layered queuing, performance optimization, replication. 1
Memory Usage in the LANL CM5 Workload
 In Job Scheduling Strategies for Parallel Processing
, 1997
"... . It is generally agreed that memory requirements should be taken into account in the scheduling of parallel jobs. However, so far the work on combined processor and memory scheduling has not been based on detailed information and measurements. To rectify this problem, we present an analysis of ..."
Abstract

Cited by 24 (7 self)
 Add to MetaCart
. It is generally agreed that memory requirements should be taken into account in the scheduling of parallel jobs. However, so far the work on combined processor and memory scheduling has not been based on detailed information and measurements. To rectify this problem, we present an analysis of memory usage by a production workload on a large parallel machine, the 1024node CM5 installed at Los Alamos National Lab. Our main observations are  The distribution of memory requests has strong discrete components, i.e. some sizes are much more popular than others.  Many jobs use a relatively small fraction of the memory available on each node, so there is some room for time slicing among several memoryresident jobs.  Larger jobs (using more nodes) tend to use more memory, but it is difficult to characterize the scaling of perprocessor memory usage. 1 Introduction Resource management includes a number of distinct topics, such as scheduling and memory management. Howeve...
Scalability Issues Affecting the Design of a Dense Linear Algebra Library
 JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1994
"... This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widelyused LAPACK library to run efficiently on scalable concurrent computers ..."
Abstract

Cited by 23 (12 self)
 Add to MetaCart
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widelyused LAPACK library to run efficiently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on blockpartitioned algorithms that reduce the frequency of data movement between different levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel blockpartitioned algorithms is given. Approximate models of algorithms' performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128node Intel iPSC/860 hypercube. It is shown that the routines are highl...
A Methodology and an Evaluation of the SGI Origin2000
 IN PROC. OF THE INTL. CONF. ON MEASUREMENT AND MODELING OF COMPUTER SYSTEMS (SIGMETRICS
"... As hardwarecoherent, distributed shared memory (DSM) multiprocessing becomes popular commercially, it is important to evaluate modern realizations to understand how they perform and scale for a range of interesting applications and to identify the nature of the key bottlenecks. This paper evaluates ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
As hardwarecoherent, distributed shared memory (DSM) multiprocessing becomes popular commercially, it is important to evaluate modern realizations to understand how they perform and scale for a range of interesting applications and to identify the nature of the key bottlenecks. This paper evaluates the SGI Origin2000the machine that perhaps has the most aggressive communication architecture of the recent cachecoherent offeringsand, in doing so, articulates a sound methodology for evaluating real systems. We examine data access and synchronization microbenchmarks; speedups for different application classes, problem sizes and scaling models; detailed interactions and time breakdowns using performance tools; and the impact of special hardware support. We find that overall the Origin appears to deliver on the promise of cachecoherent shared address space multiprocessing, at least at the 32processor scale we examine. The machine is quite easy to program for performance and has fewer...
Ultrascalable Implicit Finite Element Analyses in Solid Mechanics with over a Half a Billion Degrees of Freedom
 In ACM/IEEE Proceedings of SC2004: High Performance Networking and Computing
, 2004
"... We present a highly parallel finite element program, Olympus, equipped with an ultrascalable linear solver, Prometheus, applied to microFE bone modeling calculations on an IBM SP Power3. Scalability is demonstrated with scaled speedup studies of a nonlinear analyses of a vertebral body with over a ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
We present a highly parallel finite element program, Olympus, equipped with an ultrascalable linear solver, Prometheus, applied to microFE bone modeling calculations on an IBM SP Power3. Scalability is demonstrated with scaled speedup studies of a nonlinear analyses of a vertebral body with over a half of a billion degrees of freedom. We show parallel scalability with up to 4088 processors on the ACSI White machine. This work is significant in that, in the domain of unstructured implicit finite element analysis in solid mechanics with complex geometry, this is the first demonstration of a highly parallel, and e#cient, application of a mathematically optimal linear solution methodsmoothed aggregation algebraic multigrid.
Performance of a Fully Parallel Sparse Solver
 Int. Journal of Supercomputer Applications
, 1996
"... The performance of a fully parallel direct solver for large sparse symmetric positive definite systems of linear equations is demonstrated. The solver is designed for distributedmemory, messagepassing parallel computer systems. All phases of the computation, including symbolic processing as well a ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
The performance of a fully parallel direct solver for large sparse symmetric positive definite systems of linear equations is demonstrated. The solver is designed for distributedmemory, messagepassing parallel computer systems. All phases of the computation, including symbolic processing as well as numeric factorization and triangular solution, are performed in parallel. A parallel Cartesian nested dissection algorithm is used to compute a fillreducing ordering for the matrix and an appropriate partitioning of the problem across the processors. The separator This research was supported by the Advanced Research Projects Agency through the Army Research Office under contract number DAAL0391C0047. y Department of Computer Science and NCSA, University of Illinois, 1304 West Springfield Ave., Urbana, IL 61801, email: heath@cs.uiuc.edu. z Department of Computer Science, University of Tennessee, 107 Ayres Hall, Knoxville, TN 37996, email: padma@cs.utk.edu. Parallel Sparse Sol...
Parallel Application Scheduling on Networks of Workstations
 Journal of Parallel and Distributed Computing
, 1997
"... Parallel applications can be executed using the idle computing capacity of workstation clusters. However, it remains unclear how to most effectively schedule the processors among different applications. Processor scheduling algorithms that were successful for sharedmemory machines have proven to be ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
Parallel applications can be executed using the idle computing capacity of workstation clusters. However, it remains unclear how to most effectively schedule the processors among different applications. Processor scheduling algorithms that were successful for sharedmemory machines have proven to be inadequate for distributed memory environments due to the high costs of remote memory accesses and redistributing data. We investigate how knowledge of system load and application characteristics can be used in scheduling decisions. We propose the new algorithm AEP(2) which, by properly exploiting both the information types above, performs better than other nonpreemptive scheduling rules, and nearly as well as idealized versions of preemptive rules (with free preemption). We conclude that AEP(2) is suitable for use in scheduling parallel applications on networks of workstations. 1