Results 1 -
9 of
9
Evaluating the Performance of Cache-Affinity Scheduling in Shared-Memory Multiprocessors
- Journal of Parallel and Distributed Computing
, 1995
"... As a process executes on a CPU, it builds up state in that CPU's cache. In multiprogrammed workloads, the opportunity to reuse this state may be lost when a process gets rescheduled, either because intervening processes destroy its cache state or because the process may migrate to another processor. ..."
Abstract
-
Cited by 47 (1 self)
- Add to MetaCart
As a process executes on a CPU, it builds up state in that CPU's cache. In multiprogrammed workloads, the opportunity to reuse this state may be lost when a process gets rescheduled, either because intervening processes destroy its cache state or because the process may migrate to another processor. In this paper, we explore affinity scheduling, a technique that helps reduce cache misses by preferentially scheduling a process on a CPU where it has run recently. Our study focuses on a bus-based multiprocessor executing a variety of workloads, including mixes of scientific, software development, and database applications. In addition to quantifying the performance benefits of exploiting affinity, our study is distinctive in that it provides low-level data from a hardware performance monitor that details why the workloads perform as they do. Overall, for the workloads studied, we show that affinity scheduling reduces the number of cache misses by 7--36%, resulting in execution time improv...
Coarse-Grain Parallel Programming in Jade
, 1991
"... This paper presents Jade, a language which allows a programmer to easily express dynamic coarse-grain parallelism. Starting with a sequential program, a programmer augments those sections of code to be parallelized with abstract data usage information. The compiler and run-time system use this inf ..."
Abstract
-
Cited by 46 (4 self)
- Add to MetaCart
This paper presents Jade, a language which allows a programmer to easily express dynamic coarse-grain parallelism. Starting with a sequential program, a programmer augments those sections of code to be parallelized with abstract data usage information. The compiler and run-time system use this information to concurrently execute the program while respecting the program's data dependence constraints. Using Jade can significantly reduce the time and effort required to develop and maintain a parallel version of an imperative application with serial semantics. The paper introduces the basic principles of the language, compares Jade with other existing languages, and presents the performance of a sparse matrix Cholesky factorization algorithm implemented in Jade.
Data Locality and Load Balancing in COOL
, 1993
"... Large-scale shared memory multiprocessors typically support a multilevel memory hierarchy consisting of per-processor caches, a local portion of shared memory, and remote shared memory. On such machines, the performance of parallel programs is often limited by the high latency of remote memory refer ..."
Abstract
-
Cited by 43 (2 self)
- Add to MetaCart
Large-scale shared memory multiprocessors typically support a multilevel memory hierarchy consisting of per-processor caches, a local portion of shared memory, and remote shared memory. On such machines, the performance of parallel programs is often limited by the high latency of remote memory references. In this paper we explore how knowledge of the underlying memory hierarchy can be used to schedule computation and distribute data structures, and thereby improve data locality. Our study is done in the context of CooL, a concurrent object-oriented language developed at Stanford. We develop abstractions for the programmer to supply optional information about the data reference patterns of the program. This information is used by the runtime system to distribute tasks and objects so that the tasks execute close (in the memory hierarchy) to the objects they reference.
Characterizing the Behavior of Sparse Algorithms on Caches
- In Proceedings of Supercomputing ’92
, 1992
"... Sparse computations constitute one of the most important area of numerical algebra and scientific computing. While there are many studies on the locality of dense codes, few deal with the locality of sparse codes. Because of indirect addressing, sparse codes exhibit irregular patterns of references. ..."
Abstract
-
Cited by 38 (0 self)
- Add to MetaCart
Sparse computations constitute one of the most important area of numerical algebra and scientific computing. While there are many studies on the locality of dense codes, few deal with the locality of sparse codes. Because of indirect addressing, sparse codes exhibit irregular patterns of references. In this paper, the behavior on cache of one of the most frequent primitives SpMxV Sparse Matrix-Vector multiply is analyzed. A model of its references is built, and then performance bottlenecks of SpMxV are analyzed using model and simulations. Main parameters are identified and their role is explained and quantified. Then, this analysis is used to discuss optimizations of SpMxV. Moreover a blocking technique which takes into account the specifics of sparse codes is proposed. Keywords: sparse primitives, cache, performance prediction, data locality. 1 Introduction Due to the increasing difference between memory speed and processor speed, it becomes critical to minimize communications bet...
Algorithmic Performance Studies on Graphics Processing Units
"... Abstract — We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floatingpoint co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear interior-point optimization. Since a fu ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract — We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floatingpoint co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear interior-point optimization. Since a full re-implementation of these complex kernels is typically not feasible, we identify the matrix-matrix multiplication as a first natural entry-point for a minimally invasive integration of GPUs. We investigate the performance on the NVIDIA GeForce 8800 multicore chip initially architectured for intensive gaming applications. We exploit the architectural features of the GeForce 8800 GPU to design an efficient GPU-parallel sparse matrix solver. A prototype approach to leverage the bandwidth and computing power of GPUs for these matrix kernel operation is demonstrated resulting in an overall performance of over 110 GFlops/s on the desktop for large matrices. We use our GPU algorithm for PDE-constrained optimization problems and demonstrate that the commodity GPU is a useful co-processor for scientific applications. Index Terms — Parallel processing, graphics processing units, matrix decomposition, sparse direct solvers, nonlinear optimization I.
BOS is Boss: A Case for Bulk-Synchronous Object Systems
- Proc. ACM Symposium on Parallel Algorithms and Architectures (SPAA
, 1999
"... A key issue for parallel systems is the development of useful programming abstractions that can coexist with good performance. We describe a communication library that supports an object-based abstraction with a bulk-synchronous communication style; this is the first time such a library has been pro ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A key issue for parallel systems is the development of useful programming abstractions that can coexist with good performance. We describe a communication library that supports an object-based abstraction with a bulk-synchronous communication style; this is the first time such a library has been proposed and implemented. By restricting the library to the exclusive use of barrier synchronization, we are able to design a simple and easy-to-use object system. By exploiting established techniques based on the bulk-synchronous parallel (BSP) model, we are able to design algorithms and library implementations that work well across platforms. 1 Introduction Portable parallel programming systems should provide useful abstractions without precluding efficient execution. This paper describes a step towards this goal through the use of a communication library called the BSP Object System (BOS). BOS provides the convenience of efficient shared objects in a system optimized for (and restricted to...
Scheduling Strategies For Sparse Cholesky Factorization On A Shared Virtual Memory Parallel Computer
- In International Conference on Parallel Processing
, 1994
"... To solve a given problem on a distributed memory parallel computer (DMPC), the message passing programming model involves distributing both the data and the computations among the processors. While this can be easily feasible for well structured problems, it can become fairly hard for unstructured o ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
To solve a given problem on a distributed memory parallel computer (DMPC), the message passing programming model involves distributing both the data and the computations among the processors. While this can be easily feasible for well structured problems, it can become fairly hard for unstructured ones, like sparse matrix computations, unless you use some runtime support. In this paper, we consider a relatively new approach to implementing the Cholesky factorization on a DMPC, by using a shared virtual memory (SVM). The abstraction of a shared memory on top of a distributed memory allows us to introduce a large-grain factorization algorithm, synchronized with events. Experiments conducted so far show that some scheduling techniques enhance not only the parallelism but the SVM behavior as well, allowing interesting results. 1 INTRODUCTION In many computational kernels, the solution to a sparse linear system is required. This system may arise in various problems, such as the discretizat...
A New Approach to Parallel Sparse Cholesky Factorization on Distributed Memory Parallel Computers
, 1993
"... Nowadays, programming distributed memory parallel computers (DMPCs) evokes the "no pain, no gain" idea. That is, for a given problem to be solved in parallel, the message passing programming model involves distributing the data and the computations among the processors. While this can be easily feas ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Nowadays, programming distributed memory parallel computers (DMPCs) evokes the "no pain, no gain" idea. That is, for a given problem to be solved in parallel, the message passing programming model involves distributing the data and the computations among the processors. While this can be easily feasible for well structured problems, it can become fairly hard on unstructured ones, like sparse matrix computations. In this paper, we consider a relatively new approach to implementing the Cholesky factorization on a DMPC running a shared virtual memory (SVM). The abstraction of a shared memory on top of a distributed memory allows us to introduce a large-grain factorization algorithm, synchronized with events. Several scheduling strategies are compared, and experiments conducted so far show that this approach can provide the power of DMPCs and the ease of programming with shared variables.
Factorisation parall`ele de Cholesky pour matrices creuses sur une m'emoire virtuelle partag'ee
, 1988
"... : pto) Unit e de recherche INRIA Rennes IRISA, Campus universitaire de Beaulieu, 35042 RENNES Cedex (France) T el ephone : (33) 99 84 71 00 -- T el ecopie : (33) 99 38 38 32 Parallel sparse Cholesky factorization on a shared virtual memory Abstract: This paper addresses the problem of factoring ..."
Abstract
- Add to MetaCart
: pto) Unit e de recherche INRIA Rennes IRISA, Campus universitaire de Beaulieu, 35042 RENNES Cedex (France) T el ephone : (33) 99 84 71 00 -- T el ecopie : (33) 99 38 38 32 Parallel sparse Cholesky factorization on a shared virtual memory Abstract: This paper addresses the problem of factoring large sparse matrices using the Cholesky factorization. Several approaches for MIMD architectures are depicted in both the message passing model and the shared variables one. Our goal is to analyze the behavior of the KOAN shared virtual memory with algorithms that exhibit semi-irregular data access patterns. A new parallel approach is presented and experimental results are provided. Key-words: Sparse Cholesky factorization, shared virtual memory, KOAN, sparse matrices, scheduling. TABLE DES MATI ` ERES 1 Table des mati`eres Introduction 2 1 Factorisation de Cholesky 3 1.1 Renum'erotation : : : : : : : : : : : : : : : : : : : : : : : : : : 4 1.1.1 Algorithme du degr'e minimal : : : : :...

